Did you set "hbase.status.published" to true? if you enable it, master publish dead server list to clients every 10s by default, then client removes the cached regions on this server. so there must be sth wrong on dn29, please find the related first failure occurrence. you could also pastebin the dn29 regionserver log.
On Mon, Aug 11, 2014 at 8:17 AM, Ted Yu <yuzhih...@gmail.com> wrote: > bq. it's host dn29.manage.com,60020,1407600154728 is dead but not > processed > yet > > Can you look back (from 22:50:51) in master log to see what happened to > dn29 ? > > Thanks > > > On Sun, Aug 10, 2014 at 2:51 PM, Thomas Kwan <thomas.k...@manage.com> > wrote: > > > Thanks for your help Ted. > > > > From the master's log, I see > > > > 2014-08-09 22:50:51,176 DEBUG [827019302@qtp-63557232-287] > > client.HBaseAdmin: Trying to compact {ENCODED => > > 12c9a609765ad0bbd6468d93368f860a, NAME => > > > > > 'm_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.', > > STARTKEY => '2fd811c2b1d7476efb16499ccb2b823d', ENDKEY => > > '3328d07989225a29067b7b7981150052'}: > > org.apache.hadoop.hbase.NotServingRegionException: > > org.apache.hadoop.hbase.NotServingRegionException: Region > > > > > m_hashes,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a. > > is not online > > at > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2585) > > at > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3952) > > at > > > org.apache.hadoop.hbase.regionserver.HRegionServer.compactRegion(HRegionServer.java:3750) > > at > > > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:19803) > > at > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) > > at > > org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > > > > at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown > > Source) > > at > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > > at > java.lang.reflect.Constructor.newInstance(Constructor.java:513) > > at > > > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > > at > > > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) > > at > > > org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:277) > > at > > org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1647) > > at > > org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1623) > > at > > org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1504) > > at > > org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1491) > > at > > > org.apache.hadoop.hbase.generated.master.table_jsp._jspService(table_jsp.java:111) > > at > > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98) > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) > > at > > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > > at > > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > > at > > > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > > at > > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > > at > > > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081) > > at > > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > > at > > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > > at > > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > > at > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > > at > > > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > > at > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > > at > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > > at > > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > > at > > > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > > at > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > > at org.mortbay.jetty.Server.handle(Server.java:326) > > at > > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > > at > > > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) > > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) > > at > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > > at > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > > at > > > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) > > at > > > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > > ... > > 2014-08-09 23:11:29,846 INFO [AM.-pool1-t3] master.AssignmentManager: > > Skip assigning {ENCODED => d5887dd2b5897d14a6d2a041fc2ace1f, NAME => > > > > > 'm_data,2f03f0fa374de8af4880ba49401cd441,1406839342141.d5887dd2b5897d14a6d2a041fc2ace1f.', > > STARTKEY => '2f03f0fa374de8af4880ba49401cd441', ENDKEY => > > '2fd811c2b1d7476efb16499ccb2b823d'}, we couldn't close it: > > {d5887dd2b5897d14a6d2a041fc2ace1f state=FAILED_CLOSE, > > ts=1407651089846, server=dn05.manage.com,60020,1407649977124} > > ... > > 2014-08-10 07:49:17,589 INFO [RpcServer.handler=237,port=60000] > > master.AssignmentManager: Skip assigning > > > > > m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a., > > it's host dn29.manage.com,60020,1407600154728 is dead but not > > processed yet > > > > And I checked dn29 via hbase UI running at > > http://dn29.manage.com:60030/rs-status, looks like there is no regions > > on dn29. > > > > thanks > > thomas > > > > > > On Sun, Aug 10, 2014 at 12:28 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > Can you check master log to see why > > 'm_data,2fd811c2b1d7476efb16499ccb2b823d' > > > went offline ? > > > > > > Thanks > > > > > > > > > On Sun, Aug 10, 2014 at 12:13 PM, Thomas Kwan <thomas.k...@manage.com> > > > wrote: > > > > > >> Hi Ted, > > >> > > >> Hbase version is 0.96.0.2.0 > > >> > > >> Nothing interesting in the hbase log on dn29 and confirmed that region > > >> server is running on dn29 > > >> > > >> When I do 'get', i see > > >> > > >> hbase(main):001:0> get 'm_data','2fd811c2b1d7476efb16499ccb2b823d' > > >> > > >> COLUMN CELL > > >> > > >> ERROR: org.apache.hadoop.hbase.NotServingRegionException: Region > > >> > > >> > > > m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a. > > >> is not online > > >> at > > >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2585) > > >> at > > >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3952) > > >> at > > >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733) > > >> at > > >> > > > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26925) > > >> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) > > >> at > > org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > > >> > > >> On Sun, Aug 10, 2014 at 10:32 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > >> > bq. if I can just rmr stuff under /hbase-unsecure/splitWAL/... > > >> > > > >> > Please don't. > > >> > > > >> > Have you checked region server log on dn29.manage.com ? > > >> > > > >> > What hbase version are you using ? > > >> > > > >> > Cheers > > >> > > > >> > > > >> > On Sun, Aug 10, 2014 at 10:27 AM, Thomas Kwan < > thomas.k...@manage.com > > > > > >> > wrote: > > >> > > > >> >> And I have a program that do some read operations and it hangs. And > > I am > > >> >> seeing > > >> >> > > >> >> 2014-08-10 12:22:05,359 DEBUG [main] > > >> >> client.HConnectionManager$HConnectionImplementation: Removed all > > >> >> cached region locations that map to > > >> >> dn29.manage.com,60020,1407600154728 > > >> >> 2014-08-10 12:22:06,173 DEBUG [main] > > >> >> client.HConnectionManager$HConnectionImplementation: Removed > > >> >> dn29.manage.com:60020 as a location of > > >> >> > > >> >> > > >> > > > m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a. > > >> >> for tableName=m_data from cache > > >> >> 2014-08-10 12:22:07,180 DEBUG [main] > > >> >> client.HConnectionManager$HConnectionImplementation: Removed > > >> >> dn29.manage.com:60020 as a location of > > >> >> > > >> >> > > >> > > > m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a. > > >> >> for tableName=m_data from cache > > >> >> 2014-08-10 12:22:09,193 DEBUG [main] > > >> >> client.HConnectionManager$HConnectionImplementation: Removed > > >> >> dn29.manage.com:60020 as a location of > > >> >> > > >> >> > > >> > > > m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a. > > >> >> for tableName=m_data from cache > > >> >> 2014-08-10 12:22:09,196 DEBUG [main] > > >> >> client.HConnectionManager$HConnectionImplementation: Removed all > > >> >> cached region locations that map to > > >> >> dn29.manage.com,60020,1407600154728 > > >> >> 2014-08-10 12:22:13,208 DEBUG [main] > > >> >> client.HConnectionManager$HConnectionImplementation: Removed all > > >> >> cached region locations that map to > > >> >> dn29.manage.com,60020,1407600154728 > > >> >> > > >> >> I am seeing the following in the hbase master also > > >> >> > > >> >> 2014-08-10 10:22:25,016 INFO > > >> >> [master02.manage.com > > ,60000,1407690402682.splitLogManagerTimeoutMonitor] > > >> >> master.SplitLogManager: total tasks = 1 unassigned = 0 > > >> >> tasks={/hbase-unsecure/splitWAL/WALs%2Fdn29.manage.com > > >> >> %2C60020%2C1407600154728-splitting%2Fdn29.manage.com > > >> >> %252C60020%252C1407600154728.1407621759364=last_update > > >> >> = 1407690428226 last_version = 53 cur_worker_name = > > >> >> dn21.manage.com,60020,1407650188526 status = in_progress > > incarnation = > > >> >> 3 resubmits = 3 batch = installed = 1 done = 0 error = 0} > > >> >> > > >> >> I wonder if I can just rmr stuff under /hbase-unsecure/splitWAL/... > > >> >> > > >> >> thanks > > >> >> thomas > > >> >> > > >> > > >