bq. it's host dn29.manage.com,60020,1407600154728 is dead but not processed yet
Can you look back (from 22:50:51) in master log to see what happened to dn29 ? Thanks On Sun, Aug 10, 2014 at 2:51 PM, Thomas Kwan <thomas.k...@manage.com> wrote: > Thanks for your help Ted. > > From the master's log, I see > > 2014-08-09 22:50:51,176 DEBUG [827019302@qtp-63557232-287] > client.HBaseAdmin: Trying to compact {ENCODED => > 12c9a609765ad0bbd6468d93368f860a, NAME => > > 'm_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a.', > STARTKEY => '2fd811c2b1d7476efb16499ccb2b823d', ENDKEY => > '3328d07989225a29067b7b7981150052'}: > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region > > m_hashes,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a. > is not online > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2585) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3952) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.compactRegion(HRegionServer.java:3750) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:19803) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) > at > org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > > at sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) > at > org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:277) > at > org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1647) > at > org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1623) > at > org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1504) > at > org.apache.hadoop.hbase.client.HBaseAdmin.compact(HBaseAdmin.java:1491) > at > org.apache.hadoop.hbase.generated.master.table_jsp._jspService(table_jsp.java:111) > at > org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > at > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > ... > 2014-08-09 23:11:29,846 INFO [AM.-pool1-t3] master.AssignmentManager: > Skip assigning {ENCODED => d5887dd2b5897d14a6d2a041fc2ace1f, NAME => > > 'm_data,2f03f0fa374de8af4880ba49401cd441,1406839342141.d5887dd2b5897d14a6d2a041fc2ace1f.', > STARTKEY => '2f03f0fa374de8af4880ba49401cd441', ENDKEY => > '2fd811c2b1d7476efb16499ccb2b823d'}, we couldn't close it: > {d5887dd2b5897d14a6d2a041fc2ace1f state=FAILED_CLOSE, > ts=1407651089846, server=dn05.manage.com,60020,1407649977124} > ... > 2014-08-10 07:49:17,589 INFO [RpcServer.handler=237,port=60000] > master.AssignmentManager: Skip assigning > > m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a., > it's host dn29.manage.com,60020,1407600154728 is dead but not > processed yet > > And I checked dn29 via hbase UI running at > http://dn29.manage.com:60030/rs-status, looks like there is no regions > on dn29. > > thanks > thomas > > > On Sun, Aug 10, 2014 at 12:28 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > Can you check master log to see why > 'm_data,2fd811c2b1d7476efb16499ccb2b823d' > > went offline ? > > > > Thanks > > > > > > On Sun, Aug 10, 2014 at 12:13 PM, Thomas Kwan <thomas.k...@manage.com> > > wrote: > > > >> Hi Ted, > >> > >> Hbase version is 0.96.0.2.0 > >> > >> Nothing interesting in the hbase log on dn29 and confirmed that region > >> server is running on dn29 > >> > >> When I do 'get', i see > >> > >> hbase(main):001:0> get 'm_data','2fd811c2b1d7476efb16499ccb2b823d' > >> > >> COLUMN CELL > >> > >> ERROR: org.apache.hadoop.hbase.NotServingRegionException: Region > >> > >> > m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a. > >> is not online > >> at > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2585) > >> at > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3952) > >> at > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733) > >> at > >> > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26925) > >> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2175) > >> at > org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > >> > >> On Sun, Aug 10, 2014 at 10:32 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> > bq. if I can just rmr stuff under /hbase-unsecure/splitWAL/... > >> > > >> > Please don't. > >> > > >> > Have you checked region server log on dn29.manage.com ? > >> > > >> > What hbase version are you using ? > >> > > >> > Cheers > >> > > >> > > >> > On Sun, Aug 10, 2014 at 10:27 AM, Thomas Kwan <thomas.k...@manage.com > > > >> > wrote: > >> > > >> >> And I have a program that do some read operations and it hangs. And > I am > >> >> seeing > >> >> > >> >> 2014-08-10 12:22:05,359 DEBUG [main] > >> >> client.HConnectionManager$HConnectionImplementation: Removed all > >> >> cached region locations that map to > >> >> dn29.manage.com,60020,1407600154728 > >> >> 2014-08-10 12:22:06,173 DEBUG [main] > >> >> client.HConnectionManager$HConnectionImplementation: Removed > >> >> dn29.manage.com:60020 as a location of > >> >> > >> >> > >> > m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a. > >> >> for tableName=m_data from cache > >> >> 2014-08-10 12:22:07,180 DEBUG [main] > >> >> client.HConnectionManager$HConnectionImplementation: Removed > >> >> dn29.manage.com:60020 as a location of > >> >> > >> >> > >> > m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a. > >> >> for tableName=m_data from cache > >> >> 2014-08-10 12:22:09,193 DEBUG [main] > >> >> client.HConnectionManager$HConnectionImplementation: Removed > >> >> dn29.manage.com:60020 as a location of > >> >> > >> >> > >> > m_data,2fd811c2b1d7476efb16499ccb2b823d,1406512331699.12c9a609765ad0bbd6468d93368f860a. > >> >> for tableName=m_data from cache > >> >> 2014-08-10 12:22:09,196 DEBUG [main] > >> >> client.HConnectionManager$HConnectionImplementation: Removed all > >> >> cached region locations that map to > >> >> dn29.manage.com,60020,1407600154728 > >> >> 2014-08-10 12:22:13,208 DEBUG [main] > >> >> client.HConnectionManager$HConnectionImplementation: Removed all > >> >> cached region locations that map to > >> >> dn29.manage.com,60020,1407600154728 > >> >> > >> >> I am seeing the following in the hbase master also > >> >> > >> >> 2014-08-10 10:22:25,016 INFO > >> >> [master02.manage.com > ,60000,1407690402682.splitLogManagerTimeoutMonitor] > >> >> master.SplitLogManager: total tasks = 1 unassigned = 0 > >> >> tasks={/hbase-unsecure/splitWAL/WALs%2Fdn29.manage.com > >> >> %2C60020%2C1407600154728-splitting%2Fdn29.manage.com > >> >> %252C60020%252C1407600154728.1407621759364=last_update > >> >> = 1407690428226 last_version = 53 cur_worker_name = > >> >> dn21.manage.com,60020,1407650188526 status = in_progress > incarnation = > >> >> 3 resubmits = 3 batch = installed = 1 done = 0 error = 0} > >> >> > >> >> I wonder if I can just rmr stuff under /hbase-unsecure/splitWAL/... > >> >> > >> >> thanks > >> >> thomas > >> >> > >> >