[ https://issues.apache.org/jira/browse/HBASE-19710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312248#comment-16312248 ]
Ted Yu commented on HBASE-19710: -------------------------------- The build I used corresponded to this commit: HBASE-19667 Get rid of MasterEnvironment#supportGroupCPs The cluster has 13 nodes, running hadoop 3. > hbase:namespace table was stuck in transition > --------------------------------------------- > > Key: HBASE-19710 > URL: https://issues.apache.org/jira/browse/HBASE-19710 > Project: HBase > Issue Type: Bug > Reporter: Ted Yu > Priority: Critical > > ITBLL with chaos monkey failed due to namespace table getting stuck in > transition. > From hbase-hbase-master-ctr-e137-1514896590304-3629-01-000006.hwx.site.log , > we can see that master closed namespace table on 000009: > {code} > 2018-01-04 17:24:35,067 DEBUG [main-EventThread] zookeeper.ZKWatcher: > master:20000-0x160c222710c0028, > quorum=ctr-e137-1514896590304-3629-01-000011.hwx.site:2181,ctr-e137- > 1514896590304-3629-01-000014.hwx.site:2181,ctr-e137-1514896590304-3629-01-000009.hwx.site:2181,ctr-e137-1514896590304-3629-01-000006.hwx.site:2181,ctr-e137-1514896590304-3629- > > 01-000003.hwx.site:2181,ctr-e137-1514896590304-3629-01-000007.hwx.site:2181,ctr-e137-1514896590304-3629-01-000013.hwx.site:2181,ctr-e137-1514896590304-3629-01-000002.hwx.site: > > 2181,ctr-e137-1514896590304-3629-01-000012.hwx.site:2181,ctr-e137-1514896590304-3629-01-000008.hwx.site:2181,ctr-e137-1514896590304-3629-01-000010.hwx.site:2181, > baseZNode=/ hbase-unsecure Received ZooKeeper Event, > type=NodeChildrenChanged, state=SyncConnected, path=/hbase-unsecure/rs > 2018-01-04 17:24:35,067 INFO [ProcExecWrkr-5] assignment.RegionStateStore: > pid=643 updating hbase:meta > row=hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9., > regionState=CLOSING, > regionLocation=ctr-e137-1514896590304-3629-01-000009.hwx.site,16020,1515086643872 > ... > 2018-01-04 17:24:35,246 INFO [ProcExecWrkr-12] > procedure.MasterProcedureScheduler: pid=647, ppid=642, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase: > namespace, region=a95ed2d7434a43390fbec73abeeb9fd9 hbase:namespace > hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9. > 2018-01-04 17:25:17,041 DEBUG > [ctr-e137-1514896590304-3629-01-000006:20000.masterManager] > procedure2.ProcedureExecutor: Loading pid=641, > state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure > hri=hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9., > source=ctr-e137-1514896590304-3629-01-000009.hwx.site,16020,1515086643872, > destination= > {code} > For the move operation, from ctr-e137-1514896590304-3629-01-000009.hwx.site > log: > {code} > 2018-01-04 17:24:34,855 DEBUG > [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] > coprocessor.CoprocessorHost: Stop coprocessor > org.apache.hadoop.hbase.security. access.SecureBulkLoadEndpoint > 2018-01-04 17:24:34,855 INFO > [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] > regionserver.HRegion: Closed hbase:namespace,,1515085217343. > a95ed2d7434a43390fbec73abeeb9fd9. > 2018-01-04 17:24:34,856 DEBUG > [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] > handler.CloseRegionHandler: Closed hbase:namespace,,1515085217343. > a95ed2d7434a43390fbec73abeeb9fd9. > ... > 2018-01-04 17:25:47,607 DEBUG > [RpcServer.priority.FPBQ.Fifo.handler=18,queue=0,port=16020] ipc.RpcServer: > callId: 16 service: ClientService methodName: Get size: 103 > connection: 172.27.13.80:36738 deadline: 1515086837568 > org.apache.hadoop.hbase.NotServingRegionException: > hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9. is not > online on ctr-e137-1514896590304-3629-01-000009.hwx. site,16020,1515086719163 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3312) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3289) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1354) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2360) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:403) > {code} > We can see that the region server was not serving the region. > After that, the masters kept thinking namespace table was on 0009, leading to > prolonged downtime. -- This message was sent by Atlassian JIRA (v6.4.14#64029)