[ 
https://issues.apache.org/jira/browse/HBASE-19710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312248#comment-16312248
 ] 

Ted Yu commented on HBASE-19710:
--------------------------------

The build I used corresponded to this commit:

HBASE-19667 Get rid of MasterEnvironment#supportGroupCPs

The cluster has 13 nodes, running hadoop 3.

> hbase:namespace table was stuck in transition
> ---------------------------------------------
>
>                 Key: HBASE-19710
>                 URL: https://issues.apache.org/jira/browse/HBASE-19710
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Priority: Critical
>
> ITBLL with chaos monkey failed due to namespace table getting stuck in 
> transition.
> From hbase-hbase-master-ctr-e137-1514896590304-3629-01-000006.hwx.site.log , 
> we can see that master closed namespace table on 000009:
> {code}
> 2018-01-04 17:24:35,067 DEBUG [main-EventThread] zookeeper.ZKWatcher: 
> master:20000-0x160c222710c0028, 
> quorum=ctr-e137-1514896590304-3629-01-000011.hwx.site:2181,ctr-e137-      
> 1514896590304-3629-01-000014.hwx.site:2181,ctr-e137-1514896590304-3629-01-000009.hwx.site:2181,ctr-e137-1514896590304-3629-01-000006.hwx.site:2181,ctr-e137-1514896590304-3629-
>  
> 01-000003.hwx.site:2181,ctr-e137-1514896590304-3629-01-000007.hwx.site:2181,ctr-e137-1514896590304-3629-01-000013.hwx.site:2181,ctr-e137-1514896590304-3629-01-000002.hwx.site:
>  
> 2181,ctr-e137-1514896590304-3629-01-000012.hwx.site:2181,ctr-e137-1514896590304-3629-01-000008.hwx.site:2181,ctr-e137-1514896590304-3629-01-000010.hwx.site:2181,
>  baseZNode=/   hbase-unsecure Received ZooKeeper Event, 
> type=NodeChildrenChanged, state=SyncConnected, path=/hbase-unsecure/rs
> 2018-01-04 17:24:35,067 INFO  [ProcExecWrkr-5] assignment.RegionStateStore: 
> pid=643 updating hbase:meta 
> row=hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9.,   
> regionState=CLOSING, 
> regionLocation=ctr-e137-1514896590304-3629-01-000009.hwx.site,16020,1515086643872
> ...
> 2018-01-04 17:24:35,246 INFO  [ProcExecWrkr-12] 
> procedure.MasterProcedureScheduler: pid=647, ppid=642, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:     
> namespace, region=a95ed2d7434a43390fbec73abeeb9fd9 hbase:namespace 
> hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9.
> 2018-01-04 17:25:17,041 DEBUG 
> [ctr-e137-1514896590304-3629-01-000006:20000.masterManager] 
> procedure2.ProcedureExecutor: Loading pid=641, 
> state=WAITING:MOVE_REGION_ASSIGN;      MoveRegionProcedure 
> hri=hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9., 
> source=ctr-e137-1514896590304-3629-01-000009.hwx.site,16020,1515086643872,    
>         destination=
> {code}
> For the move operation, from ctr-e137-1514896590304-3629-01-000009.hwx.site 
> log:
> {code}
> 2018-01-04 17:24:34,855 DEBUG 
> [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] 
> coprocessor.CoprocessorHost: Stop coprocessor 
> org.apache.hadoop.hbase.security.   access.SecureBulkLoadEndpoint
> 2018-01-04 17:24:34,855 INFO  
> [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] 
> regionserver.HRegion: Closed hbase:namespace,,1515085217343.                  
>     a95ed2d7434a43390fbec73abeeb9fd9.
> 2018-01-04 17:24:34,856 DEBUG 
> [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] 
> handler.CloseRegionHandler: Closed hbase:namespace,,1515085217343.            
>     a95ed2d7434a43390fbec73abeeb9fd9.
> ...
> 2018-01-04 17:25:47,607 DEBUG 
> [RpcServer.priority.FPBQ.Fifo.handler=18,queue=0,port=16020] ipc.RpcServer: 
> callId: 16 service: ClientService methodName: Get size: 103           
> connection: 172.27.13.80:36738 deadline: 1515086837568
> org.apache.hadoop.hbase.NotServingRegionException: 
> hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9. is not 
> online on ctr-e137-1514896590304-3629-01-000009.hwx. site,16020,1515086719163
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3312)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3289)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1354)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2360)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:403)
> {code}
> We can see that the region server was not serving the region.
> After that, the masters kept thinking namespace table was on 0009, leading to 
> prolonged downtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to