[ 
https://issues.apache.org/jira/browse/HBASE-20552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469113#comment-16469113
 ] 

stack commented on HBASE-20552:
-------------------------------

Thanks for report. We throw UnexpectedStateException when we meet a condition 
we do not know how to handle. I expect there are a few of these lucking in AMv2.

Did all recover after the RS ABORTED?

The pid=465 on M2 is the original move done back up on M1 being replayed and 
thinking it is done. It seems to have succeeded in getting the region assigned 
(to 0002 it seems).

A move is composed of an unassign followed by an assign. The unsassign seems to 
have completed (a sub-procedure of pid=465) but what happened to the assign 
that was a sub-procedure of pid=465? It looks like we create a new assign when 
processing the crashed server. This is pid=475. Now there are two assigns for 
the region. Only one should prevail (the second when it notices the other 
assign should give up... There is a lock on region during the running of the 
assign so only one can run at a time).

What happens to pid=475? It succeeds? It gave up because it saw the 
subprocedure of pid=465 had succeeded, the assign?

What is pid=507? Bulk assign? Or crash processing?

We need to figure what state went without an update (an update of hbase:meta) 
or what procedure went to run presuming a state that wasn't true for some 
reason.

Looks like a good one. Thanks.



> HBase RegionServer was shutdown due to UnexpectedStateException
> ---------------------------------------------------------------
>
>                 Key: HBASE-20552
>                 URL: https://issues.apache.org/jira/browse/HBASE-20552
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Romil Choksi
>            Priority: Critical
>
> This was observed during cluster testing (source code sync'ed with hbase-2.0, 
> built May 2nd):
> {code}
> 2018-05-02 05:44:10,089 ERROR 
> [RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] 
> master.MasterRpcServices: Region server 
> ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 reported 
> a fatal error:
> ***** ABORTING region server 
> ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474: 
> org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, location=ctr-e138-     
>         1518143905142-279227-01-000007.hwx.site,16020,1525239609353, 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on server=ctr-e138-      
>     1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but state has 
> otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: 
> rit=OPEN, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,
>                          table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
> server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 
> but state  has otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
>   ... 7 more
>  *****
> Cause:
> org.apache.hadoop.hbase.YouAreDeadException: 
> org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,
>        table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
> server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 
> but state  has otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: 
> rit=OPEN, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,
>                          table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
> server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 
> but state  has otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
>   ... 7 more
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> {code}
> [~elserj] and I (Ted) did some initial analysis.
> In the following description, M1 refers to 
> master-ctr-e138-1518143905142-279227-01-000005 and M2 refers to 
> master-ctr-e138-1518143905142-279227-01-000003.
> Let's follow region 94f6ca283dbb4445b2bcdc321b734d28 .
> Master 1 was moving the region to 000007:
> {code}
> 2018-05-02 05:38:59,017 INFO  
> [master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] 
> master.HMaster: balance hri=94f6ca283dbb4445b2bcdc321b734d28, 
> source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
> destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
> 2018-05-02 05:38:59,126 DEBUG 
> [master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] 
> procedure2.ProcedureExecutor: Stored pid=465, 
> state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
> hri=94f6ca283dbb4445b2bcdc321b734d28, 
> source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
> destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
> {code}
> Region was closed on 000002:
> {code}
> 2018-05-02 05:38:59,327 INFO  
> [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1]
>  regionserver.HRegion: Closed 
> test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
> 2018-05-02 05:38:59,327 INFO  
> [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1]
>  regionserver.HRegionServer: Adding 94f6ca283dbb4445b2bcdc321b734d28 move to 
> ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 record 
> at close sequenceid=13395
> 2018-05-02 05:38:59,329 DEBUG 
> [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1]
>  handler.CloseRegionHandler: Closed 
> test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
> {code}
> The master saw the destination crash:
> {code}
> 2018-05-02 05:39:44,792 INFO  [main-EventThread] master.RegionServerTracker: 
> RegionServer ephemeral node deleted, processing expiration 
> [ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502]
> 2018-05-02 05:39:44,793 INFO  [main-EventThread] master.ServerManager: 
> Processing expiration of 
> ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 on 
> ctr-e138-1518143905142-279227-01-000005.hwx.site,20000,1525238791399
> {code}
> Processed it:
> {code}
> 2018-05-02 05:39:45,616 DEBUG [PEWorker-16] procedure.ServerCrashProcedure: 
> Done splitting WALs pid=471, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; 
> ServerCrashProcedure 
> server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
> splitWal=true, meta=true
> 2018-05-02 05:39:45,620 INFO  [PEWorker-16] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=474, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=670f6b815d2acac905130e5440d59304}, {pid=475, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28}, {pid=476, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=1595f38ee901be7c67b997fe2fc95951}, {pid=477, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=a0620fc83de532a37f6a9bb8f99cc6c4}, {pid=478, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=e75a388bc2011feed75bdc1a0e99a9a9}, {pid=479, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=c963eb77dbdc6dbab886dbe4eebba5ad}, {pid=480, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=f3db9f9879ed03f488dcb89bea834237}, {pid=481, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=b5180eee96b616afdf79578309c66a11}, {pid=482, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=a6e0d7561c4f19e78f94d37462588281}, {pid=483, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=1d954f21d711345a9587d995cecea136}, {pid=484, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=c078deb2474e9c19b85b5fdb9efaa47d}, {pid=485, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=91f73e76bbe7bc8a61b1b1299d34c6ab}, {pid=486, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=ATLAS_ENTITY_AUDIT_EVENTS, region=8dc6fd2022c2fdf8c065fbd16cadaaca}]
> {code}
> Master 1 tried to assign the region back on the original RS (that we tried to 
> move it off of):
> {code}
> 2018-05-02 05:39:46,136 INFO  [PEWorker-1] 
> assignment.RegionTransitionProcedure: Dispatch pid=475, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure                  
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474
> {code}
> Then Master 1 went down, Master 2 picked up the region.
> M2 ran a MoveProcedure where it thought region was on 007:
> {code}
> 2018-05-02 05:43:33,876 DEBUG 
> [master/ctr-e138-1518143905142-279227-01-000003:20000] 
> procedure2.ProcedureExecutor: Completed pid=465, state=SUCCESS; 
> MoveRegionProcedure hri=94f6ca283dbb4445b2bcdc321b734d28, 
> source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
> destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
> {code}
> and assigned region to 007:
> {code}
> 2018-05-02 05:44:08,464 INFO  [PEWorker-6] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28}]
> 2018-05-02 05:44:08,472 INFO  [PEWorker-6] 
> procedure.MasterProcedureScheduler: pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28 checking lock on 
> 94f6ca283dbb4445b2bcdc321b734d28
> 2018-05-02 05:44:08,480 INFO  [PEWorker-6] assignment.AssignProcedure: 
> Starting pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OFFLINE, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502;
>  forceNewPlan=false, retain=true
> 2018-05-02 05:44:08,646 INFO  
> [master/ctr-e138-1518143905142-279227-01-000003:20000] 
> balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart 
> assignment. 
> 2018-05-02 05:44:08,659 INFO  [PEWorker-11] assignment.RegionStateStore: 
> pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, 
> regionState=OPENING, 
> regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:08,727 INFO  [PEWorker-11] 
> assignment.RegionTransitionProcedure: Dispatch pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:08,768 INFO  
> [master/ctr-e138-1518143905142-279227-01-000003:20000] master.HMaster: Master 
> has completed initialization 41.430sec
> 2018-05-02 05:44:09,213 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] 
> assignment.RegionTransitionProcedure: Received report OPENED seqId=13402, 
> pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:09,213 DEBUG [PEWorker-12] 
> assignment.RegionTransitionProcedure: Finishing pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_FINISH; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:09,214 INFO  [PEWorker-12] assignment.RegionStateStore: 
> pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, 
> regionState=OPEN, openSeqNum=13402, 
> regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
> Finished subprocedure(s) of pid=507, state=RUNNABLE:SERVER_CRASH_HANDLE_RIT2; 
> ServerCrashProcedure 
> server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
> splitWal=true, meta=false; resume parent processing.
> 2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
> Finished pid=508, ppid=507, state=SUCCESS; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28 in 764msec
> 2018-05-02 05:44:09,273 INFO  [PEWorker-14] procedure2.ProcedureExecutor: 
> Finished pid=507, state=SUCCESS; ServerCrashProcedure 
> server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
> splitWal=true, meta=false in 975msec
> {code}
> So when 000002 reported region to master 2, we got the exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to