[ 
https://issues.apache.org/jira/browse/HBASE-20552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479767#comment-16479767
 ] 

Josh Elser commented on HBASE-20552:
------------------------------------

Trying to help pick this one up too..
{quote}bq. On M005, pid=471 is SCP for R007 which also hosts meta. Meta is 
re-assigned with pid=472 to R002 which is followed by other region assignments
{quote}
I'm coming to think this is our problem, too.

pid=471 is an SCP for r007 from pv2-004.log which finished at  05:39:47,288 on 
m005. When m003 takes over and reads the tracker from pv2-002.log, the largest 
pid we have is pid=467.

My hunch (which I need to back up with code) is that because m003 never sees 
the completed SCP, it thinks that r002 is holding this region (overriding what 
meta say, maybe?), claiming it to be on r007 instead. The following is the 
"largest" proc from the pv2-004 log that m003 reads.
{noformat}
2018-05-02 05:43:33,876 DEBUG 
[master/ctr-e138-1518143905142-279227-01-000003:20000] 
procedure2.ProcedureExecutor: Completed pid=465, state=SUCCESS; 
MoveRegionProcedure hri=94f6ca283dbb4445b2bcdc321b734d28, 
source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502{noformat}
Then, m003 initializes RegionStateStore, saying:
{noformat}
2018-05-02 05:44:08,236 INFO  
[master/ctr-e138-1518143905142-279227-01-000003:20000] 
assignment.RegionStateStore: Load hbase:meta entry 
region=94f6ca283dbb4445b2bcdc321b734d28, regionState=OPEN, 
lastHost=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
{noformat}
This makes me wonder if Umesh's findings about pid=507 (SCP for r007 putting 
the region back on r007) are related...

You get anywhere on a repro, [~uagashe]? I have some nodes running through this 
internal scenario which has triggered this before. Might try my hand at 
repro'ing in an IT, but unsure how hard that will be ;)

> HBase RegionServer was shutdown due to UnexpectedStateException
> ---------------------------------------------------------------
>
>                 Key: HBASE-20552
>                 URL: https://issues.apache.org/jira/browse/HBASE-20552
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Romil Choksi
>            Assignee: Umesh Agashe
>            Priority: Critical
>         Attachments: 
> 102143-master-ctr-e138-1518143905142-279227-01-000003.hwx.site.log, 
> 102143-master-ctr-e138-1518143905142-279227-01-000005.hwx.site.log, 
> 102143-regionserver-ctr-e138-1518143905142-279227-01-000002.hwx.site.log, 
> 102143-regionserver-ctr-e138-1518143905142-279227-01-000007.hwx.site.log, 
> 102143-regionserver-ctr-e138-1518143905142-279227-01-000008.hwx.site.log
>
>
> This was observed during cluster testing (source code sync'ed with hbase-2.0, 
> built May 2nd):
> {code}
> 2018-05-02 05:44:10,089 ERROR 
> [RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] 
> master.MasterRpcServices: Region server 
> ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 reported 
> a fatal error:
> ***** ABORTING region server 
> ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474: 
> org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, location=ctr-e138-     
>         1518143905142-279227-01-000007.hwx.site,16020,1525239609353, 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on server=ctr-e138-      
>     1518143905142-279227-01-000002.hwx.site,16020,1525239334474 but state has 
> otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: 
> rit=OPEN, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,
>                          table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
> server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 
> but state  has otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
>   ... 7 more
>  *****
> Cause:
> org.apache.hadoop.hbase.YouAreDeadException: 
> org.apache.hadoop.hbase.YouAreDeadException: rit=OPEN, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,
>        table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
> server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 
> but state  has otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1065)
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:987)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:459)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:13118)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException: 
> rit=OPEN, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353,
>                          table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28reported OPEN on 
> server=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474 
> but state  has otherwise.
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1037)
>   ... 7 more
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> {code}
> [~elserj] and I (Ted) did some initial analysis.
> In the following description, M1 refers to 
> master-ctr-e138-1518143905142-279227-01-000005 and M2 refers to 
> master-ctr-e138-1518143905142-279227-01-000003.
> Let's follow region 94f6ca283dbb4445b2bcdc321b734d28 .
> Master 1 was moving the region to 000007:
> {code}
> 2018-05-02 05:38:59,017 INFO  
> [master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] 
> master.HMaster: balance hri=94f6ca283dbb4445b2bcdc321b734d28, 
> source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
> destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
> 2018-05-02 05:38:59,126 DEBUG 
> [master/ctr-e138-1518143905142-279227-01-000005:20000.Chore.1] 
> procedure2.ProcedureExecutor: Stored pid=465, 
> state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
> hri=94f6ca283dbb4445b2bcdc321b734d28, 
> source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
> destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
> {code}
> Region was closed on 000002:
> {code}
> 2018-05-02 05:38:59,327 INFO  
> [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1]
>  regionserver.HRegion: Closed 
> test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
> 2018-05-02 05:38:59,327 INFO  
> [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1]
>  regionserver.HRegionServer: Adding 94f6ca283dbb4445b2bcdc321b734d28 move to 
> ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 record 
> at close sequenceid=13395
> 2018-05-02 05:38:59,329 DEBUG 
> [RS_CLOSE_REGION-regionserver/ctr-e138-1518143905142-279227-01-000002:16020-1]
>  handler.CloseRegionHandler: Closed 
> test_hbase_ha_load_test_tool_hbase,1eb851e9,1525238855422.94f6ca283dbb4445b2bcdc321b734d28.
> {code}
> The master saw the destination crash:
> {code}
> 2018-05-02 05:39:44,792 INFO  [main-EventThread] master.RegionServerTracker: 
> RegionServer ephemeral node deleted, processing expiration 
> [ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502]
> 2018-05-02 05:39:44,793 INFO  [main-EventThread] master.ServerManager: 
> Processing expiration of 
> ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502 on 
> ctr-e138-1518143905142-279227-01-000005.hwx.site,20000,1525238791399
> {code}
> Processed it:
> {code}
> 2018-05-02 05:39:45,616 DEBUG [PEWorker-16] procedure.ServerCrashProcedure: 
> Done splitting WALs pid=471, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; 
> ServerCrashProcedure 
> server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
> splitWal=true, meta=true
> 2018-05-02 05:39:45,620 INFO  [PEWorker-16] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=474, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=670f6b815d2acac905130e5440d59304}, {pid=475, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28}, {pid=476, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=1595f38ee901be7c67b997fe2fc95951}, {pid=477, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=a0620fc83de532a37f6a9bb8f99cc6c4}, {pid=478, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=e75a388bc2011feed75bdc1a0e99a9a9}, {pid=479, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=c963eb77dbdc6dbab886dbe4eebba5ad}, {pid=480, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=f3db9f9879ed03f488dcb89bea834237}, {pid=481, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=b5180eee96b616afdf79578309c66a11}, {pid=482, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=a6e0d7561c4f19e78f94d37462588281}, {pid=483, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=1d954f21d711345a9587d995cecea136}, {pid=484, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=c078deb2474e9c19b85b5fdb9efaa47d}, {pid=485, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_namenode, 
> region=91f73e76bbe7bc8a61b1b1299d34c6ab}, {pid=486, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=ATLAS_ENTITY_AUDIT_EVENTS, region=8dc6fd2022c2fdf8c065fbd16cadaaca}]
> {code}
> Master 1 tried to assign the region back on the original RS (that we tried to 
> move it off of):
> {code}
> 2018-05-02 05:39:46,136 INFO  [PEWorker-1] 
> assignment.RegionTransitionProcedure: Dispatch pid=475, ppid=471, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure                  
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474
> {code}
> Then Master 1 went down, Master 2 picked up the region.
> M2 ran a MoveProcedure where it thought region was on 007:
> {code}
> 2018-05-02 05:43:33,876 DEBUG 
> [master/ctr-e138-1518143905142-279227-01-000003:20000] 
> procedure2.ProcedureExecutor: Completed pid=465, state=SUCCESS; 
> MoveRegionProcedure hri=94f6ca283dbb4445b2bcdc321b734d28, 
> source=ctr-e138-1518143905142-279227-01-000002.hwx.site,16020,1525239334474, 
> destination=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502
> {code}
> and assigned region to 007:
> {code}
> 2018-05-02 05:44:08,464 INFO  [PEWorker-6] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28}]
> 2018-05-02 05:44:08,472 INFO  [PEWorker-6] 
> procedure.MasterProcedureScheduler: pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28 checking lock on 
> 94f6ca283dbb4445b2bcdc321b734d28
> 2018-05-02 05:44:08,480 INFO  [PEWorker-6] assignment.AssignProcedure: 
> Starting pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OFFLINE, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502;
>  forceNewPlan=false, retain=true
> 2018-05-02 05:44:08,646 INFO  
> [master/ctr-e138-1518143905142-279227-01-000003:20000] 
> balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart 
> assignment. 
> 2018-05-02 05:44:08,659 INFO  [PEWorker-11] assignment.RegionStateStore: 
> pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, 
> regionState=OPENING, 
> regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:08,727 INFO  [PEWorker-11] 
> assignment.RegionTransitionProcedure: Dispatch pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:08,768 INFO  
> [master/ctr-e138-1518143905142-279227-01-000003:20000] master.HMaster: Master 
> has completed initialization 41.430sec
> 2018-05-02 05:44:09,213 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=28,queue=1,port=20000] 
> assignment.RegionTransitionProcedure: Received report OPENED seqId=13402, 
> pid=508, ppid=507, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:09,213 DEBUG [PEWorker-12] 
> assignment.RegionTransitionProcedure: Finishing pid=508, ppid=507, 
> state=RUNNABLE:REGION_TRANSITION_FINISH; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28; rit=OPENING, 
> location=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:09,214 INFO  [PEWorker-12] assignment.RegionStateStore: 
> pid=508 updating hbase:meta row=94f6ca283dbb4445b2bcdc321b734d28, 
> regionState=OPEN, openSeqNum=13402, 
> regionLocation=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525239609353
> 2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
> Finished subprocedure(s) of pid=507, state=RUNNABLE:SERVER_CRASH_HANDLE_RIT2; 
> ServerCrashProcedure 
> server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
> splitWal=true, meta=false; resume parent processing.
> 2018-05-02 05:44:09,258 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
> Finished pid=508, ppid=507, state=SUCCESS; AssignProcedure 
> table=test_hbase_ha_load_test_tool_hbase, 
> region=94f6ca283dbb4445b2bcdc321b734d28 in 764msec
> 2018-05-02 05:44:09,273 INFO  [PEWorker-14] procedure2.ProcedureExecutor: 
> Finished pid=507, state=SUCCESS; ServerCrashProcedure 
> server=ctr-e138-1518143905142-279227-01-000007.hwx.site,16020,1525238558502, 
> splitWal=true, meta=false in 975msec
> {code}
> So when 000002 reported region to master 2, we got the exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to