[jira] [Commented] (HBASE-21623) ServerCrashProcedure can stomp on a RIT for a wrong server

Sergey Shelukhin (JIRA) Fri, 15 Feb 2019 11:39:44 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769625#comment-16769625
 ]


Sergey Shelukhin commented on HBASE-21623:
------------------------------------------

[~wchevreuil] the problem is "for (RegionInfo region : regions) {" loop.
The "regions" are the regions assumed to be on the server; they are obtained on 
previous procedure step, without any locks. So, if between getting "regions"
 and running the loop RIT makes a change (transitions the region from OPENING 
on server1 to OPENING on server2), SCP still has this region in the list.

> ServerCrashProcedure can stomp on a RIT for a wrong server
> ----------------------------------------------------------
>
>                 Key: HBASE-21623
>                 URL: https://issues.apache.org/jira/browse/HBASE-21623
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>         Attachments: HBASE-21623.patch
>
>
> A server died while some region was being opened on it; eventually the open 
> failed, and the RIT procedure started retrying on a different server.
> However, by then SCP for the dying server had already obtained the region 
> from the list of regions on the old server, and proceeded to overwrite 
> whatever the RIT was doing with a new server.
> {noformat}
> 2018-12-18 23:06:03,160 INFO  [PEWorker-14] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=151404, ppid=151104, state=RUNNABLE, 
> hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> ...
> 2018-12-18 23:06:38,208 INFO  [PEWorker-10] procedure.ServerCrashProcedure: 
> Start pid=151632, state=RUNNABLE:SERVER_CRASH_START, hasLock=true; 
> ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true, 
> meta=false
> ...
> 2018-12-18 23:06:41,953 WARN  [RSProcedureDispatcher-pool4-t115] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=151404, 
> ppid=151104, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
> {ENCODED => region1, ... } to server oldServer,17020,1545202098577 failed
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: 
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server 
> oldServer,17020,1545202098577 aborting
> 2018-12-18 23:06:42,485 INFO  [PEWorker-5] procedure2.ProcedureExecutor: 
> Finished subprocedure(s) of pid=151104, ppid=150875, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; resume parent 
> processing.
> 2018-12-18 23:06:42,485 INFO  [PEWorker-13] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=151104, ppid=150875, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, 
> location=oldServer,17020,1545202098577
> 2018-12-18 23:06:42,500 INFO  [PEWorker-13] 
> assignment.TransitRegionStateProcedure: Starting pid=151104, ppid=150875, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, 
> location=null; forceNewPlan=true, retain=false
> 2018-12-18 23:06:42,657 INFO  [PEWorker-2] assignment.RegionStateStore: 
> pid=151104 updating hbase:meta row=region1, regionState=OPENING, 
> regionLocation=newServer,17020,1545202111238
> ...
> 2018-12-18 23:06:43,094 INFO  [PEWorker-4] procedure.ServerCrashProcedure: 
> pid=151632, state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; 
> ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true, 
> meta=false found RIT  pid=151104, ppid=150875, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, 
> location=newServer,17020,1545202111238, table=t1, region=region1
> 2018-12-18 23:06:43,094 INFO  [PEWorker-4] assignment.RegionStateStore: 
> pid=151104 updating hbase:meta row=region1, regionState=ABNORMALLY_CLOSED
> {noformat}
> Later, the RIT overwrote the state again, it seems, and then the region got 
> stuck in OPENING state forever, but I'm not sure yet if that's just due to 
> this bug or if there was another bug after that. For now this can be 
> addressed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21623) ServerCrashProcedure can stomp on a RIT for a wrong server

Reply via email to