[ https://issues.apache.org/jira/browse/HBASE-21623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769318#comment-16769318 ]
Wellington Chevreuil commented on HBASE-21623: ---------------------------------------------- Thanks for the explanation [~sershe], however I still think the locks should had avoided it. Maybe my reading of this code path is mistaken, but here my interpretation of which pieces of code would be related: {quote} SCP: server1 crashed, what's on server1? looks like r1 {quote} So this would mean this part of SCPs code: {noformat} for (RegionInfo region : regions) { RegionStateNode regionNode = am.getRegionStates().getOrCreateRegionStateNode(region); regionNode.lock(); try { if (regionNode.getProcedure() != null) { LOG.info("{} found RIT {}; {}", this, regionNode.getProcedure(), regionNode); regionNode.getProcedure().serverCrashed(env, regionNode, getServerName()); } else { if (env.getMasterServices().getTableStateManager().isTableState(regionNode.getTable(), TableState.State.DISABLING, TableState.State.DISABLED)) { continue; } TransitRegionStateProcedure proc = TransitRegionStateProcedure.assign(env, region, null); regionNode.setProcedure(proc); addChildProcedure(proc); } } finally { regionNode.unlock(); } } {noformat} For this step: {quote}RIT: open failed, OPENING r1 on server2 now {quote} Related code would be in TRSP execute -> executeFromState -> openRegion, where execute method is enclosed by the region node lock: {noformat} protected Procedure[] execute(MasterProcedureEnv env) throws ProcedureSuspendedException, ProcedureYieldException, InterruptedException { RegionStateNode regionNode = env.getAssignmentManager().getRegionStates().getOrCreateRegionStateNode(getRegion()); regionNode.lock(); try { return super.execute(env); } finally { regionNode.unlock(); } } {noformat} So below point would only really happen if the TRSP had already finished its execution and regionNode object lock has been released, wouldn't it? But in this case, SCP *regionNode.getProcedure()* call should return null, not the previous RIT that had already completed, isn't it? {quote}SCP: looks like a RIT on r1; hey RIT for r1, your server crashed! (*) {quote} > ServerCrashProcedure can stomp on a RIT for a wrong server > ---------------------------------------------------------- > > Key: HBASE-21623 > URL: https://issues.apache.org/jira/browse/HBASE-21623 > Project: HBase > Issue Type: Bug > Components: amv2 > Affects Versions: 3.0.0, 2.2.0 > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Priority: Critical > Attachments: HBASE-21623.patch > > > A server died while some region was being opened on it; eventually the open > failed, and the RIT procedure started retrying on a different server. > However, by then SCP for the dying server had already obtained the region > from the list of regions on the old server, and proceeded to overwrite > whatever the RIT was doing with a new server. > {noformat} > 2018-12-18 23:06:03,160 INFO [PEWorker-14] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=151404, ppid=151104, state=RUNNABLE, > hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > ... > 2018-12-18 23:06:38,208 INFO [PEWorker-10] procedure.ServerCrashProcedure: > Start pid=151632, state=RUNNABLE:SERVER_CRASH_START, hasLock=true; > ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true, > meta=false > ... > 2018-12-18 23:06:41,953 WARN [RSProcedureDispatcher-pool4-t115] > assignment.RegionRemoteProcedureBase: The remote operation pid=151404, > ppid=151104, state=RUNNABLE, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => region1, ... } to server oldServer,17020,1545202098577 failed > org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: > org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server > oldServer,17020,1545202098577 aborting > 2018-12-18 23:06:42,485 INFO [PEWorker-5] procedure2.ProcedureExecutor: > Finished subprocedure(s) of pid=151104, ppid=150875, > state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=t1, region=region1, ASSIGN; resume parent > processing. > 2018-12-18 23:06:42,485 INFO [PEWorker-13] > assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; > pid=151104, ppid=150875, > state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, > location=oldServer,17020,1545202098577 > 2018-12-18 23:06:42,500 INFO [PEWorker-13] > assignment.TransitRegionStateProcedure: Starting pid=151104, ppid=150875, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, > location=null; forceNewPlan=true, retain=false > 2018-12-18 23:06:42,657 INFO [PEWorker-2] assignment.RegionStateStore: > pid=151104 updating hbase:meta row=region1, regionState=OPENING, > regionLocation=newServer,17020,1545202111238 > ... > 2018-12-18 23:06:43,094 INFO [PEWorker-4] procedure.ServerCrashProcedure: > pid=151632, state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; > ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true, > meta=false found RIT pid=151104, ppid=150875, > state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, > location=newServer,17020,1545202111238, table=t1, region=region1 > 2018-12-18 23:06:43,094 INFO [PEWorker-4] assignment.RegionStateStore: > pid=151104 updating hbase:meta row=region1, regionState=ABNORMALLY_CLOSED > {noformat} > Later, the RIT overwrote the state again, it seems, and then the region got > stuck in OPENING state forever, but I'm not sure yet if that's just due to > this bug or if there was another bug after that. For now this can be > addressed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)