Sergey Shelukhin created HBASE-21786: ----------------------------------------
Summary: RIT for a region without a lock can mess up the RIT that has the lock Key: HBASE-21786 URL: https://issues.apache.org/jira/browse/HBASE-21786 Project: HBase Issue Type: Bug Affects Versions: 3.0.0 Reporter: Sergey Shelukhin I cannot find in the log where the 2nd RIT is coming from, the first line I see for it is Waiting for the lock. It has no parent procedure. One RIT, restored from WAL, with a retry manages to restore the region to some server. {noformat} 2019-01-25 10:56:21,878 INFO [master/master:17000:becomeActiveMaster] procedure.MasterProcedureScheduler: Took xlock for pid=1738, ppid=3, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; TransitRegionStateProcedure table=table, region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN 2019-01-25 10:56:22,055 INFO [master/master:17000:becomeActiveMaster] assignment.AssignmentManager: Attach pid=1738, ppid=3, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; TransitRegionStateProcedure table=table, region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN to rit=OFFLINE, location=null, table=table, region=27f7ab2a05d9d730b2ab2339d1531b8e to restore RIT 2019-01-25 10:56:51,362 INFO [master/master:17000:becomeActiveMaster] assignment.RegionStateStore: Load hbase:meta entry region=27f7ab2a05d9d730b2ab2339d1531b8e, regionState=OFFLINE, lastHost=server2,17020,1548290445704, regionLocation=server1,17020,1548442302645, openSeqNum=120108 2019-01-25 10:57:26,842 INFO [PEWorker-7] procedure2.ProcedureExecutor: Finished subprocedure(s) of pid=1738, ppid=3, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; TransitRegionStateProcedure table=table, region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN; resume parent processing. 2019-01-25 10:57:26,842 INFO [PEWorker-12] assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=1738, ppid=3, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; TransitRegionStateProcedure table=table, region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN; rit=OFFLINE, location=server1,17020,1548442302645 2019-01-25 10:57:26,902 INFO [PEWorker-12] assignment.TransitRegionStateProcedure: Starting pid=1738, ppid=3, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; TransitRegionStateProcedure table=table, region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN; rit=OFFLINE, location=null; forceNewPlan=true, retain=false 2019-01-25 10:57:33,817 INFO [PEWorker-7] assignment.RegionStateStore: pid=1738 updating hbase:meta row=27f7ab2a05d9d730b2ab2339d1531b8e, regionState=OPENING, regionLocation=server3,17020,1548442571056 {noformat} The other RIT appears out of nowhere.. there's no "to restore RIT" line for it. I wonder if it could be a side effect of the region being offline, or the retry above? Regardless, it cannot get the lock. {noformat} 2019-01-25 10:57:46,255 INFO [PEWorker-15] procedure.MasterProcedureScheduler: Waiting on xlock for pid=4351, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; TransitRegionStateProcedure table=table, region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN held by pid=1738 {noformat} However, when the server responds that region is opened, the new RIT 4351 takes the notification and discards it. {noformat} 2019-01-25 10:58:23,263 WARN [RpcServer.default.FPBQ.Fifo.handler=19,queue=4,port=17000] assignment.TransitRegionStateProcedure: Received report OPENED transition from server3,17020,1548442571056 for rit=OPENING, location=server3,17020,1548442571056, table=table, region=27f7ab2a05d9d730b2ab2339d1531b8e, pid=4351, but the TRSP is not in REGION_STATE_TRANSITION_CONFIRM_OPENED state, should be a retry, ignore {noformat} Region is stuck in OPENING forever. -- This message was sent by Atlassian JIRA (v7.6.3#76005)