Sergey Shelukhin created HBASE-21786:
----------------------------------------

             Summary: RIT for a region without a lock can mess up the RIT that 
has the lock
                 Key: HBASE-21786
                 URL: https://issues.apache.org/jira/browse/HBASE-21786
             Project: HBase
          Issue Type: Bug
    Affects Versions: 3.0.0
            Reporter: Sergey Shelukhin


I cannot find in the log where the 2nd RIT is coming from, the first line I see 
for it is Waiting for the lock. It has no parent procedure.

One RIT, restored from WAL, with a retry manages to restore the region to some 
server.
{noformat}
2019-01-25 10:56:21,878 INFO  [master/master:17000:becomeActiveMaster] 
procedure.MasterProcedureScheduler: Took xlock for pid=1738, ppid=3, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
TransitRegionStateProcedure table=table, 
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN
2019-01-25 10:56:22,055 INFO  [master/master:17000:becomeActiveMaster] 
assignment.AssignmentManager: Attach pid=1738, ppid=3, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
TransitRegionStateProcedure table=table, 
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN to rit=OFFLINE, location=null, 
table=table, region=27f7ab2a05d9d730b2ab2339d1531b8e to restore RIT
2019-01-25 10:56:51,362 INFO  [master/master:17000:becomeActiveMaster] 
assignment.RegionStateStore: Load hbase:meta entry 
region=27f7ab2a05d9d730b2ab2339d1531b8e, regionState=OFFLINE, 
lastHost=server2,17020,1548290445704, 
regionLocation=server1,17020,1548442302645, openSeqNum=120108
2019-01-25 10:57:26,842 INFO  [PEWorker-7] procedure2.ProcedureExecutor: 
Finished subprocedure(s) of pid=1738, ppid=3, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=false; 
TransitRegionStateProcedure table=table, 
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN; resume parent processing.
2019-01-25 10:57:26,842 INFO  [PEWorker-12] 
assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=1738, 
ppid=3, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN; rit=OFFLINE, 
location=server1,17020,1548442302645
2019-01-25 10:57:26,902 INFO  [PEWorker-12] 
assignment.TransitRegionStateProcedure: Starting pid=1738, ppid=3, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN; rit=OFFLINE, location=null; 
forceNewPlan=true, retain=false
2019-01-25 10:57:33,817 INFO  [PEWorker-7] assignment.RegionStateStore: 
pid=1738 updating hbase:meta row=27f7ab2a05d9d730b2ab2339d1531b8e, 
regionState=OPENING, regionLocation=server3,17020,1548442571056
{noformat}
The other RIT appears out of nowhere.. there's no "to restore RIT" line for it. 
I wonder if it could be a side effect of the region being offline, or the retry 
above? 
Regardless, it cannot get the lock.
{noformat}
2019-01-25 10:57:46,255 INFO  [PEWorker-15] procedure.MasterProcedureScheduler: 
Waiting on xlock for pid=4351, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=table, 
region=27f7ab2a05d9d730b2ab2339d1531b8e, ASSIGN held by pid=1738
{noformat}
However, when the server responds that region is opened, the new RIT 4351 takes 
the notification and discards it. 
{noformat}
2019-01-25 10:58:23,263 WARN  
[RpcServer.default.FPBQ.Fifo.handler=19,queue=4,port=17000] 
assignment.TransitRegionStateProcedure: Received report OPENED transition from 
server3,17020,1548442571056 for rit=OPENING, 
location=server3,17020,1548442571056, table=table, 
region=27f7ab2a05d9d730b2ab2339d1531b8e, pid=4351, but the TRSP is not in 
REGION_STATE_TRANSITION_CONFIRM_OPENED state, should be a retry, ignore
{noformat}

Region is stuck in OPENING forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to