[
https://issues.apache.org/jira/browse/HBASE-21050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16580691#comment-16580691
]
stack commented on HBASE-21050:
-------------------------------
Is this a failure near the end of PE#execProcedure....
{code}
...
// Submit the new subprocedures
if (subprocs != null && !procedure.isFailed()) {
submitChildrenProcedures(subprocs);
}
// <<<= BEFORE HERE
// we need to log the release lock operation before waking up the parent
procedure, as there
// could be race that the parent procedure may call updateStoreOnExec ahead
of us and remove all
// the sub procedures from store and cause problems...
releaseLock(procedure, false);
// if the procedure is complete and has a parent, count down the children
latch.
// If 'suspended', do nothing to change state -- let other threads handle
unsuspend event.
if (!suspended && procedure.isFinished() && procedure.hasParent()) {
countDownChildren(procStack, procedure);
}
{code}
... so child of parent has completed, SUCCESS, and we are exiting the execution
of the child... on our way out about to release the lock and then call
countDownChildren which makes the parent RUNNABLE again BUT we fail after child
completes but BEFORE we get to the release lock?
If so, I can make a test for this. The machinery added to test HBASE-20978 will
work for here. Let me know if you think this whats up [~allan163] and I'll give
the test a go.
> Exclusive lock may be held by a SUCCESS state procedure forever
> ---------------------------------------------------------------
>
> Key: HBASE-21050
> URL: https://issues.apache.org/jira/browse/HBASE-21050
> Project: HBase
> Issue Type: Sub-task
> Components: amv2
> Affects Versions: 2.1.0, 2.0.1
> Reporter: Allan Yang
> Assignee: Allan Yang
> Priority: Major
> Attachments: HBASE-21050.branch-2.0.001.patch
>
>
> After HBASE-20846, we restore lock info for procedures. But, there is a case
> that the lock and be held by a already success procedure. Since the procedure
> won't execute again, the lock will held by the procedure forever.
> 1. All children for pid=1208 had been finished, but before procedure 1208
> awake, the master was killed
> {code}
> 2018-08-05 02:20:14,465 INFO [PEWorker-8]
> procedure2.ProcedureExecutor(1659): Finished subprocedure(s) of pid=1208,
> ppid=1206, state=RUNNABLE, hasLock=true; MoveRegionProcedure
> hri=c2a23a735f16df57299
> dba6fd4599f2f, source=e010125050127.bja,60020,1533403109034,
> destination=e010125050127.bja,60020,1533403109034; resume parent processing.
> 2018-08-05 02:20:14,466 INFO [PEWorker-8]
> procedure2.ProcedureExecutor(1296): Finished pid=1232, ppid=1208,
> state=SUCCESS, hasLock=false; AssignProcedure
> table=IntegrationTestBigLinkedList, region=c2a
> 23a735f16df57299dba6fd4599f2f, target=e010125050127.bja,60020,1533403109034
> in 1.5060sec
> {code}
> 2. Master restarts, since procedure 1208 held the lock before restart, so the
> lock was resotore for it
> {code}
> 2018-08-05 02:20:30,803 DEBUG [Thread-15] procedure2.ProcedureExecutor(456):
> Loading pid=1208, ppid=1206, state=SUCCESS, hasLock=false;
> MoveRegionProcedure hri=c2a23a735f16df57299dba6fd4599f2f, source=
> e010125050127.bja,60020,1533403109034,
> destination=e010125050127.bja,60020,1533403109034
> 2018-08-05 02:20:30,818 DEBUG [Thread-15] procedure2.Procedure(898):
> pid=1208, ppid=1206, state=SUCCESS, hasLock=false; MoveRegionProcedure
> hri=c2a23a735f16df57299dba6fd4599f2f, source=e010125050127.bj
> a,60020,1533403109034, destination=e010125050127.bja,60020,1533403109034 held
> the lock before restarting, call acquireLock to restore it.
> 2018-08-05 02:20:30,818 INFO [Thread-15]
> procedure.MasterProcedureScheduler(631): pid=1208, ppid=1206, state=SUCCESS,
> hasLock=false; MoveRegionProcedure hri=c2a23a735f16df57299dba6fd4599f2f,
> source=e0
> 10125050127.bja,60020,1533403109034,
> destination=e010125050127.bja,60020,1533403109034 checking lock on
> c2a23a735f16df57299dba6fd4599f2f
> {code}
> 3. Since procedure 1208 is success, it won't execute later, so the lock will
> be held by it forever
> We need to check the state of the procedure before restoring locks, if the
> procedure is already finished (success or rollback), we do not need to
> acquire lock for it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)