[ 
https://issues.apache.org/jira/browse/HBASE-20973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665162#comment-16665162
 ] 

Duo Zhang commented on HBASE-20973:
-----------------------------------

Talked with [~allan163] offline, and finally we found the root cause of this 
problem. The problem is that, when rolling back the whole procedure stack, if a 
sub procedures has been successfully rolled back, i.e, its stack indexes are 
empty, we will delete it. But later when we finish rolling back all the 
procedures, we will call store.delete(procId, subProcIds), which will delete 
the sub procedures again.

In the implementation of BitSetNode, we assume that when deleting, the 
BitSetNode for this procedure should already been there. But here we do a 
double delete, if after the first deletion, the BitSetNode for this procedure 
is empty then we will delete the BitSetNode, so when the second time we call 
the delete, it will lead to an ArrayIndexOutOfBoundsException. In fact, in a UT 
where the assert is enabled, we will have an AssertionError at this line:
{code}
assert node.contains(procId) : "expected procId=" + procId + " in the node";
{code}
If we comment it out(as what have done in the UT patch), we will have exactly 
an ArrayIndexOutOfBoundsException

{noformat}
2018-10-26 21:16:11,512 WARN  [PEWorker-2] 
procedure2.ProcedureExecutor$WorkerThread(2121): Worker terminating UNNATURALLY 
null
java.lang.ArrayIndexOutOfBoundsException: 1
        at 
org.apache.hadoop.hbase.procedure2.store.BitSetNode.updateState(BitSetNode.java:399)
        at 
org.apache.hadoop.hbase.procedure2.store.BitSetNode.delete(BitSetNode.java:158)
        at 
org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.delete(ProcedureStoreTracker.java:158)
        at 
org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.delete(ProcedureStoreTracker.java:150)
        at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:780)
        at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:730)
        at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.delete(WALProcedureStore.java:642)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.cleanupAfterRollbackOneStep(ProcedureExecutor.java:1703)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1741)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1657)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1524)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$7(ProcedureExecutor.java:1501)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2105)
{noformat}

> ArrayIndexOutOfBoundsException when rolling back procedure
> ----------------------------------------------------------
>
>                 Key: HBASE-20973
>                 URL: https://issues.apache.org/jira/browse/HBASE-20973
>             Project: HBase
>          Issue Type: Sub-task
>          Components: amv2
>    Affects Versions: 2.1.0, 2.0.1
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Critical
>             Fix For: 3.0.0, 2.2.0, 2.1.1, 2.0.3
>
>         Attachments: HBASE-20973-UT.patch, HBASE-20973.branch-2.0.001.patch, 
> HBASE-20973.branch-2.0.002.patch
>
>
> Find this one while investigating HBASE-20921. After the root 
> procedure(ModifyTableProcedure  in this case) rolled back, a 
> ArrayIndexOutOfBoundsException was thrown
> {code}
> 2018-07-18 01:39:10,241 ERROR [PEWorker-8] procedure2.ProcedureExecutor(159): 
> CODE-BUG: Uncaught runtime exception for pid=5973, 
> state=FAILED:MODIFY_TABLE_REOPEN_ALL_REGIONS, exception=java.lang.NullPo
> interException via CODE-BUG: Uncaught runtime exception: pid=5974, ppid=5973, 
> state=RUNNABLE:REOPEN_TABLE_REGIONS_CONFIRM_REOPENED; 
> ReopenTableRegionsProcedure table=IntegrationTestBigLinkedList:java.l
> ang.NullPointerException; ModifyTableProcedure 
> table=IntegrationTestBigLinkedList
> java.lang.UnsupportedOperationException: unhandled 
> state=MODIFY_TABLE_REOPEN_ALL_REGIONS
>         at 
> org.apache.hadoop.hbase.master.procedure.ModifyTableProcedure.rollbackState(ModifyTableProcedure.java:147)
>         at 
> org.apache.hadoop.hbase.master.procedure.ModifyTableProcedure.rollbackState(ModifyTableProcedure.java:50)
>         at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:864)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1741)
> 2018-07-18 01:39:10,243 WARN  [PEWorker-8] 
> procedure2.ProcedureExecutor(1756): Worker terminating UNNATURALLY null
> java.lang.ArrayIndexOutOfBoundsException: 1
>         at 
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker$BitSetNode.updateState(ProcedureStoreTracker.java:405)
>         at 
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker$BitSetNode.delete(ProcedureStoreTracker.java:178)
>         at 
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.delete(ProcedureStoreTracker.java:513)
>         at 
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.delete(ProcedureStoreTracker.java:505)
>         at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:741)
>         at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:691)
>         at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.delete(WALProcedureStore.java:603)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1387)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1741)
> {code}
> This is a very serious condition, After this exception thrown, the exclusive 
> lock held by ModifyTableProcedure was never released. All the procedure 
> against this table were blocked. Until the master restarted, and since the 
> lock info for the procedure won't be restored, the other procedures can go 
> again, it is quite embarrassing that a bug save us...(this bug will be fixed 
> in HBASE-20846)
> I tried to reproduce this one using the test case in HBASE-20921 but I just 
> can't reproduce it.
> A easy way to resolve this is add a try catch, making sure no matter what 
> happens, the table's exclusive lock can always be relased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to