Xiaolin Ha created HBASE-23079:
----------------------------------

             Summary: SCP fails when PE executes rollback of ORP 
                 Key: HBASE-23079
                 URL: https://issues.apache.org/jira/browse/HBASE-23079
             Project: HBase
          Issue Type: Bug
            Reporter: Xiaolin Ha


In our test environment, we found that, when OpenRegionProcedure failed by 
timeout, though it doesn't support rollback, ProcedureExcutor will delete the 
OPR node in the store in its own rollback steps and make the RIT stay there 
looping rollback exception. ServerCrashProcedure will detect this RIT but 
cannot deal with it, because no node in the store and it will encounters NPE, 
as a result, SCP aborts.

 

Logs are as follows,
{quote}2019-08-29,07:27:35,710 INFO [PEWorker-15] 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=70206, 
state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure 
server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true 
found RIT pid=56043, ppid=55625, state=ROLLEDBACK; TransitRegionStateProcedure 
table=c3srv_galaxy:tsdb, region=c1ccd94593bf1b87269cec98d6ffaaae, ASSIGN; 
rit=OPEN, location=c3-hadoop-srv-st297.bj,21600,1567012736142, 
table=c3srv_galaxy:tsdb, region=c1ccd94593bf1b87269cec98d6ffaaae
2019-08-29,07:27:35,711 ERROR [PEWorker-15] 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; 
ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, 
splitWal=true, meta=true
java.lang.NullPointerException
 at 
org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:140)
 at 
org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:133)
 at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:782)
 at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:737)
 at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:604)
 at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:182)
 at 
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:239)
 at 
org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:398)
 at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
 at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
 at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
 at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
 at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1645)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1392)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
2019-08-29,07:27:35,714 ERROR [PEWorker-15] 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
runtime exception for pid=70206, state=FAILED:SERVER_CRASH_SPLIT_LOGS, 
locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught 
runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; 
ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, 
splitWal=true, meta=true:java.lang.NullPointerException; ServerCrashProcedure 
server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
 at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
 at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
 at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
 at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1566)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1498)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1349)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
 at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
2019-08-29,07:27:35,716 ERROR [PEWorker-15] 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
runtime exception for pid=70206, state=FAILED:SERVER_CRASH_GET_REGIONS, 
locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught 
runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; 
ServerCrashProcedure server=c3-hadoop-srv-st297.bj,21600,1567012736142, 
splitWal=true, meta=true:java.lang.NullPointerException; ServerCrashProcedure 
server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_SPLIT_LOGS
 at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
 at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
 at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
 at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
{quote}
 

-------------
{quote}[work@c3-hadoop-srv-ct30 log]$ grep 'pid=56659' *
hbase.4500.log:2019-08-29,01:40:31,103 INFO 
[master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
for pid=56659, ppid=56653, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
region=413a5199ff68f3505f490c8341efe459, ASSIGN
hbase.4500.log:2019-08-29,01:40:31,107 INFO 
[master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] 
org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread: ADDED pid=56661, 
ppid=56659, state=WAITING_TIMEOUT; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure; timeout=1006, 
timestamp=1567014012995
hbase.4500.log:2019-08-29,01:40:31,186 INFO 
[master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach pid=56659, 
ppid=56653, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
region=413a5199ff68f3505f490c8341efe459, ASSIGN to rit=OFFLINE, location=null, 
table=c3prck8smonitorc3:tsdb-service, region=413a5199ff68f3505f490c8341efe459 
to restore RIT
hbase.4500.log:2019-08-29,01:40:31,676 INFO [PEWorker-16] 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: LOCK_EVENT_WAIT 
rollback...pid=56661, ppid=56659, state=FAILED, 
exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via 
ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: 
Operation timed out after 19.6860sec; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
hbase.4500.log:2019-08-29,01:40:43,155 ERROR [PEWorker-7] 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, 
exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via 
ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: 
Operation timed out after 19.6860sec; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
hbase.4500.log:2019-08-29,01:40:43,759 ERROR [PEWorker-7] 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, 
exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via 
ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: 
Operation timed out after 19.6860sec; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
hbase.4500.log:2019-08-29,01:40:44,064 ERROR [PEWorker-7] 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
runtime exception for pid=56659, ppid=56653, 
state=WAITING:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
region=413a5199ff68f3505f490c8341efe459, ASSIGN
hbase.4500.log:2019-08-29,01:40:44,471 INFO [PEWorker-7] 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_OPEN; 
TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
region=413a5199ff68f3505f490c8341efe459, ASSIGN
hbase.4500.log:2019-08-29,01:40:44,573 ERROR [PEWorker-7] 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
runtime exception for pid=56659, ppid=56653, 
state=WAITING:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
region=413a5199ff68f3505f490c8341efe459, ASSIGN
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to