[ 
https://issues.apache.org/jira/browse/HBASE-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939283#comment-16939283
 ] 

Xiaolin Ha commented on HBASE-23079:
------------------------------------

No logs can help to find out what happened on RS. Will be back. 

> SCP fails when PE executes rollback of ORP 
> -------------------------------------------
>
>                 Key: HBASE-23079
>                 URL: https://issues.apache.org/jira/browse/HBASE-23079
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>
> In our test environment, we found that, when OpenRegionProcedure failed by 
> timeout, though it doesn't support rollback, ProcedureExcutor will delete the 
> OPR node in the store in its own rollback steps and make the RIT stay there 
> looping rollback exception. ServerCrashProcedure will detect this RIT but 
> cannot deal with it, because no node in the store and it will encounters NPE, 
> as a result, SCP aborts.
>  
> Logs are as follows,
> {quote}2019-08-29,07:27:35,710 INFO [PEWorker-15] 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=70206, 
> state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true 
> found RIT pid=56043, ppid=55625, state=ROLLEDBACK; 
> TransitRegionStateProcedure table=c3srv_galaxy:tsdb, 
> region=c1ccd94593bf1b87269cec98d6ffaaae, ASSIGN; rit=OPEN, 
> location=c3-hadoop-srv-st297.bj,21600,1567012736142, table=c3srv_galaxy:tsdb, 
> region=c1ccd94593bf1b87269cec98d6ffaaae
> 2019-08-29,07:27:35,711 ERROR [PEWorker-15] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
> locked=true; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:140)
>  at 
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:133)
>  at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:782)
>  at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:737)
>  at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:604)
>  at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:182)
>  at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:239)
>  at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:398)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
>  at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
>  at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1645)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1392)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
> 2019-08-29,07:27:35,714 ERROR [PEWorker-15] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=70206, state=FAILED:SERVER_CRASH_SPLIT_LOGS, 
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught 
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
> locked=true; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, 
> meta=true:java.lang.NullPointerException; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
>  at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
>  at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1566)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1498)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1349)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
> 2019-08-29,07:27:35,716 ERROR [PEWorker-15] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=70206, state=FAILED:SERVER_CRASH_GET_REGIONS, 
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught 
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
> locked=true; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, 
> meta=true:java.lang.NullPointerException; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled 
> state=SERVER_CRASH_SPLIT_LOGS
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
>  at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
>  at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
> {quote}
>  
> -------------
> {quote}[work@c3-hadoop-srv-ct30 log]$ grep 'pid=56659' *
> hbase.4500.log:2019-08-29,01:40:31,103 INFO 
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
> for pid=56659, ppid=56653, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:31,107 INFO 
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] 
> org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread: ADDED pid=56661, 
> ppid=56659, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure; timeout=1006, 
> timestamp=1567014012995
> hbase.4500.log:2019-08-29,01:40:31,186 INFO 
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach 
> pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459, ASSIGN to rit=OFFLINE, 
> location=null, table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459 to restore RIT
> hbase.4500.log:2019-08-29,01:40:31,676 INFO [PEWorker-16] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: LOCK_EVENT_WAIT 
> rollback...pid=56661, ppid=56659, state=FAILED, 
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via 
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: 
> Operation timed out after 19.6860sec; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:43,155 ERROR [PEWorker-7] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, 
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via 
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: 
> Operation timed out after 19.6860sec; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:43,759 ERROR [PEWorker-7] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, 
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via 
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: 
> Operation timed out after 19.6860sec; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:44,064 ERROR [PEWorker-7] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=56659, ppid=56653, 
> state=WAITING:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:44,471 INFO [PEWorker-7] 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
> for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_OPEN; 
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:44,573 ERROR [PEWorker-7] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=56659, ppid=56653, 
> state=WAITING:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to