[ https://issues.apache.org/jira/browse/HBASE-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940986#comment-16940986 ]
Hudson commented on HBASE-23079: -------------------------------- Results for branch master [build #1488 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1488/]: (/) *{color:green}+1 overall{color}* ---- details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/1488//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1488//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1488//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > RegionRemoteProcedureBase should override setTimeoutFailure > ----------------------------------------------------------- > > Key: HBASE-23079 > URL: https://issues.apache.org/jira/browse/HBASE-23079 > Project: HBase > Issue Type: Bug > Components: amv2 > Reporter: Xiaolin Ha > Assignee: Duo Zhang > Priority: Blocker > Fix For: 3.0.0, 2.3.0, 2.2.2 > > > In our test environment, we found that, when OpenRegionProcedure failed by > timeout, though it doesn't support rollback, ProcedureExcutor will delete the > OPR node in the store in its own rollback steps and make the RIT stay there > looping rollback exception. ServerCrashProcedure will detect this RIT but > cannot deal with it, because no node in the store and it will encounters NPE, > as a result, SCP aborts. > > Logs are as follows, > {quote}2019-08-29,07:27:35,710 INFO [PEWorker-15] > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=70206, > state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure > server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true > found RIT pid=56043, ppid=55625, state=ROLLEDBACK; > TransitRegionStateProcedure table=c3srv_galaxy:tsdb, > region=c1ccd94593bf1b87269cec98d6ffaaae, ASSIGN; rit=OPEN, > location=c3-hadoop-srv-st297.bj,21600,1567012736142, table=c3srv_galaxy:tsdb, > region=c1ccd94593bf1b87269cec98d6ffaaae > 2019-08-29,07:27:35,711 ERROR [PEWorker-15] > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught > runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, > locked=true; ServerCrashProcedure > server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true > java.lang.NullPointerException > at > org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:140) > at > org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:133) > at > org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:782) > at > org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:737) > at > org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:604) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:182) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:239) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:398) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189) > at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1645) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1392) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962) > 2019-08-29,07:27:35,714 ERROR [PEWorker-15] > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught > runtime exception for pid=70206, state=FAILED:SERVER_CRASH_SPLIT_LOGS, > locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught > runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, > locked=true; ServerCrashProcedure > server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, > meta=true:java.lang.NullPointerException; ServerCrashProcedure > server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1566) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1498) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1349) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962) > 2019-08-29,07:27:35,716 ERROR [PEWorker-15] > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught > runtime exception for pid=70206, state=FAILED:SERVER_CRASH_GET_REGIONS, > locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught > runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, > locked=true; ServerCrashProcedure > server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, > meta=true:java.lang.NullPointerException; ServerCrashProcedure > server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_SPLIT_LOGS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979) > {quote} > > ------------- > {quote}[work@c3-hadoop-srv-ct30 log]$ grep 'pid=56659' * > hbase.4500.log:2019-08-29,01:40:31,103 INFO > [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=56659, ppid=56653, > state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; > TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, > region=413a5199ff68f3505f490c8341efe459, ASSIGN > hbase.4500.log:2019-08-29,01:40:31,107 INFO > [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] > org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread: ADDED pid=56661, > ppid=56659, state=WAITING_TIMEOUT; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure; timeout=1006, > timestamp=1567014012995 > hbase.4500.log:2019-08-29,01:40:31,186 INFO > [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] > org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach > pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; > TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, > region=413a5199ff68f3505f490c8341efe459, ASSIGN to rit=OFFLINE, > location=null, table=c3prck8smonitorc3:tsdb-service, > region=413a5199ff68f3505f490c8341efe459 to restore RIT > hbase.4500.log:2019-08-29,01:40:31,676 INFO [PEWorker-16] > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: LOCK_EVENT_WAIT > rollback...pid=56661, ppid=56659, state=FAILED, > exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via > ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: > Operation timed out after 19.6860sec; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure > hbase.4500.log:2019-08-29,01:40:43,155 ERROR [PEWorker-7] > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught > runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, > exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via > ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: > Operation timed out after 19.6860sec; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure > hbase.4500.log:2019-08-29,01:40:43,759 ERROR [PEWorker-7] > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught > runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, > exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via > ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: > Operation timed out after 19.6860sec; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure > hbase.4500.log:2019-08-29,01:40:44,064 ERROR [PEWorker-7] > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught > runtime exception for pid=56659, ppid=56653, > state=WAITING:REGION_STATE_TRANSITION_OPEN, locked=true; > TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, > region=413a5199ff68f3505f490c8341efe459, ASSIGN > hbase.4500.log:2019-08-29,01:40:44,471 INFO [PEWorker-7] > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_OPEN; > TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, > region=413a5199ff68f3505f490c8341efe459, ASSIGN > hbase.4500.log:2019-08-29,01:40:44,573 ERROR [PEWorker-7] > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught > runtime exception for pid=56659, ppid=56653, > state=WAITING:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, > region=413a5199ff68f3505f490c8341efe459, ASSIGN > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)