[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664740#comment-16664740 ] Hudson commented on HBASE-21344: Results for branch branch-2.0 [build #1015 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1015/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1015//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1015//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1015//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.3 > > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch, > HBASE-21344.branch-2.0.001.patch, HBASE-21344.branch-2.0.003-addendum.patch, > HBASE-21344.branch-2.0.003.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.15882307
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664328#comment-16664328 ] Josh Elser commented on HBASE-21344: Thanks, Stack. Doing so now. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.3 > > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch, > HBASE-21344.branch-2.0.001.patch, HBASE-21344.branch-2.0.003-addendum.patch, > HBASE-21344.branch-2.0.003.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFutu
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664192#comment-16664192 ] stack commented on HBASE-21344: --- +1 on push [~elserj] and [~an...@apache.org] Thanks. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.3 > > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch, > HBASE-21344.branch-2.0.001.patch, HBASE-21344.branch-2.0.003-addendum.patch, > HBASE-21344.branch-2.0.003.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.C
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664159#comment-16664159 ] Josh Elser commented on HBASE-21344: Me being the silent instigator :) Tried to run the test and I'm getting timeouts when I run the whole test class. Seems like when both methods run in the same JVM, testMetaAssignmentFailure fails for me. I don't know why QA didn't fail though (but it happens every time for me). > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.3 > > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch, > HBASE-21344.branch-2.0.001.patch, HBASE-21344.branch-2.0.003-addendum.patch, > HBASE-21344.branch-2.0.003.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) >
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664155#comment-16664155 ] stack commented on HBASE-21344: --- [~an...@apache.org] Why the addendum sir? > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.3 > > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch, > HBASE-21344.branch-2.0.001.patch, HBASE-21344.branch-2.0.003-addendum.patch, > HBASE-21344.branch-2.0.003.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableF
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663600#comment-16663600 ] Hudson commented on HBASE-21344: Results for branch branch-2.0 [build #1011 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1011/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1011//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1011//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1011//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.3 > > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch, > HBASE-21344.branch-2.0.001.patch, HBASE-21344.branch-2.0.003.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663156#comment-16663156 ] Hadoop QA commented on HBASE-21344: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} branch-2.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 16s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 8s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 47s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 6m 13s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 29s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 10s{color} | {color:green} branch-2.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 6m 13s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 14m 33s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 20s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}120m 0s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}178m 17s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 | | JIRA Issue | HBASE-21344 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12945501/HBASE-21344.branch-2.0.001.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux cb7e928ccd95 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | branch-2.0 / 8c97a869a1 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/14851/testReport/ | | Max. process+thread count | 4223 (vs. ulimit of 1) | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/14851/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message w
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663053#comment-16663053 ] Hadoop QA commented on HBASE-21344: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 24s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} branch-2.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 7s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 48s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 17s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 40s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 20s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green} branch-2.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 26s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 7m 56s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}185m 35s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}221m 27s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 | | JIRA Issue | HBASE-21344 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12945333/HBASE-21344.branch-2.0.003.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux fd226421b253 4.4.0-133-generic #159-Ubuntu SMP Fri Aug 10 07:31:43 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | branch-2.0 / 8c97a869a1 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/14848/testReport/ | | Max. process+thread count | 4411 (vs. ulimit of 1) | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/14848/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message w
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663050#comment-16663050 ] stack commented on HBASE-21344: --- Thanks for offer [~elserj] I tried the patch locally. The test is failing for me so I pushed up a patch to retry a run and to see if the failed test is related. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.3 > > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch, > HBASE-21344.branch-2.0.001.patch, HBASE-21344.branch-2.0.003.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(Co
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663011#comment-16663011 ] Josh Elser commented on HBASE-21344: bq. Thanks stack for the pointer, I didn't go down as the problem was started when we are starting tableStateManager without waiting for meta assignment by SCPs. I think we can just remove this from here as we already starting after waiting for meta to get online.(attached patch for the same) Great find, Ankit! Sorry I ended up leading you on a goose-chase here. Explanation makes sense given what I saw on the live system. [~stack], I can merge this in as an insufficient means of "thanks" for the reviews if you're good with it. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.3 > > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch, > HBASE-21344.branch-2.0.003.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661788#comment-16661788 ] Hadoop QA commented on HBASE-21344: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} branch-2.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 15s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 22s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 3s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 21s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green} branch-2.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 4s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 44s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}119m 36s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}158m 9s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.master.procedure.TestMasterFailoverWithProcedures | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 | | JIRA Issue | HBASE-21344 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12945333/HBASE-21344.branch-2.0.003.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux df63925f8666 4.4.0-134-generic #160~14.04.1-Ubuntu SMP Fri Aug 17 11:07:07 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | branch-2.0 / 169e3bafc8 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/14839/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/14839/testReport/ | | Max. process+thread count | 4057
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661666#comment-16661666 ] stack commented on HBASE-21344: --- You need to add .branch-2.0. into the name of your patch [~an...@apache.org] Also, this is an important patch because the left-over start will prevent us getting to the wait-on-meta holding pattern. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.3 > > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) >
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661658#comment-16661658 ] Hadoop QA commented on HBASE-21344: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 6s{color} | {color:red} HBASE-21344 does not apply to branch-2. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HBASE-21344 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12945321/HBASE-21344-branch-2.0_v3.patch | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/14837/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.3 > > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAs
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661655#comment-16661655 ] stack commented on HBASE-21344: --- The change in HBaseTestingUtility is just formatting? Otherwise, patch seems good. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Fix For: 2.0.3 > > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch, HBASE-21344-branch-2.0_v3.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > j
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661580#comment-16661580 ] stack commented on HBASE-21344: --- Make new patch [~an...@apache.org]? Thanks. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661554#comment-16661554 ] Ankit Singhal commented on HBASE-21344: --- {quote}1096 Optional optProc = this.procedureExecutor.getProcedures().stream() 1097 .filter(p -> p instanceof ServerCrashProcedure).map(o -> (ServerCrashProcedure) o) 1098 .filter(s -> s.hasMetaTableRegion()).findAny(); You are reporting SCPs only if they have meta on them? Isn't this method more generic than just meta searches? {quote} Yes, My bad, we don't need this particular change. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$ge
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661548#comment-16661548 ] stack commented on HBASE-21344: --- bq. Actually, I'm working against branch-2.0 only, here you can see tableStateManager is started 2 times, My bad. Indeed, 2.0 has this. 2.1 does not. What you doing here from patch? 1096 Optional optProc = this.procedureExecutor.getProcedures().stream() 1097 .filter(p -> p instanceof ServerCrashProcedure).map(o -> (ServerCrashProcedure) o) 1098 .filter(s -> s.hasMetaTableRegion()).findAny(); You are reporting SCPs only if they have meta on them? Isn't this method more generic than just meta searches? > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661544#comment-16661544 ] Ankit Singhal commented on HBASE-21344: --- bq. You don't seem to be working against the tip of branch-2.0 or branch-2.1. You seem to be working in your own branch? Is that so? If so, startup has changed pretty radically since 2.0.0. Actually, I'm working against branch-2.0 only, here you can see tableStateManager is started 2 times, At this instance, we only wait for IMP( which will be ok during the first start after deploy) but not when there are SCPs. https://github.com/apache/hbase/blob/branch-2.0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L929 TableStateManger is started after meta is actually online(which is correct). https://github.com/apache/hbase/blob/branch-2.0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L958 > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at >
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661513#comment-16661513 ] stack commented on HBASE-21344: --- [~an...@apache.org] You don't seem to be working against the tip of branch-2.0 or branch-2.1. You seem to be working in your own branch? Is that so? If so, startup has changed pretty radically since 2.0.0. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenCo
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661395#comment-16661395 ] Ankit Singhal commented on HBASE-21344: --- bq. This should be happening already. We wait on meta assign. If SCPs, they'll run and recover meta if one of them was holding it. If no assign for meta in the procedure store, then something untoward and at least for now, operator needs to figure what happened until we fix the bug. Operator can schedule an assign with hbck2 bq. branch-2.0 will go into a holding pattern if hbase:meta is not assigned (ditto if hbase:namespace is not assigned) waiting on operator intevention to clear the lack-of-assign. Thanks [~stack] for the pointer, I didn't go down as the problem was started when we are starting tableStateManager without waiting for meta assignment by SCPs. I think we can just remove this from here as we already starting after waiting for meta to get online.(attached patch for the same) {code} if (initMetaProc != null) { initMetaProc.await(); } -tableStateManager.start(); {code} bq. That said, I see some value in this patch. In particular the bit around resetting hbase:meta state if failure. We shouldn't offline the meta if we are failing the assignment as it will start the InitMetaProcedure (which we don't want as SCP need to take care of recovering of Meta). > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch, > HBASE-21344-branch-2.0_v2.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661184#comment-16661184 ] stack commented on HBASE-21344: --- [~an...@apache.org] bq. ...ok, so during master initialization(restart or standby becoming active), do I search for SCP in procedure queue(and wait on it) and see if it is holding meta and can recover from splitting of logs to the assignment of Meta? This should be happening already. We wait on meta assign. If SCPs, they'll run and recover meta if one of them was holding it. If no assign for meta in the procedure store, then something untoward and at least for now, operator needs to figure what happened until we fix the bug. Operator can schedule an assign with hbck2 https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2#master-startup-cannot-progress-in-holding-pattern-until-region-onlined In fact, here we see difference between your hbase version and tip of branch-2.0. branch-2.0 will go into a holding pattern if hbase:meta is not assigned (ditto if hbase:namespace is not assigned) waiting on operator intevention to clear the lack-of-assign. I filed a sub-issue so we keep retrying rather than giving up after 10 as we currently do as per [~allan163] suggestion so we avoid the rollback and the meta in an OPENING state. That said, I see some value in this patch. In particular the bit around resetting hbase:meta state if failure. Thanks. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call except
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16660875#comment-16660875 ] Josh Elser commented on HBASE-21344: {quote} > Is this 2.0.0 or branch-2.0? I checked with branch-2.0 , may need to check for other branches as well. {quote} Meant to state this earlier -- this was on an internal branch, but it was based on bc12f38b4689953f0eea50c6a19d3bdeb7468f5a from branch-2.0 (which was HBASE-21261, Mon Oct 1 18:08:37 2018). > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(Completable
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659239#comment-16659239 ] Josh Elser commented on HBASE-21344: {quote}This is not right, the meta table may on a crashed server, and in RIT state, if we assign it directly, it may loss some data, since the WAL may not replayed. {quote} Ah, I think I see what you're saying, Allan; I didn't consider that. I was just thinking about it from the context of the failed AP at the end of an (otherwise) successful SCP. We don't necessarily know that there isn't an SCP that needs to happen (and that the SCP wouldn't be re-running when this new master comes up). I was just thinking about this problem knowing that there was no SCP to run and assuming IMP was the one who should be assigning meta. {quote}I suggest we can set hbase.assignment.maximum.attempts to Long.MAX. So AP will try forever, until we resolve the problem which cause the region can't assign. {quote} {quote}We can't afford that the AP fails(thus SCP roll back) leaving some region unassigned and we don't know about it {quote} I get what you're saying, but I'm not excited about that being our answer, especially for the seriousness of meta being offline and that we know/expect such problems to happen. Thinking out loud... the goal is that IMP is just doing a "normal" assignment of meta _or_ creating it for the first time. SCP is the one responsible if meta was on a failed server and getting it back onto the cluster. We can't re-run an SCP, so we're left with these hosed regions. I want to suggest something that can make sure we get Regions re-assigned (otherwise, huge user pain-point), but I need to read some more code to think about that some more :) Thanks for your help (and Stack's too). Great to come back to! > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.Procedure
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658540#comment-16658540 ] Allan Yang commented on HBASE-21344: For the problem SCP or AP may fail due to all kinds of issues (security issue like this one), I suggest we can set hbase.assignment.maximum.attempts to Long.MAX. So AP will try forever, until we resolve the problem which cause the region can't assign. We can't afford that the AP fails(thus SCP roll back) leaving some region unassigned and we don't know about it(since the corresponding procedures are rolled back and the region won't show as RIT in the WebUI ). I think branch-1.x also have this kind of issue. The assign failed, but shows nothing in WebUI, we don't know a region is not online until the customer reach us. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postCompl
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658515#comment-16658515 ] Allan Yang commented on HBASE-21344: {code} -if (assignmentManager.getRegionStates().getRegionState(RegionInfoBuilder.FIRST_META_REGIONINFO) - .isOffline()) { +RegionState metaRegionState = + assignmentManager.getRegionStates().getRegionState(RegionInfoBuilder.FIRST_META_REGIONINFO); +if (!metaRegionState.isOpened()) { Optional> optProc = procedureExecutor.getProcedures().stream() .filter(p -> p instanceof InitMetaProcedure).findAny(); - if (optProc.isPresent()) { + // check if we are not loading a successful procedure by the last master as Meta is still not + // in OPEN state + // this also helps in unnecessary waiting on the latch(and get stuck) as the countdown was + // reset and will never be down to zero as the procedure is not running + if (optProc.isPresent() && !((InitMetaProcedure) optProc.get()).isSuccess()) { initMetaProc = (InitMetaProcedure) optProc.get(); } else { // schedule an init meta procedure if meta has not been deployed yet {code} This is not right, the meta table may on a crashed server, and in RIT state, if we assign it directly, it may loss some data, since the WAL may not replayed. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OP
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658426#comment-16658426 ] Ankit Singhal commented on HBASE-21344: --- bq. To what are you referring? Looking in AP and SCP, there is no rollback – but you are referring to a specific location. Looking at patch, it looks like you are working in AP#FAILED_OPEN. yes, I meant AP#FAILED_OPEN. Thank you for checking. bq. 400 try bq. { 401 handleFailure(env, regionNode); 402 } bq. catch (IOException e) bq. { 403 return false; 404 } bq. Previous we'd let out IOEs now you are catching them and converting them to false. Maybe catch more local to undoRegionAsOpening since this is new source of IOE. Previously there was no IOException thrown by handleFailure, but agreed, I'll catch locally to undoRegionAsOpening and add a warning. bq. Did you change decrementMinRegionServerCount to public for tests? If so, add @VisibleForTesting... Yes, Sure will make the change. bq.Yes. Philosophy is that all recovery is done via SCP since it has the means for splitting WALs (or figuring this step can be skipped). ok, so during master initialization(restart or standby becoming active), do I search for SCP in procedure queue(and wait on it) and see if it is holding meta and can recover from splitting of logs to the assignment of Meta? Thank you so much for the review, let me know once you have more comments on the same.(I'll be happy to work on them) > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] clien
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658410#comment-16658410 ] stack commented on HBASE-21344: --- bq. Here actually, we are correcting the rollback of assign procedure for Meta in both IMP and SCP. Do you mean roll back of AssignProcedure? (IMP has one step, making an AssignProcedure as subprocedure for meta). Rollback and AP don't really go together; AP doesn't support Rollback. When you say... bq. Earlier, rollback of assign corrects the meta region node(by moving it to offline state) To what are you referring? Looking in AP and SCP, there is no rollback -- but you are referring to a specific location. Looking at patch, it looks like you are working in AP#FAILED_OPEN. If so, your changes in here look good... I wonder about this one though: 400 try { 401 handleFailure(env, regionNode); 402 } catch (IOException e) { 403 return false; 404 } Previous we'd let out IOEs now you are catching them and converting them to false. Maybe catch more local to undoRegionAsOpening since this is new source of IOE. I think this addition of yours looks good down here in undoRegionAsOpening. Did you change decrementMinRegionServerCount to public for tests? If so, add @VisibleForTesting... annotation sir. Let me get back to you after I study your tests more. They look good. bq. Do you think scheduling IMP without checking whether meta logs were split or not, will cause in any problem? Yes. Philosophy is that all recovery is done via SCP since it has the means for splitting WALs (or figuring this step can be skipped). bq. HBASE-21035 looks quite similar but it seems that the handling is more related to the case where the procedures WALs is accidentally/ intentionally cleared. That is correct. but the general practice suggested is that when unsure, then for now at least, we fall back to operator intervention. Please be patient with me Ankit. I'm generally slow. It takes me a while to understand. Thanks for the help. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoo
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658335#comment-16658335 ] Ankit Singhal commented on HBASE-21344: --- Thanks [~stack] for taking a look. PFB , my responses {quote}So, you are trying to figure the case where the assign in IMP failed to succeed – where the region is stuck in the OPENING state – and if you can find this condition, you'd reschedule an IMP (the body of which happens to be an assign of meta)? {quote} Here actually, we are correcting the rollback of assign procedure for Meta in both IMP and SCP. We are not re-scheduling the IMP until the master restarts(or standby become active) and it finds that meta is still not OPEN. Earlier, rollback of assign corrects the meta region node(by moving it to offline state) but for Meta, we also store state in another meta znode(/hbase/meta-region-server), which was not cleared or set back to offline. (patch is trying to fix this) {quote}What you think of the discussion over in HBASE-21035 where we decide to punt on auto-assign for now at least (IMP only assigns, doesn't do recovery of meta WALs if any). {quote} HBASE-21035 looks quite similar but it seems that the handling is more related to the case where the procedures WALs is accidentally/ intentionally cleared. In our case, splitting of meta logs is completed but assign in SCP failed, which left the meta region node in the OPENING state still after rollback, now meta is never assigned(even after restart) resulting in SCP to never kick-in or cluster to get stuck. So to fix this we are doing two things:- * fixing the meta regionserver node(back to offline state) during the rollback(undoRegionOpening) * During master initialization, checking meta assignment, if it's still not open, we are scheduling another IMP for assignment and waiting on it for completion (Do you think scheduling IMP without checking whether meta logs were split or not, will cause in any problem?) > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:157
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657699#comment-16657699 ] stack commented on HBASE-21344: --- [~an...@apache.org] So, you are trying to figure the case where the assign in IMP failed to succeed -- where region is stuck in OPENING state -- and if you can find this condition, you'd reschedule an IMP (the body of which happens to be an assign of meta)? What you think of the discussion over in HBASE-21035 where we decide to punt on auto-assign for now at least (IMP only assigns, doesn't do recovery of meta WALs if any). (I've been trying to respond to this all day [~an...@apache.org] -- I owe you more -- but let me send what I have so far..). > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFu
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657691#comment-16657691 ] Ankit Singhal commented on HBASE-21344: --- bq. You need something for 2.0.0? 2.0.0 is tough because hbck2 only starts working in 2.0.3 (not yet released) or tip of branch-2.0. bq. If you can go to the tip of branch-2.0, you can use hbck2 to schedule an assign of hbase:meta. [~stack], do you think what we did in the attached patch can help in this particular use-case?, I'll also look in hbck2 code to see if it takes care of meta when not assigned due to any failure in IMP/SCP. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegi
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656264#comment-16656264 ] Josh Elser commented on HBASE-21344: I think Ankit also found (via unit tests in the patch) that you can {{move}} hbase:meta in the shell to fix this without mucking in ZK directly. Trying to distill it down to the simplest explanation: The RegionState for meta in ZK gets left as OPENING after a failed AP for meta (for us, due to a krb issue, but could be any cause) and then the master is restarted. Rollback of this AP doesn't undo the OPENING RegionState for meta back to OFFLINE (or is it CLOSED? whatever). Issue #1. Issue #2 is that the IMP submitted by the new active Master doesn't get scheduled because we only check that meta is not OFFLINE, not that its state is anything but OPEN (similar problem we had for assignment issues in earlier 2.0.x). Thus, we just sit and wait for meta to come online, but nothing is actually working on assinging meta. Ensuring we submit that IMP fixes the issue, but setting the state back to OFFLINE during the rollback of AP would have also helped us avoid this. Ankit's patch should fix both of these. One big open question Ankit and I both have is around resiliency of that IMP. What happens if, on master startup and IMP submission or replay, the IMP fails. Should we resubmit the IMP? Or, do we expect AM to come along and do this for us? Reading the code, it seems to imply that only SCP or IMP will assign meta which we think means we need to be more careful watching for IMP's successful completion. Hope this helps, gents. I've reviewed Ankit patch before he posted here, and would say I'm +0.99 on it already (want to hear from you smart fellows, obviously). I'm offline on Friday, but will be back Monday if there's any questions for me. Ankit thoroughly has a handle on the problem, so you probably don't need me further :) > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.P
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656254#comment-16656254 ] stack commented on HBASE-21344: --- I was going to say try removing the znode, manually [~an...@apache.org], but you already figured this. You need something for 2.0.0? 2.0.0 is tough because hbck2 only starts working in 2.0.3 (not yet released) or tip of branch-2.0. If you can go to the tip of branch-2.0, you can use hbck2 to schedule an assign of hbase:meta. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(Com
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656248#comment-16656248 ] Ankit Singhal commented on HBASE-21344: --- bq. Is this 2.0.0 or branch-2.0? I checked with branch-2.0 , may need to check for other branches as well. bq. If Master is restarted, what happens? Master will not get initialized and get stuck. because currently during startup, master checks whether meta region is offline or not to schedule an InitMetaProcedure. {code} java.net.SocketTimeoutException: callTimeout=120, callDuration=1213908: Meta region is in state OPENING row 'test_table' on table 'hbase:meta' at null at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:159) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:386) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:360) at org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1066) at org.apache.hadoop.hbase.master.TableStateManager.readMetaState(TableStateManager.java:258) at org.apache.hadoop.hbase.master.TableStateManager.getTableState(TableStateManager.java:213) at org.apache.hadoop.hbase.master.TableStateManager.migrateZooKeeper(TableStateManager.java:338) at org.apache.hadoop.hbase.master.TableStateManager.start(TableStateManager.java:267) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:914) at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2090) at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:553) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Meta region is in state OPENING at org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) at org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) at org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$ZKTask$1.exec(ReadOnlyZKClient.java:165) at org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient.run(ReadOnlyZKClient.java:323) ... 1 more {code} bq. I have seen similar issues, a Assign procedure of SCP failed and rolled back, and the whole SCP rolled back, left some regions unassigned. does the user regions are not assigned eventually by balancer or after restart master? We saw this specifically with meta region which is never assigned even after master restart until we delete the meta znode. > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21344-branch-2.0.patch > > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656195#comment-16656195 ] Allan Yang commented on HBASE-21344: I have seen similar issues, a Assign procedure of SCP failed and rolled back, and the whole SCP rolled back, left some regions unassigned. It is a expect behave but not so user friendly… Maybe we have to schedule another SCP for the failed one? > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryF
[jira] [Commented] (HBASE-21344) hbase:meta location in ZooKeeper set to OPENING by the procedure which eventually failed but precludes Master from assigning it forever
[ https://issues.apache.org/jira/browse/HBASE-21344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655979#comment-16655979 ] stack commented on HBASE-21344: --- Is this 2.0.0 or branch-2.0? If Master is restarted, what happens? > hbase:meta location in ZooKeeper set to OPENING by the procedure which > eventually failed but precludes Master from assigning it forever > --- > > Key: HBASE-21344 > URL: https://issues.apache.org/jira/browse/HBASE-21344 > Project: HBase > Issue Type: Bug > Components: proc-v2 >Affects Versions: 2.0.0 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > > [~elserj] has already summarized it well. > 1. hbase:meta was on RS8 > 2. RS8 crashed, SCP was queued for it, meta first > 3. meta was marked OFFLINE > 4. meta marked as OPENING on RS3 > 5. Can't actually send the openRegion RPC to RS3 due to the krb ticket issue > 6. We attempt the openRegion/assignment 10 times, failing each time > 7. We start rolling back the procedure: > {code:java} > 2018-10-08 06:51:24,440 WARN [PEWorker-9] procedure2.ProcedureExecutor: > Usually this should not happen, we will release the lock before if the > procedure is finished, even if the holdLock is true, arrive here means we > have some holes where we do not release the lock. And the releaseLock below > may fail since the procedure may have already been deleted from the procedure > store. > 2018-10-08 06:51:24,543 INFO [PEWorker-9] > procedure.MasterProcedureScheduler: pid=48, ppid=47, > state=FAILED:REGION_TRANSITION_QUEUE, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; AssignProcedure table=hbase:meta, region=1588230740 > checking lock on 1588230740 > {code} > {code:java} > 2018-10-08 06:51:30,957 ERROR [PEWorker-9] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=47, > state=FAILED:SERVER_CRASH_ASSIGN_META, locked=true, > exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via > AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max > attempts exceeded; ServerCrashProcedure > server=,16020,1538974612843, splitWal=true, meta=true > java.lang.UnsupportedOperationException: unhandled > state=SERVER_CRASH_GET_REGIONS > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:254) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:58) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:960) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1577) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1539) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1418) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1981) > {code} > {code:java} > { DEBUG [PEWorker-2] client.RpcRetryingCallerImpl: Call exception, tries=7, > retries=7, started=8168 ms ago, cancelled=false, msg=Meta region is in state > OPENING, details=row 'backup:system' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, hostname=, seqNum=-1, > exception=java.io.IOException: Meta region is in state OPENING > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$null$1(ZKAsyncRegistry.java:154) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) > at > org.apache.hadoop.hbase.client.ZKAsyncRegistry.lambda$getAndConvert$0(ZKAsyncRegistry.java:77) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.complet