[jira] [Commented] (HBASE-21222) [amv2] Closing region on a non-existent server creates STUCK regions

2018-09-24 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625816#comment-16625816
 ] 

stack commented on HBASE-21222:
---

Yes. A workaround is clearing out the old location. That seems to work. I'll 
write it up.

> [amv2] Closing region on a non-existent server creates STUCK regions
> 
>
> Key: HBASE-21222
> URL: https://issues.apache.org/jira/browse/HBASE-21222
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Reporter: stack
>Assignee: stack
>Priority: Major
>
> Ran into this one where a Region had been on a server but after a bunch of 
> crashing and meddling in Master Proc WALs, any attempt at unassign has the 
> procedure fail (see below) and then report the region as STUCK.
> I broke the lock w/ new hbck2 tooling and then tried to offline again but 
> same thing happened. Bug. Fix.
> {code}
> 2018-09-22 18:36:41,900 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch 
> pid=138650, ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH, 
> locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558; rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558
> 2018-09-22 18:36:41,899 INFO 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: 
> pid=138646, ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=IntegrationTestBigLinkedList_20180614072614, 
> region=0780467efe4c5901887fb12bfa406fa7, 
> server=vc1228.halxg.cloudera.com,22101,1537578279837 checking lock on 
> 0780467efe4c5901887fb12bfa406fa7
> 2018-09-22 18:36:41,900 WARN 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote 
> call failed vd0637.halxg.cloudera.com,22101,1537397969558; pid=138650, 
> ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; 
> UnassignProcedure table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558; rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558; 
> exception=NoServerDispatchException
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> vd0637.halxg.cloudera.com,22101,1537397969558; pid=138650, ppid=121871, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558
> at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:177)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:277)
> at 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:202)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:370)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
> at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:924)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1684)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1471)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:77)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1983)
> 2018-09-22 18:36:41,903 WARN 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure: Expiring 
> vd0637.halxg.cloudera.com,22101,1537397969558, pid=138650, ppid=121871, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558 rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558; 
> exception=NoServerDispatchException
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21222) [amv2] Closing region on a non-existent server creates STUCK regions

2018-09-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625812#comment-16625812
 ] 

Duo Zhang commented on HBASE-21222:
---

Got it. So we need a tool in HBCK2 to handle this case.

> [amv2] Closing region on a non-existent server creates STUCK regions
> 
>
> Key: HBASE-21222
> URL: https://issues.apache.org/jira/browse/HBASE-21222
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Reporter: stack
>Assignee: stack
>Priority: Major
>
> Ran into this one where a Region had been on a server but after a bunch of 
> crashing and meddling in Master Proc WALs, any attempt at unassign has the 
> procedure fail (see below) and then report the region as STUCK.
> I broke the lock w/ new hbck2 tooling and then tried to offline again but 
> same thing happened. Bug. Fix.
> {code}
> 2018-09-22 18:36:41,900 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch 
> pid=138650, ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH, 
> locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558; rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558
> 2018-09-22 18:36:41,899 INFO 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: 
> pid=138646, ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=IntegrationTestBigLinkedList_20180614072614, 
> region=0780467efe4c5901887fb12bfa406fa7, 
> server=vc1228.halxg.cloudera.com,22101,1537578279837 checking lock on 
> 0780467efe4c5901887fb12bfa406fa7
> 2018-09-22 18:36:41,900 WARN 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote 
> call failed vd0637.halxg.cloudera.com,22101,1537397969558; pid=138650, 
> ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; 
> UnassignProcedure table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558; rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558; 
> exception=NoServerDispatchException
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> vd0637.halxg.cloudera.com,22101,1537397969558; pid=138650, ppid=121871, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558
> at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:177)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:277)
> at 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:202)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:370)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
> at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:924)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1684)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1471)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:77)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1983)
> 2018-09-22 18:36:41,903 WARN 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure: Expiring 
> vd0637.halxg.cloudera.com,22101,1537397969558, pid=138650, ppid=121871, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558 rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558; 
> exception=NoServerDispatchException
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21222) [amv2] Closing region on a non-existent server creates STUCK regions

2018-09-24 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625801#comment-16625801
 ] 

stack commented on HBASE-21222:
---

In this case it is because WALs were deleted.

Been thinking about this. It should not happen during usual operation but we 
should have some defense in place just in case it does manage to bubble-up.

> [amv2] Closing region on a non-existent server creates STUCK regions
> 
>
> Key: HBASE-21222
> URL: https://issues.apache.org/jira/browse/HBASE-21222
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Reporter: stack
>Assignee: stack
>Priority: Major
>
> Ran into this one where a Region had been on a server but after a bunch of 
> crashing and meddling in Master Proc WALs, any attempt at unassign has the 
> procedure fail (see below) and then report the region as STUCK.
> I broke the lock w/ new hbck2 tooling and then tried to offline again but 
> same thing happened. Bug. Fix.
> {code}
> 2018-09-22 18:36:41,900 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch 
> pid=138650, ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH, 
> locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558; rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558
> 2018-09-22 18:36:41,899 INFO 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: 
> pid=138646, ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=IntegrationTestBigLinkedList_20180614072614, 
> region=0780467efe4c5901887fb12bfa406fa7, 
> server=vc1228.halxg.cloudera.com,22101,1537578279837 checking lock on 
> 0780467efe4c5901887fb12bfa406fa7
> 2018-09-22 18:36:41,900 WARN 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote 
> call failed vd0637.halxg.cloudera.com,22101,1537397969558; pid=138650, 
> ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; 
> UnassignProcedure table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558; rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558; 
> exception=NoServerDispatchException
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> vd0637.halxg.cloudera.com,22101,1537397969558; pid=138650, ppid=121871, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558
> at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:177)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:277)
> at 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:202)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:370)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
> at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:924)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1684)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1471)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:77)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1983)
> 2018-09-22 18:36:41,903 WARN 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure: Expiring 
> vd0637.halxg.cloudera.com,22101,1537397969558, pid=138650, ppid=121871, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558 rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558; 
> exception=NoServerDispatchException
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21222) [amv2] Closing region on a non-existent server creates STUCK regions

2018-09-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625769#comment-16625769
 ] 

Duo Zhang commented on HBASE-21222:
---

Is this because you delete all the master proc wals? Or it could happen after 
crashing and failover? If the latter I think there are critical bugs?

> [amv2] Closing region on a non-existent server creates STUCK regions
> 
>
> Key: HBASE-21222
> URL: https://issues.apache.org/jira/browse/HBASE-21222
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Reporter: stack
>Assignee: stack
>Priority: Major
>
> Ran into this one where a Region had been on a server but after a bunch of 
> crashing and meddling in Master Proc WALs, any attempt at unassign has the 
> procedure fail (see below) and then report the region as STUCK.
> I broke the lock w/ new hbck2 tooling and then tried to offline again but 
> same thing happened. Bug. Fix.
> {code}
> 2018-09-22 18:36:41,900 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch 
> pid=138650, ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH, 
> locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558; rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558
> 2018-09-22 18:36:41,899 INFO 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: 
> pid=138646, ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=IntegrationTestBigLinkedList_20180614072614, 
> region=0780467efe4c5901887fb12bfa406fa7, 
> server=vc1228.halxg.cloudera.com,22101,1537578279837 checking lock on 
> 0780467efe4c5901887fb12bfa406fa7
> 2018-09-22 18:36:41,900 WARN 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Remote 
> call failed vd0637.halxg.cloudera.com,22101,1537397969558; pid=138650, 
> ppid=121871, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; 
> UnassignProcedure table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558; rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558; 
> exception=NoServerDispatchException
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> vd0637.halxg.cloudera.com,22101,1537397969558; pid=138650, ppid=121871, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558
> at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:177)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:277)
> at 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:202)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:370)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
> at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:924)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1684)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1471)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:77)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1983)
> 2018-09-22 18:36:41,903 WARN 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure: Expiring 
> vd0637.halxg.cloudera.com,22101,1537397969558, pid=138650, ppid=121871, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList_20180614072614, 
> region=51cdade76ca7217ec191f39e5f56c61c, 
> server=vd0637.halxg.cloudera.com,22101,1537397969558 rit=CLOSING, 
> location=vd0637.halxg.cloudera.com,22101,1537397969558; 
> exception=NoServerDispatchException
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)