[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-26 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21095:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Pushed to branch-2. Thanks all for reviewing.

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095-v1.patch, 
> HBASE-21095-v2.patch, HBASE-21095.branch-2.0.001.patch, HBASE-21095.patch
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-24 Thread stack (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-21095:
--
Fix Version/s: (was: 2.1.1)
   (was: 2.0.2)

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095-v1.patch, 
> HBASE-21095-v2.patch, HBASE-21095.branch-2.0.001.patch, HBASE-21095.patch
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-23 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21095:
---
Attachment: HBASE-21095.branch-2.0.001.patch

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0, 2.1.1, 2.0.2
>
> Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095-v1.patch, 
> HBASE-21095-v2.patch, HBASE-21095.branch-2.0.001.patch, HBASE-21095.patch
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-23 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21095:
--
Attachment: HBASE-21095-v2.patch

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0, 2.1.1, 2.0.2
>
> Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095-v1.patch, 
> HBASE-21095-v2.patch, HBASE-21095.patch
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-23 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21095:
--
Attachment: HBASE-21095-v1.patch

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0, 2.1.1, 2.0.2
>
> Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095-v1.patch, 
> HBASE-21095.patch
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-22 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21095:
--
Issue Type: Sub-task  (was: Bug)
Parent: HBASE-20828

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1
>
> Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095.patch
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-22 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21095:
--
Attachment: HBASE-21095.patch

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Bug
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1
>
> Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095.patch
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-22 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21095:
--
Attachment: HBASE-21095-branch-2.0.patch

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Bug
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1
>
> Attachments: HBASE-21095-branch-2.0.patch
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-22 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21095:
--
Assignee: Duo Zhang
  Status: Patch Available  (was: Open)

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Bug
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1
>
> Attachments: HBASE-21095-branch-2.0.patch
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-22 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21095:
--
Priority: Critical  (was: Major)

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Bug
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-22 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21095:
--
Description: For TRSP, and also RTP in branch-2.0 and branch-2.1, if we 
fail to assign or unassign a region, we will set the procedure to 
WAITING_TIMEOUT state, and rely on the ProcedureEvent in RegionStateNode to 
wake us up later. But after restarting, we do not suspend the ProcedureEvent in 
RSN, and also do not add the procedure to the ProcedureEvent's suspending 
queue, so we will hang there forever as no one will wake us up.  (was: It also 
uses TRSP as sub procedure so probably we should set killIfHasParent to true, 
but the log is a bit interesting, that we just hang there without executing any 
procedures after a restart, but for other tests where we need to set 
killIfHasParent to true, we will keep executing procedures but do not make any 
progress.

Need to dig more.)

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Bug
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-22 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21095:
--
Fix Version/s: 2.1.1
   2.2.0
   2.0.2
   3.0.0

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Bug
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1
>
>
> For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or 
> unassign a region, we will set the procedure to WAITING_TIMEOUT state, and 
> rely on the ProcedureEvent in RegionStateNode to wake us up later. But after 
> restarting, we do not suspend the ProcedureEvent in RSN, and also do not add 
> the procedure to the ProcedureEvent's suspending queue, so we will hang there 
> forever as no one will wake us up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts

2018-08-22 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21095:
--
Summary: The timeout retry logic for several procedures are broken after 
master restarts  (was: The timeout retry logic for several procedures are 
broken)

> The timeout retry logic for several procedures are broken after master 
> restarts
> ---
>
> Key: HBASE-21095
> URL: https://issues.apache.org/jira/browse/HBASE-21095
> Project: HBase
>  Issue Type: Bug
>  Components: amv2, proc-v2
>Reporter: Duo Zhang
>Priority: Major
>
> It also uses TRSP as sub procedure so probably we should set killIfHasParent 
> to true, but the log is a bit interesting, that we just hang there without 
> executing any procedures after a restart, but for other tests where we need 
> to set killIfHasParent to true, we will keep executing procedures but do not 
> make any progress.
> Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)