[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21095: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Pushed to branch-2. Thanks all for reviewing. > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Sub-task > Components: amv2, proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095-v1.patch, > HBASE-21095-v2.patch, HBASE-21095.branch-2.0.001.patch, HBASE-21095.patch > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-21095: -- Fix Version/s: (was: 2.1.1) (was: 2.0.2) > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Sub-task > Components: amv2, proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095-v1.patch, > HBASE-21095-v2.patch, HBASE-21095.branch-2.0.001.patch, HBASE-21095.patch > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Yang updated HBASE-21095: --- Attachment: HBASE-21095.branch-2.0.001.patch > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Sub-task > Components: amv2, proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0, 2.1.1, 2.0.2 > > Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095-v1.patch, > HBASE-21095-v2.patch, HBASE-21095.branch-2.0.001.patch, HBASE-21095.patch > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21095: -- Attachment: HBASE-21095-v2.patch > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Sub-task > Components: amv2, proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0, 2.1.1, 2.0.2 > > Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095-v1.patch, > HBASE-21095-v2.patch, HBASE-21095.patch > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21095: -- Attachment: HBASE-21095-v1.patch > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Sub-task > Components: amv2, proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0, 2.1.1, 2.0.2 > > Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095-v1.patch, > HBASE-21095.patch > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21095: -- Issue Type: Sub-task (was: Bug) Parent: HBASE-20828 > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Sub-task > Components: amv2, proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1 > > Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095.patch > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21095: -- Attachment: HBASE-21095.patch > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Bug > Components: amv2, proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1 > > Attachments: HBASE-21095-branch-2.0.patch, HBASE-21095.patch > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21095: -- Attachment: HBASE-21095-branch-2.0.patch > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Bug > Components: amv2, proc-v2 >Reporter: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1 > > Attachments: HBASE-21095-branch-2.0.patch > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21095: -- Assignee: Duo Zhang Status: Patch Available (was: Open) > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Bug > Components: amv2, proc-v2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1 > > Attachments: HBASE-21095-branch-2.0.patch > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21095: -- Priority: Critical (was: Major) > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Bug > Components: amv2, proc-v2 >Reporter: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1 > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21095: -- Description: For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or unassign a region, we will set the procedure to WAITING_TIMEOUT state, and rely on the ProcedureEvent in RegionStateNode to wake us up later. But after restarting, we do not suspend the ProcedureEvent in RSN, and also do not add the procedure to the ProcedureEvent's suspending queue, so we will hang there forever as no one will wake us up. (was: It also uses TRSP as sub procedure so probably we should set killIfHasParent to true, but the log is a bit interesting, that we just hang there without executing any procedures after a restart, but for other tests where we need to set killIfHasParent to true, we will keep executing procedures but do not make any progress. Need to dig more.) > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Bug > Components: amv2, proc-v2 >Reporter: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1 > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21095: -- Fix Version/s: 2.1.1 2.2.0 2.0.2 3.0.0 > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Bug > Components: amv2, proc-v2 >Reporter: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.0.2, 2.2.0, 2.1.1 > > > For TRSP, and also RTP in branch-2.0 and branch-2.1, if we fail to assign or > unassign a region, we will set the procedure to WAITING_TIMEOUT state, and > rely on the ProcedureEvent in RegionStateNode to wake us up later. But after > restarting, we do not suspend the ProcedureEvent in RSN, and also do not add > the procedure to the ProcedureEvent's suspending queue, so we will hang there > forever as no one will wake us up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21095) The timeout retry logic for several procedures are broken after master restarts
[ https://issues.apache.org/jira/browse/HBASE-21095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21095: -- Summary: The timeout retry logic for several procedures are broken after master restarts (was: The timeout retry logic for several procedures are broken) > The timeout retry logic for several procedures are broken after master > restarts > --- > > Key: HBASE-21095 > URL: https://issues.apache.org/jira/browse/HBASE-21095 > Project: HBase > Issue Type: Bug > Components: amv2, proc-v2 >Reporter: Duo Zhang >Priority: Major > > It also uses TRSP as sub procedure so probably we should set killIfHasParent > to true, but the log is a bit interesting, that we just hang there without > executing any procedures after a restart, but for other tests where we need > to set killIfHasParent to true, we will keep executing procedures but do not > make any progress. > Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)