[jira] [Created] (STORM-2043) Nimbus should not make assignments crazy when Pacemaker down
Yuzhao Chen created STORM-2043: -- Summary: Nimbus should not make assignments crazy when Pacemaker down Key: STORM-2043 URL: https://issues.apache.org/jira/browse/STORM-2043 Project: Apache Storm Issue Type: Improvement Components: storm-core Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0 Environment: CentOS 6.5 Reporter: Yuzhao Chen Fix For: 1.0.2 When pacemaker goes down, all the heartbeats of workers are lost. These heartbeats will need a long time to recover even if pacemaker goes up immediately if it costs dozens of GB memory. During the time worker heartbeats are not complete,Nimbus will think the workers are died( heartbeat time out ), and reassign these workers crazily. But actually the workers are healthy, the reassignment will move in cycles until pacemaker heartbeats recover. During this time, all the topologies's throughout will goes down. We should avoid this, because Pacemaker has no HA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (STORM-2044) Nimb
Yuzhao Chen created STORM-2044: -- Summary: Nimb Key: STORM-2044 URL: https://issues.apache.org/jira/browse/STORM-2044 Project: Apache Storm Issue Type: Improvement Components: storm-core Affects Versions: 1.0.2 Environment: CentOS 6.5 Reporter: Yuzhao Chen Fix For: 1.1.0 Now pacemaker is a stand-alone service and not HA. When is goes down, all the workers's heartbeats will be lost. It will task a long time to recover even if pacemaker goes up immediately if there are dozens GBs of heartbeats. During the time worker heartbeats are not restored completely, Nimbus will think these workers are died because of heartbeats timeout and reassign these "dead" workers continuously until heartbeats restore to normal. So, during recovery time, many topologies will be reassigned and the throughout will goes very down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazy when Pacemaker down
[ https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuzhao Chen updated STORM-2044: --- Summary: Nimbus should not make assignments crazy when Pacemaker down (was: Nimb) > Nimbus should not make assignments crazy when Pacemaker down > > > Key: STORM-2044 > URL: https://issues.apache.org/jira/browse/STORM-2044 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core >Affects Versions: 1.0.2 > Environment: CentOS 6.5 >Reporter: Yuzhao Chen > Labels: patch > Fix For: 1.1.0 > > Original Estimate: 672h > Remaining Estimate: 672h > > Now pacemaker is a stand-alone service and not HA. When is goes down, all the > workers's heartbeats will be lost. It will task a long time to recover even > if pacemaker goes up immediately if there are dozens GBs of heartbeats. > During the time worker heartbeats are not restored completely, Nimbus will > think these workers are died because of heartbeats timeout and reassign these > "dead" workers continuously until heartbeats restore to normal. So, during > recovery time, many topologies will be reassigned and the throughout will > goes very down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazily when Pacemaker goes down
[ https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuzhao Chen updated STORM-2044: --- Summary: Nimbus should not make assignments crazily when Pacemaker goes down (was: Nimbus should not make assignments crazy when Pacemaker down) > Nimbus should not make assignments crazily when Pacemaker goes down > --- > > Key: STORM-2044 > URL: https://issues.apache.org/jira/browse/STORM-2044 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core >Affects Versions: 1.0.2 > Environment: CentOS 6.5 >Reporter: Yuzhao Chen > Labels: patch > Fix For: 1.1.0 > > Original Estimate: 672h > Remaining Estimate: 672h > > Now pacemaker is a stand-alone service and not HA. When is goes down, all the > workers's heartbeats will be lost. It will task a long time to recover even > if pacemaker goes up immediately if there are dozens GBs of heartbeats. > During the time worker heartbeats are not restored completely, Nimbus will > think these workers are died because of heartbeats timeout and reassign these > "dead" workers continuously until heartbeats restore to normal. So, during > recovery time, many topologies will be reassigned and the throughout will > goes very down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazily when Pacemaker goes down
[ https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuzhao Chen updated STORM-2044: --- Description: Now pacemaker is a stand-alone service and no HA is supported. When it goes down, all the workers's heartbeats will be lost. It will take a long time to recover even if pacemaker goes up immediately if there are dozens GB of heartbeats. During the time worker heartbeats are not restored completely, Nimbus will think these workers are died because of heartbeats timeout and reassign these "dead" workers continuously until heartbeats restore to normal. So, during recovery time, many topologies will be reassigned continuously and the throughout will goes very down. This is not acceptable. So i think, pacemaker is not suitable for production if the problem above exists. i think several ways to solve this problem: 1. pacemaker HA 2. when pacemaker does down, notice nimbus not to reassign any more until it recover was: Now pacemaker is a stand-alone service and no HA is supported. When it goes down, all the workers's heartbeats will be lost. It will take a long time to recover even if pacemaker goes up immediately if there are dozens GB of heartbeats. During the time worker heartbeats are not restored completely, Nimbus will think these workers are died because of heartbeats timeout and reassign these "dead" workers continuously until heartbeats restore to normal. So, during recovery time, many topologies will be reassigned continuously and the throughout will goes very down. This is not acceptable. So i think, pacemaker is not suitable for production if the problem above exists. i think several ways to solve this problem: 1. pacemaker HA 2. when pacemaker does down, notice nimbus not to reassign any more until it recover > Nimbus should not make assignments crazily when Pacemaker goes down > --- > > Key: STORM-2044 > URL: https://issues.apache.org/jira/browse/STORM-2044 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core >Affects Versions: 1.0.2 > Environment: CentOS 6.5 >Reporter: Yuzhao Chen > Labels: patch > Fix For: 1.1.0 > > Original Estimate: 672h > Remaining Estimate: 672h > > Now pacemaker is a stand-alone service and no HA is supported. When > it goes down, all the workers's heartbeats will be lost. It will take a long > time to recover even if pacemaker goes up immediately if there are dozens GB > of heartbeats. During the time worker heartbeats are not restored completely, > Nimbus will think these workers are died because of heartbeats timeout and > reassign these "dead" workers continuously until heartbeats restore to > normal. So, during recovery time, many topologies will be reassigned > continuously and the throughout will goes very down. > This is not acceptable. > So i think, pacemaker is not suitable for production if the problem > above exists. >i think several ways to solve this problem: > 1. pacemaker HA > 2. when pacemaker does down, notice nimbus not to reassign any > more until it recover -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazily when Pacemaker goes down
[ https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuzhao Chen updated STORM-2044: --- Description: Now pacemaker is a stand-alone service and no HA is supported. When it goes down, all the workers's heartbeats will be lost. It will take a long time to recover even if pacemaker goes up immediately if there are dozens GB of heartbeats. During the time worker heartbeats are not restored completely, Nimbus will think these workers are died because of heartbeats timeout and reassign these "dead" workers continuously until heartbeats restore to normal. So, during recovery time, many topologies will be reassigned continuously and the throughout will goes very down. This is not acceptable. So i think, pacemaker is not suitable for production if the problem above exists. i think several ways to solve this problem: 1. pacemaker HA 2. when pacemaker does down, notice nimbus not to reassign any more until it recover was:Now pacemaker is a stand-alone service and not HA. When is goes down, all the workers's heartbeats will be lost. It will task a long time to recover even if pacemaker goes up immediately if there are dozens GBs of heartbeats. During the time worker heartbeats are not restored completely, Nimbus will think these workers are died because of heartbeats timeout and reassign these "dead" workers continuously until heartbeats restore to normal. So, during recovery time, many topologies will be reassigned and the throughout will goes very down. > Nimbus should not make assignments crazily when Pacemaker goes down > --- > > Key: STORM-2044 > URL: https://issues.apache.org/jira/browse/STORM-2044 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core >Affects Versions: 1.0.2 > Environment: CentOS 6.5 >Reporter: Yuzhao Chen > Labels: patch > Fix For: 1.1.0 > > Original Estimate: 672h > Remaining Estimate: 672h > > Now pacemaker is a stand-alone service and no HA is supported. When > it goes down, all the workers's heartbeats will be lost. It will take a long > time to recover even if pacemaker goes up immediately if there are dozens GB > of heartbeats. During the time worker heartbeats are not restored completely, > Nimbus will think these workers are died because of heartbeats timeout and > reassign these "dead" workers continuously until heartbeats restore to > normal. So, during recovery time, many topologies will be reassigned > continuously and the throughout will goes very down. > This is not acceptable. > So i think, pacemaker is not suitable for production if the problem > above exists. > i think several ways to solve this problem: > 1. pacemaker HA > 2. when pacemaker does down, notice nimbus not to reassign any more > until it recover -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazily when Pacemaker goes down
[ https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuzhao Chen updated STORM-2044: --- Description: Now pacemaker is a stand-alone service and no HA is supported. When it goes down, all the workers's heartbeats will be lost. It will take a long time to recover even if pacemaker goes up immediately if there are dozens GB of heartbeats. During the time worker heartbeats are not restored completely, Nimbus will think these workers are dead because of heartbeats timeout and reassign these "dead" workers continuously until heartbeats restore to normal. So, during recovery time, many topologies will be reassigned continuously and the throughout will goes very down. This is not acceptable. So i think, pacemaker is not suitable for production if the problem above exists. i think several ways to solve this problem: 1. pacemaker HA 2. when pacemaker does down, notice nimbus not to reassign any more until it recover was: Now pacemaker is a stand-alone service and no HA is supported. When it goes down, all the workers's heartbeats will be lost. It will take a long time to recover even if pacemaker goes up immediately if there are dozens GB of heartbeats. During the time worker heartbeats are not restored completely, Nimbus will think these workers are died because of heartbeats timeout and reassign these "dead" workers continuously until heartbeats restore to normal. So, during recovery time, many topologies will be reassigned continuously and the throughout will goes very down. This is not acceptable. So i think, pacemaker is not suitable for production if the problem above exists. i think several ways to solve this problem: 1. pacemaker HA 2. when pacemaker does down, notice nimbus not to reassign any more until it recover > Nimbus should not make assignments crazily when Pacemaker goes down > --- > > Key: STORM-2044 > URL: https://issues.apache.org/jira/browse/STORM-2044 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core >Affects Versions: 1.0.2 > Environment: CentOS 6.5 >Reporter: Yuzhao Chen > Labels: patch > Fix For: 1.1.0 > > Original Estimate: 672h > Remaining Estimate: 672h > > Now pacemaker is a stand-alone service and no HA is supported. When > it goes down, all the workers's heartbeats will be lost. It will take a long > time to recover even if pacemaker goes up immediately if there are dozens GB > of heartbeats. During the time worker heartbeats are not restored completely, > Nimbus will think these workers are dead because of heartbeats timeout and > reassign these "dead" workers continuously until heartbeats restore to > normal. So, during recovery time, many topologies will be reassigned > continuously and the throughout will goes very down. > This is not acceptable. > So i think, pacemaker is not suitable for production if the problem > above exists. >i think several ways to solve this problem: > 1. pacemaker HA > 2. when pacemaker does down, notice nimbus not to reassign any > more until it recover -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology
[ https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425987#comment-15425987 ] Yuzhao Chen commented on STORM-1949: I have test storm 1.0.1 ABP in our staging cluster[ 10 machine nodes and a topology with 32 workers, 120 executors ] on, the topology will stall within 10 mins very possibly, i agree with @Roshan Naik > Backpressure can cause spout to stop emitting and stall topology > > > Key: STORM-1949 > URL: https://issues.apache.org/jira/browse/STORM-1949 > Project: Apache Storm > Issue Type: Bug >Reporter: Roshan Naik > Attachments: 1.x-branch-works-perfect.png > > > Problem can be reproduced by this [Word count > topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java] > within a IDE. > I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt > instances. > The problem is more easily reproduced with WC topology as it causes an > explosion of tuples due to splitting a sentence tuple into word tuples. As > the bolts have to process more tuples than the spout is producing, spout > needs to operate slower. > The amount of time it takes for the topology to stall can vary.. but > typically under 10 mins. > *My theory:* I suspect there is a race condition in the way ZK is being > utilized to enable/disable back pressure. When congested (i.e pressure > exceeds high water mark), the bolt's worker records this congested situation > in ZK by creating a node. Once the congestion is reduced below the low water > mark, it deletes this node. > The spout's worker has setup a watch on the parent node, expecting a callback > whenever there is change in the child nodes. On receiving the callback the > spout's worker lists the parent node to check if there are 0 or more child > nodes it is essentially trying to figure out the nature of state change > in ZK to determine whether to throttle or not. Subsequently it setsup > another watch in ZK to keep an eye on future changes. > When there are multiple bolts, there can be rapid creation/deletion of these > ZK nodes. Between the time the worker receives a callback and sets up the > next watch.. many changes may have undergone in ZK which will go unnoticed by > the spout. > The condition that the bolts are no longer congested may not get noticed as a > result of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology
[ https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425987#comment-15425987 ] Yuzhao Chen edited comment on STORM-1949 at 8/18/16 7:01 AM: - I have test storm 1.0.1 ABP in our staging cluster[ 10 machine nodes and a topology with 32 workers, 120 executors ] on, the topology will stall within 10 mins very possibly i see that when all the topology tasks throttle off, the backpressure root node on ZK still have child nodes which cause the "stall", if i delete the child node manually, the spouts will continue to consume normally(but will again stall within 10 minutes), so i agree with @Roshan Naik was (Author: danny0405): I have test storm 1.0.1 ABP in our staging cluster[ 10 machine nodes and a topology with 32 workers, 120 executors ] on, the topology will stall within 10 mins very possibly, i agree with @Roshan Naik > Backpressure can cause spout to stop emitting and stall topology > > > Key: STORM-1949 > URL: https://issues.apache.org/jira/browse/STORM-1949 > Project: Apache Storm > Issue Type: Bug >Reporter: Roshan Naik > Attachments: 1.x-branch-works-perfect.png > > > Problem can be reproduced by this [Word count > topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java] > within a IDE. > I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt > instances. > The problem is more easily reproduced with WC topology as it causes an > explosion of tuples due to splitting a sentence tuple into word tuples. As > the bolts have to process more tuples than the spout is producing, spout > needs to operate slower. > The amount of time it takes for the topology to stall can vary.. but > typically under 10 mins. > *My theory:* I suspect there is a race condition in the way ZK is being > utilized to enable/disable back pressure. When congested (i.e pressure > exceeds high water mark), the bolt's worker records this congested situation > in ZK by creating a node. Once the congestion is reduced below the low water > mark, it deletes this node. > The spout's worker has setup a watch on the parent node, expecting a callback > whenever there is change in the child nodes. On receiving the callback the > spout's worker lists the parent node to check if there are 0 or more child > nodes it is essentially trying to figure out the nature of state change > in ZK to determine whether to throttle or not. Subsequently it setsup > another watch in ZK to keep an eye on future changes. > When there are multiple bolts, there can be rapid creation/deletion of these > ZK nodes. Between the time the worker receives a callback and sets up the > next watch.. many changes may have undergone in ZK which will go unnoticed by > the spout. > The condition that the bolts are no longer congested may not get noticed as a > result of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazily when Pacemaker goes down and up
[ https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuzhao Chen updated STORM-2044: --- Summary: Nimbus should not make assignments crazily when Pacemaker goes down and up (was: Nimbus should not make assignments crazily when Pacemaker goes down) > Nimbus should not make assignments crazily when Pacemaker goes down and up > -- > > Key: STORM-2044 > URL: https://issues.apache.org/jira/browse/STORM-2044 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core >Affects Versions: 1.0.2 > Environment: CentOS 6.5 >Reporter: Yuzhao Chen > Labels: patch > Fix For: 1.1.0 > > Original Estimate: 672h > Remaining Estimate: 672h > > Now pacemaker is a stand-alone service and no HA is supported. When > it goes down, all the workers's heartbeats will be lost. It will take a long > time to recover even if pacemaker goes up immediately if there are dozens GB > of heartbeats. During the time worker heartbeats are not restored completely, > Nimbus will think these workers are dead because of heartbeats timeout and > reassign these "dead" workers continuously until heartbeats restore to > normal. So, during recovery time, many topologies will be reassigned > continuously and the throughout will goes very down. > This is not acceptable. > So i think, pacemaker is not suitable for production if the problem > above exists. >i think several ways to solve this problem: > 1. pacemaker HA > 2. when pacemaker does down, notice nimbus not to reassign any > more until it recover -- This message was sent by Atlassian JIRA (v6.3.4#6332)