[jira] [Created] (STORM-2043) Nimbus should not make assignments crazy when Pacemaker down

2016-08-17 Thread Yuzhao Chen (JIRA)
Yuzhao Chen created STORM-2043:
--

 Summary: Nimbus should not make assignments crazy when Pacemaker 
down
 Key: STORM-2043
 URL: https://issues.apache.org/jira/browse/STORM-2043
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-core
Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0
 Environment: CentOS 6.5 
Reporter: Yuzhao Chen
 Fix For: 1.0.2


When pacemaker goes down, all the heartbeats of workers are lost. These 
heartbeats will need a long time to recover even if pacemaker goes up 
immediately if it costs dozens of GB memory. During the time worker heartbeats 
are not complete,Nimbus will think the workers are died( heartbeat time out ),  
and reassign these workers crazily. But actually the workers are healthy, the 
reassignment will move in cycles until pacemaker heartbeats recover. During 
this time, all the topologies's throughout will goes down. We should avoid 
this, because Pacemaker has no HA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (STORM-2044) Nimb

2016-08-17 Thread Yuzhao Chen (JIRA)
Yuzhao Chen created STORM-2044:
--

 Summary: Nimb
 Key: STORM-2044
 URL: https://issues.apache.org/jira/browse/STORM-2044
 Project: Apache Storm
  Issue Type: Improvement
  Components: storm-core
Affects Versions: 1.0.2
 Environment: CentOS 6.5
Reporter: Yuzhao Chen
 Fix For: 1.1.0


Now pacemaker is a stand-alone service and not HA. When is goes down, all the 
workers's heartbeats will be lost. It will task a long time to recover even if 
pacemaker goes up immediately if there are dozens GBs of heartbeats. During the 
time worker heartbeats are not restored completely, Nimbus will think these 
workers are died because of heartbeats timeout and reassign these "dead" 
workers continuously until heartbeats restore to normal. So, during recovery 
time, many topologies will be reassigned and the throughout will goes very 
down. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazy when Pacemaker down

2016-08-17 Thread Yuzhao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuzhao Chen updated STORM-2044:
---
Summary: Nimbus should not make assignments crazy when Pacemaker down  
(was: Nimb)

> Nimbus should not make assignments crazy when Pacemaker down
> 
>
> Key: STORM-2044
> URL: https://issues.apache.org/jira/browse/STORM-2044
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Affects Versions: 1.0.2
> Environment: CentOS 6.5
>Reporter: Yuzhao Chen
>  Labels: patch
> Fix For: 1.1.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Now pacemaker is a stand-alone service and not HA. When is goes down, all the 
> workers's heartbeats will be lost. It will task a long time to recover even 
> if pacemaker goes up immediately if there are dozens GBs of heartbeats. 
> During the time worker heartbeats are not restored completely, Nimbus will 
> think these workers are died because of heartbeats timeout and reassign these 
> "dead" workers continuously until heartbeats restore to normal. So, during 
> recovery time, many topologies will be reassigned and the throughout will 
> goes very down. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazily when Pacemaker goes down

2016-08-17 Thread Yuzhao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuzhao Chen updated STORM-2044:
---
Summary: Nimbus should not make assignments crazily when Pacemaker goes 
down  (was: Nimbus should not make assignments crazy when Pacemaker down)

> Nimbus should not make assignments crazily when Pacemaker goes down
> ---
>
> Key: STORM-2044
> URL: https://issues.apache.org/jira/browse/STORM-2044
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Affects Versions: 1.0.2
> Environment: CentOS 6.5
>Reporter: Yuzhao Chen
>  Labels: patch
> Fix For: 1.1.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Now pacemaker is a stand-alone service and not HA. When is goes down, all the 
> workers's heartbeats will be lost. It will task a long time to recover even 
> if pacemaker goes up immediately if there are dozens GBs of heartbeats. 
> During the time worker heartbeats are not restored completely, Nimbus will 
> think these workers are died because of heartbeats timeout and reassign these 
> "dead" workers continuously until heartbeats restore to normal. So, during 
> recovery time, many topologies will be reassigned and the throughout will 
> goes very down. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazily when Pacemaker goes down

2016-08-17 Thread Yuzhao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuzhao Chen updated STORM-2044:
---
Description: 
Now pacemaker is a stand-alone service and no HA is supported. When it 
goes down, all the workers's heartbeats will be lost. It will take a long time 
to recover even if pacemaker goes up immediately if there are dozens GB of 
heartbeats. During the time worker heartbeats are not restored completely, 
Nimbus will think these workers are died because of heartbeats timeout and 
reassign these "dead" workers continuously until heartbeats restore to normal. 
So, during recovery time, many topologies will be reassigned continuously and 
the throughout will goes very down.  
This is not acceptable. 
So i think, pacemaker is not suitable for production if the problem 
above exists.
   i think several ways to solve this problem:
  1. pacemaker HA
  2. when pacemaker does down, notice nimbus not to reassign any 
more until it recover

  was:
Now pacemaker is a stand-alone service and no HA is supported. When it 
goes down, all the workers's heartbeats will be lost. It will take a long time 
to recover even if pacemaker goes up immediately if there are dozens GB of 
heartbeats. During the time worker heartbeats are not restored completely, 
Nimbus will think these workers are died because of heartbeats timeout and 
reassign these "dead" workers continuously until heartbeats restore to normal. 
So, during recovery time, many topologies will be reassigned continuously and 
the throughout will goes very down.  
This is not acceptable. 
So i think, pacemaker is not suitable for production if the problem 
above exists.
i think several ways to solve this problem:
1. pacemaker HA
2. when pacemaker does down, notice nimbus not to reassign any more 
until it recover


> Nimbus should not make assignments crazily when Pacemaker goes down
> ---
>
> Key: STORM-2044
> URL: https://issues.apache.org/jira/browse/STORM-2044
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Affects Versions: 1.0.2
> Environment: CentOS 6.5
>Reporter: Yuzhao Chen
>  Labels: patch
> Fix For: 1.1.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Now pacemaker is a stand-alone service and no HA is supported. When 
> it goes down, all the workers's heartbeats will be lost. It will take a long 
> time to recover even if pacemaker goes up immediately if there are dozens GB 
> of heartbeats. During the time worker heartbeats are not restored completely, 
> Nimbus will think these workers are died because of heartbeats timeout and 
> reassign these "dead" workers continuously until heartbeats restore to 
> normal. So, during recovery time, many topologies will be reassigned 
> continuously and the throughout will goes very down.  
> This is not acceptable. 
> So i think, pacemaker is not suitable for production if the problem 
> above exists.
>i think several ways to solve this problem:
>   1. pacemaker HA
>   2. when pacemaker does down, notice nimbus not to reassign any 
> more until it recover



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazily when Pacemaker goes down

2016-08-17 Thread Yuzhao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuzhao Chen updated STORM-2044:
---
Description: 
Now pacemaker is a stand-alone service and no HA is supported. When it 
goes down, all the workers's heartbeats will be lost. It will take a long time 
to recover even if pacemaker goes up immediately if there are dozens GB of 
heartbeats. During the time worker heartbeats are not restored completely, 
Nimbus will think these workers are died because of heartbeats timeout and 
reassign these "dead" workers continuously until heartbeats restore to normal. 
So, during recovery time, many topologies will be reassigned continuously and 
the throughout will goes very down.  
This is not acceptable. 
So i think, pacemaker is not suitable for production if the problem 
above exists.
i think several ways to solve this problem:
1. pacemaker HA
2. when pacemaker does down, notice nimbus not to reassign any more 
until it recover

  was:Now pacemaker is a stand-alone service and not HA. When is goes down, all 
the workers's heartbeats will be lost. It will task a long time to recover even 
if pacemaker goes up immediately if there are dozens GBs of heartbeats. During 
the time worker heartbeats are not restored completely, Nimbus will think these 
workers are died because of heartbeats timeout and reassign these "dead" 
workers continuously until heartbeats restore to normal. So, during recovery 
time, many topologies will be reassigned and the throughout will goes very 
down. 


> Nimbus should not make assignments crazily when Pacemaker goes down
> ---
>
> Key: STORM-2044
> URL: https://issues.apache.org/jira/browse/STORM-2044
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Affects Versions: 1.0.2
> Environment: CentOS 6.5
>Reporter: Yuzhao Chen
>  Labels: patch
> Fix For: 1.1.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Now pacemaker is a stand-alone service and no HA is supported. When 
> it goes down, all the workers's heartbeats will be lost. It will take a long 
> time to recover even if pacemaker goes up immediately if there are dozens GB 
> of heartbeats. During the time worker heartbeats are not restored completely, 
> Nimbus will think these workers are died because of heartbeats timeout and 
> reassign these "dead" workers continuously until heartbeats restore to 
> normal. So, during recovery time, many topologies will be reassigned 
> continuously and the throughout will goes very down.  
> This is not acceptable. 
> So i think, pacemaker is not suitable for production if the problem 
> above exists.
> i think several ways to solve this problem:
> 1. pacemaker HA
> 2. when pacemaker does down, notice nimbus not to reassign any more 
> until it recover



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazily when Pacemaker goes down

2016-08-17 Thread Yuzhao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuzhao Chen updated STORM-2044:
---
Description: 
Now pacemaker is a stand-alone service and no HA is supported. When it 
goes down, all the workers's heartbeats will be lost. It will take a long time 
to recover even if pacemaker goes up immediately if there are dozens GB of 
heartbeats. During the time worker heartbeats are not restored completely, 
Nimbus will think these workers are dead because of heartbeats timeout and 
reassign these "dead" workers continuously until heartbeats restore to normal. 
So, during recovery time, many topologies will be reassigned continuously and 
the throughout will goes very down.  
This is not acceptable. 
So i think, pacemaker is not suitable for production if the problem 
above exists.
   i think several ways to solve this problem:
  1. pacemaker HA
  2. when pacemaker does down, notice nimbus not to reassign any 
more until it recover

  was:
Now pacemaker is a stand-alone service and no HA is supported. When it 
goes down, all the workers's heartbeats will be lost. It will take a long time 
to recover even if pacemaker goes up immediately if there are dozens GB of 
heartbeats. During the time worker heartbeats are not restored completely, 
Nimbus will think these workers are died because of heartbeats timeout and 
reassign these "dead" workers continuously until heartbeats restore to normal. 
So, during recovery time, many topologies will be reassigned continuously and 
the throughout will goes very down.  
This is not acceptable. 
So i think, pacemaker is not suitable for production if the problem 
above exists.
   i think several ways to solve this problem:
  1. pacemaker HA
  2. when pacemaker does down, notice nimbus not to reassign any 
more until it recover


> Nimbus should not make assignments crazily when Pacemaker goes down
> ---
>
> Key: STORM-2044
> URL: https://issues.apache.org/jira/browse/STORM-2044
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Affects Versions: 1.0.2
> Environment: CentOS 6.5
>Reporter: Yuzhao Chen
>  Labels: patch
> Fix For: 1.1.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Now pacemaker is a stand-alone service and no HA is supported. When 
> it goes down, all the workers's heartbeats will be lost. It will take a long 
> time to recover even if pacemaker goes up immediately if there are dozens GB 
> of heartbeats. During the time worker heartbeats are not restored completely, 
> Nimbus will think these workers are dead because of heartbeats timeout and 
> reassign these "dead" workers continuously until heartbeats restore to 
> normal. So, during recovery time, many topologies will be reassigned 
> continuously and the throughout will goes very down.  
> This is not acceptable. 
> So i think, pacemaker is not suitable for production if the problem 
> above exists.
>i think several ways to solve this problem:
>   1. pacemaker HA
>   2. when pacemaker does down, notice nimbus not to reassign any 
> more until it recover



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology

2016-08-17 Thread Yuzhao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425987#comment-15425987
 ] 

Yuzhao Chen commented on STORM-1949:


I have test storm 1.0.1 ABP in our staging cluster[ 10 machine nodes and a 
topology with 32 workers, 120 executors ] on, the topology will stall within 10 
mins very possibly, i agree with @Roshan Naik 

> Backpressure can cause spout to stop emitting and stall topology
> 
>
> Key: STORM-1949
> URL: https://issues.apache.org/jira/browse/STORM-1949
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Roshan Naik
> Attachments: 1.x-branch-works-perfect.png
>
>
> Problem can be reproduced by this [Word count 
> topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java]
> within a IDE.
> I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt 
> instances.
> The problem is more easily reproduced with WC topology as it causes an 
> explosion of tuples due to splitting a sentence tuple into word tuples. As 
> the bolts have to process more tuples than the  spout is producing, spout 
> needs to operate slower.
> The amount of time it takes for the topology to stall can vary.. but 
> typically under 10 mins. 
> *My theory:*  I suspect there is a race condition in the way ZK is being 
> utilized to enable/disable back pressure. When congested (i.e pressure 
> exceeds high water mark), the bolt's worker records this congested situation 
> in ZK by creating a node. Once the congestion is reduced below the low water 
> mark, it deletes this node. 
> The spout's worker has setup a watch on the parent node, expecting a callback 
> whenever there is change in the child nodes. On receiving the callback the 
> spout's worker lists the parent node to check if there are 0 or more child 
> nodes it is essentially trying to figure out the nature of state change 
> in ZK to determine whether to throttle or not. Subsequently  it setsup 
> another watch in ZK to keep an eye on future changes.
> When there are multiple bolts, there can be rapid creation/deletion of these 
> ZK nodes. Between the time the worker receives a callback and sets up the 
> next watch.. many changes may have undergone in ZK which will go unnoticed by 
> the spout. 
> The condition that the bolts are no longer congested may not get noticed as a 
> result of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (STORM-1949) Backpressure can cause spout to stop emitting and stall topology

2016-08-18 Thread Yuzhao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425987#comment-15425987
 ] 

Yuzhao Chen edited comment on STORM-1949 at 8/18/16 7:01 AM:
-

I have test storm 1.0.1 ABP in our staging cluster[ 10 machine nodes and a 
topology with 32 workers, 120 executors ] on, the topology will stall within 10 
mins very possibly
i see that when all the topology tasks throttle off, the backpressure root 
node on ZK still have child nodes which cause   the "stall", if i delete the 
child node manually, the spouts will continue to consume normally(but will 
again stall within 10 minutes), so  i agree with @Roshan Naik


was (Author: danny0405):
I have test storm 1.0.1 ABP in our staging cluster[ 10 machine nodes and a 
topology with 32 workers, 120 executors ] on, the topology will stall within 10 
mins very possibly, i agree with @Roshan Naik 

> Backpressure can cause spout to stop emitting and stall topology
> 
>
> Key: STORM-1949
> URL: https://issues.apache.org/jira/browse/STORM-1949
> Project: Apache Storm
>  Issue Type: Bug
>Reporter: Roshan Naik
> Attachments: 1.x-branch-works-perfect.png
>
>
> Problem can be reproduced by this [Word count 
> topology|https://github.com/hortonworks/storm/blob/perftopos1.x/examples/storm-starter/src/jvm/org/apache/storm/starter/perf/FileReadWordCountTopo.java]
> within a IDE.
> I ran it with 1 spout instance, 2 splitter bolt instances, 2 counter bolt 
> instances.
> The problem is more easily reproduced with WC topology as it causes an 
> explosion of tuples due to splitting a sentence tuple into word tuples. As 
> the bolts have to process more tuples than the  spout is producing, spout 
> needs to operate slower.
> The amount of time it takes for the topology to stall can vary.. but 
> typically under 10 mins. 
> *My theory:*  I suspect there is a race condition in the way ZK is being 
> utilized to enable/disable back pressure. When congested (i.e pressure 
> exceeds high water mark), the bolt's worker records this congested situation 
> in ZK by creating a node. Once the congestion is reduced below the low water 
> mark, it deletes this node. 
> The spout's worker has setup a watch on the parent node, expecting a callback 
> whenever there is change in the child nodes. On receiving the callback the 
> spout's worker lists the parent node to check if there are 0 or more child 
> nodes it is essentially trying to figure out the nature of state change 
> in ZK to determine whether to throttle or not. Subsequently  it setsup 
> another watch in ZK to keep an eye on future changes.
> When there are multiple bolts, there can be rapid creation/deletion of these 
> ZK nodes. Between the time the worker receives a callback and sets up the 
> next watch.. many changes may have undergone in ZK which will go unnoticed by 
> the spout. 
> The condition that the bolts are no longer congested may not get noticed as a 
> result of this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (STORM-2044) Nimbus should not make assignments crazily when Pacemaker goes down and up

2016-08-18 Thread Yuzhao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuzhao Chen updated STORM-2044:
---
Summary: Nimbus should not make assignments crazily when Pacemaker goes 
down and up  (was: Nimbus should not make assignments crazily when Pacemaker 
goes down)

> Nimbus should not make assignments crazily when Pacemaker goes down and up
> --
>
> Key: STORM-2044
> URL: https://issues.apache.org/jira/browse/STORM-2044
> Project: Apache Storm
>  Issue Type: Improvement
>  Components: storm-core
>Affects Versions: 1.0.2
> Environment: CentOS 6.5
>Reporter: Yuzhao Chen
>  Labels: patch
> Fix For: 1.1.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Now pacemaker is a stand-alone service and no HA is supported. When 
> it goes down, all the workers's heartbeats will be lost. It will take a long 
> time to recover even if pacemaker goes up immediately if there are dozens GB 
> of heartbeats. During the time worker heartbeats are not restored completely, 
> Nimbus will think these workers are dead because of heartbeats timeout and 
> reassign these "dead" workers continuously until heartbeats restore to 
> normal. So, during recovery time, many topologies will be reassigned 
> continuously and the throughout will goes very down.  
> This is not acceptable. 
> So i think, pacemaker is not suitable for production if the problem 
> above exists.
>i think several ways to solve this problem:
>   1. pacemaker HA
>   2. when pacemaker does down, notice nimbus not to reassign any 
> more until it recover



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)