[ 
https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuzhao Chen updated STORM-2044:
-------------------------------
    Summary: Nimbus should not make assignments crazily when Pacemaker goes 
down and up  (was: Nimbus should not make assignments crazily when Pacemaker 
goes down)

> Nimbus should not make assignments crazily when Pacemaker goes down and up
> --------------------------------------------------------------------------
>
>                 Key: STORM-2044
>                 URL: https://issues.apache.org/jira/browse/STORM-2044
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>    Affects Versions: 1.0.2
>         Environment: CentOS 6.5
>            Reporter: Yuzhao Chen
>              Labels: patch
>             Fix For: 1.1.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
>         Now pacemaker is a stand-alone service and no HA is supported. When 
> it goes down, all the workers's heartbeats will be lost. It will take a long 
> time to recover even if pacemaker goes up immediately if there are dozens GB 
> of heartbeats. During the time worker heartbeats are not restored completely, 
> Nimbus will think these workers are dead because of heartbeats timeout and 
> reassign these "dead" workers continuously until heartbeats restore to 
> normal. So, during recovery time, many topologies will be reassigned 
> continuously and the throughout will goes very down.  
>         This is not acceptable. 
>         So i think, pacemaker is not suitable for production if the problem 
> above exists.
>                i think several ways to solve this problem:
>               1. pacemaker HA
>               2. when pacemaker does down, notice nimbus not to reassign any 
> more until it recover



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to