[ https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuzhao Chen updated STORM-2044: ------------------------------- Summary: Nimbus should not make assignments crazily when Pacemaker goes down and up (was: Nimbus should not make assignments crazily when Pacemaker goes down) > Nimbus should not make assignments crazily when Pacemaker goes down and up > -------------------------------------------------------------------------- > > Key: STORM-2044 > URL: https://issues.apache.org/jira/browse/STORM-2044 > Project: Apache Storm > Issue Type: Improvement > Components: storm-core > Affects Versions: 1.0.2 > Environment: CentOS 6.5 > Reporter: Yuzhao Chen > Labels: patch > Fix For: 1.1.0 > > Original Estimate: 672h > Remaining Estimate: 672h > > Now pacemaker is a stand-alone service and no HA is supported. When > it goes down, all the workers's heartbeats will be lost. It will take a long > time to recover even if pacemaker goes up immediately if there are dozens GB > of heartbeats. During the time worker heartbeats are not restored completely, > Nimbus will think these workers are dead because of heartbeats timeout and > reassign these "dead" workers continuously until heartbeats restore to > normal. So, during recovery time, many topologies will be reassigned > continuously and the throughout will goes very down. > This is not acceptable. > So i think, pacemaker is not suitable for production if the problem > above exists. > i think several ways to solve this problem: > 1. pacemaker HA > 2. when pacemaker does down, notice nimbus not to reassign any > more until it recover -- This message was sent by Atlassian JIRA (v6.3.4#6332)