[ https://issues.apache.org/jira/browse/STORM-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rick Kellogg updated STORM-397: ------------------------------- Component/s: storm-core > Nimbus does not reassign a topology when the supervisor dies > ------------------------------------------------------------ > > Key: STORM-397 > URL: https://issues.apache.org/jira/browse/STORM-397 > Project: Apache Storm > Issue Type: Bug > Components: storm-core > Affects Versions: 0.9.2-incubating > Environment: 2 topologies, 3 supervisors > Reporter: Simon Cooper > Priority: Critical > Attachments: nimbus.log, storm.png > > > We're running two topologies on a cluster with 3 supervisors. By default, > both topologies are assigned onto the same supervisor. If that supervisor > dies, storm reassigns one topology to another supervisor but not the other, > leaving the second topology inactive. > There are various symptoms/possible causes of this problem. In the nimbus > logs, from when the topologies are initially submitted, nimbus is continually > trying to reassign the second topology to the same supervisor every 10 > seconds: > {noformat} > 2014-07-09 14:17:11 -: b.s.d.nimbus [INFO] Setting new assignment for > topology id Sync-1-1404911509: > #backtype.storm.daemon.common.Assignment{:master-code-dir > "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host > {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port > {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs > {[1 2] 1404911831, [7 8] 1404911831, [3 4] 1404911831, [12 12] 1404911831, [9 > 10] 1404911831, [5 5] 1404911831, [11 11] 1404911831, [6 6] 1404911831}} > 2014-07-09 14:17:21 -: b.s.d.nimbus [INFO] Setting new assignment for > topology id Sync-1-1404911509: > #backtype.storm.daemon.common.Assignment{:master-code-dir > "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host > {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port > {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs > {[1 2] 1404911841, [7 8] 1404911841, [3 4] 1404911841, [12 12] 1404911841, [9 > 10] 1404911841, [5 5] 1404911841, [11 11] 1404911841, [6 6] 1404911841}} > 2014-07-09 14:17:32 -: b.s.d.nimbus [INFO] Setting new assignment for > topology id Sync-1-1404911509: > #backtype.storm.daemon.common.Assignment{:master-code-dir > "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host > {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port > {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] > ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs > {[1 2] 1404911852, [7 8] 1404911852, [3 4] 1404911852, [12 12] 1404911852, [9 > 10] 1404911852, [5 5] 1404911852, [11 11] 1404911852, [6 6] 1404911852}} > {noformat} > These log messages continue after the supervisor it's running on dies - > nimbus continually tries to reassign to a dead supervisor. Note that the > other topology is reassigned elsewhere without problems. > If the broken topology is rebalanced, only then does nimbus assign the > topology to a working supervisor. > Another symptom of this is that, when the machines running storm are started, > only one topology is running on startup. The second topology is not assigned > to a supervisor. Again, it takes a rebalance for nimbus to actually assign > the topology somewhere. > A couple of possibly related bugs are STORM-256 and STORM-341, but I don't > really understand those bugs enough to be able to link it to these problems. > This is a major issue for us. One of the reasons for using storm is that if a > supervisor were to die, storm would automatically fail over to another > supervisor. This does not happen, leaving our cluster with a SPOF. -- This message was sent by Atlassian JIRA (v6.3.4#6332)