[ 
https://issues.apache.org/jira/browse/STORM-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Kellogg updated STORM-397:
-------------------------------
    Component/s: storm-core

> Nimbus does not reassign a topology when the supervisor dies
> ------------------------------------------------------------
>
>                 Key: STORM-397
>                 URL: https://issues.apache.org/jira/browse/STORM-397
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 0.9.2-incubating
>         Environment: 2 topologies, 3 supervisors
>            Reporter: Simon Cooper
>            Priority: Critical
>         Attachments: nimbus.log, storm.png
>
>
> We're running two topologies on a cluster with 3 supervisors. By default, 
> both topologies are assigned onto the same supervisor. If that supervisor 
> dies, storm reassigns one topology to another supervisor but not the other, 
> leaving the second topology inactive.
> There are various symptoms/possible causes of this problem. In the nimbus 
> logs, from when the topologies are initially submitted, nimbus is continually 
> trying to reassign the second topology to the same supervisor every 10 
> seconds:
> {noformat}
> 2014-07-09 14:17:11 -: b.s.d.nimbus [INFO] Setting new assignment for 
> topology id Sync-1-1404911509: 
> #backtype.storm.daemon.common.Assignment{:master-code-dir 
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host 
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port 
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs 
> {[1 2] 1404911831, [7 8] 1404911831, [3 4] 1404911831, [12 12] 1404911831, [9 
> 10] 1404911831, [5 5] 1404911831, [11 11] 1404911831, [6 6] 1404911831}}
> 2014-07-09 14:17:21 -: b.s.d.nimbus [INFO] Setting new assignment for 
> topology id Sync-1-1404911509: 
> #backtype.storm.daemon.common.Assignment{:master-code-dir 
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host 
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port 
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs 
> {[1 2] 1404911841, [7 8] 1404911841, [3 4] 1404911841, [12 12] 1404911841, [9 
> 10] 1404911841, [5 5] 1404911841, [11 11] 1404911841, [6 6] 1404911841}}
> 2014-07-09 14:17:32 -: b.s.d.nimbus [INFO] Setting new assignment for 
> topology id Sync-1-1404911509: 
> #backtype.storm.daemon.common.Assignment{:master-code-dir 
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host 
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port 
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs 
> {[1 2] 1404911852, [7 8] 1404911852, [3 4] 1404911852, [12 12] 1404911852, [9 
> 10] 1404911852, [5 5] 1404911852, [11 11] 1404911852, [6 6] 1404911852}}
> {noformat}
> These log messages continue after the supervisor it's running on dies - 
> nimbus continually tries to reassign to a dead supervisor. Note that the 
> other topology is reassigned elsewhere without problems.
> If the broken topology is rebalanced, only then does nimbus assign the 
> topology to a working supervisor.
> Another symptom of this is that, when the machines running storm are started, 
> only one topology is running on startup. The second topology is not assigned 
> to a supervisor. Again, it takes a rebalance for nimbus to actually assign 
> the topology somewhere.
> A couple of possibly related bugs are STORM-256 and STORM-341, but I don't 
> really understand those bugs enough to be able to link it to these problems.
> This is a major issue for us. One of the reasons for using storm is that if a 
> supervisor were to die, storm would automatically fail over to another 
> supervisor. This does not happen, leaving our cluster with a SPOF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to