[ https://issues.apache.org/jira/browse/STORM-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249648#comment-16249648 ]
Robert Joseph Evans commented on STORM-2809: -------------------------------------------- This might be related. In my supervisor logs I see for a specific topology that it failed to download the jar, etc for it. {code} AsyncLocalizer Executor - 2 [WARN] Failed to download blob LOCAL TOPO BLOB TOPO_CONF TumblingWindowTestt2-20-1510353392 will try again in 100 ms {code} Looking at nimbus for it I see. {code} 2017-11-10 16:36:32.412 o.a.s.d.n.Nimbus pool-14-thread-28 [INFO] Received topology submission for TumblingWindowTestt2 ... 2017-11-10 16:36:41.834 o.a.s.d.n.Nimbus timer [INFO] Setting new assignment for topology id TumblingWindowTestt2-20-1510353392 ... 2017-11-10 16:38:12.529 o.a.s.d.n.Nimbus pool-14-thread-51 [INFO] Delaying event REMOVE for 30 secs for TumblingWindowTestt2-20-1510353392 2017-11-10 16:38:12.529 o.a.s.d.n.Nimbus pool-14-thread-51 [INFO] TRANSITION: TumblingWindowTestt2-20-1510353392 KILL null true 2017-11-10 16:38:12.531 o.a.s.d.n.Nimbus pool-14-thread-51 [INFO] Adding topo to history log: TumblingWindowTestt2-20-1510353392 ... 2017-11-10 16:38:42.019 o.a.s.d.n.Nimbus timer [INFO] Executor TumblingWindowTestt2-20-1510353392:[1, 1] not alive ... 2017-11-10 16:38:42.020 o.a.s.d.n.Nimbus timer [INFO] Setting new assignment for topology id TumblingWindowTestt2-20-1510353392 //No assignments ... 2017-11-10 16:38:42.533 o.a.s.d.n.Nimbus timer [INFO] Killing topology: TumblingWindowTestt2-20-1510353392 ... {code} Only after the topology was totally killed did the supervisor start doing anything about it. {code} 2017-11-10 16:39:36.639 o.a.s.l.AsyncLocalizer AsyncLocalizer Executor - 2 [WARN] Failed to download blob LOCAL TOPO BLOB TOPO_CONF TumblingWindowTestt2-20-1510353392 will try again in 100 ms {code} It was just under 3 mins from the time the worker was scheduled that the supervisor started to look at it. I will try to see what the supervisor was doing at that time. > Integration test is failing consistently and topologies sometimes fail to > start workers > --------------------------------------------------------------------------------------- > > Key: STORM-2809 > URL: https://issues.apache.org/jira/browse/STORM-2809 > Project: Apache Storm > Issue Type: Bug > Affects Versions: 2.0.0 > Reporter: Stig Rohde Døssing > Assignee: Robert Joseph Evans > Fix For: 2.0.0 > > > The integration test has been failing fairly consistently since > https://github.com/apache/storm/pull/2363. I tried running the test outside a > VM with a locally installed Storm setup, and it has failed every time for me. > Most runs seem to fail in ways that make it look like the integration test is > just flaky (e.g. tuple windows not matching the calculated window), but in at > least a few tests I saw the topology get submitted to Nimbus followed by > about 3 minutes of nothing happening. The workers never started and the > supervisor didn't seem aware of the scheduling. The only evidence that the > topology was submitted was in the Nimbus log. This still happens even if the > test topologies are killed with a timeout of 0, so there should be slots free > for the next test immediately. > I tried reverting https://github.com/apache/storm/pull/2363 and it seems to > make the integration test pass much more often. Over 5 runs there was still > an instance of a supervisor failing to start the workers, but the other 4 > passed. > We should try to fix whatever is causing the supervisor to fail to start > workers, and get the integration test more stable. -- This message was sent by Atlassian JIRA (v6.4.14#64029)