Github user srdo commented on the issue:
https://github.com/apache/storm/pull/2363
@revans2 I think something in this PR is causing topology deployment to
either fail or be really slow occasionally.
The integration test has been failing fairly consistently since
cef450064fa20e2194ef3f51a21c8e6693a285e3. I tried running the test outside a VM
with a locally installed Storm setup, and it has failed every time for me.
Most runs seem to fail in ways that make it look like the integration test
is just flaky (e.g. tuple windows not matching the calculated window), but in
at least a few tests I saw the topology get submitted to Nimbus followed by
about 3 minutes of nothing happening. The workers never started and the
supervisor didn't seem aware of the scheduling. The only evidence that the
topology was submitted was in the Nimbus log. This still happens even if the
test topologies are killed with a timeout of 0, so there should be slots free
for the next test immediately.
I tried reverting cef450064fa20e2194ef3f51a21c8e6693a285e3 and it seems to
make the integration test pass much more often. Over 5 runs there was still an
instance of a supervisor failing to start the workers, but the other 4 passed.
---