[jira] [Commented] (STORM-2809) Integration test is failing consistently and topologies sometimes fail to start workers

Robert Joseph Evans (JIRA) Mon, 13 Nov 2017 06:18:46 -0800

    [ 
https://issues.apache.org/jira/browse/STORM-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249648#comment-16249648
 ]


Robert Joseph Evans commented on STORM-2809:
--------------------------------------------

This might be related.  In my supervisor logs I see for a specific topology 
that it failed to download the jar, etc for it.

{code}
AsyncLocalizer Executor - 2 [WARN] Failed to download blob LOCAL TOPO BLOB 
TOPO_CONF TumblingWindowTestt2-20-1510353392 will try again in 100 ms
{code}

Looking at nimbus for it I see.

{code}
2017-11-10 16:36:32.412 o.a.s.d.n.Nimbus pool-14-thread-28 [INFO] Received 
topology submission for TumblingWindowTestt2
...
2017-11-10 16:36:41.834 o.a.s.d.n.Nimbus timer [INFO] Setting new assignment 
for topology id TumblingWindowTestt2-20-1510353392
...
2017-11-10 16:38:12.529 o.a.s.d.n.Nimbus pool-14-thread-51 [INFO] Delaying 
event REMOVE for 30 secs for TumblingWindowTestt2-20-1510353392
2017-11-10 16:38:12.529 o.a.s.d.n.Nimbus pool-14-thread-51 [INFO] TRANSITION: 
TumblingWindowTestt2-20-1510353392 KILL null true
2017-11-10 16:38:12.531 o.a.s.d.n.Nimbus pool-14-thread-51 [INFO] Adding topo 
to history log: TumblingWindowTestt2-20-1510353392
...
2017-11-10 16:38:42.019 o.a.s.d.n.Nimbus timer [INFO] Executor 
TumblingWindowTestt2-20-1510353392:[1, 1] not alive
...
2017-11-10 16:38:42.020 o.a.s.d.n.Nimbus timer [INFO] Setting new assignment 
for topology id TumblingWindowTestt2-20-1510353392 //No assignments
...
2017-11-10 16:38:42.533 o.a.s.d.n.Nimbus timer [INFO] Killing topology: 
TumblingWindowTestt2-20-1510353392
...
{code}

Only after the topology was totally killed did the supervisor start doing 
anything about it.

{code}
2017-11-10 16:39:36.639 o.a.s.l.AsyncLocalizer AsyncLocalizer Executor - 2 
[WARN] Failed to download blob LOCAL TOPO BLOB TOPO_CONF 
TumblingWindowTestt2-20-1510353392 will try again in 100 ms
{code}

It was just under 3 mins from the time the worker was scheduled that the 
supervisor started to look at it.

I will try to see what the supervisor was doing at that time.

> Integration test is failing consistently and topologies sometimes fail to 
> start workers
> ---------------------------------------------------------------------------------------
>
>                 Key: STORM-2809
>                 URL: https://issues.apache.org/jira/browse/STORM-2809
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Stig Rohde Døssing
>            Assignee: Robert Joseph Evans
>             Fix For: 2.0.0
>
>
> The integration test has been failing fairly consistently since 
> https://github.com/apache/storm/pull/2363. I tried running the test outside a 
> VM with a locally installed Storm setup, and it has failed every time for me.
> Most runs seem to fail in ways that make it look like the integration test is 
> just flaky (e.g. tuple windows not matching the calculated window), but in at 
> least a few tests I saw the topology get submitted to Nimbus followed by 
> about 3 minutes of nothing happening. The workers never started and the 
> supervisor didn't seem aware of the scheduling. The only evidence that the 
> topology was submitted was in the Nimbus log. This still happens even if the 
> test topologies are killed with a timeout of 0, so there should be slots free 
> for the next test immediately.
> I tried reverting https://github.com/apache/storm/pull/2363 and it seems to 
> make the integration test pass much more often. Over 5 runs there was still 
> an instance of a supervisor failing to start the workers, but the other 4 
> passed.
> We should try to fix whatever is causing the supervisor to fail to start 
> workers, and get the integration test more stable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (STORM-2809) Integration test is failing consistently and topologies sometimes fail to start workers

Reply via email to