I've still had no chance to try this on a large Storm 0.9.6/0.10.0 setup,
so any similar experience is of interest for me. Thanks.

2016-01-15 20:01 GMT+03:00 Yury Ruchin <[email protected]>:

> Hi,
>
> I'm facing an issue when Storm 0.9.5 topology looks alive but effectively
> stops processing tuples. These are the steps to reproduce:
>
> 0. Have a large topology with dozens of workers. The topology reads data
> from Kafka spout and has topology.max.spout.pending set to a finite size.
> 1. Deploy topology so that all the worker slots are occupied.
> 2. Take note of two processes, let's call them worker A and worker B.
> Let's assume worker B occupies slot (N+P), where N is a node name, P is
> port.
> 3. Kill worker A.
> 4. Wait for Nimbus to detect A's death. Nimbus will initiate restart of A.
> 5. Wait for A to establish Netty client connection to B.
> 6. Kill B. Since that point A's connection to B is stale. Nevertheless it
> will remain in the ":cached-node+port->socket" map unless it's closed in
> refresh-connections() call later.
> 7. If B restarts before the next scheduled refresh-connections() call goes
> off, the A's stale connection to (N+P) will never be reestablished, since B
> is restarted in the same slot it occupied before death and so assignment
> does not change in regard to (N+P) port.
> 8. Worker A hangs in the not-yet-started state (storm-active-flag is
> false), but from Nimbus perspective it is alive, so other workers' spouts
> keep sending data to A, run out of their topology.max.spout.pending and
> stop emitting as well.
>
> This may look like a contrived case, but I'm facing it several times a day
> in my setup. Probably, because of ZK being slow I observe massive worker
> restart by heartbeat timeout at nearly the same time, which leads me to the
> scenario above. Actually, I do have some free slots in cluster but that
> does not prevent workers from being assigned to the same slot in rapid
> succession.
>
> Something very similar is described in this issue:
> https://issues.apache.org/jira/browse/STORM-946.
>
> Have anyone ever seen this? Maybe it's somehow fixed / alleviated in Storm
> 0.9.6/0.10.0?
>
> Thanks,
> Yury
>

Reply via email to