Hi,

I'm facing an issue when Storm 0.9.5 topology looks alive but effectively
stops processing tuples. These are the steps to reproduce:

0. Have a large topology with dozens of workers. The topology reads data
from Kafka spout and has topology.max.spout.pending set to a finite size.
1. Deploy topology so that all the worker slots are occupied.
2. Take note of two processes, let's call them worker A and worker B. Let's
assume worker B occupies slot (N+P), where N is a node name, P is port.
3. Kill worker A.
4. Wait for Nimbus to detect A's death. Nimbus will initiate restart of A.
5. Wait for A to establish Netty client connection to B.
6. Kill B. Since that point A's connection to B is stale. Nevertheless it
will remain in the ":cached-node+port->socket" map unless it's closed in
refresh-connections() call later.
7. If B restarts before the next scheduled refresh-connections() call goes
off, the A's stale connection to (N+P) will never be reestablished, since B
is restarted in the same slot it occupied before death and so assignment
does not change in regard to (N+P) port.
8. Worker A hangs in the not-yet-started state (storm-active-flag is
false), but from Nimbus perspective it is alive, so other workers' spouts
keep sending data to A, run out of their topology.max.spout.pending and
stop emitting as well.

This may look like a contrived case, but I'm facing it several times a day
in my setup. Probably, because of ZK being slow I observe massive worker
restart by heartbeat timeout at nearly the same time, which leads me to the
scenario above. Actually, I do have some free slots in cluster but that
does not prevent workers from being assigned to the same slot in rapid
succession.

Something very similar is described in this issue:
https://issues.apache.org/jira/browse/STORM-946.

Have anyone ever seen this? Maybe it's somehow fixed / alleviated in Storm
0.9.6/0.10.0?

Thanks,
Yury

Reply via email to