Hi, I'm facing an issue when Storm 0.9.5 topology looks alive but effectively stops processing tuples. These are the steps to reproduce:
0. Have a large topology with dozens of workers. The topology reads data from Kafka spout and has topology.max.spout.pending set to a finite size. 1. Deploy topology so that all the worker slots are occupied. 2. Take note of two processes, let's call them worker A and worker B. Let's assume worker B occupies slot (N+P), where N is a node name, P is port. 3. Kill worker A. 4. Wait for Nimbus to detect A's death. Nimbus will initiate restart of A. 5. Wait for A to establish Netty client connection to B. 6. Kill B. Since that point A's connection to B is stale. Nevertheless it will remain in the ":cached-node+port->socket" map unless it's closed in refresh-connections() call later. 7. If B restarts before the next scheduled refresh-connections() call goes off, the A's stale connection to (N+P) will never be reestablished, since B is restarted in the same slot it occupied before death and so assignment does not change in regard to (N+P) port. 8. Worker A hangs in the not-yet-started state (storm-active-flag is false), but from Nimbus perspective it is alive, so other workers' spouts keep sending data to A, run out of their topology.max.spout.pending and stop emitting as well. This may look like a contrived case, but I'm facing it several times a day in my setup. Probably, because of ZK being slow I observe massive worker restart by heartbeat timeout at nearly the same time, which leads me to the scenario above. Actually, I do have some free slots in cluster but that does not prevent workers from being assigned to the same slot in rapid succession. Something very similar is described in this issue: https://issues.apache.org/jira/browse/STORM-946. Have anyone ever seen this? Maybe it's somehow fixed / alleviated in Storm 0.9.6/0.10.0? Thanks, Yury
