Hi All I'm currently investigating the case of worker node failures, where a failure leads to both the worker process and supervisor beeing killed. The documentation states the following: "What happens when a node dies? The tasks assigned to that machine will time-out and Nimbus will reassign those tasks to other machines." The documentation somewhat implies that only tasks that were currently running on the failed node will be started elsewhere on the cluster. But during my tests I see quite a different behavior. It looks like each time a node fails or is added again to the cluster the topology will freeze for about 10 to 15 seconds (the spouts/bolts do not emit/receive any events). During this time all tasks are reshuffled and may start on different machines. After reshuffling the spouts/bolts resume to emit/receive data without any problems. I'm working on test data to eliminate any external data source problems from the tests.
Is this the design of storm to handle node failures or cluster growth? And if not, what am I missing here? Is this related to the configuration option `nimbus.reassign`? I tried to keep heartbeats and timeouts very low, so a failure is detected faster: nimbus.task.timeout.secs: 2 nimbus.supervisor.timeout.secs: 2 supervisor.worker.start.timeout.secs: 5 supervisor.worker.timeout.secs: 2 supervisor.monitor.frequency.secs: 2 supervisor.heartbeat.frequency.secs: 1 And I defined the following properties for my topology: conf.put(Config.TOPOLOGY_RECEIVER_BUFFER_SIZE, 8); conf.put(Config.TOPOLOGY_TRANSFER_BUFFER_SIZE, 32); conf.put(Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE, 16384); conf.put(Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE, 16384); conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 2000); conf.put(Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS, 1); Any ideas on how to reduce the freeze of the topology during node failures would be highly appreciated. Best Nicolas