To answer your questions. 1. There are no plans to try and recover more gracefully. Storm is a fail fast system. We believe that if your bolt/spout knew how to recover from the exception gracefully it would have done so (or should have done so) already. By the time the system gets the exception we assume that the entire worker is in a possibly unknown state (or at least one that the bolt developers did not expect to happen). So the best possibly way to recover from this is to start everything from a known state. If you have a bolt that you don't own that is causing issues, you are free to wrap it in another bolt and catch/recover from the exceptions yourself. If you know that they are not going to cause issues.
2. That is actually up to the scheduler, and the scheduler is pluggable. The scheduler runs periodically (with a 10 second sleep in between by default) and there is nothing that prevents it from moving things around. In general storm will not move things around if what the topology has requested is being met. For most schedulers they simply check to see if all of the bolts/spouts are scheduled on a worker and the topology has the number of workers it requested. There are some odd corner cases when a cluster fills up that a topology may have all of their bolts and spouts scheduled on fewer workers than requested. In those cases when more resources become available the scheduler will "fix" what it did. But in all others I know of it does not, which is why we have the rebalance command. We have been looking at elasticity (especially for the resource aware scheduler) and expect to put something in place in the future where by default we will not move things around to give you a better scheduling, but expose some sort of config that would let modify that on a per topology basis. We have not worked out the details of this yet though. - Bobby On Tuesday, July 25, 2017, 9:23:13 AM CDT, Petr Janeček <[email protected]> wrote: Hello gentlemen, I find the Fault Tolerance documentation slightly … incomplete. I read both http://storm.apache.org/releases/1.1.0/Daemon-Fault-Tolerance.html and http://storm.apache.org/releases/1.1.0/Guaranteeing-message-processing.html, and these questions don't seem to be mentioned much: 1. A surprising thing is that if a bolt throws, the exception tears down the whole worker. Are there any plans for changing this so that only the bolt in question is restarted? (As an explicit configuration option, simply loggignoring the exception and rolling on might also be a viable strategy for some, too...) 2. If a node dies, its tasks are reassigned to other nodes, this works fine. But what if the node comes back up? Currently I believe Storm does not rebalance the topology automatically, and it must be triggered manually. We’re not using stateful bolts nor message acking as it is okay for us to drop messages occasionally. Does Storm recover tasks to a revived worker when the topology is stateful and message acking is enabled? Because in that case the rebalance seems to be a safe operation… If not, is this somewhere on the roadmap? Thank you for any clarifications, once all this is crystal clear to me I intend to enhance the documentation. Petr Janeček
