Re: Fault Tolerance details

Bobby Evans Tue, 25 Jul 2017 08:00:58 -0700

To answer your questions.
1. There are no plans to try and recover more gracefully.  Storm is a fail fast 
system.  We believe that if your bolt/spout knew how to recover from the 
exception gracefully it would have done so (or should have done so) already.  
By the time the system gets the exception we assume that the entire worker is 
in a possibly unknown state (or at least one that the bolt developers did not 
expect to happen).  So the best possibly way to recover from this is to start 
everything from a known state.
If you have a bolt that you don't own that is causing issues, you are free to 
wrap it in another bolt and catch/recover from the exceptions yourself.  If you 
know that they are not going to cause issues.


2. That is actually up to the scheduler, and the scheduler is pluggable.  The 
scheduler runs periodically (with a 10 second sleep in between by default) and 
there is nothing that prevents it from moving things around.  In general storm 
will not move things around if what the topology has requested is being met.

For most schedulers they simply check to see if all of the bolts/spouts are 
scheduled on a worker and the topology has the number of workers it requested.  
There are some odd corner cases when a cluster fills up that a topology may 
have all of their bolts and spouts scheduled on fewer workers than requested.  
In those cases when more resources become available the scheduler will "fix" 
what it did.  But in all others I know of it does not, which is why we have the 
rebalance command.
We have been looking at elasticity (especially for the resource aware 
scheduler) and expect to put something in place in the future where by default 
we will not move things around to give you a better scheduling, but expose some 
sort of config that would let modify that on a per topology basis.  We have not 
worked out the details of this yet though.

- Bobby


On Tuesday, July 25, 2017, 9:23:13 AM CDT, Petr Janeček <[email protected]> 
wrote:

Hello gentlemen,
I find the Fault Tolerance documentation slightly … incomplete.

I read both http://storm.apache.org/releases/1.1.0/Daemon-Fault-Tolerance.html 
and 
http://storm.apache.org/releases/1.1.0/Guaranteeing-message-processing.html, 
and these questions don't seem to be mentioned much:

1. A surprising thing is that if a bolt throws, the exception tears down the 
whole worker. Are there any plans for changing this so that only the bolt in 
question is restarted? (As an explicit configuration option, simply 
loggignoring the exception and rolling on might also be a viable strategy for 
some, too...)


2. If a node dies, its tasks are reassigned to other nodes, this works fine. 
But what if the node comes back up? Currently I believe Storm does not 
rebalance the topology automatically, and it must be triggered manually.

We’re not using stateful bolts nor message acking as it is okay for us to drop 
messages occasionally. Does Storm recover tasks to a revived worker when the 
topology is stateful and message acking is enabled? Because in that case the 
rebalance seems to be a safe operation… If not, is this somewhere on the 
roadmap?




Thank you for any clarifications, once all this is crystal clear to me I intend 
to enhance the documentation.


Petr Janeček

Re: Fault Tolerance details

Reply via email to