Re: Why won't storm restart failing tasks?

Marc Vaillant Fri, 17 Oct 2014 08:04:35 -0700

Sure, but that's really for individual replays etc.  I don't what to
have to set up an infrastructure to monitor health per Itai's
suggestions if I don't have to.  I.e. count the number of failed tuples,
track which task it's coming from, restart the task if some "dead"
criteria is met.


Best,
Marc

On Fri, Oct 17, 2014 at 09:32:36AM +0300, Onur Ünlü wrote:
> Hi Marc,
> 
> Did you override the "fail" method of BaseRichSpout?
> 
> Regards,
> Onur
> 
> 
> On Tue, Oct 14, 2014 at 6:24 PM, Marc Vaillant <[email protected]> 
> wrote:
> 
>     How about:
> 
>     https://storm.incubator.apache.org/apidocs/backtype/storm/Config.html#
>     NIMBUS_TASK_TIMEOUT_SECS
> 
>     "How long without heartbeating a task can go before nimbus will consider
>     the task dead and reassign it to another location."
> 
>     and
> 
>     https://storm.incubator.apache.org/apidocs/backtype/storm/Config.html#
>     NIMBUS_REASSIGN
> 
>     "Whether or not nimbus should reassign tasks if it detects that a task
>     goes down. Defaults to true, and it's not recommended to change this
>     value."
>    
>     Marc
>    
>     On Sat, Oct 11, 2014 at 08:42:16PM +0000, Itai Frenkel wrote:
>     > AFAIK the task.heartbeat.frequency.secs is related to zookeeper
>     bookeeping. Not sure if it is being act upon.
>     >
>     > See also https://github.com/apache/storm/pull/286 (related to 2.3 
> below).
>     >
>     > Itai
>     >
>     > ________________________________________
>     > From: Marc Vaillant <[email protected]>
>     > Sent: Friday, October 10, 2014 7:32 PM
>     > To: [email protected]
>     > Subject: Re: Why won't storm restart failing tasks?
>     >
>     > Hi Itai,
>     >
>     > Thanks very much for your response and suggestions.  My first naive
>     > thought is: why is this so hard? Isn't this the most basic fault
>     > tolerance provided by the simplest load balancers (which is essentially
>     > what a shuffle grouping is)?  Second question is that I see config
>     > parameters like task.heartbeat.frequency.secs, what is the purpose of a
>     > task heartbeat if not to detect this kind of situation?
>     >
>     > Thanks,
>     > Marc
>     >
>     > On Fri, Oct 10, 2014 at 09:22:38AM +0000, Itai Frenkel wrote:
>     > > Hi Marc,
>     > >
>     > > (I'm a storm newbie myself).
>     > >
>     > > As I understand it your question has two parts:
>     > > 1. Can Storm detect a Bolt that is "stuck". AFAIK the answer is no.
>     > > 2. Can Storm do anything about it. AFAIK the answer is no.
>     > >
>     > > What I would do in your situation:
>     > > 1. I would write a delegate bolt that would delegate all the execute/
>     ack/fail to "problematic bolt".
>     > >     I would then connect to that delegate bolt a tick tuple http://
>     kitmenke.com/blog/2014/08/04/tick-tuples-within-storm/  which would update
>     a TTL data structure
>     > >    (such as expireAfterWrite and removeListener as described in 
> http://
>     docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/cache/
>     CacheBuilder.html  )
>     > >    Each time the tick arrives it writes to the Cache which updates the
>     TTL. When the TTL expires you will get an event on a separate thread.
>     > >    The delegate bolt would ofcourse cannot forward any of the tick
>     tuples to the "problematic bolt"
>     > >
>     > > 2. There are four choices:
>     > >     2.1. Try and interrupt the execute() thread. This is an
>     anti-pattern in java since many libraries swallow InterruptedException and
>     do not call Thread.interrupted(). So it most likely not work.
>     > >     2.2  Have the "problematic bolt" implement RestartableBolt
>     interface which has a restart method that sets everything back to normal.
>     This is also not recommended as trying to fix an unknown bug state in code
>     is always bad. But if there is a specific place you know the bolt gets
>     stuck, then the restart method could just flip an AtomicBoolean and that
>     code would check if that AtomicBoolean is set. AtomicBoolean has its cost
>     in terms of CPU pipeline caching so make sure you don't put it too deep
>     inside the for loops.
>     > >    2.3 Have the "problematic bolt" run in a separate process. This
>     would require implementing the multilang protocol in Java (it is already
>     implemented in python/nodejs/ruby). The bolt would get hit by 
> serialization
>     /deserialization CPU cycles per emit.
>     > >     Once you implemented the multilang protocol, then run the
>     problematic bolt using ShellBolt which has the PID of the child process 
> and
>     can kill it.
>     > >    2.4 Throw an exception that will crash the entire worker process.
>     The supervisor would then restart the process. But all of the executors on
>     the workers get restarted too (and you get a few tens of seconds of
>     downtime).
>     > >
>     > > Regards,
>     > > Itai
>     > >
>     > >
>     > > ________________________________________
>     > > From: Marc Vaillant <[email protected]>
>     > > Sent: Thursday, October 9, 2014 8:05 PM
>     > > To: [email protected]
>     > > Subject: Why won't storm restart failing tasks?
>     > >
>     > > We have now replicated in the exclamation topology, a situation that 
> is
>     > > occurring for us in production.  In short, one of the tasks gets into 
> a
>     > > hung state so that all tuples that are sent to it fail.  We expected
>     > > storm to detect this failing state (task heartbeat, etc) and then
>     > > restart a new task/executor.  But that does not happen, storm just
>     > > continues to send tuples to to the failing task.  How can we get storm
>     > > to tear down the executor or task and spawn a new one?
>     > >
>     > > To replicate the issue in the exclamation topology, we modified
>     > > TestWordSpout to emit the word "dummy" only once, and then emit from
>     the
>     > > standard list as usual:
>     > >
>     > >     public void nextTuple() {
>     > >         Utils.sleep(_sleepTime);
>     > >
>     > >                                 if(!_isSent)
>     > >                                 {
>     > >                                         _collector.emit(new Values
>     ("dummy"), new String("dummy" + _counter));
>     > >                                         _isSent = true;
>     > >                                 }
>     > >                                 else
>     > >                                 {
>     > >                                         final String[] words = new
>     String[] {"nathan", "mike", "jackson", "golda", "bertels"};
>     > >                                         final Random rand = new Random
>     ();
>     > >                                         final String word = words
>     [rand.nextInt(words.length)] + _counter;
>     > >                                         _collector.emit(new Values
>     (word), word);
>     > >                                 }
>     > >                                 _counter++;
>     > >                 }
>     > >
>     > > In the ExclamationBolt's execute method we go into an infinite loop if
>     > > the tuple received is "dummy":
>     > >
>     > >
>     > >     public void execute(Tuple tuple){
>     > >                         if(tuple.getString(0) == "dummy")
>     > >                         {
>     > >                                 while(true) {}
>     > >                         }
>     > >       _collector.emit(tuple, new Values(tuple.getString(0) + "!!!"));
>     > >       _collector.ack(tuple);
>     > >     }
>     > >
>     > > The topology has a parallelism of 3 for the ExclaimationBolt so the
>     task
>     > > that receives the "dummy" tuple fails every subsequent tuple it
>     > > receives because it is in stuck in an  infinite loop, but the other 2
>     > > tasks continue to process tuples normally.  While storm eventually
>     fails
>     > > all the tuples that are sent to the "dummy" task, it never tears the
>     > > task down to restart a new one.
>     > >
>     > > Any help greatly appreciated.
>     > >
>     > > Thanks,
>     > > Marc
> 
> 
> 
> 
> --
> Onur Ünlü

Re: Why won't storm restart failing tasks?

Reply via email to