how to handle downstream bolt failure

Chen Wang Tue, 29 Jul 2014 11:35:38 -0700

Hi, folks,
I have a signal bolt which signals a downstream polling bolt to read from a
queue. In order to achieve parallel processing, the polling bolt will just
read from queue and then emit to its downstream processing bolt. the
processing bolt will then process the message from queue, and then commit
them to a data sink(kafka).


The problem with this design is that if kafka is down due to some network
issue, only processing bolt will receive notification.  Although it can
retry by some retry policy(exponentially back-off), the messages have been
drained out of the queue already from  polling bolt. This means that if
kakfa is down, we would never get chances to replay those messages.

How should I handle this case? i can certainly put all the processing logic
into the polling bolt, and fail the bolt fast upon seeing failures
on committing to kafka. Then have a separate monitoring spout to recheck
all the queues again. But hope there would be a more elegant way to handle
this. I also looked into the replay mechanism of storm. But this means that
I have to anchor all the messages(millions) from queue, which seems a heavy
overhead to the system. Besides, its hard to tell whether its a real
failure or just time out.

Any input appreciated.
Thanks,
Chen

how to handle downstream bolt failure

Reply via email to