Hi, folks, I have a signal bolt which signals a downstream polling bolt to read from a queue. In order to achieve parallel processing, the polling bolt will just read from queue and then emit to its downstream processing bolt. the processing bolt will then process the message from queue, and then commit them to a data sink(kafka).
The problem with this design is that if kafka is down due to some network issue, only processing bolt will receive notification. Although it can retry by some retry policy(exponentially back-off), the messages have been drained out of the queue already from polling bolt. This means that if kakfa is down, we would never get chances to replay those messages. How should I handle this case? i can certainly put all the processing logic into the polling bolt, and fail the bolt fast upon seeing failures on committing to kafka. Then have a separate monitoring spout to recheck all the queues again. But hope there would be a more elegant way to handle this. I also looked into the replay mechanism of storm. But this means that I have to anchor all the messages(millions) from queue, which seems a heavy overhead to the system. Besides, its hard to tell whether its a real failure or just time out. Any input appreciated. Thanks, Chen