Hi everyone, I'm working on a system that is using Kafka and Storm to transcribe phone calls that can be up to 4 hours long. The call is transcribed in chunks and will slowly stream into Kafka as the phone call progresses.
Now to the problem. As we transcribe chunks, they get put into Kafka and we're trying to use Storm to join all chunks for a given call into one complete transcript. We do this by field grouping on the id of the call, and keeping an in-memory map of all the chunks per call in the bolt, and emitting the complete transcript once the call is done. In order to guarantee that we process the complete call, I want to delay ack'ing of the chunk tuples until we have received every single tuple for the call, at which point we then emit the complete transcript anchored to every chunk tuple for the call. Because the tuples could be waiting in the bolt for up to 4 hours, we need to do something to guarantee that Storm doesn't replay the tuples that are still good, while also guaranteeing that if something falls further in the topology, the all chunks for a call is replayed. I've had a couple of ideas on how to solve this, but neither of them seem right. 1) Up the tuple timeout to 4 hours. This would technically solve this, but given the original 30 second timeout, changing it to 4 hours seems crazy. 2) Periodically call resetTimeout on the tuples to keep them alive. Would also work, but there's supposedly significant performance cost of doing that. 3) Store chunks in a data store somewhere outside of storm and join from that when a call is done. Any help would be greatly appreciated. Thanks, Morten
