For #2 it is a question of what do you optimize for. Storm typically assumes that failures should be rare so we optimize for that. We keep the minimal information around to be able to replay the message, but not a lot more. If you are getting lots of failures, you really should be more concerned about why you are getting failures and not so much with how the failures are impacting the performance of the spout. Or to put it another way the best performance optimization is to stop doing the thing that is slow. If failures are the slowest part the best thing for performance is to stop failing.
- Bobby On Monday, July 10, 2017, 9:10:38 AM CDT, Stig Rohde Døssing <stigdoess...@gmail.com> wrote: Hi Chandan, I'm going to assume we're talking about the storm-kafka-client spout, and not the storm-kafka spout. 1) Yes, we do polling and committing in the spout thread. I'm not aware of why that would be against spout best practices? Simplicity is definitely a reason, but I don't think anyone has actually checked what the throughput of a threaded implementation would be compared to the current implementation. Keep in mind that the KafkaConsumer is not thread safe, so a threaded implementation would still need to do all consumer interaction via a single thread. Also as far as I know the KafkaConsumer does prefetching of messages ( https://cwiki.apache.org/confluence/display/KAFKA/KIP-41%3A+KafkaConsumer+Max+Records#KIP-41:KafkaConsumerMaxRecords-Prefetching), so most calls to poll should not take very long. What kind of threading did you have in mind? 2) No, that seems correct. When a message fails the spout remembers the offset of the failed message, but does not keep the message in memory. When the message becomes ready for retry, the spout seeks the consumer back to the failed message's offset and fetches it again. The consumer will fetch that message plus any later messages before it catches back up to where it was before the message failed. We choose to poll for the failed message again because keeping it around in memory could be a problem if there are many failed messages, since they'd potentially need to hang around for a while depending on the configured backoff. There's maybe a potential optimization here where we could try to seek directly to the latest emitted offset instead of fetching everything from the failed message forward, but it's not something we've looked at implementing. I haven't looked at whether this would work, or what kind of edge cases there may be. Do you have suggestions for improving/replacing this loop? 2017-07-10 15:39 GMT+02:00 chandan singh <cks07...@gmail.com>: > Hi > > I hope I am using the right mailing list. Please advice if I am wrong. > > I have few observations about the KafkaSpout and feel that some of these > lead to inefficiencies. It will be of great help if someone can throw some > light on the rationale behind the implementation. > > 1) Kafka polling and committing offsets is done in the spout thread which > is somewhat against the spout best practices. Is simplicity the reason > behind this design? Am I missing something? > > 2) Poll-iterate-commit-seek loop seems inefficient in recurrent failure > scenarios. Let say the first massage fails. We will keep polling the same > set of messages at least as many times as that message is retried and > probably more if we are using exponential back-off. Did I misunderstand the > implementation? > > Regards > Chandan >