Re: Few observations related to KafkaSpout implementation (1.1.0)

Bobby Evans Mon, 10 Jul 2017 07:24:20 -0700

For #2 it is a question of what do you optimize for.  Storm typically assumes 
that failures should be rare so we optimize for that.  We keep the minimal 
information around to be able to replay the message, but not a lot more.  If 
you are getting lots of failures, you really should be more concerned about why 
you are getting failures and not so much with how the failures are impacting 
the performance of the spout.  Or to put it another way the best performance 
optimization is to stop doing the thing that is slow.  If failures are the 
slowest part the best thing for performance is to stop failing.

- Bobby

On Monday, July 10, 2017, 9:10:38 AM CDT, Stig Rohde Døssing 
<stigdoess...@gmail.com> wrote:

Hi Chandan,

I'm going to assume we're talking about the storm-kafka-client spout, and
not the storm-kafka spout.

1) Yes, we do polling and committing in the spout thread. I'm not aware of
why that would be against spout best practices? Simplicity is definitely a
reason, but I don't think anyone has actually checked what the throughput
of a threaded implementation would be compared to the current
implementation. Keep in mind that the KafkaConsumer is not thread safe, so
a threaded implementation would still need to do all consumer interaction
via a single thread. Also as far as I know the KafkaConsumer does
prefetching of messages (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-41%3A+KafkaConsumer+Max+Records#KIP-41:KafkaConsumerMaxRecords-Prefetching),
so most calls to poll should not take very long.

What kind of threading did you have in mind?

2) No, that seems correct. When a message fails the spout remembers the
offset of the failed message, but does not keep the message in memory. When
the message becomes ready for retry, the spout seeks the consumer back to
the failed message's offset and fetches it again. The consumer will fetch
that message plus any later messages before it catches back up to where it
was before the message failed.

We choose to poll for the failed message again because keeping it around in
memory could be a problem if there are many failed messages, since they'd
potentially need to hang around for a while depending on the configured
backoff.

There's maybe a potential optimization here where we could try to seek
directly to the latest emitted offset instead of fetching everything from
the failed message forward, but it's not something we've looked at
implementing. I haven't looked at whether this would work, or what kind of
edge cases there may be.

Do you have suggestions for improving/replacing this loop?

2017-07-10 15:39 GMT+02:00 chandan singh <cks07...@gmail.com>:

> Hi
>
> I hope I am using the right mailing list. Please advice if I am wrong.
>
> I have few observations about the KafkaSpout and feel that some of these
> lead to inefficiencies. It will be of great help if someone can throw some
> light on the rationale behind the implementation.
>
> 1) Kafka polling and committing offsets is done in the spout thread which
> is somewhat against the spout best practices. Is simplicity the reason
> behind this design? Am I missing something?
>
> 2)  Poll-iterate-commit-seek loop seems inefficient in recurrent failure
> scenarios. Let say the first massage fails. We will keep polling the same
> set of messages at least as many times as that message is retried and
> probably more if we are using exponential back-off. Did I misunderstand the
> implementation?
>
> Regards
> Chandan
>

Re: Few observations related to KafkaSpout implementation (1.1.0)

Reply via email to