Re: KafkaConsumer#poll not returning records for all partitions of topic in single call

Helleren, Erik Fri, 11 Mar 2016 06:33:19 -0800

The partitions enable scalability. Consumers and produces know which
partition records belong in based on their key (or manual assignment),
which makes it very easy to scale up your kafka cluster or a consuming
cluster.

The ordering problem is one that I have faced, and have a workaround.
Just keep in mind that it does limit your ability to scale after a certain
point.

So first, lets assume that that ordering only maters per producer.  This
is kind of a standard assumption in most distributed, scaleable, and
performant systems.  Given that assumption, you can either set the key
data of each message to be a producer identifier or use manual partition
assignment to ensure per producer ordering.  The choice here depends on
message volume, and the nature of your producers.  Number Producers must
be >= number of partitions.

On the consumption side, since any consumer subscribes to a full
partition, its guaranteed to read message in the same order that any
specific producer produced those messages.  But there is an implicit
coupling to scaling the consumer and scaling the producer count.  So if
the consumer is slower than the producer you might get in trouble.

On 3/11/16, 1:38 AM, "Gerard Klijs" <gerard.kl...@dizzit.com> wrote:

>I noticed a similar effect with a test tool, which checked if the order
>the
>records were produced in, was the same as the order in which they were
>consumed. Using only one partition it works fine, but using multiple
>partitions the order gets messed up. If I'm right this is by design, but I
>would like to hear some feedback about this. Because messages with the
>same
>key, end up in the same partition, if you have multiple partitions, only
>the order within a partition is the same as the order they where produced
>in. But when consuming form multiple partitions the order could be
>different.
>
>If this is true it would be interesting what you should do when you have a
>topic were the order needs to be kept the same, and needs to be consumed
>by
>more then one consumer at a time?
>
>On Fri, Mar 11, 2016 at 5:50 AM Ewen Cheslack-Postava <e...@confluent.io>
>wrote:
>
>> You definitely *might* see data from multiple partitions, and that
>>won't be
>> uncommon once you start processing data. However, there is no guarantee.
>>
>> In practice, it may be unlikely to see data for both partitions on the
>> first call to poll() for a simple reason: poll() will return as soon as
>>any
>> data for any partition is available. Unless things are timed just right,
>> you're probably making requests to different brokers for data in the
>> different partitions. These requests won't be perfectly aligned -- one
>>of
>> them will get a response first and the poll() will be able to return
>>with
>> some data. Since only the one response will have been received, only one
>> partition will get data.
>>
>> After the first poll, you probably spend some time processing that data
>> before you call poll again. However, another request has been sent out
>>to
>> the broker that returned data faster and the other request also gets
>> returned. So on the next poll, you might be more likely to see data from
>> both partitions.
>>
>> So you're right: there's no hard guarantee, and you shouldn't write your
>> consumer code to assume that data will be returned for all partitions.
>>(And
>> you can't assume that anyway; what if no new data had been published to
>>one
>> of the partitions?). However, many times you will see data from multiple
>> partitions.
>>
>> -Ewen
>>
>> On Thu, Mar 10, 2016 at 11:21 AM, Shrijeet Paliwal <
>> shrijeet.pali...@gmail.com> wrote:
>>
>> > Version: 0.9.0.1
>> >
>> > I have a test which creates two partitions in a topic, writes data to
>> both
>> > partitions. Then a single consumer subscribes to the topic, verifies
>>that
>> > it has got the assignment of both partitions in that topic & finally
>> issues
>> > a poll. The firs poll always comes back with records of only one
>> partition.
>> > I need to poll one more time to get records for the second partition.
>>The
>> > poll timeout has no effect on this.
>> >
>> > Unless I've misunderstood the contract - the first poll *could* have
>> > returned records for the both the partitions. After-all poll
>> > returns ConsumerRecords<K,V>, which is a map of topic_partitions -->
>> > records
>> >
>> > I acknowledge that API does not make any hard guarantees that align
>>with
>> my
>> > expectation but  looks like API was crafted to support multiple
>> partitions
>> > & topics in single call. Is there an implementation detail which
>> restricts
>> > this? Is there a configuration which is controlling what gets fetched?
>> >
>> > --
>> > Shrijeet
>> >
>>
>>
>>
>> --
>> Thanks,
>> Ewen
>>

________________________________

NOTICE: This message, and any attachments, are for the intended recipient(s) 
only, may contain information that is privileged, confidential and/or 
proprietary and subject to important terms and conditions available at 
E-Communication 
Disclaimer<http://www.cmegroup.com/tools-information/communications/e-communication-disclaimer.html>.
 If you are not the intended recipient, please delete this message. CME Group 
and its subsidiaries reserve the right to monitor all email communications that 
occur on CME Group information systems.

Re: KafkaConsumer#poll not returning records for all partitions of topic in single call

Reply via email to