Re: Maintaining message ordering using KafkaSpout/Bolt

Matthias J. Sax Sun, 05 Jun 2016 02:38:28 -0700

> Does the parallelism_hint set when a KafkaSpout is added to a topology,
> need to match the number of partitions in a topic?


No.

On 06/05/2016 11:26 AM, Matthias J. Sax wrote:
> Hi Kanagha,
> 
> For reading, KafkaSpout's internally used KafkaConsumer ensures that
> data is received in-order per partition. Because the spout might read
> multiple partitions, and emit only a single (logical) output stream,
> within this output stream, data from multiple partitions interleave (the
> relative order within each partition is preserved though). It depends on
> the connection pattern of your spout-downstream bolt, how the partitions
> are distributed... (If you use shuffleGrouping, data of a single
> partition, is distributed over all downstream bolt instances -- still,
> order is preserved within a partition, but you get only some data per
> partition on each bolt instance. After the first bolt, the order is not
> guaranteed by Storm any more, because the data of a single partition is
> spread out over multiple parallels bolt is this case.)
> 
> If you want each partition to be processed by a single bolt, you need to
> extract the partitionId (ie, add it to the Storm tuple) in the spout and
> use fieldsGrouping on partitionId for downstream bolts. I guess,
> KafkaSpout does not support this out of the box -- you can either patch
> KafakSpout itself, if inherit from it to build you own
> "PartionKafkaSpout" to add the partitionId to the output tuples.
> 
> (Or maybe ask at u...@storm.apache.org ;))
> 
> For writing, you are correct. KafkaBolt uses key-based partitioning on
> write and if you use fieldsGrouping on the key, it should work as intended.
> 
> 
> -Matthias
> 
> On 06/05/2016 07:51 AM, Kanagha wrote:
>> Hi,
>>
>> I'm looking at the documentation for using KafkaSpout/KafkaBolt.
>>
>> https://github.com/apache/storm/tree/master/external/storm-kafka
>>
>> How is ordering guaranteed while reading messages from Kafka using
>> KafkaSpout?
>> Does the parallelism_hint set when a KafkaSpout is added to a topology,
>> need to match the number of partitions in a topic?
>>
>> Similarly while writing back to Kafka, I believe fieldsGrouping can be used
>> so that tuples that have same field value will go to the same task and can
>> be written to the same partition.
>> Would like to get suggestions on this. Thanks!
>>
>> Thanks
>> Kanagha
>>
>

signature.asc
Description: OpenPGP digital signature

Re: Maintaining message ordering using KafkaSpout/Bolt

Reply via email to