If  maintaining the order of the messages is a requirement, fields grouping
seems to be the  only strategy that ensures that all tuples of the same
partition will be sent to the same task ID.


https://storm.apache.org/releases/current/Concepts.html

Stream groupings
> Part of defining a topology is specifying for each bolt which streams it
> should receive as input. A stream grouping defines how that stream should
> be partitioned among the bolt's tasks.
> There are eight built-in stream groupings in Storm, and you can implement
> a custom stream grouping by implementing the CustomStreamGrouping
> <https://storm.apache.org/releases/current/javadocs/org/apache/storm/grouping/CustomStreamGrouping.html>
>  interface:
>
>    1. Shuffle grouping: Tuples are randomly distributed across the bolt's
>    tasks in a way such that each bolt is guaranteed to get an equal number of
>    tuples.
>
>
>    1. Fields grouping: The stream is partitioned by the fields specified
>    in the grouping. For example, if the stream is grouped by the "user-id"
>    field, tuples with the same "user-id" will always go to the same task, but
>    tuples with different "user-id"'s may go to different tasks.
>
>
>    1. Partial Key grouping: The stream is partitioned by the fields
>    specified in the grouping, like the Fields grouping, but are load balanced
>    between two downstream bolts, which provides better utilization of
>    resources when the incoming data is skewed. This paper
>    
> <https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf>
>  provides
>    a good explanation of how it works and the advantages it provides.
>
>
>    1. All grouping: The stream is replicated across all the bolt's tasks.
>    Use this grouping with care.
>
>
>    1. Global grouping: The entire stream goes to a single one of the
>    bolt's tasks. Specifically, it goes to the task with the lowest id.
>
>
>    1. None grouping: This grouping specifies that you don't care how the
>    stream is grouped. Currently, none groupings are equivalent to shuffle
>    groupings. Eventually though, Storm will push down bolts with none
>    groupings to execute in the same thread as the bolt or spout they subscribe
>    from (when possible).
>
>
>    1. Direct grouping: This is a special kind of grouping. A stream
>    grouped this way means that the producer of the tuple decides which
>    task of the consumer will receive this tuple. Direct groupings can only be
>    declared on streams that have been declared as direct streams. Tuples
>    emitted to a direct stream must be emitted using one of the emitDirect
>    
> <https://storm.apache.org/releases/current/javadocs/org/apache/storm/task/OutputCollector.html#emitDirect-int-java.util.Collection-java.util.List->
>  methods.
>    A bolt can get the task ids of its consumers by either using the provided
>    TopologyContext
>    
> <https://storm.apache.org/releases/current/javadocs/org/apache/storm/task/TopologyContext.html>
>  or
>    by keeping track of the output of the emit method in OutputCollector
>    
> <https://storm.apache.org/releases/current/javadocs/org/apache/storm/task/OutputCollector.html>
>  (which
>    returns the task ids that the tuple was sent to).
>
>
>    1. Local or shuffle grouping: If the target bolt has one or more tasks
>    in the same worker process, tuples will be shuffled to just those
>    in-process tasks. Otherwise, this acts like a normal shuffle grouping.
>
>
On Thu, 13 Aug 2020 at 12:05, Jayant Sharma <sharmajayan...@gmail.com>
wrote:

> Hi,
>
> Is there a standard way of avoiding bottleneck which arises due to fields
> grouping from one bolt to another. I have a use case where half a million
> tuples have the same field and go to the same bolt task because of field
> grouping. I cannot use shuffle grouping here because it is important that
> these get processed in a sequence. Has anyone faced such an issue in Storm,
> how were you able to resolve it?
>
> Thank you,
> Jayant
>

Reply via email to