If maintaining the order of the messages is a requirement, fields grouping seems to be the only strategy that ensures that all tuples of the same partition will be sent to the same task ID.
https://storm.apache.org/releases/current/Concepts.html Stream groupings > Part of defining a topology is specifying for each bolt which streams it > should receive as input. A stream grouping defines how that stream should > be partitioned among the bolt's tasks. > There are eight built-in stream groupings in Storm, and you can implement > a custom stream grouping by implementing the CustomStreamGrouping > <https://storm.apache.org/releases/current/javadocs/org/apache/storm/grouping/CustomStreamGrouping.html> > interface: > > 1. Shuffle grouping: Tuples are randomly distributed across the bolt's > tasks in a way such that each bolt is guaranteed to get an equal number of > tuples. > > > 1. Fields grouping: The stream is partitioned by the fields specified > in the grouping. For example, if the stream is grouped by the "user-id" > field, tuples with the same "user-id" will always go to the same task, but > tuples with different "user-id"'s may go to different tasks. > > > 1. Partial Key grouping: The stream is partitioned by the fields > specified in the grouping, like the Fields grouping, but are load balanced > between two downstream bolts, which provides better utilization of > resources when the incoming data is skewed. This paper > > <https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf> > provides > a good explanation of how it works and the advantages it provides. > > > 1. All grouping: The stream is replicated across all the bolt's tasks. > Use this grouping with care. > > > 1. Global grouping: The entire stream goes to a single one of the > bolt's tasks. Specifically, it goes to the task with the lowest id. > > > 1. None grouping: This grouping specifies that you don't care how the > stream is grouped. Currently, none groupings are equivalent to shuffle > groupings. Eventually though, Storm will push down bolts with none > groupings to execute in the same thread as the bolt or spout they subscribe > from (when possible). > > > 1. Direct grouping: This is a special kind of grouping. A stream > grouped this way means that the producer of the tuple decides which > task of the consumer will receive this tuple. Direct groupings can only be > declared on streams that have been declared as direct streams. Tuples > emitted to a direct stream must be emitted using one of the emitDirect > > <https://storm.apache.org/releases/current/javadocs/org/apache/storm/task/OutputCollector.html#emitDirect-int-java.util.Collection-java.util.List-> > methods. > A bolt can get the task ids of its consumers by either using the provided > TopologyContext > > <https://storm.apache.org/releases/current/javadocs/org/apache/storm/task/TopologyContext.html> > or > by keeping track of the output of the emit method in OutputCollector > > <https://storm.apache.org/releases/current/javadocs/org/apache/storm/task/OutputCollector.html> > (which > returns the task ids that the tuple was sent to). > > > 1. Local or shuffle grouping: If the target bolt has one or more tasks > in the same worker process, tuples will be shuffled to just those > in-process tasks. Otherwise, this acts like a normal shuffle grouping. > > On Thu, 13 Aug 2020 at 12:05, Jayant Sharma <sharmajayan...@gmail.com> wrote: > Hi, > > Is there a standard way of avoiding bottleneck which arises due to fields > grouping from one bolt to another. I have a use case where half a million > tuples have the same field and go to the same bolt task because of field > grouping. I cannot use shuffle grouping here because it is important that > these get processed in a sequence. Has anyone faced such an issue in Storm, > how were you able to resolve it? > > Thank you, > Jayant >