[ 
https://issues.apache.org/jira/browse/STORM-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siwoon Son updated STORM-2291:
------------------------------
    Description: 
I‘d like to discuss the hash collision issue that occurs when applying the 
_Grouping_ method [http://storm.apache.org/releases/current/Concepts.html] to 
_Windowing_ method [http://storm.apache.org/releases/current/Windowing.html] in 
Storm.

I first assume the following situation. Spout constantly emits tuples to Bolt. 
At this time, Bolt tries to perform operations while moving tuples of a certain 
interval. e.g. moving average, etc. To solve this situation, Storm provides 
_Windowing_ method.

However, consider the following complex situation. Two problems are added in 
the above situation.
# As a first problem, the tuple emitted by Spout is multidimensional with 
multiple pieces of information. For example, Alice, Bob, and Clark are mapped 
to random real numbers. That is, the tuples emitted from Spout are {[Alice, 
0.18322], [Clark, 0.57833], [Bob, 0.27902], [Clark, 0.24553], [Alice, 0.50164], 
[Alice, 0.06463], ...}. While those tuples are transmitted to the next windowed 
bolt, they must be necessarily separated by keys such as Alice, Bob, and Clark. 
In other words, each tuples in which Alice, Bob, and Clark are mapped must 
belong to different windows.
# The second problem is Storm's parallelism. Spout and Bolt can be operated as 
multiple objects on multiple servers. This problem is that the tuples with the 
same key must be emit in the same window even if they are created by a 
different Spout object.

Storm can specify the Bolt objects which the tuples is to be input as a 
_Grouping_ method. Storm provides various _Grouping_ methods, but a fields 
grouping is best suited as a way to solve the above problems. The fields 
grouping is a way of partitioning an input stream by a specified field. With 
the fields grouping, tuples of the same field can only be passed to the same 
Bolt object. However, the fields grouping has been implemented as a hash 
method. ([http://storm.apache.org/releases/current/Tutorial.html]) Therefore, 
it can be cause *a hash collision* problem that can include the tuples in the 
same window although they have different fields. So I am interested in solving 
the hash collision.

The source code of 
[https://github.com/dke-knu/i2am/tree/master/i2am-app/fields-window-grouping/src/main/java/org/fields/window/grouping/as_is]
 is a situation where the hash collision occurs. First, Spout randomly emits 
the tuples mapping Alice, Bob, and Clark on random real numbers. Next, Bolt 
which extends BaseWindowedBolt prints the TupleWindow objects received from 
Spout. At this time, Bolt uses the fields grouping. Spout emits three fields: 
Alice, Bob, and Clark. If the parallelism of Bolt is set less than 3, it surely 
cause the hash collision. Conversely, the greater the parallelism of Bolt than 
3, the lower the probability of the hash collision. But, it can not be 
guaranteed that the hash collision does not occur.

As an alternative to this hash collision, I used two-step IRichBolt instead of 
BaseWindowedBolt. The source code for this is 
[https://github.com/dke-knu/i2am/tree/master/i2am-app/fields-window-grouping/src/main/java/org/fields/window/grouping/to_be].
 The first Bolt is important. This Bolt takes tuples from Spout and manages 
them through a hash map of list according to the field. If the list is as 
filled as a predefined window size, the oldest tuple is removed and a new tuple 
is added. And then, Bolt emits this list to the next Bolt.

Using this method, the above problems can be solved. That is, the tuples of the 
same fields is always managed in the same window regardless of the number of 
fields and the number of parallelism. 

I'm concerned that this problem will frequently happen to many Storm users who 
use the Windowing method. If there is not a better way than the one I 
presented, I think there should be a new grouping method for Windowing method.

  was:
I‘d like to discuss the hash collision issue that occurs when applying the 
_Grouping_ method [http://storm.apache.org/releases/current/Concepts.html] to 
_Windowing_ method [http://storm.apache.org/releases/current/Windowing.html] in 
Storm.

I first assume the following situation. Spout constantly emits tuples to Bolt. 
At this time, Bolt tries to perform operations while moving tuples of a certain 
interval. e.g. moving average, etc. To solve this situation, Storm provides 
_Windowing_ method.

However, consider the following complex situation. Two problems are added in 
the above situation.
# As a first problem, the tuple emitted by Spout is multidimensional with 
multiple pieces of information. For example, Alice, Bob, and Clark are mapped 
to random real numbers. That is, the tuples emitted from Spout are {[Alice, 
0.18322], [Clark, 0.57833], [Bob, 0.27902], [Clark, 0.24553], [Alice, 0.50164], 
[Alice, 0.06463], ...}. While those tuples are transmitted to the next windowed 
bolt, they must be necessarily separated by keys such as Alice, Bob, and Clark. 
In other words, each tuples in which Alice, Bob, and Clark are mapped must 
belong to different windows.
# The second problem is Storm's parallelism. Spout and Bolt can be operated as 
multiple objects on multiple servers. This problem is that the tuples with the 
same key must be emit in the same window even if they are created by a 
different Spout object.

Storm can specify the Bolt objects which the tuples is to be input as a 
_Grouping_ method. Storm provides various _Grouping_ methods, but a fields 
grouping is best suited as a way to solve the above problems. The fields 
grouping is a way of partitioning an input stream by a specified field. With 
the fields grouping, tuples of the same field can only be passed to the same 
Bolt object. However, the fields grouping has been implemented as a hash 
method. ([http://storm.apache.org/releases/current/Tutorial.html]) Therefore, 
it can be cause *a hash collision* problem that can include the tuples in the 
same window although they have different fields. So I am interested in solving 
the hash collision.

The source code of 
[https://github.com/dke-knu/i2am/tree/master/i2am-app/fields-window-grouping/src/main/java/org/fields/window/grouping/as_is]
 is a situation where the hash collision occurs. First, Spout randomly emits 
the tuples mapping Alice, Bob, and Clark on random real numbers. Next, Bolt 
which extends BaseWindowedBolt prints the TupleWindow objects received from 
Spout. At this time, Bolt uses the fields grouping. Spout emits three fields: 
Alice, Bob, and Clark. If the parallelism of Bolt is set less than 3, it surely 
cause the hash collision. Conversely, the greater the parallelism of Bolt than 
3, the lower the probability of the hash collision. But, it can not be 
guaranteed that the hash collision does not occur.

As an alternative to this hash collision, I used two-step IRichBolt instead of 
BaseWindowedBolt. The source code for this is 
[https://github.com/dke-knu/i2am/tree/master/i2am-app/fields-window-grouping/src/main/java/org/fields/window/grouping/to_be].
 The first Bolt is important. This Bolt takes tuples from Spout and manages 
them through a hash map of list according to the field. If the list is as 
filled as a predefined window size, the oldest tuple is removed and a new tuple 
is added. And then, Bolt emits this list to the next Bolt.

Using this method, the above problems can be solved. That is, the tuples of the 
same fields is always managed in the same window regardless of the number of 
fields and the number of parallelism. 

I'm concerned that this problem will frequently happen to many Storm users who 
use the _Windowing_ method. If there is not a better way than the one I 
presented, I think there should be a new grouping method for _Windowing_ method.


> A Hash Collision Problem of Fields Grouping in Windowing Method
> ---------------------------------------------------------------
>
>                 Key: STORM-2291
>                 URL: https://issues.apache.org/jira/browse/STORM-2291
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-core
>    Affects Versions: 1.x
>            Reporter: Siwoon Son
>             Fix For: 1.x
>
>
> I‘d like to discuss the hash collision issue that occurs when applying the 
> _Grouping_ method [http://storm.apache.org/releases/current/Concepts.html] to 
> _Windowing_ method [http://storm.apache.org/releases/current/Windowing.html] 
> in Storm.
> I first assume the following situation. Spout constantly emits tuples to 
> Bolt. At this time, Bolt tries to perform operations while moving tuples of a 
> certain interval. e.g. moving average, etc. To solve this situation, Storm 
> provides _Windowing_ method.
> However, consider the following complex situation. Two problems are added in 
> the above situation.
> # As a first problem, the tuple emitted by Spout is multidimensional with 
> multiple pieces of information. For example, Alice, Bob, and Clark are mapped 
> to random real numbers. That is, the tuples emitted from Spout are {[Alice, 
> 0.18322], [Clark, 0.57833], [Bob, 0.27902], [Clark, 0.24553], [Alice, 
> 0.50164], [Alice, 0.06463], ...}. While those tuples are transmitted to the 
> next windowed bolt, they must be necessarily separated by keys such as Alice, 
> Bob, and Clark. In other words, each tuples in which Alice, Bob, and Clark 
> are mapped must belong to different windows.
> # The second problem is Storm's parallelism. Spout and Bolt can be operated 
> as multiple objects on multiple servers. This problem is that the tuples with 
> the same key must be emit in the same window even if they are created by a 
> different Spout object.
> Storm can specify the Bolt objects which the tuples is to be input as a 
> _Grouping_ method. Storm provides various _Grouping_ methods, but a fields 
> grouping is best suited as a way to solve the above problems. The fields 
> grouping is a way of partitioning an input stream by a specified field. With 
> the fields grouping, tuples of the same field can only be passed to the same 
> Bolt object. However, the fields grouping has been implemented as a hash 
> method. ([http://storm.apache.org/releases/current/Tutorial.html]) Therefore, 
> it can be cause *a hash collision* problem that can include the tuples in the 
> same window although they have different fields. So I am interested in 
> solving the hash collision.
> The source code of 
> [https://github.com/dke-knu/i2am/tree/master/i2am-app/fields-window-grouping/src/main/java/org/fields/window/grouping/as_is]
>  is a situation where the hash collision occurs. First, Spout randomly emits 
> the tuples mapping Alice, Bob, and Clark on random real numbers. Next, Bolt 
> which extends BaseWindowedBolt prints the TupleWindow objects received from 
> Spout. At this time, Bolt uses the fields grouping. Spout emits three fields: 
> Alice, Bob, and Clark. If the parallelism of Bolt is set less than 3, it 
> surely cause the hash collision. Conversely, the greater the parallelism of 
> Bolt than 3, the lower the probability of the hash collision. But, it can not 
> be guaranteed that the hash collision does not occur.
> As an alternative to this hash collision, I used two-step IRichBolt instead 
> of BaseWindowedBolt. The source code for this is 
> [https://github.com/dke-knu/i2am/tree/master/i2am-app/fields-window-grouping/src/main/java/org/fields/window/grouping/to_be].
>  The first Bolt is important. This Bolt takes tuples from Spout and manages 
> them through a hash map of list according to the field. If the list is as 
> filled as a predefined window size, the oldest tuple is removed and a new 
> tuple is added. And then, Bolt emits this list to the next Bolt.
> Using this method, the above problems can be solved. That is, the tuples of 
> the same fields is always managed in the same window regardless of the number 
> of fields and the number of parallelism. 
> I'm concerned that this problem will frequently happen to many Storm users 
> who use the Windowing method. If there is not a better way than the one I 
> presented, I think there should be a new grouping method for Windowing method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to