[ 
https://issues.apache.org/jira/browse/STORM-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064194#comment-16064194
 ] 

Arun Mahadevan commented on STORM-2258:
---------------------------------------

We already have a groupByKey implementation. What we are looking for is a 
coGroupByKey which is more like a join but instead of returning the cross 
product of the matching keys, groups together values for the same key from the 
joined streams. For example,

Say stream1 has values - (k1, v1), (k2, v2), (k2, v3)  
and stream2 has values - (k1, x1), (k1, x2), (k3, x3) 

The the co-grouped stream would contain - 

{noformat}
(k1, ([v1], [x1, x2]), (k2, ([v2, v3], [])), (k3, ([], [x3]))
{noformat}

Since you are co-grouping two streams containing key-value pairs, you would 
define the operation on the PairStream class, with the signature of the method 
to be something like,

{noformat}
public <V1> PairStream<K, Pair<Iterable<V>, Iterable<V1>>> 
coGroupByKey(PairStream<K, V1> otherStream) {
        // ...
}
{noformat}

To get some hints on how to go about implementing this, you can take a look at 
the implementation of the join operation. (see JoinProcessor.java)



> Streams api - support CoGroupByKey
> ----------------------------------
>
>                 Key: STORM-2258
>                 URL: https://issues.apache.org/jira/browse/STORM-2258
>             Project: Apache Storm
>          Issue Type: Sub-task
>            Reporter: Arun Mahadevan
>
> Group together values with same key from both streams. Similar constructs are 
> supported in beam, spark and flink.
> When called on a Stream of (K, V) and (K, W) pairs, return a new Stream of 
> (K, Seq[V], Seq[W]) tuples
> See also - https://cloud.google.com/dataflow/model/group-by-key



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to