[ 
https://issues.apache.org/jira/browse/SPARK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060774#comment-14060774
 ] 

Sean Owen commented on SPARK-2278:
----------------------------------

The more direct parallel certainly also exists, if you want to write it that 
way. Given and RDD of V, you can first groupBy some derived value type K, to 
get an RDD of (K,Iterable[V]). From there, you can mapValues and apply a reduce 
function yourself. (And something analogous for groupByKey)

The big "but" to this approach is that you materialize the values all together 
at once for a key, and then manually apply a reduce function. This is what 
reduceBy is doing under the hood for you, probably more optimally. Still you 
could break it down if you needed more control.

The part where you get to define the value K that determines grouping -- that's 
what you need and why you don't necessarily need a Comparator anywhere to get 
your job done.

Yes, understanding the 'func' is key, and it's more obvious coming from Scala. 
It answers the requirements you have as far as I understand them (with the 
caveat above about ordering and sortBy). I suggest you can resolve this by 
suggesting a doc update somewhere.

> groupBy & groupByKey should support custom comparator
> -----------------------------------------------------
>
>                 Key: SPARK-2278
>                 URL: https://issues.apache.org/jira/browse/SPARK-2278
>             Project: Spark
>          Issue Type: New Feature
>          Components: Java API
>    Affects Versions: 1.0.0
>            Reporter: Hans Uhlig
>
> To maintain parity with MapReduce you should be able to specify a custom key 
> equality function in groupBy/groupByKey similar to sortByKey. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to