[ https://issues.apache.org/jira/browse/SPARK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060774#comment-14060774 ]
Sean Owen commented on SPARK-2278: ---------------------------------- The more direct parallel certainly also exists, if you want to write it that way. Given and RDD of V, you can first groupBy some derived value type K, to get an RDD of (K,Iterable[V]). From there, you can mapValues and apply a reduce function yourself. (And something analogous for groupByKey) The big "but" to this approach is that you materialize the values all together at once for a key, and then manually apply a reduce function. This is what reduceBy is doing under the hood for you, probably more optimally. Still you could break it down if you needed more control. The part where you get to define the value K that determines grouping -- that's what you need and why you don't necessarily need a Comparator anywhere to get your job done. Yes, understanding the 'func' is key, and it's more obvious coming from Scala. It answers the requirements you have as far as I understand them (with the caveat above about ordering and sortBy). I suggest you can resolve this by suggesting a doc update somewhere. > groupBy & groupByKey should support custom comparator > ----------------------------------------------------- > > Key: SPARK-2278 > URL: https://issues.apache.org/jira/browse/SPARK-2278 > Project: Spark > Issue Type: New Feature > Components: Java API > Affects Versions: 1.0.0 > Reporter: Hans Uhlig > > To maintain parity with MapReduce you should be able to specify a custom key > equality function in groupBy/groupByKey similar to sortByKey. -- This message was sent by Atlassian JIRA (v6.2#6252)