[ https://issues.apache.org/jira/browse/SPARK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060668#comment-14060668 ]
Hans Uhlig edited comment on SPARK-2278 at 7/14/14 2:14 PM: ------------------------------------------------------------ So I can see two places this becomes painful quickly. Maps while cheap are not free; they also functionally describe a data mutation rather than a transformation modifier. This might seems like a small syntactic nuance but it can make processing large datasets painful. Secondly, handling composite keys. This is something that spark seems to ignore almost everywhere is that keys contain data too, there is no need to replicate your key into your value field for a couple hundred billion records. I don't want to have 6 separate copies of a Key class representing who what where when, just because I need to sort them or group them in different orders. I often order things by natural order: who, what, where, and when. I then group by a lesser order, who, what, where. I shouldn't need to create an entire new Key class just to change ordering. I can see something like this: JavaRDD<K,V> JavaRDD.sortBy(Comparator comp, Partitioner partitioner, int numPartitions) JavaPairRDD<K,V> JavaPairRDD.sortByKey(Comparator comp, Partitioner partitioner, int numPartitions) JavaRDD<K,Iterable<T>> JavaRDD.groupBy(JavaPairRDD<K,Iterable<T>> groupBy(Function<T,K> func()), Comparator comp, Partitioner partitioner, int numPartitions) JavaPairRDD<K,Iterable<V>> JavaPairRDD.groupByKey( JavaPairRDD<K,Iterable<T>> groupBy(Function<T,K> func), Comparator comp, Partitioner partitioner, int numPartitions) Also, what is the rationale that none of the reduction functions, reduceBy, groupBy, etc receive the key of the data they are reducing. was (Author: huhlig): So I can see two places this becomes painful quickly. Maps while cheap are not free; they also functionally describe a data mutation rather than a transformation modifier. This might seems like a small syntactic nuance but it can make processing large datasets painful. Secondly, handling composite keys. This is something that spark seems to ignore almost everywhere is that keys contain data too, there is no need to replicate your key into your value field for a couple hundred billion records. I don't want to have 6 seperate copies of a Key class representing who what where when, just because I need to sort them or group them in different orders. I often order things by natural order: who, what, where, and when. I then group by a lesser order, who, what, where. I shouldn't need to create an entire new Key class just to change ordering. I can see something like this: JavaRDD<K,V> JavaRDD.sortBy(Comparator comp, Partitioner partitioner, int numPartitions) JavaPairRDD<K,V> JavaPairRDD.sortByKey(Comparator comp, Partitioner partitioner, int numPartitions) JavaRDD<K,Iterable<T>> JavaRDD.groupBy(JavaPairRDD<K,Iterable<T>> groupBy(Function<T,K> func()), Comparator comp, Partitioner partitioner, int numPartitions) JavaPairRDD<K,Iterable<V>> JavaPairRDD.groupByKey( JavaPairRDD<K,Iterable<T>> groupBy(Function<T,K> func), Comparator comp, Partitioner partitioner, int numPartitions) Also, what is the rationale that none of the reduction functions, reduceBy, groupBy, etc receive the key of the data they are reducing. > groupBy & groupByKey should support custom comparator > ----------------------------------------------------- > > Key: SPARK-2278 > URL: https://issues.apache.org/jira/browse/SPARK-2278 > Project: Spark > Issue Type: New Feature > Components: Java API > Affects Versions: 1.0.0 > Reporter: Hans Uhlig > > To maintain parity with MapReduce you should be able to specify a custom key > equality function in groupBy/groupByKey similar to sortByKey. -- This message was sent by Atlassian JIRA (v6.2#6252)