[ 
https://issues.apache.org/jira/browse/SPARK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060668#comment-14060668
 ] 

Hans Uhlig edited comment on SPARK-2278 at 7/14/14 2:14 PM:
------------------------------------------------------------

So I can see two places this becomes painful quickly. 

Maps while cheap are not free; they also functionally describe a data mutation 
rather than a transformation modifier. This might seems like a small syntactic 
nuance but it can make processing large datasets painful.

Secondly, handling composite keys. This is something that spark seems to ignore 
almost everywhere is that keys contain data too, there is no need to replicate 
your key into your value field for a couple hundred billion records. I don't 
want to have 6 separate copies of a Key class representing who what where when, 
just because I need to sort them or group them in different orders. I often 
order things by natural order: who, what, where, and when. I then group by a 
lesser order, who, what, where. I shouldn't need to create an entire new Key 
class just to change ordering.

I can see something like this:

JavaRDD<K,V> JavaRDD.sortBy(Comparator comp, Partitioner partitioner, int 
numPartitions)
JavaPairRDD<K,V> JavaPairRDD.sortByKey(Comparator comp, Partitioner 
partitioner, int numPartitions)
JavaRDD<K,Iterable<T>> JavaRDD.groupBy(JavaPairRDD<K,Iterable<T>> 
groupBy(Function<T,K> func()), Comparator comp, Partitioner partitioner, int 
numPartitions)
JavaPairRDD<K,Iterable<V>> JavaPairRDD.groupByKey( JavaPairRDD<K,Iterable<T>> 
groupBy(Function<T,K> func), Comparator comp, Partitioner partitioner, int 
numPartitions)

Also, what is the rationale that none of the reduction functions, reduceBy, 
groupBy, etc receive the key of the data they are reducing.


was (Author: huhlig):
So I can see two places this becomes painful quickly. 

Maps while cheap are not free; they also functionally describe a data mutation 
rather than a transformation modifier. This might seems like a small syntactic 
nuance but it can make processing large datasets painful.

Secondly, handling composite keys. This is something that spark seems to ignore 
almost everywhere is that keys contain data too, there is no need to replicate 
your key into your value field for a couple hundred billion records. I don't 
want to have 6 seperate copies of a Key class representing who what where when, 
just because I need to sort them or group them in different orders. I often 
order things by natural order: who, what, where, and when. I then group by a 
lesser order, who, what, where. I shouldn't need to create an entire new Key 
class just to change ordering.

I can see something like this:

JavaRDD<K,V> JavaRDD.sortBy(Comparator comp, Partitioner partitioner, int 
numPartitions)
JavaPairRDD<K,V> JavaPairRDD.sortByKey(Comparator comp, Partitioner 
partitioner, int numPartitions)
JavaRDD<K,Iterable<T>> JavaRDD.groupBy(JavaPairRDD<K,Iterable<T>> 
groupBy(Function<T,K> func()), Comparator comp, Partitioner partitioner, int 
numPartitions)
JavaPairRDD<K,Iterable<V>> JavaPairRDD.groupByKey( JavaPairRDD<K,Iterable<T>> 
groupBy(Function<T,K> func), Comparator comp, Partitioner partitioner, int 
numPartitions)

Also, what is the rationale that none of the reduction functions, reduceBy, 
groupBy, etc receive the key of the data they are reducing.

> groupBy & groupByKey should support custom comparator
> -----------------------------------------------------
>
>                 Key: SPARK-2278
>                 URL: https://issues.apache.org/jira/browse/SPARK-2278
>             Project: Spark
>          Issue Type: New Feature
>          Components: Java API
>    Affects Versions: 1.0.0
>            Reporter: Hans Uhlig
>
> To maintain parity with MapReduce you should be able to specify a custom key 
> equality function in groupBy/groupByKey similar to sortByKey. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to