Thank you Iulian ! That's precisely what I discovered today. Best, Marius
On Wed, Apr 22, 2015 at 3:31 PM Iulian DragoČ™ <iulian.dra...@typesafe.com> wrote: > On Tue, Apr 21, 2015 at 2:38 PM, Marius Danciu <marius.dan...@gmail.com> > wrote: > >> Hello anyone, >> >> I have a question regarding the sort shuffle. Roughly I'm doing something >> like: >> >> rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) >> >> The problem is that in f2 I don't see the keys being sorted. The keys are >> Java Comparable not scala.math.Ordered or scala.math.Ordering (it would be >> weird for each key to implement Ordering as mentioned in the JIRA item >> https://issues.apache.org/jira/browse/SPARK-2045) >> >> Questions: >> 1. Do I need to explicitly sortByKey ? (if I do this I can see the keys >> correctly sorted in f2) ... but I'm worried about the extra costs since >> Spark 1.3.0 is supposed to use the SORT shuffle manager by default, right ? >> > > AFAIK the sort shuffle is not sorting *inside* each partition, unless the > shuffle comes from a sort. Otherwise, the shuffle file contains keys in > sorted order of their partitions= IDs. More details can be found in the > design document attached to SPARK-2045 > <https://issues.apache.org/jira/browse/SPARK-2045>. This was enough to > improve memory consumption. > > >> 2. Do I need each key to be an scala.math.Ordered ? ... is Java >> Comparable used at all ? >> > > There's implicit conversions from Comparable to Ordered, but that only > works for Scala code. Since you're using the Java API, I'm not sure what > you mean here. You can call `JavaPairRDD.sortByKey(comp)` with your own > comparator. > > cheers, > iulian > > >> >> ... btw I'm using Spark from Java ... don't ask me why :) >> >> >> >> Best, >> Marius >> > > > > -- > > -- > Iulian Dragos > > ------ > Reactive Apps on the JVM > www.typesafe.com > >