Re: Shuffle question

Marius Danciu Wed, 22 Apr 2015 05:48:13 -0700

Thank you Iulian !  That's precisely what I discovered today.

Best,
Marius


On Wed, Apr 22, 2015 at 3:31 PM Iulian Dragoș <iulian.dra...@typesafe.com>
wrote:

> On Tue, Apr 21, 2015 at 2:38 PM, Marius Danciu <marius.dan...@gmail.com>
> wrote:
>
>> Hello anyone,
>>
>> I have a question regarding the sort shuffle. Roughly I'm doing something
>> like:
>>
>> rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2)
>>
>> The problem is that in f2 I don't see the keys being sorted. The keys are
>> Java Comparable  not scala.math.Ordered or scala.math.Ordering (it would be
>> weird for each key to implement Ordering as mentioned in the JIRA item
>> https://issues.apache.org/jira/browse/SPARK-2045)
>>
>> Questions:
>> 1. Do I need to explicitly sortByKey ? (if I do this I can see the keys
>> correctly sorted in f2) ... but I'm worried about the extra costs since
>> Spark 1.3.0 is supposed to use the SORT shuffle manager by default, right ?
>>
>
> AFAIK the sort shuffle is not sorting *inside* each partition, unless the
> shuffle comes from a sort. Otherwise, the shuffle file contains keys in
> sorted order of their partitions= IDs. More details can be found in the
> design document attached to SPARK-2045
> <https://issues.apache.org/jira/browse/SPARK-2045>. This was enough to
> improve memory consumption.
>
>
>> 2. Do I need each key to be an scala.math.Ordered ? ... is Java
>> Comparable used at all ?
>>
>
> There's implicit conversions from Comparable to Ordered, but that only
> works for Scala code. Since you're using the Java API, I'm not sure what
> you mean here. You can call `JavaPairRDD.sortByKey(comp)` with your own
> comparator.
>
> cheers,
> iulian
>
>
>>
>> ... btw I'm using Spark from Java ... don't ask me why :)
>>
>>
>>
>> Best,
>> Marius
>>
>
>
>
> --
>
> --
> Iulian Dragos
>
> ------
> Reactive Apps on the JVM
> www.typesafe.com
>
>

Re: Shuffle question

Reply via email to