Re: Shuffle on joining two RDDs

Karlson Thu, 12 Feb 2015 08:20:47 -0800

Hi,

I believe that partitionBy will use the same (default) partitioner onboth RDDs.


On 2015-02-12 17:12, Sean Owen wrote:

Doesn't this require that both RDDs have the same partitioner?
On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid <iras...@cloudera.com>wrote:
Hi Karlson,
I think your assumptions are correct -- that join alone shouldn'trequire
any shuffling.  But its possible you are getting tripped up by lazy
evaluation of RDDs. After you do your partitionBy, are you sure thoseRDDsare actually materialized & cached somewhere? eg., if you just didthis:
val rddA = someData.partitionBy(N)
val rddB = someOtherData.partitionBy(N)
val joinedRdd = rddA.join(rddB)
joinedRdd.count() //or any other action
then the partitioning isn't actually getting run until you do thejoin. Sothough the join itself can happen without partitioning,joinedRdd.count()will trigger the evaluation of rddA & rddB which will requireshuffles.
Note that even if you have some intervening action on rddA & rddB that
shuffles them, unless you persist the result, you will need toreshuffle
them for the join.

If this doesn't help explain things, for debugging

joinedRdd.getPartitions.foreach{println}
this is getting into the weeds, but at least this will tell us whetherornot you are getting narrow dependencies, which would avoid theshuffle.
(Does anyone know of a simpler way to check this?)

hope this helps,
Imran




On Thu, Feb 12, 2015 at 9:25 AM, Karlson <ksonsp...@siberie.de> wrote:
Hi All,
using Pyspark, I create two RDDs (one with about 2M records (~200MB),the
other with about 8M records (~2GB)) of the format (key, value).
I've done a partitionBy(num_partitions) on both RDDs and verifiedthatboth RDDs have the same number of partitions and that equal keysreside on
the same partition (via mapPartitionsWithIndex).
Now I'd expect that for a join on the two RDDs no shuffling isnecessary.Looking at the Web UI under http://driver:4040 however reveals thatthat
assumption is false.
In fact I am seeing shuffle writes of about 200MB and reads of about50MB.
What's the explanation for that behaviour? Where am I wrong with my
assumption?

Thanks in advance,

Karlson

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Shuffle on joining two RDDs

Reply via email to