Re: spark shuffle

Shushant Arora Sun, 22 Nov 2015 03:21:57 -0800

And does groupByKey will keep all values of pairrdd  in an iterable list in
inmemory of reducer. Which will lead to outofmemory if values of a key are
beyond memory of that node .
1.Is there a way to spill that to disk ?
2.If not is there a feasibility of partitioning pairdd using custom
partitioner and make all values of same key on same node and number of
partitions to be equal to number of distinct keys.


On Sat, Nov 21, 2015 at 11:21 PM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Hi
>
> I have few doubts
>
> 1.does
> rdd.saveasNewAPIHadoopFile(outputdir,keyclass,valueclass,ouputformat
> class)-> shuffles data or it will always create same no of files in output
> dir as number of partitions in rdd.
>
> 2. How to use multiple outputs in saveasNewAPIHadoopFile to have file name
> generated from key for non Textoutputformat type outputformats.
>
> 3. I have a JavaPairRDD<K,V>  - I want to partition it into number of
> partitons equal to distinct keys in pairrdd.
>
>            1.will pairrdd.groupByKey() will create new rdd with partitions
> equal to number of                          distinct keys in parent pairrdd?
>
>            2.or i will have to calculate distinct keys in pairrdd (by
> using
> pairrdd.keys().distinct().count())and then call custom partitioner() on
> pair rdd with                        number of partitions as calculated
> distinct keys and partition by key?
>
> Thanks
>

Re: spark shuffle

Reply via email to