spark shuffle

Shushant Arora Sat, 21 Nov 2015 09:51:38 -0800

Hi

I have few doubts


1.does rdd.saveasNewAPIHadoopFile(outputdir,keyclass,valueclass,ouputformat
class)-> shuffles data or it will always create same no of files in output
dir as number of partitions in rdd.

2. How to use multiple outputs in saveasNewAPIHadoopFile to have file name
generated from key for non Textoutputformat type outputformats.

3. I have a JavaPairRDD<K,V>  - I want to partition it into number of
partitons equal to distinct keys in pairrdd.

           1.will pairrdd.groupByKey() will create new rdd with partitions
equal to number of                          distinct keys in parent pairrdd?

           2.or i will have to calculate distinct keys in pairrdd (by using

pairrdd.keys().distinct().count())and then call custom partitioner() on
pair rdd with                        number of partitions as calculated
distinct keys and partition by key?

Thanks

spark shuffle

Reply via email to