Thank you Mayur, I think that will help me a lot
Best, Tao 2014-02-26 8:56 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>: > Type of Shuffling is best explained by Matei in Spark Internals . > http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203 > Why dont you look at that & then if you have follow up questions ask here, > also would be good to watch this whole talk as it talks about Spark job > flows in a lot more detail. > > SCALA > import org.apache.spark.RangePartitioner; > var file=sc.textFile("<my local path>") > var partitionedFile=file.map(x=>(x,1)) > var data= partitionedFile.partitionBy(new > RangePartitioner(3, partitionedFile)) > data.glom().collect()(0).length > data.glom().collect()(1).length > data.glom().collect()(2).length > This will sample the RDD partitionedFile & then try to partition > partitionedFile in almost equal sizes. > Do not do collect if your data size is huge as this may OOM the driver, > write it to disk in that case. > > > > Scala > > Mayur Rustagi > Ph: +919632149971 > h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com > https://twitter.com/mayur_rustagi > > > > On Tue, Feb 25, 2014 at 1:19 AM, Tao Xiao <xiaotao.cs....@gmail.com>wrote: > >> I am a newbie to Spark and I need to know how RDD partitioning can be >> controlled in the process of shuffling. I have googled for examples but >> haven't found much concrete examples, in contrast with the fact that there >> are many good tutorials about Hadoop's shuffling and partitioner. >> >> Can anybody show me good tutorials explaining the process of shuffling in >> Spark, as well as examples of how to use a customized partitioner.? >> >> >> Best, >> Tao >> > >