Hi, I was going through Matei's Advanced Spark presentation at https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions. The presentation of this video is at http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
The PageRank example introduces partitioning in the below way: val ranks = // RDD of (url, rank) pairs val links = sc.textFile(...).map(...).partitionBy(new HashPartitioner(8)) However later on, it is said that 1) Any shuffle operation on two RDDs will take on the Partitioner of one of them, if one is set Question1: Could we have applied partitionBy on the ranks RDD and have the same result/performance ? 2) Otherwise, by default use HashPartitioner Question2: If partitionBy applies HashPartitioner in this example, could we simply not have any partitioner and relied on the default HashPartitioner to achieve the same result/performance ? I had another question unrelated to this presentation. Question3: If my processing is something like this.... rdd3 = rdd1.join(rdd2) rdd4 = rdd3.map((k,(v1,v2))=>(v1,k)) rdd6 = rdd4.join(rdd5) rdd6.saveAsTextFiles(out.txt) Would I benefit by partitioning ? Unlike the PageRank example, I do not have to join/shuffle the same RDD or key more than once. Regards, Sanjay