Hi,

I was going through Matei's Advanced Spark presentation at 
https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions.
The presentation of this video is at 
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf

The PageRank example introduces partitioning in the below way:
val ranks = // RDD of (url, rank) pairs
val links = sc.textFile(...).map(...).partitionBy(new HashPartitioner(8))

However later on, it is said that
1) Any shuffle operation on two RDDs will take on the Partitioner of one of 
them, if one is set
Question1: Could we have applied partitionBy on the ranks RDD and have the same 
result/performance ?
2) Otherwise, by default use HashPartitioner
Question2: If partitionBy applies HashPartitioner in this example, could we 
simply not have any partitioner and relied on the default HashPartitioner to 
achieve the same result/performance ?

I had another question unrelated to this presentation.
Question3: If my processing is something like this....
rdd3 = rdd1.join(rdd2)
rdd4 = rdd3.map((k,(v1,v2))=>(v1,k))
rdd6 = rdd4.join(rdd5)
rdd6.saveAsTextFiles(out.txt)

Would I benefit by partitioning ? Unlike the PageRank example, I do not have to 
join/shuffle the same RDD or key more than once.

Regards,
Sanjay

Reply via email to