I want several RDDs (which are the result of my program's operations on
existing RDDs) to match the partitioning of an existing RDD, since they will
be joined together in the end. Do I understand correctly that I would
benefit from using a custom partitioner that would be applied to all RDDs?

Secondly, how do I accomplish this in PySpark? The docs barely mention it,
and the only thing I could find was:

/partitionBy(self, numPartitions, partitionFunc=portable_hash)/

What is this "partitionFunc", and how do I use it to create something like
"HashPartitioner" that I can re-use for multiple RDDs?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-custom-partitioning-in-PySpark-work-tp17688.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to