Hi Can someone share some working code for custom partitioner in python?
I am trying to understand it better. Here is documentation partitionBy(*numPartitions*, *partitionFunc=<function portable_hash at 0x2c45140>*) <https://spark.apache.org/docs/1.3.1/api/python/pyspark.html#pyspark.RDD.partitionBy> Return a copy of the RDD partitioned using the specified partitioner. what I am trying to do - 1. Create a dataframe 2. Partition it using one specific column 3. create another dataframe 4. partition it on the same column 5. join (to enforce map-side join) My question: a) Am I on right path? b) How can I do partitionby? Specifically, when I call DF.rdd.partitionBy, what gets passed to the custom function? tuple? row? how to access (say 3rd column of a tuple inside partitioner function)? -- Best Regards, Ayan Guha