Hi

Can someone share some working code for custom partitioner in python?

I am trying to understand it better.

Here is documentation

partitionBy(*numPartitions*, *partitionFunc=<function portable_hash at
0x2c45140>*)
<https://spark.apache.org/docs/1.3.1/api/python/pyspark.html#pyspark.RDD.partitionBy>

Return a copy of the RDD partitioned using the specified partitioner.


what I am trying to do -

1. Create a dataframe

2. Partition it using one specific column

3. create another dataframe

4. partition it on the same column

5. join (to enforce map-side join)

My question:

a) Am I on right path?

b) How can I do partitionby? Specifically, when I call DF.rdd.partitionBy,
what gets passed to the custom function? tuple? row? how to access (say 3rd
column of a tuple inside partitioner function)?

-- 
Best Regards,
Ayan Guha

Reply via email to