I have implemented map-side join with broadcast variables and the code is
on mailing list (scala).


On Mon, May 4, 2015 at 8:38 PM, ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> Can someone share some working code for custom partitioner in python?
>
> I am trying to understand it better.
>
> Here is documentation
>
> partitionBy(*numPartitions*, *partitionFunc=<function portable_hash at
> 0x2c45140>*)
> <https://spark.apache.org/docs/1.3.1/api/python/pyspark.html#pyspark.RDD.partitionBy>
>
> Return a copy of the RDD partitioned using the specified partitioner.
>
>
> what I am trying to do -
>
> 1. Create a dataframe
>
> 2. Partition it using one specific column
>
> 3. create another dataframe
>
> 4. partition it on the same column
>
> 5. join (to enforce map-side join)
>
> My question:
>
> a) Am I on right path?
>
> b) How can I do partitionby? Specifically, when I call DF.rdd.partitionBy,
> what gets passed to the custom function? tuple? row? how to access (say 3rd
> column of a tuple inside partitioner function)?
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Deepak

Reply via email to