I have implemented map-side join with broadcast variables and the code is on mailing list (scala).
On Mon, May 4, 2015 at 8:38 PM, ayan guha <guha.a...@gmail.com> wrote: > Hi > > Can someone share some working code for custom partitioner in python? > > I am trying to understand it better. > > Here is documentation > > partitionBy(*numPartitions*, *partitionFunc=<function portable_hash at > 0x2c45140>*) > <https://spark.apache.org/docs/1.3.1/api/python/pyspark.html#pyspark.RDD.partitionBy> > > Return a copy of the RDD partitioned using the specified partitioner. > > > what I am trying to do - > > 1. Create a dataframe > > 2. Partition it using one specific column > > 3. create another dataframe > > 4. partition it on the same column > > 5. join (to enforce map-side join) > > My question: > > a) Am I on right path? > > b) How can I do partitionby? Specifically, when I call DF.rdd.partitionBy, > what gets passed to the custom function? tuple? row? how to access (say 3rd > column of a tuple inside partitioner function)? > > -- > Best Regards, > Ayan Guha > -- Deepak