Re: Partitioning of Dataframes

2015-05-22 Thread Ted Yu
Looking at python/pyspark/sql/dataframe.py : @since(1.4) def coalesce(self, numPartitions): @since(1.3) def repartition(self, numPartitions): Would the above methods serve the purpose ? Cheers On Fri, May 22, 2015 at 6:57 AM, Karlson wrote: > Alright, that doesn't seem to hav

Re: Partitioning of Dataframes

2015-05-22 Thread Karlson
Alright, that doesn't seem to have made it into the Python API yet. On 2015-05-22 15:12, Silvio Fiorito wrote: This is added to 1.4.0 https://github.com/apache/spark/pull/5762 On 5/22/15, 8:48 AM, "Karlson" wrote: Hi, wouldn't df.rdd.partitionBy() return a new RDD that I would then n

Re: Partitioning of Dataframes

2015-05-22 Thread Silvio Fiorito
This is added to 1.4.0 https://github.com/apache/spark/pull/5762 On 5/22/15, 8:48 AM, "Karlson" wrote: >Hi, > >wouldn't df.rdd.partitionBy() return a new RDD that I would then need to >make into a Dataframe again? Maybe like this: >df.rdd.partitionBy(...).toDF(schema=df.schema). That lo

Re: Partitioning of Dataframes

2015-05-22 Thread Karlson
Hi, wouldn't df.rdd.partitionBy() return a new RDD that I would then need to make into a Dataframe again? Maybe like this: df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird to me, though, and I'm not sure if the DF will be aware of its partitioning. On 2015-05-22 12:55,

Re: Partitioning of Dataframes

2015-05-22 Thread ayan guha
DataFrame is an abstraction of rdd. So you should be able to do df.rdd.partitioyBy. however as far as I know, equijoines already optimizes partitioning. You may want to look explain plans more carefully and materialise interim joins. On 22 May 2015 19:03, "Karlson" wrote: > Hi, > > is there any

Partitioning of Dataframes

2015-05-22 Thread Karlson
Hi, is there any way to control how Dataframes are partitioned? I'm doing lots of joins and am seeing very large shuffle reads and writes in the Spark UI. With PairRDDs you can control how the data is partitioned across nodes with partitionBy. There is no such method on Dataframes however. Ca