Looking at python/pyspark/sql/dataframe.py :
@since(1.4)
def coalesce(self, numPartitions):
@since(1.3)
def repartition(self, numPartitions):
Would the above methods serve the purpose ?
Cheers
On Fri, May 22, 2015 at 6:57 AM, Karlson wrote:
> Alright, that doesn't seem to hav
Alright, that doesn't seem to have made it into the Python API yet.
On 2015-05-22 15:12, Silvio Fiorito wrote:
This is added to 1.4.0
https://github.com/apache/spark/pull/5762
On 5/22/15, 8:48 AM, "Karlson" wrote:
Hi,
wouldn't df.rdd.partitionBy() return a new RDD that I would then n
This is added to 1.4.0
https://github.com/apache/spark/pull/5762
On 5/22/15, 8:48 AM, "Karlson" wrote:
>Hi,
>
>wouldn't df.rdd.partitionBy() return a new RDD that I would then need to
>make into a Dataframe again? Maybe like this:
>df.rdd.partitionBy(...).toDF(schema=df.schema). That lo
Hi,
wouldn't df.rdd.partitionBy() return a new RDD that I would then need to
make into a Dataframe again? Maybe like this:
df.rdd.partitionBy(...).toDF(schema=df.schema). That looks a bit weird
to me, though, and I'm not sure if the DF will be aware of its
partitioning.
On 2015-05-22 12:55,
DataFrame is an abstraction of rdd. So you should be able to do
df.rdd.partitioyBy. however as far as I know, equijoines already optimizes
partitioning. You may want to look explain plans more carefully and
materialise interim joins.
On 22 May 2015 19:03, "Karlson" wrote:
> Hi,
>
> is there any
Hi,
is there any way to control how Dataframes are partitioned? I'm doing
lots of joins and am seeing very large shuffle reads and writes in the
Spark UI. With PairRDDs you can control how the data is partitioned
across nodes with partitionBy. There is no such method on Dataframes
however. Ca