DataFrame is an abstraction of rdd. So you should be able to do df.rdd.partitioyBy. however as far as I know, equijoines already optimizes partitioning. You may want to look explain plans more carefully and materialise interim joins. On 22 May 2015 19:03, "Karlson" <[email protected]> wrote:
> Hi, > > is there any way to control how Dataframes are partitioned? I'm doing lots > of joins and am seeing very large shuffle reads and writes in the Spark UI. > With PairRDDs you can control how the data is partitioned across nodes with > partitionBy. There is no such method on Dataframes however. Can I somehow > partition the underlying the RDD manually? I am currently using the Python > API. > > Thanks! > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
