Hi,

is there any way to control how Dataframes are partitioned? I'm doing lots of joins and am seeing very large shuffle reads and writes in the Spark UI. With PairRDDs you can control how the data is partitioned across nodes with partitionBy. There is no such method on Dataframes however. Can I somehow partition the underlying the RDD manually? I am currently using the Python API.

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to