Re: Hash Partitioning and Dataframes

Michael Armbrust Fri, 08 May 2015 15:16:01 -0700

What are you trying to accomplish?  Internally Spark SQL will add Exchange
operators to make sure that data is partitioned correctly for joins and
aggregations.  If you are going to do other RDD operations on the result of
dataframe operations and you need to manually control the partitioning,
call df.rdd and partition as you normally would.


On Fri, May 8, 2015 at 2:47 PM, Daniel, Ronald (ELS-SDG) <
r.dan...@elsevier.com> wrote:

> Hi,
>
> How can I ensure that a batch of DataFrames I make are all partitioned
> based on the value of one column common to them all?
> For RDDs I would partitionBy a HashPartitioner, but I don't see that in
> the DataFrame API.
> If I partition the RDDs that way, then do a toDF(), will the partitioning
> be preserved?
>
> Thanks,
> Ron
>
>

Re: Hash Partitioning and Dataframes

Reply via email to