subject:"distributeBy using advantage of HDFS or RDD partitioning"

distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Deenar Toraskar

Hi

I have data in HDFS partitioned by a logical key and would like to preserve
the partitioning when creating a dataframe for the same. Is it possible to
create a dataframe that preserves partitioning from HDFS or the underlying
RDD?

Regards
Deenar

Re: distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Simon Elliston Ball

If you load data using ORC or parquet, the RDD will have a partition per file, 
so in fact your data frame will not directly match the partitioning of the 
table. 

If you want to process by and guarantee preserving partitioning then 
mapPartition etc will be useful. 

Note that if you perform any DataFrame operations which shuffle, you will end 
up implicitly re-partitioning to spark.sql.shuffle.partitions (default 200).

Simon

> On 13 Jan 2016, at 10:09, Deenar Toraskar  wrote:
> 
> Hi
> 
> I have data in HDFS partitioned by a logical key and would like to preserve 
> the partitioning when creating a dataframe for the same. Is it possible to 
> create a dataframe that preserves partitioning from HDFS or the underlying 
> RDD?
> 
> Regards
> Deenar

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

distributeBy using advantage of HDFS or RDD partitioning

Re: distributeBy using advantage of HDFS or RDD partitioning

2 matches

Site Navigation

Mail list logo

Footer information