Appreciate the follow up.

I am not entirely sure how or why my question is related to bucketization 
capabilities. It indeeds sounds like a powerful feature to avoid shuffling, but 
in my case, I am referring to straight forward processes of reading data and 
writing to parquet.
If bucket tables allow to setup on pre-reading time buckets and specify 
parallelization when directly writing, then you hit on the nail.

My problem is that reading from source (usually hundreds of text files) turn in 
into 10k+ partition dataframes, based on the partition's block size and  number 
of data splits, writing these back are a huge overhead for parquet and require 
repartitioning in order to reduce heap memory usage, specially on wide tables.

Let see how it goes.
Saif


From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr]
Sent: Friday, June 03, 2016 2:55 PM
To: Ellafi, Saif A.
Cc: user; Reynold Xin; mich...@databricks.com
Subject: Re: Strategies for propery load-balanced partitioning

I suppose you are running on 1.6.
I guess you need some solution based on [1], [2] features which are coming in 
2.0.

[1] https://issues.apache.org/jira/browse/SPARK-12538 / 
https://issues.apache.org/jira/browse/SPARK-12394
[2] https://issues.apache.org/jira/browse/SPARK-12849

However, I did not check for examples, I would like to add to your question and 
ask the community to link to some examples with the recent improvements/changes.

It could help however to give concrete example on your specific problem, as you 
may hit some stragglers also probably caused by data skew.

Best,
Ovidiu


On 03 Jun 2016, at 17:31, 
saif.a.ell...@wellsfargo.com<mailto:saif.a.ell...@wellsfargo.com> wrote:

Hello everyone!

I was noticing that, when reading parquet files or actually any kind of source 
data frame data (spark-csv, etc), default partinioning is not fair.
Action tasks usually act very fast on some partitions and very slow on some 
others, and frequently, even fast on all but last partition (which looks like 
it reads +50% of the data input size).

I notice that each task is loading some portion of the data, say 1024MB chunks, 
and some task loading 20+GB of data.

Applying repartition strategies solve this issue properly and general 
performance is increased considerably, but for very large dataframes, 
repartitioning is a costly process.

In short, what are the available strategies or configurations that help reading 
from disk or hdfs with proper executor-data-distribution??

If this needs to be more specific, I am strictly focused on PARQUET files rom 
HDFS. I know there are some MIN

Really appreciate,
Saif

Reply via email to