Appreciate the follow up. I am not entirely sure how or why my question is related to bucketization capabilities. It indeeds sounds like a powerful feature to avoid shuffling, but in my case, I am referring to straight forward processes of reading data and writing to parquet. If bucket tables allow to setup on pre-reading time buckets and specify parallelization when directly writing, then you hit on the nail.
My problem is that reading from source (usually hundreds of text files) turn in into 10k+ partition dataframes, based on the partition's block size and number of data splits, writing these back are a huge overhead for parquet and require repartitioning in order to reduce heap memory usage, specially on wide tables. Let see how it goes. Saif From: Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr] Sent: Friday, June 03, 2016 2:55 PM To: Ellafi, Saif A. Cc: user; Reynold Xin; mich...@databricks.com Subject: Re: Strategies for propery load-balanced partitioning I suppose you are running on 1.6. I guess you need some solution based on [1], [2] features which are coming in 2.0. [1] https://issues.apache.org/jira/browse/SPARK-12538 / https://issues.apache.org/jira/browse/SPARK-12394 [2] https://issues.apache.org/jira/browse/SPARK-12849 However, I did not check for examples, I would like to add to your question and ask the community to link to some examples with the recent improvements/changes. It could help however to give concrete example on your specific problem, as you may hit some stragglers also probably caused by data skew. Best, Ovidiu On 03 Jun 2016, at 17:31, saif.a.ell...@wellsfargo.com<mailto:saif.a.ell...@wellsfargo.com> wrote: Hello everyone! I was noticing that, when reading parquet files or actually any kind of source data frame data (spark-csv, etc), default partinioning is not fair. Action tasks usually act very fast on some partitions and very slow on some others, and frequently, even fast on all but last partition (which looks like it reads +50% of the data input size). I notice that each task is loading some portion of the data, say 1024MB chunks, and some task loading 20+GB of data. Applying repartition strategies solve this issue properly and general performance is increased considerably, but for very large dataframes, repartitioning is a costly process. In short, what are the available strategies or configurations that help reading from disk or hdfs with proper executor-data-distribution?? If this needs to be more specific, I am strictly focused on PARQUET files rom HDFS. I know there are some MIN Really appreciate, Saif