Hello everyone!

I was noticing that, when reading parquet files or actually any kind of source 
data frame data (spark-csv, etc), default partinioning is not fair.
Action tasks usually act very fast on some partitions and very slow on some 
others, and frequently, even fast on all but last partition (which looks like 
it reads +50% of the data input size).

I notice that each task is loading some portion of the data, say 1024MB chunks, 
and some task loading 20+GB of data.

Applying repartition strategies solve this issue properly and general 
performance is increased considerably, but for very large dataframes, 
repartitioning is a costly process.

In short, what are the available strategies or configurations that help reading 
from disk or hdfs with proper executor-data-distribution??

If this needs to be more specific, I am strictly focused on PARQUET files rom 
HDFS. I know there are some MIN

Really appreciate,
Saif

Reply via email to