Hello everyone! I was noticing that, when reading parquet files or actually any kind of source data frame data (spark-csv, etc), default partinioning is not fair. Action tasks usually act very fast on some partitions and very slow on some others, and frequently, even fast on all but last partition (which looks like it reads +50% of the data input size).
I notice that each task is loading some portion of the data, say 1024MB chunks, and some task loading 20+GB of data. Applying repartition strategies solve this issue properly and general performance is increased considerably, but for very large dataframes, repartitioning is a costly process. In short, what are the available strategies or configurations that help reading from disk or hdfs with proper executor-data-distribution?? If this needs to be more specific, I am strictly focused on PARQUET files rom HDFS. I know there are some MIN Really appreciate, Saif