Rong Ma created SPARK-54335:
-------------------------------

             Summary: Reducing skew in the number of file splits per partition
                 Key: SPARK-54335
                 URL: https://issues.apache.org/jira/browse/SPARK-54335
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.1.0
            Reporter: Rong Ma


Currently, when Spark partitions the input files in a table scan, it first 
sorts the input splits, then adjacent splits are coalesced into a single 
partition. If the input split size distribution is uneven, some partitions will 
have only a few splits while others will have many.

We observed that this file partitioning strategy can slow down the reading 
process if some tasks are reading more files than others, especially in 
Gluten’s native Parquet reader. To address this performance issue, we are 
proposing a new partitioning strategy that takes both partition size and file 
count into account and distributes the small files across different partitions 
to avoid skew.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to