Rong Ma created SPARK-54335:
-------------------------------
Summary: Reducing skew in the number of file splits per partition
Key: SPARK-54335
URL: https://issues.apache.org/jira/browse/SPARK-54335
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.1.0
Reporter: Rong Ma
Currently, when Spark partitions the input files in a table scan, it first
sorts the input splits, then adjacent splits are coalesced into a single
partition. If the input split size distribution is uneven, some partitions will
have only a few splits while others will have many.
We observed that this file partitioning strategy can slow down the reading
process if some tasks are reading more files than others, especially in
Gluten’s native Parquet reader. To address this performance issue, we are
proposing a new partitioning strategy that takes both partition size and file
count into account and distributes the small files across different partitions
to avoid skew.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]