Prajwal H G created SPARK-53876:
-----------------------------------
Summary: Addition of column-level Parquet compression preference
in Spark
Key: SPARK-53876
URL: https://issues.apache.org/jira/browse/SPARK-53876
Project: Spark
Issue Type: Improvement
Components: PySpark, Spark Submit, SQL
Affects Versions: 4.1.0
Environment: Spark Version: 4.1.0 (open-source)
Deployment: Spark on Kubernetes (GKE)
Language: PySpark + Scala
Delta Lake: 3.0.0
OS: Ubuntu 22.04
Java: OpenJDK 17 (Zulu)
Cluster: GKE N2D (AMD EPYC), 8 vCPU / 32 GB per executor
Reporter: Prajwal H G
Fix For: 4.1.0
h4. *Problem*
Apache Spark currently allows only *global compression configuration* for
Parquet files using:
{{spark.sql.parquet.compression.codec = snappy | gzip | zstd | uncompressed}}
However, many production datasets contain heterogeneous columns — for example:
* text or categorical columns that compress better with {*}ZSTD{*},
* numeric columns that perform better with {*}SNAPPY{*}.
Today, Spark applies a single codec to the entire file, preventing users from
optimizing storage and I/O performance per column.
h4. Proposed Improvement
Introduce a new configuration key to define *per-column compression codecs* in
a map format:
{{spark.sql.parquet.column.compression.map = colA:zstd,colB:snappy,colC:gzip}}
*Behavior:*
* The global codec ({{{}spark.sql.parquet.compression.codec{}}}) remains the
default for all columns.
* Any column listed in {{spark.sql.parquet.column.compression.map}} will use
its specified codec.
* Unspecified columns continue to use the global codec.
*Example:*
{{--conf spark.sql.parquet.compression.codec=snappy \}}
{{--conf
spark.sql.parquet.column.compression.map="country:zstd,price:snappy,comment:gzip"
Effect:-
}}
||Column||Codec||
|country|zstd|
|price|snappy|
|comment|gzip|
|all others|snappy (global default)|
{{}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]