Prajwal H G created SPARK-53876:
-----------------------------------

             Summary: Addition of column-level Parquet compression preference 
in Spark
                 Key: SPARK-53876
                 URL: https://issues.apache.org/jira/browse/SPARK-53876
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, Spark Submit, SQL
    Affects Versions: 4.1.0
         Environment: Spark Version: 4.1.0 (open-source)
Deployment: Spark on Kubernetes (GKE)
Language: PySpark + Scala
Delta Lake: 3.0.0
OS: Ubuntu 22.04
Java: OpenJDK 17 (Zulu)
Cluster: GKE N2D (AMD EPYC), 8 vCPU / 32 GB per executor
            Reporter: Prajwal H G
             Fix For: 4.1.0


h4. *Problem*

Apache Spark currently allows only *global compression configuration* for 
Parquet files using:

{{spark.sql.parquet.compression.codec = snappy | gzip | zstd | uncompressed}}

However, many production datasets contain heterogeneous columns — for example:
 * text or categorical columns that compress better with {*}ZSTD{*},

 * numeric columns that perform better with {*}SNAPPY{*}.

Today, Spark applies a single codec to the entire file, preventing users from 
optimizing storage and I/O performance per column.


h4. Proposed Improvement

Introduce a new configuration key to define *per-column compression codecs* in 
a map format:

{{spark.sql.parquet.column.compression.map = colA:zstd,colB:snappy,colC:gzip}}

*Behavior:*
 * The global codec ({{{}spark.sql.parquet.compression.codec{}}}) remains the 
default for all columns.

 * Any column listed in {{spark.sql.parquet.column.compression.map}} will use 
its specified codec.

 * Unspecified columns continue to use the global codec.

*Example:*

{{--conf spark.sql.parquet.compression.codec=snappy \}}

{{--conf 
spark.sql.parquet.column.compression.map="country:zstd,price:snappy,comment:gzip"

Effect:-

}}
||Column||Codec||
|country|zstd|
|price|snappy|
|comment|gzip|
|all others|snappy (global default)|

{{}}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to