(arrow-go) branch main updated: parquet/compress: enable WithAllLitEntropyCompression(true) for zstd (#779)

zeroshade Fri, 24 Apr 2026 09:58:12 -0700

This is an automated email from the ASF dual-hosted git repository.

zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-go.git



The following commit(s) were added to refs/heads/main by this push:
     new 7b3d772a parquet/compress: enable WithAllLitEntropyCompression(true) 
for zstd (#779)
7b3d772a is described below

commit 7b3d772a9a75f218d743a2eb671fce5cd4af5f39
Author: Varun Venkatesh <[email protected]>
AuthorDate: Fri Apr 24 09:56:44 2026 -0700

    parquet/compress: enable WithAllLitEntropyCompression(true) for zstd (#779)
    
    ## Rationale
    
    The `klauspost/compress/zstd` encoder currently disables
    `AllLitEntropyCompression` at `SpeedDefault` (the preset that maps to
    zstd levels 1–4). Klauspost's encoder short-circuits to storing literals
    uncompressed when no LZ matches are found, skipping the entropy-coding
    stage. This is a good tradeoff for genuinely incompressible data (random
    bytes), but it leaves significant compression on the table for
    real-world columnar data where LZ match density is low but byte
    distributions are highly skewed — e.g. parquet INT32 decimal columns
    whose values cluster in a small range (so the high bytes are mostly
    zero).
    
    Enabling `WithAllLitEntropyCompression(true)` forces entropy coding on
    literals even without LZ matches, matching the behavior of the C
    reference implementation (`facebook/zstd`) at the same nominal levels.
    
    ## Impact
    
    Measured on a real-world parquet workload — TPC-DS `store_sales`, 7
    Trino-written files, ~9.5M rows, 23 columns including high-cardinality
    `Decimal(7,2)` columns — going through Apache Iceberg's compaction path
    at ZSTD level 3:
    
    | Config | Output vs input |
    |---|---|
    | klauspost (current default) | +6.11% inflation |
    | **klauspost + WithAllLitEntropyCompression(true)** | **-1.84%
    reduction** |
    | DataDog/zstd (CGo wrapper around C zstd) level 3 | -2.23% reduction |
    | Trino (JNI, C zstd level 3) — reference | -3.99% reduction |
    
    Per-blob benchmark (161 page blobs compressed directly by both
    implementations at level 3):
    - klauspost current default: 346,287 KB (66.60% of raw)
    - klauspost + this fix: 329,249 KB (63.32% of raw)
    - DataDog/zstd: 329,648 KB (63.40% of raw)
    
    With this one-line change, klauspost matches (and slightly beats) the C
    reference implementation on this workload.
    
    Discussing with @klauspost we concluded that enabling
    `AllLitEntropyCompression` is the intended way to close this gap. This
    PR applies that setting to arrow-go's zstd codec.
    
    ## Trade-off
    
    Slightly slower compression on genuinely incompressible data (the case
    `AllLitEntropyCompression` was disabled for). For parquet workloads,
    this is typically a non-issue since columns with no structure are rare.
    
    Co-authored-by: Varun Venkatesh <[email protected]>
---
 parquet/compress/zstd.go | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/parquet/compress/zstd.go b/parquet/compress/zstd.go
index 674d84b8..a046a48a 100644
--- a/parquet/compress/zstd.go
+++ b/parquet/compress/zstd.go
@@ -64,7 +64,7 @@ func (p *zstdEncoderPool) getEncoderFromPool(level 
zstd.EncoderLevel) *zstd.Enco
                if !ok {
                        pool = &sync.Pool{
                                New: func() interface{} {
-                                       enc, _ := zstd.NewWriter(nil, 
zstd.WithZeroFrames(true), zstd.WithEncoderLevel(level), 
zstd.WithEncoderConcurrency(1))
+                                       enc, _ := zstd.NewWriter(nil, 
zstd.WithZeroFrames(true), zstd.WithEncoderLevel(level), 
zstd.WithEncoderConcurrency(1), zstd.WithAllLitEntropyCompression(true))
                                        return enc
                                },
                        }
@@ -92,7 +92,7 @@ func (p *zstdEncoderPool) putEncoderToPool(enc *zstd.Encoder, 
level zstd.Encoder
 
 func getencoder() *zstd.Encoder {
        initEncoder.Do(func() {
-               enc, _ = zstd.NewWriter(nil, zstd.WithZeroFrames(true))
+               enc, _ = zstd.NewWriter(nil, zstd.WithZeroFrames(true), 
zstd.WithAllLitEntropyCompression(true))
        })
        return enc
 }

(arrow-go) branch main updated: parquet/compress: enable WithAllLitEntropyCompression(true) for zstd (#779)

Reply via email to