This is an automated email from the ASF dual-hosted git repository.
zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-go.git
The following commit(s) were added to refs/heads/main by this push:
new 7b3d772a parquet/compress: enable WithAllLitEntropyCompression(true)
for zstd (#779)
7b3d772a is described below
commit 7b3d772a9a75f218d743a2eb671fce5cd4af5f39
Author: Varun Venkatesh <[email protected]>
AuthorDate: Fri Apr 24 09:56:44 2026 -0700
parquet/compress: enable WithAllLitEntropyCompression(true) for zstd (#779)
## Rationale
The `klauspost/compress/zstd` encoder currently disables
`AllLitEntropyCompression` at `SpeedDefault` (the preset that maps to
zstd levels 1–4). Klauspost's encoder short-circuits to storing literals
uncompressed when no LZ matches are found, skipping the entropy-coding
stage. This is a good tradeoff for genuinely incompressible data (random
bytes), but it leaves significant compression on the table for
real-world columnar data where LZ match density is low but byte
distributions are highly skewed — e.g. parquet INT32 decimal columns
whose values cluster in a small range (so the high bytes are mostly
zero).
Enabling `WithAllLitEntropyCompression(true)` forces entropy coding on
literals even without LZ matches, matching the behavior of the C
reference implementation (`facebook/zstd`) at the same nominal levels.
## Impact
Measured on a real-world parquet workload — TPC-DS `store_sales`, 7
Trino-written files, ~9.5M rows, 23 columns including high-cardinality
`Decimal(7,2)` columns — going through Apache Iceberg's compaction path
at ZSTD level 3:
| Config | Output vs input |
|---|---|
| klauspost (current default) | +6.11% inflation |
| **klauspost + WithAllLitEntropyCompression(true)** | **-1.84%
reduction** |
| DataDog/zstd (CGo wrapper around C zstd) level 3 | -2.23% reduction |
| Trino (JNI, C zstd level 3) — reference | -3.99% reduction |
Per-blob benchmark (161 page blobs compressed directly by both
implementations at level 3):
- klauspost current default: 346,287 KB (66.60% of raw)
- klauspost + this fix: 329,249 KB (63.32% of raw)
- DataDog/zstd: 329,648 KB (63.40% of raw)
With this one-line change, klauspost matches (and slightly beats) the C
reference implementation on this workload.
Discussing with @klauspost we concluded that enabling
`AllLitEntropyCompression` is the intended way to close this gap. This
PR applies that setting to arrow-go's zstd codec.
## Trade-off
Slightly slower compression on genuinely incompressible data (the case
`AllLitEntropyCompression` was disabled for). For parquet workloads,
this is typically a non-issue since columns with no structure are rare.
Co-authored-by: Varun Venkatesh <[email protected]>
---
parquet/compress/zstd.go | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/parquet/compress/zstd.go b/parquet/compress/zstd.go
index 674d84b8..a046a48a 100644
--- a/parquet/compress/zstd.go
+++ b/parquet/compress/zstd.go
@@ -64,7 +64,7 @@ func (p *zstdEncoderPool) getEncoderFromPool(level
zstd.EncoderLevel) *zstd.Enco
if !ok {
pool = &sync.Pool{
New: func() interface{} {
- enc, _ := zstd.NewWriter(nil,
zstd.WithZeroFrames(true), zstd.WithEncoderLevel(level),
zstd.WithEncoderConcurrency(1))
+ enc, _ := zstd.NewWriter(nil,
zstd.WithZeroFrames(true), zstd.WithEncoderLevel(level),
zstd.WithEncoderConcurrency(1), zstd.WithAllLitEntropyCompression(true))
return enc
},
}
@@ -92,7 +92,7 @@ func (p *zstdEncoderPool) putEncoderToPool(enc *zstd.Encoder,
level zstd.Encoder
func getencoder() *zstd.Encoder {
initEncoder.Do(func() {
- enc, _ = zstd.NewWriter(nil, zstd.WithZeroFrames(true))
+ enc, _ = zstd.NewWriter(nil, zstd.WithZeroFrames(true),
zstd.WithAllLitEntropyCompression(true))
})
return enc
}