[PR] perf(parquet): eliminate per-value allocation in delta bit-pack decoder [arrow-go]

via GitHub Fri, 27 Mar 2026 08:28:22 -0700


zeroshade opened a new pull request, #730:
URL: https://github.com/apache/arrow-go/pull/730


   ### Rationale for this change
   
   The delta bit-pack decoder's `unpackNextMini()` method calls 
`BitReader.GetValue()` once per value in each miniblock. `GetValue()` allocates 
a fresh `[]uint64` slice on every call.
   
   For the default block size of 128 with 4 miniblocks of 32 values each, this 
causes ~128 heap allocations per block, or ~2048 allocations per 1024-value 
page. This allocation pressure dominates the decoder's runtime and generates 
significant GC load.
   
   ### What changes are included in this PR?
   
   1. **Reused buffer**: Added a `deltaBuf []uint64` field to the decoder 
struct that is allocated once and reused across calls, eliminating the 
per-value allocations.
   2. **Width=0 fast-path**: When `deltaBitWidth == 0` (all deltas are 
identical, common for sequential or constant data), skip the bit reading 
entirely and directly accumulate `minDelta`
   
   ### Are these changes tested?
   Yes, the existing test suite passes along with all encoding and property 
tests
   
   ### Are there any user-facing changes?
   No user-facing API changes, purely an internal optimization
   
   *Benchmark Results (darwin/arm64, Apple M4, Go 1.25)*
   
   **Baseline (main):**
   ```
   goos: darwin
   goarch: arm64
   pkg: github.com/apache/arrow-go/v18/parquet/internal/encoding
   cpu: Apple M4
   BenchmarkDeltaBinaryPackedDecodingInt32/len_1024-10     101904   11393 ns/op 
   359.52 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_1024-10     107488   11192 ns/op 
   365.97 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_1024-10     107224   11338 ns/op 
   361.26 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_1024-10     105836   11338 ns/op 
   361.26 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_1024-10     105051   11167 ns/op 
   366.78 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_65536-10      1846  653185 ns/op 
   401.33 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_65536-10      1914  631663 ns/op 
   415.01 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_65536-10      1912  630612 ns/op 
   415.70 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_65536-10      1916  633029 ns/op 
   414.11 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_65536-10      1939  640181 ns/op 
   409.48 MB/s   4912 B/op   2 allocs/op
   ```
   
   **Optimized (this PR):**
   ```
   goos: darwin
   goarch: arm64
   pkg: github.com/apache/arrow-go/v18/parquet/internal/encoding
   cpu: Apple M4
   BenchmarkDeltaBinaryPackedDecodingInt32/len_1024-10     435112    2768 ns/op 
  1479.84 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_1024-10     434569    2768 ns/op 
  1479.95 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_1024-10     434536    2772 ns/op 
  1477.51 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_1024-10     411572    2859 ns/op 
  1432.55 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_1024-10     419126    2862 ns/op 
  1431.03 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_65536-10      8756  136708 ns/op 
  1917.55 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_65536-10      8850  136880 ns/op 
  1915.14 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_65536-10      8913  136911 ns/op 
  1914.70 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_65536-10      8158  145797 ns/op 
  1798.01 MB/s   4912 B/op   2 allocs/op
   BenchmarkDeltaBinaryPackedDecodingInt32/len_65536-10      8281  142564 ns/op 
  1838.79 MB/s   4912 B/op   2 allocs/op
   ```
   
   **Summary:**
   
   | Size | Baseline (ns/op) | Optimized (ns/op) | Speedup | Baseline (MB/s) | 
Optimized (MB/s) |
   
|------|-------------------|-------------------|---------|-----------------|------------------|
   | 1024 | 11,286 | 2,806 | **4.0x** | 363 | 1,460 |
   | 65536 | 637,734 | 139,772 | **4.6x** | 411 | 1,877 |
   
   * ~4-4.6x faster decoding
   * Zero additional allocations (2 allocs/op unchanged — those are from test 
setup)
   * Encoding performance is unchanged (encoder path not modified)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf(parquet): eliminate per-value allocation in delta bit-pack decoder [arrow-go]

Reply via email to