iemejia opened a new pull request, #3555:
URL: https://github.com/apache/parquet-java/pull/3555
## Summary
Bypass the Hadoop `Compressor`/`Decompressor`/`CodecPool` abstraction layer
in `CodecFactory` and `DirectCodecFactory`, calling native compression
libraries directly. This eliminates per-page stream creation, intermediate
buffer copies, and codec pool synchronization for all four supported codecs.
### What changes
- **Snappy**: Replace `CodecPool` + `SnappyCompressor` (which copies
heap→direct→heap) with a single `Snappy.compress(byte[], byte[])` /
`Snappy.uncompress(byte[], byte[])` JNI call and a reusable output buffer.
- **LZ4_RAW**: Replace `NonBlockedCompressor` (which allocates direct
ByteBuffers and copies heap↔direct twice per call) with heap
`ByteBuffer.wrap()` and direct airlift LZ4 compress/decompress — zero
intermediate copies.
- **ZSTD**: Replace `ZstdCompressorStream` with
`ZstdOutputStreamNoFinalizer` (avoids finalizer registration) and cache the
ZSTD level / buffer pool configuration reads per compressor instance instead of
re-reading `Configuration` on each page.
- **GZIP**: Replace Hadoop's `GzipCodec` (which wraps Java's
`Deflater`/`Inflater` in stream abstractions) with direct `Deflater`/`Inflater`
usage, reusing instances via `reset()` and managing GZIP headers/trailers
manually.
- **Benchmark**: Update `CompressionBenchmark` page sizes from `{8KB, 64KB,
256KB}` to `{64KB, 128KB, 256KB, 1MB}` to reflect real-world Parquet page sizes
(most pages are 64-256KB due to the 20K row-count limit from PARQUET-1414; only
wide string/binary columns hit the 1MB size limit).
### Benchmark results (ops/s, higher is better)
#### Compression
| Codec | Page Size | Master | Branch | Delta |
|-------|-----------|-------:|-------:|------:|
| SNAPPY | 64 KB | 53,979 | 60,799 | **+12.6%** |
| SNAPPY | 128 KB | 27,764 | 30,524 | **+9.9%** |
| SNAPPY | 256 KB | 13,549 | 14,648 | **+8.1%** |
| SNAPPY | 1 MB | 2,445 | 2,675 | **+9.4%** |
| LZ4_RAW | 1 MB | 1,961 | 2,191 | **+11.7%** |
| LZ4_RAW | 64-256 KB | — | — | within noise (-1 to -4%) |
| ZSTD | all sizes | — | — | within noise |
| GZIP | all sizes | — | — | within noise |
#### Decompression
| Codec | Page Size | Master | Branch | Delta |
|-------|-----------|-------:|-------:|------:|
| LZ4_RAW | 64 KB | 80,415 | 118,358 | **+47.2%** |
| LZ4_RAW | 128 KB | 40,615 | 59,620 | **+46.8%** |
| LZ4_RAW | 256 KB | 19,888 | 29,914 | **+50.4%** |
| LZ4_RAW | 1 MB | 4,628 | 7,517 | **+62.4%** |
| SNAPPY | 64 KB | 60,928 | 67,224 | **+10.3%** |
| SNAPPY | 128 KB | 29,919 | 33,457 | **+11.8%** |
| SNAPPY | 256 KB | 14,431 | 15,912 | **+10.3%** |
| SNAPPY | 1 MB | 3,140 | 3,540 | **+12.7%** |
| ZSTD | 64 KB | 32,042 | 35,750 | **+11.6%** |
| ZSTD | 128 KB | 19,447 | 21,800 | **+12.1%** |
| ZSTD | 256 KB | 9,495 | 10,759 | **+13.3%** |
| ZSTD | 1 MB | 2,155 | 2,409 | **+11.8%** |
| GZIP | 128 KB | 4,101 | 4,536 | **+10.6%** |
| GZIP | 256 KB | 1,736 | 1,891 | **+8.9%** |
| GZIP | 1 MB | 406 | 442 | **+9.1%** |
JMH config: JDK 25.0.3 Temurin, 1 fork, 2 warmup × 1s, 3 measurement × 2s.
### Why LZ4_RAW decompression gains are largest
`NonBlockedDecompressor` performs two full data copies per operation — heap
byte[] → direct ByteBuffer on input, direct ByteBuffer → heap byte[] on output
— plus direct buffer allocation and synchronized access. The bypass eliminates
both copies by using `ByteBuffer.wrap()` on heap arrays, letting airlift's LZ4
decompress directly between heap buffers.
### Why ZSTD compression gains are minimal
`ZstandardCodec` already returns `null` from
`createCompressor()`/`createDecompressor()` and delegates directly to
`zstd-jni` streams. The Hadoop abstraction overhead was already bypassed at the
codec level. The branch adds finalizer avoidance (`NoFinalizer` variants) and
caches configuration reads, which helps decompression but leaves compression
within noise.
### Alternative considered: modify codecs instead of CodecFactory
We evaluated modifying `SnappyCodec` and `Lz4RawCodec` to follow the
`ZstandardCodec` pattern (return `null` from `createCompressor()`, use custom
stream wrappers). This approach was **25-50% slower** than the `CodecFactory`
bypass for Snappy/LZ4 and even **20-47% slower than master**. The per-call
stream creation, `ByteArrayOutputStream` buffering, and lack of buffer reuse
dominate for memory-bandwidth-bound codecs where the actual compression takes
only 8-65 microseconds.
### Files changed
- `CodecFactory.java`: Bypass compressor/decompressor with codec-specific
inner classes (`SnappyBytesCompressor`, `Lz4RawBytesCompressor`,
`ZstdBytesCompressor`, `GzipBytesCompressor` + matching decompressors)
- `DirectCodecFactory.java`: Bypass for direct `ByteBuffer` path (Snappy,
LZ4_RAW, ZSTD)
- `BytesInput.java`: Add `ByteBufferBackedOutputStream` to avoid
`toByteArray()` copies
- `CompressionBenchmark.java`: Realistic page sizes + JMH annotation
processor fix for Java 17+
- `TestDirectCodecFactory.java`: Updated tests for bypass path
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]