[PR] shuffle: TypeAwareCompress(tac) for column-wise data compression like (U)INT64 [gluten]

via GitHub Wed, 08 Apr 2026 19:26:11 -0700


guowangy opened a new pull request, #11894:
URL: https://github.com/apache/gluten/pull/11894


   ## What changes are proposed in this pull request?
   
   Introduces **TypeAwareCompress (TAC)** — a column-wise compression layer for 
shuffle that selects
   an algorithm based on each buffer's data type, applied per-buffer alongside 
the existing LZ4/ZSTD
   codec path.
   
   For `INT64`/`UINT64` columns the values are often clustered in a small 
range, making
   Frame-of-Reference + Bit-Packing (FFOR) significantly more effective than 
generic byte-level
   compression. TAC exploits this by encoding 8-byte integer buffers with a 
4-lane FFOR codec before
   the standard codec sees them.
   
   Here is the performance data on TPCH/TPCDS:
   |        |Total Latency|Shuffle Write Size|
   |--------|-------------|------------------|
   |TPCH-6T |-15%         |-32%              |
   |TPCDS-6T|-6%          |-14%              |
   
   ### New files
   
   | Path | Purpose |
   |------|---------|
   | `cpp/core/utils/tac/ffor.hpp` | Header-only 4-lane FFOR codec for 
`uint64_t` |
   | `cpp/core/utils/tac/FForCodec.{h,cc}` | Arrow-Result wrapper around 
`ffor.hpp` |
   | `cpp/core/utils/tac/TypeAwareCompressCodec.{h,cc}` | Type dispatch; 
self-describing wire format (codec ID + element width embedded in header, so 
decompression needs no type hint) |
   | `cpp/velox/shuffle/VeloxTypeAwareCompress.h` | Maps Velox `TypeKind` → 
`TacDataType` (`BIGINT` → `kUInt64`) |
   
   ### Shuffle integration
   
   - `Payload.cc/h`: `BlockPayload::fromBuffers` accepts an optional 
`bufferTypes` vector. Per-buffer:
     if `TypeAwareCompressCodec::support(type)` is true, use TAC; otherwise 
fall back to LZ4/ZSTD.
     A new wire marker `kTypeAwareBuffer = -3` is added; decompression in 
`readCompressedBuffer` is
     self-describing. If TAC compressed size ≥ original, falls back to 
`kUncompressedBuffer`.
   - `Options.h`: adds `enableTypeAwareCompress` (default `false`) to 
`LocalPartitionWriterOptions`.
   - `VeloxHashShuffleWriter`: populates `bufferTypes` from the schema when TAC 
is enabled.
   - `GlutenConfig.scala`: new config 
`spark.gluten.sql.columnar.shuffle.typeAwareCompress.enabled` (default `false`).
   - `ColumnarShuffleWriter` / `LocalPartitionWriterJniWrapper`: forward the 
new option to native.
   
   Disabled by default — no behaviour changes for existing deployments.
   
   ## How was this patch tested?
   
   `cpp/core/tests/FForCodecTest.cc` covers:
   - Round-trip correctness for random, all-zero, monotonic, and near-max value 
patterns
   - `maxCompressedLength` boundary checks
   - Invalid input size rejection
   
   `cpp/velox/tests/VeloxShuffleWriterTest.cc`: extended to exercise the TAC 
path end-to-end through
   `VeloxHashShuffleWriter`.
   
   ## Was this patch authored or co-authored using generative AI tooling?
   
   Co-authored-by: Claude Sonnet 4.6
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] shuffle: TypeAwareCompress(tac) for column-wise data compression like (U)INT64 [gluten]

Reply via email to