Re: [PR] [VL]:shuffle: extend TypeAwareCompress to INT128 (DECIMAL HUGEINT) [gluten]

via GitHub Mon, 15 Jun 2026 01:59:59 -0700


Copilot commented on code in PR #12299:
URL: https://github.com/apache/gluten/pull/12299#discussion_r3412233275



##########
cpp/core/utils/tac/ffor.hpp:
##########
@@ -59,6 +60,13 @@
 namespace gluten {
 namespace ffor {
 
+// Byte order (applies to the 128-bit codec below): this codec round-trips
+// data as native uint64 reads/writes against the lo (offset 0) and hi
+// (offset 8) halves of each 128-bit value (DECIMAL128's in-memory layout in
+// Velox).  Producer and consumer share byte order by virtue of running in
+// the same Spark cluster -- LZ4 and any other shuffle codec carry the same
+// implicit assumption -- so no explicit endian guard is needed here.

Review Comment:
   The 128-bit codec assumes a specific in-memory half ordering (lo at offset 
0, hi at offset 8) and effectively relies on native endianness, but the code 
currently has no compile-time/runtime guard to prevent incorrect behavior on 
big-endian targets or mixed-architecture clusters. **Fix (recommended):** add 
an explicit build-time guard (e.g., `static_assert` on native endianness) or 
implement a defined byte order for the on-wire format (with conversions), to 
avoid silent cross-platform corruption.



##########
cpp/core/utils/tac/TypeAwareCompressCodec.cc:
##########
@@ -65,18 +80,38 @@ TypeAwareCompressCodec::decompress(const uint8_t* input, 
int64_t inputLen, uint8
 
   auto* in = input;
   auto codecId = static_cast<CodecId>(*in++);
-  [[maybe_unused]] auto tacType = *in++;
+  auto tacType = (*in++);

Review Comment:
   `tacType` is encoded from an `int8_t` but decoded into an `auto` from a 
`uint8_t` buffer, which can make unsupported negative values appear as large 
positive numbers (e.g., `-1` becomes `255`) and can confuse 
diagnostics/handling. **Fix (recommended):** decode into an `int8_t` (or 
`tac::TacDataType`) explicitly, e.g., `const int8_t tacType = 
static_cast<int8_t>(*in++);`, and keep error reporting consistent with the 
signed TAC type domain.



##########
cpp/core/utils/tac/ffor.hpp:
##########
@@ -335,15 +343,55 @@ inline constexpr size_t compress64Bound(size_t num) {
   return (nBlocks + 1) * kHeaderSize + num * sizeof(uint64_t);
 }
 
+// Encode one block: write header + bit-packed payload into `out`.
+// Returns the number of bytes written.  `out` must be 8-byte aligned;
+// callers whose output buffer is unaligned stage through a local scratch.
+// Shared by the 64-bit and 128-bit codecs.
+inline size_t encodeBlock(const uint64_t* src, size_t blockVals, uint64_t* 
out) {
+  uint64_t base;
+  unsigned bw;
+  analyze(src, blockVals, base, bw);
+  writeHeader(reinterpret_cast<uint8_t*>(out), static_cast<uint8_t>(bw),
+              static_cast<uint8_t>(blockVals / kLanes), base);

Review Comment:
   `writeHeader` stores `count` in a `uint8_t`, but `blockVals / kLanes` can 
overflow if a block can contain more than `255 * kLanes` values (e.g., 256 
groups). This would corrupt the header and make decompression fail or 
misbehave. **Fix (required):** enforce `kMaxValuesPerBlock / kLanes <= 255` via 
a compile-time/static check, or widen the header field (and update the wire 
format + readers) to store a larger `count` type.



##########
cpp/core/utils/tac/FForCodec.cc:
##########
@@ -57,4 +57,44 @@ FForCodec::decompress(const uint8_t* input, int64_t 
inputSize, uint8_t* output,
   return static_cast<int64_t>(nDecoded);
 }
 
+int64_t FForCodec::maxCompressedLength128(int64_t inputSize) {
+  if (inputSize % sizeof(__int128_t) != 0) {
+    return 0;
+  }
+  size_t numValues = inputSize / sizeof(__int128_t);
+  return static_cast<int64_t>(ffor::compress128Bound(numValues));
+}
+
+arrow::Result<int64_t>
+FForCodec::compress128(const uint8_t* input, int64_t inputSize, uint8_t* 
output, int64_t outputSize) {
+  if (inputSize == 0) {
+    return 0;
+  }
+  if (inputSize % sizeof(__int128_t) != 0) {
+    return arrow::Status::Invalid("FForCodec: input size ", inputSize, " is 
not a multiple of ", sizeof(__int128_t), ".");
+  }
+
+  size_t numValues = inputSize / sizeof(__int128_t);
+  auto maxLen = static_cast<int64_t>(ffor::compress128Bound(numValues));
+  if (outputSize < maxLen) {
+    return arrow::Status::Invalid("FForCodec: output buffer too small for 
128-bit compression.");
+  }

Review Comment:
   This error message doesn’t include the provided `outputSize` and required 
`maxLen`, which makes diagnosing sizing issues harder (especially since callers 
compute sizes via `maxCompressedLen`). **Fix (optional):** include both sizes 
in the message (e.g., “need X bytes, got Y bytes”) for faster debugging.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [VL]:shuffle: extend TypeAwareCompress to INT128 (DECIMAL HUGEINT) [gluten]

Reply via email to