[PR] [VL] Add KeyGroupedPartitioning support to columnar shuffle [gluten]

via GitHub Wed, 13 May 2026 00:41:03 -0700


minni31 opened a new pull request, #12084:
URL: https://github.com/apache/gluten/pull/12084


   ## CONTEXT
   
   `KeyGroupedPartitioning` is a Spark partitioning scheme used by V2 data 
source connectors (e.g., Iceberg, Paimon) where data is partitioned by specific 
key expressions with known unique partition values. Currently, Gluten's 
columnar shuffle exchange does not handle this partitioning type, causing a 
fallback to vanilla Spark for any query involving V2 sources with key-grouped 
partitioning.
   
   ## WHAT
   
   Adds `KeyGroupedPartitioning` support to the columnar shuffle exchange path 
in the Velox backend. The implementation reuses the existing JVM-side partition 
ID computation pattern (same mechanism as `RangePartitioning`):
   
   - Adds `KeyGroupedPartitioning` to the validation whitelist in 
`ColumnarShuffleExchangeExecBase`, allowing the columnar shuffle to accept this 
partitioning type.
   - Constructs a `KeyGroupedPartitioner` from the partitioning's 
`uniquePartitionValues`, mapping each partition key to its index.
   - Computes partition IDs on the JVM side by evaluating partition key 
expressions against each row (via `BindReferences`) and looking up the result 
in the `KeyGroupedPartitioner`. The pid column is prepended to each batch so 
the native shuffle writer can read it directly.
   - Reuses `RangePartitioningShortName` for the native partitioning descriptor 
since both Range and KeyGrouped use the same JVM-side pid prepend pattern — the 
native shuffle writer reads the prepended column rather than computing 
partition IDs natively.
   - Each key extraction allocates a fresh array per row and converts to an 
immutable `ArraySeq` to avoid aliasing issues with mutable array reuse.
   
   ### Tests
   
   | Suite | Tests | Status |
   |-------|-------|--------|
   | `VeloxShufflePartitioningSuite` | 22 tests: SxS tests for hash (A1-A4), 
range (B1-B3), round-robin (C1-C2), single (D1-D2), null semantics (E1-E2), 
data types (F1-F2), boundary cases (G1-G2), KeyGrouped unit tests (H1-H6) | 
Local pass |
   
   Note: End-to-end KeyGroupedPartitioning tests require V2 data source 
connectors (Iceberg/Paimon) which are not available in this test module. The 
H-series unit tests validate key extraction, `KeyGroupedPartitioner` 
construction, and the full key-extraction-to-partition-lookup flow.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [VL] Add KeyGroupedPartitioning support to columnar shuffle [gluten]

Reply via email to