felipepessoto opened a new issue, #12377:
URL: https://github.com/apache/gluten/issues/12377

   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   ## Backend
   VL (Velox)
   
   ## Bug description
   
   **Expected:** A Delta `MERGE INTO` that writes deletion vectors (DVs) 
completes successfully, exactly as it does on vanilla Spark + Delta.
   
   **Actual:** Under the Gluten Velox bundle the MERGE intermittently aborts 
with a native `VeloxRuntimeError` (`INVALID_STATE`) raised by Gluten's Delta DV 
bitmap aggregator:
   
       Delta RoaringBitmapArray row index 9223372036854775807 exceeds max 
representable value 9223372030412324864
   
   `9223372036854775807` is exactly `Long.MAX_VALUE` (`2^63 - 1`). The target 
table in the failing test is tiny (a handful of rows), so this is **not** a 
real row index -- it is a sentinel / placeholder value that is leaking into the 
DV-write aggregation.
   
   The aggregation that builds the per-file DV (`PartialAggregation`, function 
`addSafe`) packs each matched target row's index into a `RoaringBitmapArray`. 
`RoaringBitmapArray::addSafe` enforces `value <= kMaxRepresentableValue` (= 
`0x7ffffffe80000000` = `9223372030412324864`, which the code comments say 
mirrors Delta JVM's `RoaringBitmapArray.MAX_REPRESENTABLE_VALUE`). 
`Long.MAX_VALUE` is one 2^32 block above that ceiling, so the check fails and 
the whole stage aborts.
   
   **This is flaky / non-deterministic.** The exact same, byte-for-byte 
identical bundle passed this test in one CI run and failed it in the next (see 
Logs). So whether the sentinel reaches the aggregator depends on runtime plan / 
scan / scheduling (split boundaries, batch composition, task distribution), not 
on a source change. It reproduces in the suite:
   
       
org.apache.spark.sql.delta.generatedsuites.MergeIntoExtendedSyntaxSQLPathBasedDVsPredPushOnSuite
         test: extended syntax - update + conditional insert - isPartitioned: 
true
   
   (`...DVsPredPushOn...` = deletion vectors on, predicate pushdown on.)
   
   ### Root cause analysis
   
   - The aggregator only skips SQL NULLs; it does not special-case the sentinel:
     - `cpp/velox/operators/functions/delta/DeltaBitmapAggregator.cc:63-69`
       (`addInput` returns early only when `!value.has_value()`),
     - `cpp/velox/operators/functions/delta/DeltaBitmapAggregator.cc:43-46`
       (`addRowIndex` checks only `value >= 0`, then calls `bitmap.addSafe`).
   - The ceiling and check:
     - `cpp/velox/compute/delta/RoaringBitmapArray.cpp:91-98` (`addSafe`,
       `VELOX_CHECK_LE(value, kMaxRepresentableValue, ...)`),
     - `cpp/velox/compute/delta/RoaringBitmapArray.h:51-56`
       (`kMaxHighKey = 0x7ffffffe`, `kMaxLowKeyForMaxHighKey = 0x80000000`,
       `kMaxRepresentableValue = (kMaxHighKey << 32) | kMaxLowKeyForMaxHighKey`;
       comment: "Matches Delta JVM RoaringBitmapArray.MAX_REPRESENTABLE_VALUE").
   
   Open question for a maintainer with Velox + Delta DV-write context: Delta's 
own JVM `RoaringBitmapArray` uses the **same** `MAX_REPRESENTABLE_VALUE`, so 
vanilla Delta would reject `Long.MAX_VALUE` too. Since vanilla Delta passes 
this MERGE, it must either never produce the sentinel on the DV-write branch or 
filter it out before the bitmap is built. That suggests the real defect is 
**upstream of the aggregator** -- Gluten's native row-index materialization / 
DV-write plan is emitting (and not filtering) a `Long.MAX_VALUE` placeholder 
that vanilla Delta would have excluded. The `addSafe` check is just where it 
surfaces. Two possible fix directions:
   1. Stop the sentinel at the source (mirror Delta's filter so placeholder 
rows never reach the DV aggregation), or
   2. Make the aggregator skip the sentinel the same way it skips NULLs -- but 
only if that matches Delta's documented semantics (silently dropping a 
genuinely out-of-range index would corrupt the DV, so option 1 is preferred 
unless the sentinel is a contract).
   
   This was written with the assistance of AI tooling.
   
   ## Gluten version
   main branch
   
   ## Spark version
   spark-4.0.x (actually Spark 4.1.0 -- Delta 4.2.0's default; the form has no 
4.1 option)
   
   ## Spark configurations
   
   From the Delta-on-Gluten test harness (patched `DeltaSQLCommandTest`):
   
       spark.plugins                = org.apache.gluten.GlutenPlugin
       spark.shuffle.manager        = 
org.apache.spark.shuffle.sort.ColumnarShuffleManager
       spark.memory.offHeap.enabled = true
       spark.memory.offHeap.size    = 2g
       Delta 4.2.0, Scala 2.13, JDK 17
       (Delta defaults: deletion vectors enabled; predicate pushdown enabled.)
   
   ## System information
   CI runner: ubuntu-22.04 host, ~16 GB RAM, container 
apache/gluten:centos-9-jdk17. Not run via dev/info.sh (observed in CI).
   
   ## Relevant logs
   
   Delta Spark UT (Gluten) pipeline, apache/gluten run 28198677737, shard 1 
(job 83536282846). The prior, byte-for-byte identical run 28148323203 passed 
the same test (shard 1: 230 expected failures, 0 regressions) -- demonstrating 
the intermittency.
   
       extended syntax - update + conditional insert - isPartitioned: true *** 
FAILED ***
       org.apache.spark.SparkException: Job aborted due to stage failure:
         Task 0 in stage 1028.0 failed 1 times, most recent failure:
         Lost task 0.0 in stage 1028.0 (TID 843):
         org.apache.gluten.exception.GlutenException: ... Exception: 
VeloxRuntimeError
         Error Source: RUNTIME
         Error Code: INVALID_STATE
         Reason: (9223372036854775807 vs. 9223372030412324864)
                 Delta RoaringBitmapArray row index 9223372036854775807
                 exceeds max representable value 9223372030412324864
         Retriable: False
         Expression: value <= kMaxRepresentableValue
         Context: Operator: PartialAggregation[9] 9
         Function: addSafe
         File: /work/cpp/velox/compute/delta/RoaringBitmapArray.cpp
         Line: 92
         ...
         at 
org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native 
Method)
         at 
org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:135)
         at 
org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:316)
         at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:111)
   
   ## Reproduction
   1. Build the Gluten Velox bundle (Spark 4.1 + Scala 2.13 + JDK 17, Delta 
profile).
   2. Run delta-io/delta v4.2.0 with the Gluten plugin enabled 
(`spark.plugins=org.apache.gluten.GlutenPlugin`), suite 
`MergeIntoExtendedSyntaxSQLPathBasedDVsPredPushOnSuite`, test "extended syntax 
- update + conditional insert - isPartitioned: true".
      - Because it is intermittent, it may take several runs (or concurrent 
test forks / CPU contention) to surface. Equivalent minimal repro: a `MERGE 
INTO` with an UPDATE action plus a conditional INSERT into a partitioned Delta 
table that has deletion vectors enabled, with predicate pushdown on.
   CI Link: 
https://github.com/apache/gluten/actions/runs/28198677737/job/83536282846?pr=12371#step:9:2828
   
   ## Impact / workaround
   - Intermittently fails any MERGE-with-DV workload, and makes the 
Delta-on-Gluten CI gate flaky (apache/gluten PR #12278): the test is not in the 
known-failures baseline (it usually passes), so a run that hits the sentinel is 
reported as a regression and turns the gate red.
   - No good baseline workaround: because the failure is flaky, adding it to 
`known-failures.txt` would instead make the gate red on every run where it 
passes (the pipeline runs with `DELTA_FAIL_ON_FIXED=true`). A proper fix (or a 
dedicated flaky-quarantine list in the gate) is needed.
   
   
   ### Gluten version
   
   main branch
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   Spark 4.1.0
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to