malinjawi opened a new pull request, #12030:
URL: https://github.com/apache/gluten/pull/12030

   Description
   Currently, Delta writes with table invariants are routed through 
`DeltaInvariantCheckerExec`. This checker is row-based, so even simple 
top-level `NOT NULL` constraints can introduce a Columnar-to-Row transition in 
an otherwise native Delta write path.
   
   This patch adds a native invariant checker for the safe common case: 
top-level `NOT NULL` constraints. The checker validates Velox columnar batches 
directly before they are written, avoiding `DeltaInvariantCheckerExec` and the 
associated C2R transition for supported NOT NULL-only writes.
   
   Unsupported constraints remain conservative:
   
   - `CHECK` constraints still use Delta's existing row invariant checker
   - nested NOT NULL constraints still use Delta's existing row invariant 
checker
   - planned-write / optimized-write fallback paths do not force the native 
checker when the existing writer path is required
   
   Main changes:
   
   - add `GlutenDeltaInvariantChecker` for Delta 3.3 and Delta 4.0 sources
   - route supported top-level `NOT NULL` constraints through the native 
checker in `GlutenOptimisticTransaction`
   - pass the native checker into `GlutenDeltaFileFormatWriter` and validate 
rows at the writer boundary
   - add Velox JNI support to detect nulls in selected batch columns, with a 
cached `nullCount` fast path
   - add Spark 3.5 and Spark 4.0 tests for native NOT NULL checks, unsupported 
CHECK fallback, and violation reporting
   
   This is independent from #12024. It belongs to the native Delta write 
hardening track and is a follow-up to the Delta write C2R reduction work in 
#11419 and #12016.
   
   Related issue: #10215
   
   Tracked by #12025
   
   Performance
   The patch was benchmarked locally with append workloads comparing the native 
top-level NOT NULL path against an equivalent unsupported CHECK-constraint 
fallback path.
   
   Wide append benchmark:
   
   - 2M rows
   - 14 columns
   - 3 append iterations
   
   | Path | Average |
   | --- | ---: |
   | Native top-level NOT NULL checker | 1713 ms |
   | Row CHECK fallback path | 1664 ms |
   
   Excluding the first warm-up append:
   
   | Path | Average |
   | --- | ---: |
   | Native top-level NOT NULL checker | 1605 ms |
   | Row CHECK fallback path | 1647 ms |
   
   This is not a headline throughput PR by itself. The local benchmark is 
effectively neutral because Delta write setup, Parquet output, and commit/log 
work dominate this microbenchmark. The value is to remove a row invariant 
operator and C2R transition from a common constrained Delta write path while 
preserving Delta's existing fallback behavior for unsupported constraints.
   
   How was this patch tested?
   Spark 3.5:
   
   - `DeltaNativeWriteInvariantSuite`
     - native Delta write checks top-level NOT NULL without 
`DeltaInvariantCheckerExec`
     - CHECK constraints keep `DeltaInvariantCheckerExec`
     - NOT NULL violations report `InvariantViolationException`
   
   Spark 4.0:
   
   - `DeltaNativeWriteSuite`
     - native Delta write checks top-level NOT NULL without 
`DeltaInvariantCheckerExec`
     - native NOT NULL path avoids `ColumnarToRow`
     - CHECK constraints keep `DeltaInvariantCheckerExec`
     - NOT NULL violations report `InvariantViolationException`
   
   Additional validation:
   
   - Spark 3.5 Delta invariant suite passed
   - Spark 4.0 Delta native write suite passed
   - Spark 4.0 focused NOT NULL tests passed after hardening
   - C++ Velox build passed
   - JNI symbol verified in `libvelox.dylib`
   - Spark 3.5 and Spark 4.0 spotless/checkstyle/scalastyle passed
   - `git diff --check` passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to