malinjawi opened a new pull request, #12030:
URL: https://github.com/apache/gluten/pull/12030
Description
Currently, Delta writes with table invariants are routed through
`DeltaInvariantCheckerExec`. This checker is row-based, so even simple
top-level `NOT NULL` constraints can introduce a Columnar-to-Row transition in
an otherwise native Delta write path.
This patch adds a native invariant checker for the safe common case:
top-level `NOT NULL` constraints. The checker validates Velox columnar batches
directly before they are written, avoiding `DeltaInvariantCheckerExec` and the
associated C2R transition for supported NOT NULL-only writes.
Unsupported constraints remain conservative:
- `CHECK` constraints still use Delta's existing row invariant checker
- nested NOT NULL constraints still use Delta's existing row invariant
checker
- planned-write / optimized-write fallback paths do not force the native
checker when the existing writer path is required
Main changes:
- add `GlutenDeltaInvariantChecker` for Delta 3.3 and Delta 4.0 sources
- route supported top-level `NOT NULL` constraints through the native
checker in `GlutenOptimisticTransaction`
- pass the native checker into `GlutenDeltaFileFormatWriter` and validate
rows at the writer boundary
- add Velox JNI support to detect nulls in selected batch columns, with a
cached `nullCount` fast path
- add Spark 3.5 and Spark 4.0 tests for native NOT NULL checks, unsupported
CHECK fallback, and violation reporting
This is independent from #12024. It belongs to the native Delta write
hardening track and is a follow-up to the Delta write C2R reduction work in
#11419 and #12016.
Related issue: #10215
Tracked by #12025
Performance
The patch was benchmarked locally with append workloads comparing the native
top-level NOT NULL path against an equivalent unsupported CHECK-constraint
fallback path.
Wide append benchmark:
- 2M rows
- 14 columns
- 3 append iterations
| Path | Average |
| --- | ---: |
| Native top-level NOT NULL checker | 1713 ms |
| Row CHECK fallback path | 1664 ms |
Excluding the first warm-up append:
| Path | Average |
| --- | ---: |
| Native top-level NOT NULL checker | 1605 ms |
| Row CHECK fallback path | 1647 ms |
This is not a headline throughput PR by itself. The local benchmark is
effectively neutral because Delta write setup, Parquet output, and commit/log
work dominate this microbenchmark. The value is to remove a row invariant
operator and C2R transition from a common constrained Delta write path while
preserving Delta's existing fallback behavior for unsupported constraints.
How was this patch tested?
Spark 3.5:
- `DeltaNativeWriteInvariantSuite`
- native Delta write checks top-level NOT NULL without
`DeltaInvariantCheckerExec`
- CHECK constraints keep `DeltaInvariantCheckerExec`
- NOT NULL violations report `InvariantViolationException`
Spark 4.0:
- `DeltaNativeWriteSuite`
- native Delta write checks top-level NOT NULL without
`DeltaInvariantCheckerExec`
- native NOT NULL path avoids `ColumnarToRow`
- CHECK constraints keep `DeltaInvariantCheckerExec`
- NOT NULL violations report `InvariantViolationException`
Additional validation:
- Spark 3.5 Delta invariant suite passed
- Spark 4.0 Delta native write suite passed
- Spark 4.0 focused NOT NULL tests passed after hardening
- C++ Velox build passed
- JNI symbol verified in `libvelox.dylib`
- Spark 3.5 and Spark 4.0 spotless/checkstyle/scalastyle passed
- `git diff --check` passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]