felipepessoto opened a new pull request, #12388: URL: https://github.com/apache/gluten/pull/12388
Fix https://github.com/apache/gluten/issues/9296. ## What changes are proposed in this pull request? Adds a CI pipeline that runs delta-io/delta's `spark` ScalaTest suite against the Gluten Velox bundle, so we can validate Gluten against a real Delta release and catch regressions over time. Running the Delta UTs on Gluten produces **many expected failures** (Gluten does not yet offload every Delta code path, and falls back or behaves differently in places). A plain "red on any failure" gate would be useless. Instead, the pipeline keeps a **committed baseline of known failures** and gates each run against it: - **regression** -- a test fails that is *not* in the baseline -> the shard fails. - **expected** -- a failing test that *is* in the baseline -> ignored. - **now-passing** -- a baseline test that starts passing -> fails the shard (keeps the baseline honest), unless `fail_on_fixed=false`. ### How it works 1. Runs as a **reusable workflow** (`on: workflow_call`) invoked from `velox_backend_x86.yml`, so it **reuses the Velox native libs + Arrow jars that workflow already builds** instead of duplicating the expensive native C++ build. It then assembles the `gluten-velox-bundle` fat jar (Spark 4.1 + Scala 2.13 + JDK 17, Delta profile). A `workflow_dispatch` trigger is kept for standalone manual runs (which build the native lib themselves). 2. Clones delta-io/delta at a release tag (currently `v4.2.0`), drops the bundle onto the `spark` project's test classpath, patches `DeltaSQLCommandTest` to register `GlutenPlugin`, and cherry-picks two merged upstream Delta test-only fixes (delta-io/delta#7104 + #7105) that widen `FileSourceScanExec` checks to `FileSourceScanLike` so Gluten's transformed plan is recognized. 3. Runs `sbt spark/test` **sharded by suite** across **4 shards (4 forked test JVMs each, ~16-way parallelism)**, with ScalaTest's JUnit XML reporter enabled, then gates each shard with `compare-test-results.py` against `known-failures.txt`. A final job aggregates all shards into a single ready-to-commit baseline and flags stale entries. ### Files | File | Purpose | |---|---| | `.github/workflows/velox_backend_x86.yml` | Caller: builds the native lib once, uploads the native + Arrow artifacts, and invokes the reusable Delta workflow (reusing that build instead of duplicating it). | | `.github/workflows/delta_spark_ut.yml` | The reusable Delta workflow (build bundle -> shard tests -> gate). | | `.github/workflows/util/delta-spark-ut/setup-delta.sh` | Clones Delta, injects the Gluten bundle, patches `DeltaSQLCommandTest`, cherry-picks the upstream test fixes. | | `.github/workflows/util/delta-spark-ut/compare-test-results.py` | Parses JUnit XML and enforces / seeds / aggregates against the baseline (stdlib only). | | `.github/workflows/util/delta-spark-ut/known-failures.txt` | Committed baseline of currently-expected failures (`#` comments per line). | | `.github/workflows/util/delta-spark-ut/README.md` | Documents the gate, bootstrapping, and baseline refresh. | ### Operational hardening - **JDK 17 + Arrow/Netty**: forked test JVMs get the `--add-opens` set plus `-Dio.netty.tryReflectionSetAccessible=true` (otherwise Arrow's allocator fails to initialize). - **Heap tuning**: forked-test heap and the sbt launcher's idle G1 behavior are tuned to keep the ~16 GB runner under the cgroup OOM threshold. - **Hang watchdog**: a per-shard watchdog dumps threads and kills a forked test JVM that has gone silent too long, so a wedged suite can't stall the whole job. - **DeletionVectorsSuite 2B-row tests**: two tests build/read/delete a 2-billion-row table and balloon the fork to ~13 GB of native memory (Velox row-index materialization), OOM-killing it and hanging the shard. They are force-failed (with a clear message) rather than silently ignored, so the gap stays visible until the native memory blow-up is fixed. ### Scope / known limitations - Velox backend, x86 only; Delta `v4.2.0` / Spark 4.1 / Scala 2.13 / JDK 17. - The baseline reflects the *current* set of known Delta-on-Gluten failures; refresh it via a `workflow_dispatch` run with `update_baseline=true`. - **Future work -- Delta 4.3.0**: attempted, but the bundle (compiled against Delta 4.1.0) hits a binary-incompatible Delta change (`IdentityColumn.logTableWrite` first param `Snapshot` -> `SnapshotDescriptor`), which `NoSuchMethodError`s on every write. Supporting 4.3.0 needs the bundle built against 4.3.0; tracked as follow-up. ## How was this patch tested? This change *is* CI. The Delta suite runs as part of `velox_backend_x86.yml` -- on every PR/trigger that touches Velox/core/cpp or the Delta CI files -- and via manual `workflow_dispatch`. In the latest runs all shards pass against the committed baseline (failures limited to known-failures entries; no regressions). 19,073 Delta tests run (18,297 passed / 776 failed). ### Main failures (776 baseline): - 226 tests - Increment Metric: known issue https://github.com/apache/gluten/issues/9003. [Test with increment metric offload disabled](https://github.com/apache/gluten/actions/runs/28226442887/job/83623837492?pr=12380) - 99 tests - VariantType - java.lang.UnsupportedOperationException: Unsupported data type: variant - Arrow throws (SparkArrowUtil.scala:60) - ~47 tests - ClassCast ProjectExec -> WholeStageTransformer (Delta stats) - This will be addressed by https://github.com/apache/gluten/issues/11622#issuecomment-4421317668 `timestamp -> timestamp_ntz` **Fixed since the first draft (#12371):** the 187 `MatchError List()` DataSkipping-empty-stats failures (caused by a `FileSourceScanExec` match) were fixed by cherry-picking the merged Delta PRs 7104 + 7105 (`FileSourceScanExec` -> `FileSourceScanLike`) during test setup. That dropped the baseline from 963 to **776** known failures (187 now-passing removed, 0 regressions). ## Delta Spark UT (Gluten) -- shard count vs test parallelism Sharding is by **suite** (`MurmurHash3(suiteName) % NUM_SHARDS`), so total test work is fixed (~1250 fork-minutes). The runners are 4-core / ~16 GB. The committed config is **4 shards x 4 forks**. | Config | Runner jobs | Forks/shard | Max shard | Wall-clock | Billed job-hrs* | Outcome | |---|---|---|---|---|---|---| | 16 shards x 1 fork | 16 | 1 | ~110 min | ~130 min | ~29 | green | | **4 shards x 4 forks** | **4** | **4** | **158 min** | **178 min** | **~10.5** | **green** | | 4 shards x 1 fork | 4 | 1 | 360 min (hit cap) | -- | -- | cancelled | ## Was this patch authored or co-authored using generative AI tooling? Generated-by: GitHub Copilot CLI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
