Re: [PR] [VL][Delta] Add Delta Spark UT pipeline gated against a known-failures baseline [gluten]

via GitHub Fri, 03 Jul 2026 02:37:44 -0700


Copilot commented on code in PR #12388:
URL: https://github.com/apache/gluten/pull/12388#discussion_r3518962952



##########
.github/workflows/util/delta-spark-ut/compare-test-results.py:
##########
@@ -0,0 +1,509 @@
+#!/usr/bin/env python3
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Gate / seed / aggregate the Delta-on-Gluten unit test results.
+
+Running delta-io/delta's ScalaTest suite against the Gluten Velox bundle
+produces many *expected* failures (Gluten does not yet support every Delta
+code path). To keep the red/green signal meaningful while we fix those
+failures incrementally, we maintain a committed baseline of known failing
+tests (``known-failures.txt``) and compare each CI run against it.
+
+This script has three modes:
+
+``enforce`` (default, per shard)
+    Parse the JUnit XML produced by ``sbt spark/test`` (ScalaTest ``-u``
+    reporter) and compare against the baseline:
+
+      * regression -- a test that FAILED but is NOT in the baseline. These
+        fail the build: a previously-passing test just started failing.
+      * expected   -- a test that failed and IS in the baseline. Ignored.
+      * fixed      -- a baseline test that now PASSES. By default these also
+        fail the build (``--fail-on-fixed true``) so the baseline stays honest
+        and contributors remove entries as they fix them.
+
+    If the baseline file exists but is empty (not yet bootstrapped) the mode
+    automatically degrades to ``seed`` so the first run is never spuriously 
red.
+    A *missing* ``--known-failures`` file is treated as a configuration error
+    (the gate fails) so a mis-referenced path can't silently pass.
+
+``seed`` (bootstrap / ``update_baseline``)
+    Never fails. Just writes the current shard's failing tests so the baseline
+    can be (re)generated from a real run.
+
+``aggregate`` (final job)
+    Merge every shard's ``--failures-out`` / ``--ran-out`` file into a single,
+    sorted, ready-to-commit ``known-failures.txt`` and report stale baseline
+    entries (tests no longer present in any shard).
+
+Baseline file format (``known-failures.txt``)::
+
+    # comment lines start with '#'
+    <fully.qualified.SuiteName>#<test display name>
+
+The suite is always a JVM class name (dot-separated, never starts with '#'),
+so a line whose first non-space character is '#' is unambiguously a comment,
+and the FIRST '#' after the suite separates suite from the (possibly
+'#'-containing) test name.
+
+Only the Python standard library is used so the script runs in the bare
+centos image used by the Delta UT pipeline with no ``pip install``.
+"""
+
+import argparse
+import glob
+import os
+import sys
+import xml.etree.ElementTree as ET
+
+# Synthetic "test name" recorded when a whole suite aborts (e.g. beforeAll
+# throws) so that the JUnit XML reports a suite-level error with no per-test
+# <testcase>. Without this, a suite that used to pass but now aborts entirely
+# would record zero failing testcases and the regression would be missed.
+SUITE_ABORTED = "<suite aborted>"
+
+
+class NoReportsError(RuntimeError):
+    """Raised when no JUnit <testsuite> elements are found under 
reports_dir."""
+
+
+SEP = "#"
+
+
+def eprint(*args, **kwargs):
+    print(*args, file=sys.stderr, **kwargs)
+
+
+# --------------------------------------------------------------------------- #
+# Baseline (known-failures.txt) parsing / formatting
+# --------------------------------------------------------------------------- #
+def format_entry(suite, test):
+    return "{}{}{}".format(suite, SEP, test)
+
+
+def parse_entry(line):
+    """Parse a 'suite#test' line into (suite, test) or return None for 
blanks/comments."""
+    stripped = line.strip()
+    if not stripped or stripped.startswith("#"):
+        return None
+    idx = stripped.find(SEP)
+    if idx < 0:
+        # No separator: treat the whole line as a suite-level entry.
+        return (stripped, SUITE_ABORTED)
+    return (stripped[:idx], stripped[idx + len(SEP) :])
+
+
+def load_entries(path):
+    """Load a set of (suite, test) tuples from a baseline/shard-list file."""
+    entries = set()
+    if not path or not os.path.exists(path):
+        return entries
+    with open(path, "r", encoding="utf-8") as fh:
+        for line in fh:
+            parsed = parse_entry(line)
+            if parsed is not None:
+                entries.add(parsed)
+    return entries
+
+
+def write_entries(path, entries, header=None):
+    """Write a sorted set of (suite, test) tuples to a file."""
+    os.makedirs(os.path.dirname(os.path.abspath(path)) or ".", exist_ok=True)
+    with open(path, "w", encoding="utf-8") as fh:
+        if header:
+            for hl in header.splitlines():
+                fh.write(hl.rstrip() + "\n")
+        for suite, test in sorted(entries):
+            # Defensive: collapse any stray newlines so each entry stays on 
one line.
+            safe_test = test.replace("\r", " ").replace("\n", " ")
+            fh.write(format_entry(suite, safe_test) + "\n")
+
+
+# --------------------------------------------------------------------------- #
+# JUnit XML parsing
+# --------------------------------------------------------------------------- #
+def _iter_testsuites(root):
+    """Yield every <testsuite> element regardless of whether the file root is
+    <testsuites> (wrapper) or a single <testsuite>."""
+    tag = root.tag.split("}")[-1]  # strip any namespace
+    if tag == "testsuites":
+        for child in root:
+            if child.tag.split("}")[-1] == "testsuite":
+                yield child
+    elif tag == "testsuite":
+        yield root
+
+
+def _child_local_tags(elem):
+    return {c.tag.split("}")[-1] for c in elem}
+
+
+def parse_reports(reports_dir):
+    """Walk reports_dir for JUnit XML and classify every test.
+
+    Returns (passed, failed, skipped) sets of (suite, test) tuples. A test is
+    'failed' if its <testcase> has a <failure> or <error> child, 'skipped' if
+    it has a <skipped> child, otherwise 'passed'. Suite-level aborts (a
+    <testsuite> reporting errors/failures with no failing <testcase>) are
+    recorded as a synthetic (suite, SUITE_ABORTED) failure.
+    """
+    passed, failed, skipped = set(), set(), set()
+
+    xml_files = []
+    # ScalaTest's -u reporter and Maven surefire both write `TEST-<suite>.xml`
+    # under a `target/.../*-reports/` dir. Restrict the secondary glob to
+    # `target/` so we never parse Delta's own XML *test resources* (which live
+    # under src/test/resources and are not reports). The <testsuite>-root guard
+    # below is a final safety net.
+    for pattern in ("**/TEST-*.xml", "**/target/**/*.xml"):
+        xml_files.extend(glob.glob(os.path.join(reports_dir, pattern), 
recursive=True))
+    xml_files = sorted(set(xml_files))
+
+    parsed_any = False
+    for xml_file in xml_files:
+        try:
+            tree = ET.parse(xml_file)
+        except ET.ParseError as exc:
+            eprint("WARNING: could not parse {}: {}".format(xml_file, exc))
+            continue

Review Comment:
   `parse_reports()` logs and skips XML parse errors. If a JUnit report is 
truncated/corrupted (e.g., forked JVM killed mid-write), silently ignoring it 
can drop failures from the evaluated result set and potentially miss 
regressions. For this gate, it’s safer to fail fast at least for expected 
report files (TEST-*.xml / surefire-style reports) so CI can’t go green on 
partial data.



##########
.github/workflows/delta_spark_ut.yml:
##########
@@ -0,0 +1,683 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Runs Delta Lake's `spark` sbt module unit tests against a Gluten Velox bundle
+# that is built from the source in this repository. The pipeline:
+#
+#   1. Builds the Velox/Gluten native libraries (centos-7 + vcpkg, x86_64).
+#   2. Builds the Gluten Java/Scala jars and assembles the
+#      `gluten-velox-bundle-spark<spark>_<scala>-linux_amd64-<version>.jar`
+#      fat jar for Spark 4.1 + Scala 2.13 + Java 17 with the Delta profile.
+#   3. Clones delta-io/delta at the requested release tag (default `v4.2.0`),
+#      drops the bundle jar into `spark-unified/lib/` only (NOT `spark/lib/`
+#      -- see setup-delta.sh for the unmanagedJars scoping rationale),
+#      patches Delta's `DeltaSQLCommandTest` to register the Gluten plugin,
+#      and runs `sbt spark/test` sharded across the matrix.
+#
+# Limited to Velox + x86 to keep the matrix simple, per the pipeline's purpose
+# of validating Gluten changes against the latest Delta release.
+
+name: Delta Spark UT (Gluten)
+
+on:
+  # Reusable workflow. velox_backend_x86.yml calls this (gated on 
Delta-relevant
+  # changes) and passes the native-lib + arrow-jars artifacts it already built,
+  # so the expensive native C++ build is NOT duplicated. Those artifacts live 
in
+  # the CALLER's run (a called workflow runs as part of the caller run), so the
+  # jobs below download them by name. See velox_backend_x86.yml 
`delta-spark-ut`.
+  #
+  # NOTE: the `pull_request` trigger was removed so this no longer runs as its 
own
+  # workflow on PRs (which would double-run the Delta suite). 
velox_backend_x86.yml
+  # is now the single PR entry point; `workflow_dispatch` keeps manual 
standalone
+  # runs working (those build the native lib themselves -- see 
build-native-lib).
+  workflow_call:
+    inputs:
+      native_lib_artifact:
+        description: 'Name of the cpp/build artifact uploaded by the caller'
+        type: string
+        required: true
+      arrow_jars_artifact:
+        description: 'Name of the org.apache.arrow jars artifact uploaded by 
the caller'
+        type: string
+        required: true
+      delta_ref:
+        type: string
+        required: false
+        default: 'v4.2.0'
+      spark_version:
+        type: string
+        required: false
+        default: '4.1'
+      test_parallelism:
+        type: string
+        required: false
+        default: '4'
+      update_baseline:
+        type: boolean
+        required: false
+        default: false
+      fail_on_fixed:
+        type: boolean
+        required: false
+        default: true
+  workflow_dispatch:
+    inputs:
+      delta_ref:
+        description: 'delta-io/delta git ref (tag/branch/SHA) to test against'
+        required: true
+        default: 'v4.2.0'
+      spark_version:
+        description: 'Delta `-DsparkVersion` value (must match the Gluten -P 
profile below)'
+        required: true
+        default: '4.1'
+      test_parallelism:
+        description: 'Forked test JVMs per shard (TEST_PARALLELISM_COUNT)'
+        required: true
+        default: '4'
+      update_baseline:
+        description: 'Seed/refresh the known-failures baseline instead of 
enforcing it'
+        type: boolean
+        required: false
+        default: false
+      fail_on_fixed:
+        description: 'Fail when a baseline test now passes (keeps the baseline 
honest)'
+        type: boolean
+        required: false
+        default: true
+
+env:
+  ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
+  MVN_CMD: 'build/mvn -ntp'
+  CCACHE_DIR: "${{ github.workspace }}/.ccache"
+  # Gluten profile / bundle naming for the build-gluten-bundle and
+  # delta-spark-test jobs. Spark 4.1 + Scala 2.13 + JDK 17 matches Delta 
v4.2.0's
+  # default Spark version (4.1.0) from project/CrossSparkVersions.scala.
+  GLUTEN_SPARK_PROFILE: 'spark-4.1'
+  GLUTEN_SCALA_PROFILE: 'scala-2.13'
+  GLUTEN_JAVA_PROFILE: 'java-17'
+  GLUTEN_BUNDLE_SPARK_VERSION: '4.1'
+  GLUTEN_BUNDLE_SCALA_VERSION: '2.13'
+  DELTA_SCALA_VERSION: '2.13.16'
+  # Number of shards in the delta-spark-test matrix. Must equal the length of
+  # the `shard` matrix below.
+  #
+  # 4 shards x TEST_PARALLELISM_COUNT=4 gives ~16-way parallelism packed into 4
+  # runner jobs (4 forks each) rather than 16 single-fork jobs -- fewer 
concurrent
+  # runners for the same throughput. Sharding is by SUITE; total work
+  # (~1250 shard-minutes) is fixed. Each forked test JVM uses ~4G (2G heap + 2G
+  # off-heap), so 4 forks plus the sbt launcher sit close to the ~16G runner 
limit;
+  # this fits because the worst memory hog (DeletionVectorsSuite 2B-row) is
+  # force-failed in setup-delta.sh.
+  DELTA_NUM_SHARDS: '4'
+
+# No `concurrency:` here on purpose. As a reusable workflow this runs inside 
the
+# caller's run, where `github.workflow` resolves to the CALLER's name -- a 
group
+# keyed on it would collide with the caller's own group and, with
+# cancel-in-progress, could cancel the parent run. The caller's concurrency
+# already governs cancellation. (A standalone workflow_dispatch run just won't
+# auto-cancel, which is fine for infrequent manual runs.)
+
+jobs:
+  build-native-lib-centos-7:
+    # Standalone (workflow_dispatch) only. When called by velox_backend_x86.yml
+    # the caller already built the native lib + arrow jars and passes them as
+    # inputs, so this job is skipped and the duplicate native build is avoided.
+    if: github.event_name == 'workflow_dispatch'
+    runs-on: ubuntu-22.04
+    steps:
+      - uses: actions/checkout@v4
+      - name: Get Ccache
+        uses: actions/cache/restore@v4
+        with:
+          path: '${{ env.CCACHE_DIR }}'
+          key: ccache-delta-spark-ut-centos7-release-default-${{github.sha}}
+          restore-keys: |
+            ccache-delta-spark-ut-centos7-release-default
+            ccache-centos7-release-default
+      - name: Build Gluten native libraries
+        run: |
+          docker run -v $GITHUB_WORKSPACE:/work -w /work 
apache/gluten:vcpkg-centos-7-gcc13 bash -c "
+            set -e
+            yum install tzdata -y
+            df -a
+            cd /work
+            export CCACHE_DIR=/work/.ccache
+            export CCACHE_MAXSIZE=1G
+            mkdir -p /work/.ccache
+            ccache -sz
+            bash dev/ci-velox-buildstatic-centos-7.sh
+            ccache -s
+            mkdir -p /work/.m2/repository/org/apache/arrow/
+            cp -r /root/.m2/repository/org/apache/arrow/* 
/work/.m2/repository/org/apache/arrow/
+          "
+      - name: Save Ccache
+        if: always()
+        uses: actions/cache/save@v4
+        with:
+          path: '${{ env.CCACHE_DIR }}'
+          key: ccache-delta-spark-ut-centos7-release-default-${{github.sha}}
+      - uses: actions/upload-artifact@v4
+        with:
+          name: delta-spark-ut-native-lib-centos-7-${{github.sha}}
+          path: ./cpp/build/
+          if-no-files-found: error
+      - uses: actions/upload-artifact@v4
+        with:
+          name: delta-spark-ut-arrow-jars-centos-7-${{github.sha}}
+          path: .m2/repository/org/apache/arrow/
+          if-no-files-found: error
+
+  build-gluten-bundle:
+    needs: build-native-lib-centos-7
+    # Run whether the native lib was built here (dispatch -> success) or 
provided
+    # by the caller (workflow_call -> build-native-lib-centos-7 skipped).
+    if: ${{ always() && needs.build-native-lib-centos-7.result != 'failure' && 
needs.build-native-lib-centos-7.result != 'cancelled' }}
+    runs-on: ubuntu-22.04
+    container: apache/gluten:centos-9-jdk17
+    steps:
+      - uses: actions/checkout@v4
+      - name: Download native artifacts
+        uses: actions/download-artifact@v4
+        with:
+          name: ${{ inputs.native_lib_artifact || 
format('delta-spark-ut-native-lib-centos-7-{0}', github.sha) }}
+          path: ./cpp/build/
+      - name: Cache Maven repository
+        uses: actions/cache@v4
+        with:
+          path: /root/.m2/repository
+          key: m2-delta-spark-ut-bundle-${{ env.GLUTEN_SPARK_PROFILE }}-${{ 
env.GLUTEN_SCALA_PROFILE }}-${{ hashFiles('pom.xml', '**/pom.xml') }}
+          restore-keys: |
+            m2-delta-spark-ut-bundle-${{ env.GLUTEN_SPARK_PROFILE }}-${{ 
env.GLUTEN_SCALA_PROFILE }}-
+            m2-delta-spark-ut-bundle-
+      - name: Download Arrow jars
+        uses: actions/download-artifact@v4
+        with:
+          name: ${{ inputs.arrow_jars_artifact || 
format('delta-spark-ut-arrow-jars-centos-7-{0}', github.sha) }}
+          path: /root/.m2/repository/org/apache/arrow/
+      - name: Build Gluten Velox + Delta bundle
+        run: |
+          set -euo pipefail
+          yum install -y java-17-openjdk-devel
+          export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
+          export PATH=$JAVA_HOME/bin:$PATH
+          java -version
+          cd "$GITHUB_WORKSPACE"
+          # `install` (not `package`) so the gluten-delta artifact is in the 
local
+          # m2 repo before the `package/` shaded jar is built. 
`Dmaven.compiler.release=17`
+          # overrides any user settings.xml that may pin release=1.8 for Java 
17 builds.
+          $MVN_CMD clean install \
+            -P${{ env.GLUTEN_SPARK_PROFILE }} \
+            -P${{ env.GLUTEN_SCALA_PROFILE }} \
+            -P${{ env.GLUTEN_JAVA_PROFILE }} \
+            -Pbackends-velox -Pdelta \
+            -DskipTests -Dmaven.compiler.release=17
+      - name: Stage bundle jar
+        run: |
+          set -euo pipefail
+          mkdir -p bundle-out
+          # Match the renamed fat jar produced by package/pom.xml's 
copy-fat-jar
+          # exec. The version part may bump (e.g. 1.7.0-SNAPSHOT -> 
1.8.0-SNAPSHOT),
+          # so glob the version suffix. `2>/dev/null ... || true` keeps a 
no-match
+          # `ls` from aborting the step under `set -o pipefail`, so the 
explicit
+          # check below runs instead of dying with a generic "cannot access".
+          jar=$(ls package/target/gluten-velox-bundle-spark${{ 
env.GLUTEN_BUNDLE_SPARK_VERSION }}_${{ env.GLUTEN_BUNDLE_SCALA_VERSION 
}}-linux_amd64-*.jar 2>/dev/null | head -n 1 || true)
+          if [ -z "$jar" ] || [ ! -f "$jar" ]; then
+            echo "ERROR: Could not find Gluten bundle jar under 
package/target/" >&2
+            ls -la package/target/ || true
+            exit 1
+          fi
+          cp "$jar" bundle-out/
+          ls -lh bundle-out/
+      - uses: actions/upload-artifact@v4
+        with:
+          name: delta-spark-ut-gluten-bundle-${{github.sha}}
+          path: bundle-out/gluten-velox-bundle-spark*_*-linux_amd64-*.jar
+          if-no-files-found: error
+
+  delta-spark-test:
+    needs: build-gluten-bundle
+    # build-gluten-bundle runs via `if: always()` (its 
build-native-lib-centos-7 need
+    # is skipped on workflow_call), so this job needs an explicit condition 
too --
+    # otherwise GitHub's transitive skip propagation, seeing the skipped
+    # build-native-lib-centos-7 ancestor, would skip the whole shard matrix.
+    if: ${{ !cancelled() && needs.build-gluten-bundle.result == 'success' }}
+    runs-on: ubuntu-22.04
+    container: apache/gluten:centos-9-jdk17
+    # 350-min safety cap. With 4 forks per shard the per-shard suites run
+    # 4-at-a-time, so a shard finishes well under this.
+    timeout-minutes: 350
+    strategy:
+      fail-fast: false
+      matrix:
+        # Length of this list MUST equal env.DELTA_NUM_SHARDS.
+        shard: [0, 1, 2, 3]
+    env:
+      # Mirror Delta's spark_test.yaml env vars used by run-tests.py /
+      # TestParallelization.scala.
+      SHARD_ID: ${{ matrix.shard }}
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Resolve workflow inputs
+        id: resolve
+        # Every input has a default (workflow_call + workflow_dispatch), so 
they
+        # are always set; just surface them as step outputs for the steps 
below.
+        env:
+          DELTA_REF: ${{ inputs.delta_ref }}
+          SPARK_VERSION: ${{ inputs.spark_version }}
+          TEST_PARALLELISM: ${{ inputs.test_parallelism }}
+          UPDATE_BASELINE: ${{ inputs.update_baseline }}
+          FAIL_ON_FIXED: ${{ inputs.fail_on_fixed }}
+        run: |
+          set -euo pipefail
+          {
+            echo "delta_ref=${DELTA_REF}"
+            echo "spark_version=${SPARK_VERSION}"
+            echo "test_parallelism=${TEST_PARALLELISM}"
+            echo "update_baseline=${UPDATE_BASELINE}"
+            echo "fail_on_fixed=${FAIL_ON_FIXED}"
+          } | tee -a "$GITHUB_OUTPUT"
+
+      - name: Download Gluten bundle jar
+        uses: actions/download-artifact@v4
+        with:
+          name: delta-spark-ut-gluten-bundle-${{github.sha}}
+          path: gluten-bundle
+
+      - name: Install minimal tools
+        run: |
+          set -euo pipefail
+          # apache/gluten:centos-9-jdk17 already has java-17, git, tar, a POSIX
+          # shell, and curl-minimal (which provides the `curl` command sbt's
+          # launcher needs). Install the rest of what Delta's build/sbt and the
+          # tests may need. We deliberately do NOT install the full `curl`
+          # package -- it conflicts with the pre-installed curl-minimal.
+          yum install -y java-17-openjdk-devel which findutils gzip python3
+          export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
+          export PATH=$JAVA_HOME/bin:$PATH
+          java -version
+          git --version
+          curl --version | head -n 1
+
+      - name: Cache sbt / Ivy / Coursier
+        uses: actions/cache@v4
+        with:
+          path: |
+            /root/.sbt
+            /root/.ivy2
+            /root/.cache/coursier
+          # Intentionally NOT keyed by ${{ matrix.shard }} -- all shards share
+          # the same dependency tree, so a single shared cache (with parallel
+          # save races resolved by GH on a first-write-wins basis) gives the
+          # best storage / hit-rate tradeoff.
+          key: delta-spark-ut-sbt-${{ steps.resolve.outputs.delta_ref }}-${{ 
steps.resolve.outputs.spark_version }}-${{ env.DELTA_SCALA_VERSION }}
+          restore-keys: |
+            delta-spark-ut-sbt-${{ steps.resolve.outputs.delta_ref }}-${{ 
steps.resolve.outputs.spark_version }}-
+            delta-spark-ut-sbt-${{ steps.resolve.outputs.delta_ref }}-
+
+      - name: Clone and patch Delta
+        run: |
+          set -euo pipefail
+          # `2>/dev/null ... || true` keeps a no-match `ls` from aborting the 
step
+          # under `set -o pipefail`, so the explicit check below emits a clear
+          # error instead of a generic "cannot access".
+          GLUTEN_JAR=$(ls 
"$GITHUB_WORKSPACE"/gluten-bundle/gluten-velox-bundle-spark*_*-linux_amd64-*.jar
 2>/dev/null | head -n 1 || true)
+          if [ -z "$GLUTEN_JAR" ] || [ ! -f "$GLUTEN_JAR" ]; then
+            echo "ERROR: No Gluten bundle jar found under 
$GITHUB_WORKSPACE/gluten-bundle/" >&2
+            ls -la "$GITHUB_WORKSPACE/gluten-bundle/" || true
+            exit 1
+          fi
+          echo "Using Gluten bundle: $GLUTEN_JAR"
+          bash 
"$GITHUB_WORKSPACE/.github/workflows/util/delta-spark-ut/setup-delta.sh" \
+            "${{ steps.resolve.outputs.delta_ref }}" \
+            "$GITHUB_WORKSPACE/delta" \
+            "$GLUTEN_JAR" \
+            "$GITHUB_WORKSPACE"
+
+      - name: Run Delta spark module tests (shard ${{ matrix.shard }} / ${{ 
env.DELTA_NUM_SHARDS }})
+        env:
+          NUM_SHARDS: ${{ env.DELTA_NUM_SHARDS }}
+          TEST_PARALLELISM_COUNT: ${{ steps.resolve.outputs.test_parallelism }}
+          # Required by Delta to enable testing-only code paths
+          # (see delta build.sbt: "Test / envVars += DELTA_TESTING -> 1").
+          DELTA_TESTING: '1'
+          # JDK 17 + Gluten/Arrow/Netty requires extra --add-opens and the
+          # `io.netty.tryReflectionSetAccessible` system property; otherwise
+          # the forked test JVM fails with
+          #   java.lang.UnsupportedOperationException: sun.misc.Unsafe or
+          #   java.nio.DirectByteBuffer.<init>(long, int) not available
+          # as soon as Gluten's bundled Arrow allocator initializes Netty
+          # direct buffers. Delta's own `Test / javaOptions` (see
+          # project/CrossSparkVersions.scala `java17TestSettings`) sets the
+          # base add-opens but NOT the Netty property -- Delta's own tests
+          # don't load Arrow/Netty buffers in a way that triggers it.
+          #
+          # Use JAVA_TOOL_OPTIONS so the flags propagate to BOTH the sbt
+          # launcher JVM and the forked test JVM (sbt forks tests and the
+          # child inherits the parent's env). The set below mirrors
+          # `extraJavaTestArgs` from Gluten's own root pom.xml (the
+          # canonical Gluten test JVM flag set).
+          #
+          # NOTE: we deliberately do NOT put `-Xmx` here. JAVA_TOOL_OPTIONS
+          # is processed BEFORE the JVM command line, so Delta's explicit
+          # `-Xmx1024m` (set in build.sbt `Test / javaOptions`) would still
+          # win (last `-Xmx` wins). The forked-test-JVM heap is bumped via
+          # an sbt `set spark / Test / javaOptions ++= ...` command below,
+          # which APPENDS to Delta's own seq -- so our `-Xmx` lands AFTER
+          # `-Xmx1024m` and wins.
+          JAVA_TOOL_OPTIONS: >-
+            -XX:+IgnoreUnrecognizedVMOptions
+            --add-opens=java.base/java.lang=ALL-UNNAMED
+            --add-opens=java.base/java.lang.invoke=ALL-UNNAMED
+            --add-opens=java.base/java.lang.reflect=ALL-UNNAMED
+            --add-opens=java.base/java.io=ALL-UNNAMED
+            --add-opens=java.base/java.net=ALL-UNNAMED
+            --add-opens=java.base/java.nio=ALL-UNNAMED
+            --add-opens=java.base/java.util=ALL-UNNAMED
+            --add-opens=java.base/java.util.concurrent=ALL-UNNAMED
+            --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
+            --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED
+            --add-opens=java.base/sun.nio.ch=ALL-UNNAMED
+            --add-opens=java.base/sun.nio.cs=ALL-UNNAMED
+            --add-opens=java.base/sun.security.action=ALL-UNNAMED
+            --add-opens=java.base/sun.util.calendar=ALL-UNNAMED
+            -Djdk.reflect.useDirectMethodHandle=false
+            -Dio.netty.tryReflectionSetAccessible=true
+            -Dfile.encoding=UTF-8
+        run: |
+          set -euo pipefail
+          export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
+          export PATH=$JAVA_HOME/bin:$PATH
+          cd "$GITHUB_WORKSPACE/delta"
+          chmod +x build/sbt
+          # Only run the unified `spark` sbt project, NOT `sparkGroup/test` --
+          # `sparkGroup` aggregates many other projects (sparkV2, contribs,
+          # sharing, connect*, ...) that are out of scope for this pipeline.
+          #
+          # JVM heap layout -- two memory consumers on the ~16G runner:
+          # * sbt launcher JVM: -J-Xmx4G for the test compile, then forced to
+          #   return idle memory during the (long) test phase via G1 periodic 
GC
+          #   (G1PeriodicGCInterval=10s; G1PeriodicGCSystemLoadThreshold=0 so 
the
+          #   busy fork doesn't suppress it; -XX:-G1PeriodicGCInvokesConcurrent
+          #   forces each periodic GC to a full STW collection) that uncommits 
to a
+          #   tight free ratio (Min/MaxHeapFreeRatio 5/15, JEP 346) above a low
+          #   -Xms512m floor.
+          #   Without this the idle launcher holds ~5.3G for the whole run; 
with
+          #   it, it drops back to ~1-2G. These flags touch no Gluten/Spark 
runtime
+          #   config, so they cannot affect the measured pass/fail signal.
+          # * Forked test JVM: -Xmx2G via the `set ... Test / javaOptions` 
command
+          #   below. Delta caps its fork at -Xmx1024m in build.sbt; `++=` 
appends
+          #   so our -Xmx2G comes last and wins. Gluten offloads data to Velox
+          #   off-heap (capped at 2g via spark.memory.offHeap.size in the 
patched
+          #   DeltaSQLCommandTest), so the fork's heap need is modest. A larger
+          #   fork heap pushed the cgroup peak past the ~16G OOM threshold and 
the
+          #   kernel OOM-killed the fork mid-shard (no hs_err), wedging sbt -- 
2G
+          #   keeps headroom. Keep heap-dump-on-OOM so a real >2G heap OOM is
+          #   analyzable.
+          # `-u target/test-reports` enables ScalaTest's JUnit XML reporter so
+          # every suite writes per-test results. Delta itself only configures
+          # the console reporter (-oDF), so without this we'd have no machine-
+          # readable results to gate on. The path is relative to the forked
+          # test JVM's working dir (Test / baseDirectory = spark/), i.e.
+          # delta/spark/target/test-reports/TEST-*.xml.
+          #
+          # We deliberately do NOT let an sbt non-zero exit (which fires on the
+          # MANY expected Delta-on-Gluten failures) fail this step directly.
+          # Instead the known-failures gate below decides pass/fail: the build
+          # is green when the only failures are ones already recorded in the
+          # baseline, and red on a genuine regression.
+          set +e
+          # --- hang watchdog 
---------------------------------------------------
+          # Shard 2 (and occasionally others) hangs indefinitely after a 
suite's
+          # last test with no further output. ScalaTest's failAfter only wraps
+          # individual test BODIES, so a wedge in suite teardown/afterAll -- 
or in
+          # a non-interruptible native Velox/JNI call that ignores
+          # Thread.interrupt() -- has no timeout and stalls until the 350-min 
job
+          # limit with zero diagnostics. This watchdog dumps the forked test 
JVM's
+          # threads (to the job log, and to a file for the artifact) once the 
test
+          # output has been silent for too long, so the deadlock is 
diagnosable.
+          SBT_LOG="/tmp/sbt-spark-test-shard-${{ matrix.shard }}.log"
+          : > "$SBT_LOG"
+          rm -f /tmp/sbt-done
+          (
+            # CRITICAL: the step shell runs with `bash -eo pipefail`, which the
+            # subshell inherits. Without `set +e` here, ANY non-zero command --
+            # e.g. fork detection finding no match, or `kill`/`jps` returning
+            # non-zero -- silently kills this watchdog. That errexit kill 
(plus a
+            # /proc detection miss) once made the watchdog capture ZERO dumps. 
A
+            # diagnostic must never abort on a failed probe.
+            set +e +o pipefail
+            JSTACK="${JAVA_HOME}/bin/jstack"
+            JPS="${JAVA_HOME}/bin/jps"
+            silent_limit=900   # 15 min with no new test output => treat as 
hung
+            dumps=0
+            fork_pids() {
+              # The sbt test fork's main class is sbt.ForkMain. Prefer jps 
(reads
+              # the main class from hsperfdata, robust to sbt's @argfile 
launch);
+              # fall back to scanning /proc cmdline + @argfile.
+              "$JPS" -l 2>/dev/null | awk '/sbt\.ForkMain/ {print $1}'
+              local p cl arg
+              for p in /proc/[0-9]*; do
+                [ "$(cat "$p/comm" 2>/dev/null)" = "java" ] || continue
+                cl="$(tr '\0' ' ' < "$p/cmdline" 2>/dev/null)"
+                case "$cl" in *sbt.ForkMain*) echo "${p##*/}"; continue ;; esac
+                arg="$(printf '%s' "$cl" | tr ' ' '\n' | sed -n 's/^@//p' | 
head -1)"
+                [ -n "$arg" ] && [ -f "$arg" ] && grep -qa 'sbt\.ForkMain' 
"$arg" 2>/dev/null \
+                  && echo "${p##*/}"
+              done
+            }
+            all_java_pids() {
+              "$JPS" -q 2>/dev/null
+              local p
+              for p in /proc/[0-9]*; do
+                [ "$(cat "$p/comm" 2>/dev/null)" = "java" ] && echo "${p##*/}"
+              done
+            }
+            echo "HANG WATCHDOG armed: dumps the test JVM after 
${silent_limit}s of output silence"
+            hb=0
+            while [ ! -f /tmp/sbt-done ]; do
+              sleep 60
+              [ -f "$SBT_LOG" ] || continue
+              now=$(date +%s)
+              mtime=$(stat -c %Y "$SBT_LOG" 2>/dev/null || echo "$now")
+              silent=$(( now - mtime ))
+              # Per-minute memory profile: heap tuning proved the ~16G OOM 
peak is
+              # NATIVE-driven, so log which JVM (sbt launcher vs fork) 
actually grows
+              # toward it -- the last lines before a hang reveal the real hog 
to cut.
+              # Read /proc directly (no `ps` dependency in the minimal 
container).
+              memnow=$(awk '{printf "%.2fG",$1/1073741824}' 
/sys/fs/cgroup/memory.current 2>/dev/null)
+              jvmrss=""
+              for mp in $(all_java_pids 2>/dev/null | sort -un); do
+                r=$(awk '/^VmRSS:/{print $2}' "/proc/$mp/status" 2>/dev/null)
+                [ -n "$r" ] && jvmrss="$jvmrss $(( r / 1024 ))M(p$mp)"
+              done
+              echo "MEM cgroup=${memnow} JVMs=[${jvmrss# }]"
+              hb=$(( hb + 1 ))
+              # Heartbeat every ~5 min so we can SEE the watchdog is alive 
(and how
+              # long the test has been silent) without waiting for a hang.
+              [ $(( hb % 5 )) -eq 0 ] && echo "HANG WATCHDOG: alive; last test 
output ${silent}s ago"
+              if [ "$silent" -ge "$silent_limit" ] && [ "$dumps" -lt 3 ]; then
+                dumps=$(( dumps + 1 ))
+                pids="$(fork_pids | sort -un)"
+                # Safety net: if the fork JVM cannot be pinpointed, dump EVERY 
JVM.
+                [ -n "$pids" ] || pids="$(all_java_pids | sort -un)"
+                echo "::group::HANG WATCHDOG: test output silent ${silent}s -- 
thread dump #${dumps} (pids:$(printf ' %s' $pids))"
+                [ -n "$pids" ] || echo "HANG WATCHDOG: no java process found 
to dump"
+                for pid in $pids; do
+                  # SIGQUIT makes the JVM print a full thread dump to its OWN 
stderr,
+                  # which sbt relays into the test log via the SAME stream as 
test
+                  # output -- so it lands in the job log even when a separately
+                  # spawned jstack child's output would be buffered/lost. Also 
write
+                  # jstack to a file for the per-shard artifact.
+                  echo "----- SIGQUIT + jstack pid ${pid} -----"
+                  kill -QUIT "$pid" 2>/dev/null || echo "HANG WATCHDOG: kill 
-QUIT failed for pid ${pid}"
+                  timeout 120 "$JSTACK" -l "$pid" > "/tmp/threaddump-shard-${{ 
matrix.shard }}-${dumps}-${pid}.txt" 2>&1 \
+                    || echo "HANG WATCHDOG: jstack failed/timed out for pid 
${pid}"
+                done
+                echo "::endgroup::"
+                # The dump is now captured (job log via SIGQUIT + artifact via
+                # jstack file). A hung JVM otherwise stalls the whole shard 
until
+                # the 350-min job timeout AND keeps the job log frozen so the 
dump
+                # never becomes reachable. So KILL the wedged JVM(s): the suite
+                # fails fast (acceptable -- errors are expected; only an
+                # unrecoverable hang blocks CI), the job proceeds/ends, and 
the log
+                # + artifacts flush. Give SIGQUIT a moment to print first.
+                sleep 20
+                echo "HANG WATCHDOG: killing wedged JVM(s) to unblock the 
shard: $(printf '%s ' $pids)"
+                for pid in $pids; do kill -KILL "$pid" 2>/dev/null; done
+              fi
+            done
+          ) &
+          WATCHDOG_PID=$!
+
+          ./build/sbt \
+            -DsparkVersion=${{ steps.resolve.outputs.spark_version }} \
+            -v \
+            -J-XX:+UseG1GC -J-Xms512m -J-Xmx4G \
+            -J-XX:G1PeriodicGCInterval=10000 \
+            -J-XX:G1PeriodicGCSystemLoadThreshold=0 \
+            -J-XX:-G1PeriodicGCInvokesConcurrent \
+            -J-XX:MinHeapFreeRatio=5 -J-XX:MaxHeapFreeRatio=15 \
+            "++ ${DELTA_SCALA_VERSION}" \
+            'set spark / Test / javaOptions ++= Seq("-Xmx2G", 
"-XX:+HeapDumpOnOutOfMemoryError", "-XX:HeapDumpPath=/tmp/")' \
+            'set spark / Test / testOptions += 
Tests.Argument(TestFrameworks.ScalaTest, "-u", "target/test-reports")' \
+            "spark/test" 2>&1 | tee "$SBT_LOG"
+          SBT_EXIT=${PIPESTATUS[0]}
+          touch /tmp/sbt-done
+          kill "$WATCHDOG_PID" 2>/dev/null || true
+          set -e
+          echo "sbt spark/test exited with ${SBT_EXIT}"
+
+          # Memory forensics: a sudden forked-JVM death with no hs_err and no 
heap
+          # dump is almost always a kernel/cgroup OOM-kill (Velox off-heap + 
JVM
+          # heap exceeding the ~16G runner). Surface the cgroup peak + oom_kill
+          # count so we can confirm/measure it (cgroup v2 paths; best-effort).
+          ( echo "=== cgroup memory forensics (exit ${SBT_EXIT}) ==="
+            for f in /sys/fs/cgroup/memory.peak /sys/fs/cgroup/memory.max \
+                     /sys/fs/cgroup/memory.current 
/sys/fs/cgroup/memory.events; do
+              [ -r "$f" ] && { echo "--- $f ---"; cat "$f"; }
+            done ) || true
+
+          # A compile/launch failure leaves no reports at all. In that case the
+          # gate would see zero failures and pass spuriously, so fail loudly.
+          REPORT_COUNT=$(find . -path '*/target/test-reports/*.xml' 
2>/dev/null | wc -l || true)
+          echo "Found ${REPORT_COUNT} JUnit XML report file(s)."
+          if [ "${REPORT_COUNT}" -eq 0 ]; then
+            echo "::error::sbt produced no test reports (exit ${SBT_EXIT}) -- 
likely a compile or launch failure, not test failures."
+            exit 1
+          fi
+
+          # update_baseline=true -> SEED mode (record failures, never fail) so 
the
+          # baseline can be (re)generated. Otherwise ENFORCE against the 
baseline.
+          GATE_MODE=enforce
+          if [ "${{ steps.resolve.outputs.update_baseline }}" = "true" ]; then
+            GATE_MODE=seed
+          fi
+          mkdir -p "$GITHUB_WORKSPACE/gate-out"
+          python3 
"$GITHUB_WORKSPACE/.github/workflows/util/delta-spark-ut/compare-test-results.py"
 \
+            --mode "${GATE_MODE}" \
+            --reports-dir "$GITHUB_WORKSPACE/delta" \
+            --known-failures 
"$GITHUB_WORKSPACE/.github/workflows/util/delta-spark-ut/known-failures.txt" \
+            --failures-out "$GITHUB_WORKSPACE/gate-out/failures-shard-${{ 
matrix.shard }}.txt" \
+            --ran-out "$GITHUB_WORKSPACE/gate-out/ran-shard-${{ matrix.shard 
}}.txt" \
+            --fail-on-fixed "${{ steps.resolve.outputs.fail_on_fixed }}"
+
+      - name: Compress heap dumps (if any)
+        if: ${{ failure() }}
+        run: |
+          set -euo pipefail
+          if compgen -G "/tmp/*.hprof" > /dev/null; then
+            echo "Found heap dump(s); compressing..."
+            ls -lh /tmp/*.hprof
+            # gzip is single-threaded and slow on multi-GB heaps but is
+            # always present in the centos image. Heap dumps compress ~10x.
+            gzip -1 /tmp/*.hprof
+            ls -lh /tmp/*.hprof.gz
+          else
+            echo "No heap dumps found in /tmp/."
+          fi
+
+      - name: Upload per-shard gate lists
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: delta-spark-ut-gate-lists-shard-${{ matrix.shard }}
+          path: gate-out/*.txt
+          if-no-files-found: warn
+
+      - name: Upload test reports
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: delta-spark-ut-reports-shard-${{ matrix.shard }}
+          path: |
+            delta/**/target/test-reports/**/*.xml
+            delta/**/target/surefire-reports/**/*.xml
+          if-no-files-found: warn
+
+      - name: Upload hang watchdog thread dumps
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: delta-spark-ut-threaddumps-shard-${{ matrix.shard }}
+          path: /tmp/threaddump-shard-${{ matrix.shard }}-*.txt
+          if-no-files-found: ignore
+
+      - name: Upload JVM crash logs and other failure artifacts
+        if: ${{ failure() }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: delta-spark-ut-failure-logs-shard-${{ matrix.shard }}
+          path: |
+            delta/**/target/*.log
+            delta/**/hs_err_pid*.log
+            delta/**/core.*
+            /tmp/*.hprof
+            /tmp/*.hprof.gz
+          if-no-files-found: ignore
+
+  # Merges every shard's failure/ran lists into a single, sorted, 
ready-to-commit
+  # known-failures.txt and reports global regressions / now-passing / stale
+  # entries. Runs even when some shards went red (if: always()) so the 
refreshed
+  # baseline artifact is always available -- this is what you download and 
commit
+  # to bootstrap or refresh the baseline (see util/delta-spark-ut/README.md).
+  delta-spark-aggregate:
+    needs: delta-spark-test
+    if: always()
+    runs-on: ubuntu-22.04
+    steps:
+      - uses: actions/checkout@v4
+      - name: Download per-shard gate lists
+        uses: actions/download-artifact@v4
+        continue-on-error: true
+        with:
+          pattern: delta-spark-ut-gate-lists-shard-*
+          path: gate-lists
+          merge-multiple: true

Review Comment:
   The aggregate job downloads gate-list artifacts with `continue-on-error: 
true`, and the aggregator script only hard-fails when *no* inputs are found. If 
some shard artifacts are missing (or only partially downloaded), the job can 
still produce an "aggregated/known-failures.txt" that looks ready-to-commit but 
is incomplete, which can accidentally shrink the baseline and create noisy 
follow-up CI failures. Consider validating that gate lists were collected for 
every shard (e.g., pass an expected shard count to `compare-test-results.py 
--mode aggregate` and fail when fewer/mismatched shard inputs are present).



##########
.github/workflows/velox_backend_x86.yml:
##########
@@ -19,6 +19,11 @@ on:
   pull_request:
     paths:
       - '.github/workflows/velox_backend_x86.yml'
+      # Delta Spark UT runs here too (reusable delta_spark_ut.yml). These extra
+      # paths make Delta-CI-only changes trigger this workflow; Delta also 
runs on
+      # the velox paths below since core/velox changes can affect Delta 
offload.
+      - '.github/workflows/delta_spark_ut.yml'
+      - '.github/workflows/util/delta-spark-ut/**'

Review Comment:
   Adding Delta-CI-only paths under `on.pull_request.paths` makes *the entire* 
`Velox Backend (x86)` workflow run when only the Delta UT workflow/baseline 
changes. Since this workflow contains additional expensive jobs (e.g., the 
Spark/TPC matrices), a baseline-only edit to `known-failures.txt` will now 
trigger the full suite rather than just Delta UT. If the intent is to validate 
only the Delta pipeline on Delta-CI file changes, consider adding a 
path-filtering pre-job and gating non-Delta jobs with an `if:` condition, or 
introducing a dedicated lightweight caller workflow that only invokes 
`delta_spark_ut.yml`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [VL][Delta] Add Delta Spark UT pipeline gated against a known-failures baseline [gluten]

Reply via email to