Copilot commented on code in PR #12388: URL: https://github.com/apache/gluten/pull/12388#discussion_r3522306262
########## .github/workflows/delta_spark_ut.yml: ########## @@ -0,0 +1,685 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Runs Delta Lake's `spark` sbt module unit tests against a Gluten Velox bundle +# that is built from the source in this repository. The pipeline: +# +# 1. Builds the Velox/Gluten native libraries (centos-7 + vcpkg, x86_64). +# 2. Builds the Gluten Java/Scala jars and assembles the +# `gluten-velox-bundle-spark<spark>_<scala>-linux_amd64-<version>.jar` +# fat jar for Spark 4.1 + Scala 2.13 + Java 17 with the Delta profile. +# 3. Clones delta-io/delta at the requested release tag (default `v4.2.0`), +# drops the bundle jar into `spark-unified/lib/` only (NOT `spark/lib/` +# -- see setup-delta.sh for the unmanagedJars scoping rationale), +# patches Delta's `DeltaSQLCommandTest` to register the Gluten plugin, +# and runs `sbt spark/test` sharded across the matrix. +# +# Limited to Velox + x86 to keep the matrix simple, per the pipeline's purpose +# of validating Gluten changes against the latest Delta release. + +name: Delta Spark UT (Gluten) + +on: + # Reusable workflow. velox_backend_x86.yml calls this (gated on Delta-relevant + # changes) and passes the native-lib + arrow-jars artifacts it already built, + # so the expensive native C++ build is NOT duplicated. Those artifacts live in + # the CALLER's run (a called workflow runs as part of the caller run), so the + # jobs below download them by name. See velox_backend_x86.yml `delta-spark-ut`. + # + # NOTE: the `pull_request` trigger was removed so this no longer runs as its own + # workflow on PRs (which would double-run the Delta suite). velox_backend_x86.yml + # is now the single PR entry point; `workflow_dispatch` keeps manual standalone + # runs working (those build the native lib themselves -- see build-native-lib). + workflow_call: + inputs: + native_lib_artifact: + description: 'Name of the cpp/build artifact uploaded by the caller' + type: string + required: true + arrow_jars_artifact: + description: 'Name of the org.apache.arrow jars artifact uploaded by the caller' + type: string + required: true + delta_ref: + type: string + required: false + default: 'v4.2.0' + spark_version: + type: string + required: false + default: '4.1' + test_parallelism: + type: string + required: false + default: '4' + update_baseline: + type: boolean + required: false + default: false + fail_on_fixed: + type: boolean + required: false + default: true + workflow_dispatch: + inputs: + delta_ref: + description: 'delta-io/delta git ref (tag/branch/SHA) to test against' + required: true + default: 'v4.2.0' + spark_version: + description: 'Delta `-DsparkVersion` value (must match the Gluten -P profile below)' + required: true + default: '4.1' + test_parallelism: + description: 'Forked test JVMs per shard (TEST_PARALLELISM_COUNT)' + required: true + default: '4' + update_baseline: + description: 'Seed/refresh the known-failures baseline instead of enforcing it' + type: boolean + required: false + default: false + fail_on_fixed: + description: 'Fail when a baseline test now passes (keeps the baseline honest)' + type: boolean + required: false + default: true + +env: + ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true + MVN_CMD: 'build/mvn -ntp' + CCACHE_DIR: "${{ github.workspace }}/.ccache" + # Gluten profile / bundle naming for the build-gluten-bundle and + # delta-spark-test jobs. Spark 4.1 + Scala 2.13 + JDK 17 matches Delta v4.2.0's + # default Spark version (4.1.0) from project/CrossSparkVersions.scala. + GLUTEN_SPARK_PROFILE: 'spark-4.1' + GLUTEN_SCALA_PROFILE: 'scala-2.13' + GLUTEN_JAVA_PROFILE: 'java-17' + GLUTEN_BUNDLE_SPARK_VERSION: '4.1' + GLUTEN_BUNDLE_SCALA_VERSION: '2.13' + DELTA_SCALA_VERSION: '2.13.16' + # Number of shards in the delta-spark-test matrix. Must equal the length of + # the `shard` matrix below. + # + # 4 shards x TEST_PARALLELISM_COUNT=4 gives ~16-way parallelism packed into 4 + # runner jobs (4 forks each) rather than 16 single-fork jobs -- fewer concurrent + # runners for the same throughput. Sharding is by SUITE; total work + # (~1250 shard-minutes) is fixed. Each forked test JVM uses ~4G (2G heap + 2G + # off-heap), so 4 forks plus the sbt launcher sit close to the ~16G runner limit; + # this fits because the worst memory hog (DeletionVectorsSuite 2B-row) is + # force-failed in setup-delta.sh. + DELTA_NUM_SHARDS: '4' + +# No `concurrency:` here on purpose. As a reusable workflow this runs inside the +# caller's run, where `github.workflow` resolves to the CALLER's name -- a group +# keyed on it would collide with the caller's own group and, with +# cancel-in-progress, could cancel the parent run. The caller's concurrency +# already governs cancellation. (A standalone workflow_dispatch run just won't +# auto-cancel, which is fine for infrequent manual runs.) + +jobs: + build-native-lib-centos-7: + # Standalone (workflow_dispatch) only. When called by velox_backend_x86.yml + # the caller already built the native lib + arrow jars and passes them as + # inputs, so this job is skipped and the duplicate native build is avoided. + if: github.event_name == 'workflow_dispatch' + runs-on: ubuntu-22.04 + steps: + - uses: actions/checkout@v4 + - name: Get Ccache + uses: actions/cache/restore@v4 + with: + path: '${{ env.CCACHE_DIR }}' + key: ccache-delta-spark-ut-centos7-release-default-${{github.sha}} + restore-keys: | + ccache-delta-spark-ut-centos7-release-default + ccache-centos7-release-default + - name: Build Gluten native libraries + run: | + docker run -v $GITHUB_WORKSPACE:/work -w /work apache/gluten:vcpkg-centos-7-gcc13 bash -c " + set -e + yum install tzdata -y + df -a + cd /work + export CCACHE_DIR=/work/.ccache + export CCACHE_MAXSIZE=1G + mkdir -p /work/.ccache + ccache -sz + bash dev/ci-velox-buildstatic-centos-7.sh + ccache -s + mkdir -p /work/.m2/repository/org/apache/arrow/ + cp -r /root/.m2/repository/org/apache/arrow/* /work/.m2/repository/org/apache/arrow/ + " + - name: Save Ccache + if: always() + uses: actions/cache/save@v4 + with: + path: '${{ env.CCACHE_DIR }}' + key: ccache-delta-spark-ut-centos7-release-default-${{github.sha}} + - uses: actions/upload-artifact@v4 + with: + name: delta-spark-ut-native-lib-centos-7-${{github.sha}} + path: ./cpp/build/ + if-no-files-found: error + - uses: actions/upload-artifact@v4 + with: + name: delta-spark-ut-arrow-jars-centos-7-${{github.sha}} + path: .m2/repository/org/apache/arrow/ + if-no-files-found: error + + build-gluten-bundle: + needs: build-native-lib-centos-7 + # Run whether the native lib was built here (dispatch -> success) or provided + # by the caller (workflow_call -> build-native-lib-centos-7 skipped). + if: ${{ always() && needs.build-native-lib-centos-7.result != 'failure' && needs.build-native-lib-centos-7.result != 'cancelled' }} + runs-on: ubuntu-22.04 + container: apache/gluten:centos-9-jdk17 + steps: + - uses: actions/checkout@v4 + - name: Download native artifacts + uses: actions/download-artifact@v4 + with: + name: ${{ inputs.native_lib_artifact || format('delta-spark-ut-native-lib-centos-7-{0}', github.sha) }} + path: ./cpp/build/ + - name: Cache Maven repository + uses: actions/cache@v4 + with: + path: /root/.m2/repository + key: m2-delta-spark-ut-bundle-${{ env.GLUTEN_SPARK_PROFILE }}-${{ env.GLUTEN_SCALA_PROFILE }}-${{ hashFiles('pom.xml', '**/pom.xml') }} + restore-keys: | + m2-delta-spark-ut-bundle-${{ env.GLUTEN_SPARK_PROFILE }}-${{ env.GLUTEN_SCALA_PROFILE }}- + m2-delta-spark-ut-bundle- + - name: Download Arrow jars + uses: actions/download-artifact@v4 + with: + name: ${{ inputs.arrow_jars_artifact || format('delta-spark-ut-arrow-jars-centos-7-{0}', github.sha) }} + path: /root/.m2/repository/org/apache/arrow/ + - name: Build Gluten Velox + Delta bundle + run: | + set -euo pipefail + yum install -y java-17-openjdk-devel + export JAVA_HOME=/usr/lib/jvm/java-17-openjdk + export PATH=$JAVA_HOME/bin:$PATH + java -version + cd "$GITHUB_WORKSPACE" + # `install` (not `package`) so the gluten-delta artifact is in the local + # m2 repo before the `package/` shaded jar is built. `Dmaven.compiler.release=17` + # overrides any user settings.xml that may pin release=1.8 for Java 17 builds. + $MVN_CMD clean install \ + -P${{ env.GLUTEN_SPARK_PROFILE }} \ + -P${{ env.GLUTEN_SCALA_PROFILE }} \ + -P${{ env.GLUTEN_JAVA_PROFILE }} \ + -Pbackends-velox -Pdelta \ + -DskipTests -Dmaven.compiler.release=17 + - name: Stage bundle jar + run: | + set -euo pipefail + mkdir -p bundle-out + # Match the renamed fat jar produced by package/pom.xml's copy-fat-jar + # exec. The version part may bump (e.g. 1.7.0-SNAPSHOT -> 1.8.0-SNAPSHOT), + # so glob the version suffix. `2>/dev/null ... || true` keeps a no-match + # `ls` from aborting the step under `set -o pipefail`, so the explicit + # check below runs instead of dying with a generic "cannot access". + jar=$(ls package/target/gluten-velox-bundle-spark${{ env.GLUTEN_BUNDLE_SPARK_VERSION }}_${{ env.GLUTEN_BUNDLE_SCALA_VERSION }}-linux_amd64-*.jar 2>/dev/null | head -n 1 || true) + if [ -z "$jar" ] || [ ! -f "$jar" ]; then + echo "ERROR: Could not find Gluten bundle jar under package/target/" >&2 + ls -la package/target/ || true + exit 1 + fi + cp "$jar" bundle-out/ + ls -lh bundle-out/ + - uses: actions/upload-artifact@v4 + with: + name: delta-spark-ut-gluten-bundle-${{github.sha}} + path: bundle-out/gluten-velox-bundle-spark*_*-linux_amd64-*.jar + if-no-files-found: error + + delta-spark-test: + needs: build-gluten-bundle + # build-gluten-bundle runs via `if: always()` (its build-native-lib-centos-7 need + # is skipped on workflow_call), so this job needs an explicit condition too -- + # otherwise GitHub's transitive skip propagation, seeing the skipped + # build-native-lib-centos-7 ancestor, would skip the whole shard matrix. + if: ${{ !cancelled() && needs.build-gluten-bundle.result == 'success' }} + runs-on: ubuntu-22.04 + container: apache/gluten:centos-9-jdk17 + # 350-min safety cap. With 4 forks per shard the per-shard suites run + # 4-at-a-time, so a shard finishes well under this. + timeout-minutes: 350 + strategy: + fail-fast: false + matrix: + # Length of this list MUST equal env.DELTA_NUM_SHARDS. + shard: [0, 1, 2, 3] + env: + # Mirror Delta's spark_test.yaml env vars used by run-tests.py / + # TestParallelization.scala. + SHARD_ID: ${{ matrix.shard }} + steps: + - uses: actions/checkout@v4 + + - name: Resolve workflow inputs + id: resolve + # Every input has a default (workflow_call + workflow_dispatch), so they + # are always set; just surface them as step outputs for the steps below. + env: + DELTA_REF: ${{ inputs.delta_ref }} + SPARK_VERSION: ${{ inputs.spark_version }} + TEST_PARALLELISM: ${{ inputs.test_parallelism }} + UPDATE_BASELINE: ${{ inputs.update_baseline }} + FAIL_ON_FIXED: ${{ inputs.fail_on_fixed }} + run: | + set -euo pipefail + { + echo "delta_ref=${DELTA_REF}" + echo "spark_version=${SPARK_VERSION}" + echo "test_parallelism=${TEST_PARALLELISM}" + echo "update_baseline=${UPDATE_BASELINE}" + echo "fail_on_fixed=${FAIL_ON_FIXED}" + } | tee -a "$GITHUB_OUTPUT" Review Comment: `spark_version` is user-configurable (workflow_dispatch / workflow_call), but the workflow always builds the Gluten bundle with the hard-coded Spark 4.1 profile (`GLUTEN_SPARK_PROFILE`/`GLUTEN_BUNDLE_SPARK_VERSION`). A manual run with `spark_version != 4.1` will run Delta tests for that Spark version against a Spark-4.1-compiled bundle, which is very likely to fail in confusing ways. Add an explicit validation here (or remove the input and hard-code Spark 4.1) so misconfigured runs fail fast with a clear message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
