zhengruifeng opened a new pull request, #55761: URL: https://github.com/apache/spark/pull/55761
### What changes were proposed in this pull request? Follow-up to [SPARK-56768](https://issues.apache.org/jira/browse/SPARK-56768) (apache/spark#55726), which introduced a shared `precompile` CI job that runs Spark's SBT build once and publishes the resulting `target/` trees as a GitHub Actions artifact for the pyspark matrix entries to consume. This PR extends that same artifact to the `sparkr` build. Concretely: - The `precompile` job's `if:` gate now also fires when `sparkr == 'true'` is set in the precondition output, so the artifact is built whenever only sparkr changes. - The `sparkr` job adds `precompile` to `needs:`, downloads and extracts the artifact (with the same graceful fallback as the pyspark matrix), and exports `SKIP_SCALA_BUILD=true` for `dev/run-tests.py` only when the artifact was successfully extracted. - No `dev/run-tests.py` change is needed — the `SKIP_SCALA_BUILD` gate landed with SPARK-56768. ### Optional: graceful fallback if precompile fails Same pattern as the pyspark matrix: - The "Download precompiled artifact" step is gated on `needs.precompile.result == 'success'` and has `continue-on-error: true`. - The "Extract precompiled artifact" step is gated on the download succeeding and also has `continue-on-error: true`. - Inside the "Run tests" bash block, `SKIP_SCALA_BUILD=true` is exported only when `steps.extract-precompiled.outcome == 'success'`. Otherwise it stays unset and `dev/run-tests.py` falls back to the original local SBT build. So a precompile/download/extract failure degrades sparkr to the pre-PR behavior, not a workflow failure. ### Why are the changes needed? The sparkr job today runs the same ~13m of redundant SBT compile that the pyspark matrix used to run. Reusing the existing precompile artifact removes that redundant work. The `precompile` job is already running in any workflow run where pyspark changes are present; adding sparkr as another consumer is essentially free (just another download of the same artifact). When sparkr is the only changed module, the `precompile` job is now scheduled to run anyway (via the new `sparkr == 'true'` clause in its `if:` gate), so this case picks up the same saving. ### Estimated savings | | Per sparkr run | |---|---:| | Redundant SBT compile in sparkr today | ~13m | | Add back: download + extract overhead | ~1m | | **Net CI compute saved per sparkr run** | **~12m** | This is on top of the ~96m / ~14% already saved by SPARK-56768. The actual wall clock for the sparkr job will drop by roughly the same amount (sparkr is not on the critical path; the pyspark matrix still drives the workflow's wall-clock). ### Does this PR introduce _any_ user-facing change? No. CI infrastructure change only. ### How was this patch tested? The change is exercised by the CI run of this PR itself, when the sparkr job runs. The expected log signature inside "Run tests" is `Reusing precompiled artifact, skipping local SBT build.`, mirroring what the pyspark matrix already prints. If the precompile artifact is not available (precompile job failed, or this is some future caller that doesn't enable it), sparkr falls back to the local SBT build path, which is identical to today's behavior. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
