This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.4 by this push: new 2d5a77bbea4a [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs 2d5a77bbea4a is described below commit 2d5a77bbea4a96916525299d277f368790ccc602 Author: Dongjoon Hyun <dh...@apple.com> AuthorDate: Wed May 8 13:48:12 2024 -0700 [SPARK-48116][INFRA][3.4] Run `pyspark-pandas*` only in PR builder and Daily Python CIs ### What changes were proposed in this pull request? This PR aims to run `pyspark-pandas*` of `branch-3.4` only in PR builder and Daily Python CIs. In other words, only the commit builder will skip it by default. Please note that all PR builders is not consuming ASF resources and they provides lots of test coverage everyday. `branch-3.4` Python Daily CI runs all Python tests including `pyspark-pandas` like the following. https://github.com/apache/spark/blob/21548a8cc5c527d4416a276a852f967b4410bd4b/.github/workflows/build_branch34_python.yml#L43-L44 ### Why are the changes needed? To reduce GitHub Action usage to meet ASF INFRA policy. - https://infra.apache.org/github-actions-policy.html > All workflows MUST have a job concurrency level less than or equal to 20. This means a workflow cannot have more than 20 jobs running at the same time across all matrices. Although `pandas` is an **optional** package in PySpark, this is essential for PySpark users and we have **6 test pipelines** which requires lots of resources. We need to optimize the job concurrently level to `less than or equal to 20` while keeping the test capability as much as possible. https://github.com/apache/spark/blob/da0c7cc81bb3d69d381dd0683e910eae4c80e9ae/dev/requirements.txt#L4-L7 - pyspark-pandas - pyspark-pandas-slow ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46483 from dongjoon-hyun/SPARK-48116-3.4. Authored-by: Dongjoon Hyun <dh...@apple.com> Signed-off-by: Dongjoon Hyun <dh...@apple.com> --- .github/workflows/build_and_test.yml | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 2d2e8da80d46..825ad064d078 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -88,12 +88,18 @@ jobs: tpcds=`./dev/is-changed.py -m sql` docker=`./dev/is-changed.py -m docker-integration-tests` fi + if [ "${{ github.repository != 'apache/spark' }}" ]; then + pandas=$pyspark + else + pandas=false + fi # 'build', 'scala-213', and 'java-11-17' are always true for now. # It does not save significant time and most of PRs trigger the build. precondition=" { \"build\": \"true\", \"pyspark\": \"$pyspark\", + \"pyspark-pandas\": \"$pandas\", \"sparkr\": \"$sparkr\", \"tpcds-1g\": \"$tpcds\", \"docker-integration-tests\": \"$docker\", @@ -349,6 +355,12 @@ jobs: pyspark-pandas-slow - >- pyspark-connect + exclude: + # Always run if pyspark-pandas == 'true', even infra-image is skip (such as non-master job) + # In practice, the build will run in individual PR, but not against the individual commit + # in Apache Spark repository. + - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 'pyspark-pandas' }} + - modules: ${{ fromJson(needs.precondition.outputs.required).pyspark-pandas != 'true' && 'pyspark-pandas-slow' }} env: MODULES_TO_TEST: ${{ matrix.modules }} HADOOP_PROFILE: ${{ inputs.hadoop }} --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org