Hi Nicholas, Thanks for your help! I'm definitely interested in participating in this unification work. Let me know how I can help.
Wenchen On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > Re: unification > > We also have a long-standing problem with how we manage Python > dependencies, something I’ve tried (unsuccessfully > <https://github.com/apache/spark/pull/27928>) to fix in the past. > > Consider, for example, how many separate places this numpy dependency is > installed: > > 1. > https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277 > 2. > https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733 > 3. > https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853 > 4. > https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871 > 5. > https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70 > 6. > https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181 > 7. > https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5 > 8. > https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90 > 9. > https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99 > 10. > https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40 > 11. > https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89 > 12. > https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92 > > None of those installations reference a unified version requirement, so > naturally they are inconsistent across all these different lines. Some say > `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In > several cases there is no version requirement specified at all. > > I’m interested in trying again to fix this problem, but it needs to be in > collaboration with a committer since I cannot fully test the release > scripts. (This testing gap is what doomed my last attempt at fixing this > problem.) > > Nick > > > On May 13, 2024, at 12:18 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > > After finishing the 4.0.0-preview1 RC1, I have more experience with this > topic now. > > In fact, the main job of the release process: building packages and > documents, is tested in Github Action jobs. However, the way we test them > is different from what we do in the release scripts. > > 1. the execution environment is different: > The release scripts define the execution environment with this Dockerfile: > https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile > However, Github Action jobs use a different Dockerfile: > https://github.com/apache/spark/blob/master/dev/infra/Dockerfile > We should figure out a way to unify it. The docker image for the release > process needs to set up more things so it may not be viable to use a single > Dockerfile for both. > > 2. the execution code is different. Use building documents as an example: > The release scripts: > https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411 > The Github Action job: > https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895 > I don't know which one is more correct, but we should definitely unify > them. > > It's better if we can run the release scripts as Github Action jobs, but I > think it's more important to do the unification now. > > Thanks, > Wenchen > > > On Fri, May 10, 2024 at 12:34 AM Hussein Awala <huss...@awala.fr> wrote: > >> Hello, >> >> I can answer some of your common questions with other Apache projects. >> >> > Who currently has permissions for Github actions? Is there a specific >> owner for that today or a different volunteer each time? >> >> The Apache organization owns Github Actions, and committers (contributors >> with write permissions) can retrigger/cancel a Github Actions workflow, but >> Github Actions runners are managed by the Apache infra team. >> >> > What are the current limits of GitHub Actions, who set them - and what >> is the process to change those (if possible at all, but I presume not all >> Apache projects have the same limits)? >> >> For limits, I don't think there is any significant limit, especially >> since the Apache organization has 900 donated runners used by its projects, >> and there is an initiative from the Infra team to add self-hosted runners >> running on Kubernetes (document >> <https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners> >> ). >> >> > Where should the artifacts be stored? >> >> Usually, we use Maven for jars, DockerHub for Docker images, and Github >> cache for workflow cache. But we can use Github artifacts to store any kind >> of package (even Docker images in the ghcr), which is fully accepted by >> Apache policies. Also if the project has a cloud account (AWS, GCP, Azure, >> ...), a bucket can be used to store some of the packages. >> >> >> > Who should be permitted to sign a version - and what is the process >> for that? >> >> The Apache documentation is clear about this, by default only PMC members >> can be release managers, but we can contact the infra team to add one of >> the committers as a release manager (document >> <https://infra.apache.org/release-publishing.html#releasemanager>). The >> process of creating a new version is described in this document >> <https://www.apache.org/legal/release-policy.html#policy>. >> >> >> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek <ofek.nim...@gmail.com> >> wrote: >> >>> Following the conversation started with Spark 4.0.0 release, this is a >>> thread to discuss improvements to our release processes. >>> >>> I'll Start by raising some questions that probably should have answers >>> to start the discussion: >>> >>> >>> 1. What is currently running in GitHub Actions? >>> 2. Who currently has permissions for Github actions? Is there a >>> specific owner for that today or a different volunteer each time? >>> 3. What are the current limits of GitHub Actions, who set them - and >>> what is the process to change those (if possible at all, but I presume >>> not >>> all Apache projects have the same limits)? >>> 4. What versions should we support as an output for the build? >>> 5. Where should the artifacts be stored? >>> 6. What should be the output? only tar or also a docker image >>> published somewhere? >>> 7. Do we want to have a release on fixed dates or a manual release >>> upon request? >>> 8. Who should be permitted to sign a version - and what is the >>> process for that? >>> >>> >>> Thanks! >>> Nimrod >>> >> >