Re: unification

We also have a long-standing problem with how we manage Python dependencies, 
something I’ve tried (unsuccessfully 
<https://github.com/apache/spark/pull/27928>) to fix in the past.

Consider, for example, how many separate places this numpy dependency is 
installed:

1. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
2. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
3. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
4. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
5. 
https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
6. 
https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
7. 
https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
8. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
9. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
10. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
11. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
12. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92

None of those installations reference a unified version requirement, so 
naturally they are inconsistent across all these different lines. Some say 
`>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In several 
cases there is no version requirement specified at all.

I’m interested in trying again to fix this problem, but it needs to be in 
collaboration with a committer since I cannot fully test the release scripts. 
(This testing gap is what doomed my last attempt at fixing this problem.)

Nick


> On May 13, 2024, at 12:18 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
> 
> After finishing the 4.0.0-preview1 RC1, I have more experience with this 
> topic now.
> 
> In fact, the main job of the release process: building packages and 
> documents, is tested in Github Action jobs. However, the way we test them is 
> different from what we do in the release scripts.
> 
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release 
> process needs to set up more things so it may not be viable to use a single 
> Dockerfile for both.
> 
> 2. the execution code is different. Use building documents as an example:
> The release scripts: 
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job: 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify them.
> 
> It's better if we can run the release scripts as Github Action jobs, but I 
> think it's more important to do the unification now.
> 
> Thanks,
> Wenchen
> 
> 
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala <huss...@awala.fr 
> <mailto:huss...@awala.fr>> wrote:
>> Hello,
>> 
>> I can answer some of your common questions with other Apache projects.
>> 
>> > Who currently has permissions for Github actions? Is there a specific 
>> > owner for that today or a different volunteer each time?
>> 
>> The Apache organization owns Github Actions, and committers (contributors 
>> with write permissions) can retrigger/cancel a Github Actions workflow, but 
>> Github Actions runners are managed by the Apache infra team.
>> 
>> > What are the current limits of GitHub Actions, who set them - and what is 
>> > the process to change those (if possible at all, but I presume not all 
>> > Apache projects have the same limits)?
>> 
>> For limits, I don't think there is any significant limit, especially since 
>> the Apache organization has 900 donated runners used by its projects, and 
>> there is an initiative from the Infra team to add self-hosted runners 
>> running on Kubernetes (document 
>> <https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners>).
>> 
>> > Where should the artifacts be stored?
>> 
>> Usually, we use Maven for jars, DockerHub for Docker images, and Github 
>> cache for workflow cache. But we can use Github artifacts to store any kind 
>> of package (even Docker images in the ghcr), which is fully accepted by 
>> Apache policies. Also if the project has a cloud account (AWS, GCP, Azure, 
>> ...), a bucket can be used to store some of the packages.
>> 
>> 
>>  > Who should be permitted to sign a version - and what is the process for 
>> that?
>> 
>> The Apache documentation is clear about this, by default only PMC members 
>> can be release managers, but we can contact the infra team to add one of the 
>> committers as a release manager (document 
>> <https://infra.apache.org/release-publishing.html#releasemanager>). The 
>> process of creating a new version is described in this document 
>> <https://www.apache.org/legal/release-policy.html#policy>.
>> 
>> 
>> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek <ofek.nim...@gmail.com 
>> <mailto:ofek.nim...@gmail.com>> wrote:
>>> Following the conversation started with Spark 4.0.0 release, this is a 
>>> thread to discuss improvements to our release processes.
>>> 
>>> I'll Start by raising some questions that probably should have answers to 
>>> start the discussion:
>>> 
>>> What is currently running in GitHub Actions?
>>> Who currently has permissions for Github actions? Is there a specific owner 
>>> for that today or a different volunteer each time?
>>> What are the current limits of GitHub Actions, who set them - and what is 
>>> the process to change those (if possible at all, but I presume not all 
>>> Apache projects have the same limits)?
>>> What versions should we support as an output for the build?
>>> Where should the artifacts be stored?
>>> What should be the output? only tar or also a docker image published 
>>> somewhere?
>>> Do we want to have a release on fixed dates or a manual release upon 
>>> request?
>>> Who should be permitted to sign a version - and what is the process for 
>>> that?
>>> 
>>> Thanks!
>>> Nimrod

Reply via email to