Hi Nicholas,

Thanks for your help! I'm definitely interested in participating in this
unification work. Let me know how I can help.

Wenchen

On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> Re: unification
>
> We also have a long-standing problem with how we manage Python
> dependencies, something I’ve tried (unsuccessfully
> <https://github.com/apache/spark/pull/27928>) to fix in the past.
>
> Consider, for example, how many separate places this numpy dependency is
> installed:
>
> 1.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
> 2.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
> 3.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
> 4.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
> 5.
> https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
> 6.
> https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
> 7.
> https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
> 8.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
> 9.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
> 10.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
> 11.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
> 12.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92
>
> None of those installations reference a unified version requirement, so
> naturally they are inconsistent across all these different lines. Some say
> `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In
> several cases there is no version requirement specified at all.
>
> I’m interested in trying again to fix this problem, but it needs to be in
> collaboration with a committer since I cannot fully test the release
> scripts. (This testing gap is what doomed my last attempt at fixing this
> problem.)
>
> Nick
>
>
> On May 13, 2024, at 12:18 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
>
> After finishing the 4.0.0-preview1 RC1, I have more experience with this
> topic now.
>
> In fact, the main job of the release process: building packages and
> documents, is tested in Github Action jobs. However, the way we test them
> is different from what we do in the release scripts.
>
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile:
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile:
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release
> process needs to set up more things so it may not be viable to use a single
> Dockerfile for both.
>
> 2. the execution code is different. Use building documents as an example:
> The release scripts:
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job:
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify
> them.
>
> It's better if we can run the release scripts as Github Action jobs, but I
> think it's more important to do the unification now.
>
> Thanks,
> Wenchen
>
>
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala <huss...@awala.fr> wrote:
>
>> Hello,
>>
>> I can answer some of your common questions with other Apache projects.
>>
>> > Who currently has permissions for Github actions? Is there a specific
>> owner for that today or a different volunteer each time?
>>
>> The Apache organization owns Github Actions, and committers (contributors
>> with write permissions) can retrigger/cancel a Github Actions workflow, but
>> Github Actions runners are managed by the Apache infra team.
>>
>> > What are the current limits of GitHub Actions, who set them - and what
>> is the process to change those (if possible at all, but I presume not all
>> Apache projects have the same limits)?
>>
>> For limits, I don't think there is any significant limit, especially
>> since the Apache organization has 900 donated runners used by its projects,
>> and there is an initiative from the Infra team to add self-hosted runners
>> running on Kubernetes (document
>> <https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners>
>> ).
>>
>> > Where should the artifacts be stored?
>>
>> Usually, we use Maven for jars, DockerHub for Docker images, and Github
>> cache for workflow cache. But we can use Github artifacts to store any kind
>> of package (even Docker images in the ghcr), which is fully accepted by
>> Apache policies. Also if the project has a cloud account (AWS, GCP, Azure,
>> ...), a bucket can be used to store some of the packages.
>>
>>
>>  > Who should be permitted to sign a version - and what is the process
>> for that?
>>
>> The Apache documentation is clear about this, by default only PMC members
>> can be release managers, but we can contact the infra team to add one of
>> the committers as a release manager (document
>> <https://infra.apache.org/release-publishing.html#releasemanager>). The
>> process of creating a new version is described in this document
>> <https://www.apache.org/legal/release-policy.html#policy>.
>>
>>
>> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek <ofek.nim...@gmail.com>
>> wrote:
>>
>>> Following the conversation started with Spark 4.0.0 release, this is a
>>> thread to discuss improvements to our release processes.
>>>
>>> I'll Start by raising some questions that probably should have answers
>>> to start the discussion:
>>>
>>>
>>>    1. What is currently running in GitHub Actions?
>>>    2. Who currently has permissions for Github actions? Is there a
>>>    specific owner for that today or a different volunteer each time?
>>>    3. What are the current limits of GitHub Actions, who set them - and
>>>    what is the process to change those (if possible at all, but I presume 
>>> not
>>>    all Apache projects have the same limits)?
>>>    4. What versions should we support as an output for the build?
>>>    5. Where should the artifacts be stored?
>>>    6. What should be the output? only tar or also a docker image
>>>    published somewhere?
>>>    7. Do we want to have a release on fixed dates or a manual release
>>>    upon request?
>>>    8. Who should be permitted to sign a version - and what is the
>>>    process for that?
>>>
>>>
>>> Thanks!
>>> Nimrod
>>>
>>
>

Reply via email to