Re: [DISCUSS] How can we afford CI for the new release cadence?

Akira Ajisaka Mon, 11 May 2026 03:14:17 -0700

> I'm working on fixing branch-3.5 CI: 
> https://github.com/apache/spark/pull/55764. Hopefully I'll complete it this 
> week.


Closed the above PR as a duplicate of
https://github.com/apache/spark/pull/55432. Sorry for the confusion.

On Mon, May 11, 2026 at 3:22 PM Akira Ajisaka <[email protected]> wrote:
>
> > Also on the 3.5 side the CI is super broken so I’m trying to fix it up now, 
> > the timing is complicated by the Ubuntu PPA DDoS outages.
>
> I'm working on fixing branch-3.5 CI:
> https://github.com/apache/spark/pull/55764. Hopefully I'll complete it
> this week. The Ubuntu outage seems unrelated.
>
> Anyway, I'm +1 to reduce the frequency on non-active branches.
>
> Thanks,
> Akira
>
> On Fri, May 8, 2026 at 5:30 AM Tian Gao via dev <[email protected]> wrote:
> >
> > Yeah I'm not surprised that 3.5 is not in its best shape at this point 
> > because we almost did not run tests on it. When we reduce the coverage for 
> > a branch, we will have issues when we try to release. That's why we should 
> > not only make efforts on that side. We should explore all different ways to 
> > make CI better.
> >
> > On Thu, May 7, 2026 at 12:02 PM Holden Karau <[email protected]> wrote:
> >>
> >> Smarter test selection is probably the magic but it’s going to be effort. 
> >> Also on the 3.5 side the CI is super broken so I’m trying to fix it up 
> >> now, the timing is complicated by the Ubuntu PPA DDoS outages.
> >>
> >>
> >> Twitter: https://twitter.com/holdenkarau
> >> Fight Health Insurance: https://www.fighthealthinsurance.com/
> >> Books (Learning Spark, High Performance Spark, etc.): 
> >> https://amzn.to/2MaRAG9
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >> Pronouns: she/her
> >>
> >> On Thu, May 7, 2026 at 11:28 AM Tian Gao via dev <[email protected]> 
> >> wrote:
> >>>
> >>> I definitely agree that we can save a lot of time by optimizing the CI. 
> >>> But currently, java tests take more time than python tests. They are 
> >>> comparable but java tests are still observably more expensive. We should 
> >>> not only focus on python ones.
> >>>
> >>> In the meantime, I'll take a look on low hanging fruits on CI to make it 
> >>> smarter.
> >>>
> >>> Tian
> >>>
> >>> On Thu, May 7, 2026 at 6:40 AM Ruifeng Zheng <[email protected]> wrote:
> >>>>
> >>>> I also did some data analysis, and think we should also revisit the the 
> >>>> CI:
> >>>> 1, Deduplicate the compile. For example, the pyspark matrix executes 8 
> >>>> byte-identical SBT compiles in parallel today, costing ~108m of 
> >>>> redundant work per run.
> >>>>    (I am working on a POC: https://github.com/apache/spark/pull/55726)
> >>>> 2, Smarter test selection. 11% of recent 10000 commits are test-only 
> >>>> changes. Today these trigger the full pyspark matrix because the 
> >>>> dependency
> >>>>    graph in dev/sparktestsupport/modules.py cascades through 
> >>>> dependent_modules regardless of whether the change is in source or 
> >>>> tests. The cascade is correct
> >>>>    for source changes (downstream modules import the source) but 
> >>>> unnecessary for tests (no production code imports test code).
> >>>>
> >>>> On Thu, May 7, 2026 at 5:23 PM Hyukjin Kwon <[email protected]> wrote:
> >>>>>
> >>>>> For now, I created a PR to reduce the frequency by half: 
> >>>>> https://github.com/apache/spark/pull/55729
> >>>>>
> >>>>> On Thu, 7 May 2026 at 07:56, Yicong Huang <[email protected]> 
> >>>>> wrote:
> >>>>>>
> >>>>>> I think we need to 1) cut CIs pressure and 2) look for more resources 
> >>>>>> to run CIs at the same time.
> >>>>>>
> >>>>>> Cut CIs:
> >>>>>>
> >>>>>> I think the biggest cut would be on the scheduled jobs first. For 
> >>>>>> instance change 3.5 and 4.0 scheduled jobs from daily to once in three 
> >>>>>> days, or even once per week.
> >>>>>> Then for branch 4.x or more active release branches we can do daily 
> >>>>>> post merge CI, instead of after each commit?
> >>>>>> Meanwhile we can explore ways to run selected tests on the actual 
> >>>>>> affected code path to avoid full runs.
> >>>>>> And optimize tests themselves so they run faster.
> >>>>>>
> >>>>>> Expand resources:
> >>>>>>
> >>>>>> We can probably move some of the scheduled jobs out to another repo 
> >>>>>> like what Apache Arrow did.
> >>>>>> I wonder if self hosted runners are acceptable to the community? This 
> >>>>>> sounds like a longer term solution if we were to introduce more checks 
> >>>>>> in the future.
> >>>>>>
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Yicong Huang
> >>>>>>
> >>>>>> On Wed, May 6, 2026 at 3:04 PM Hyukjin Kwon <[email protected]> 
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> We should probably reduce the scheduled build for the time being.
> >>>>>>>
> >>>>>>> As a reference, I worked in Apache Arrow, and they use an extra CI by 
> >>>>>>> thirdparty, e.g., see
> >>>>>>> - PR: https://github.com/apache/arrow/pull/48915
> >>>>>>> - You comment like 
> >>>>>>> https://github.com/apache/arrow/pull/48915#issuecomment-3852062184
> >>>>>>> - It posts the CI link like 
> >>>>>>> https://github.com/apache/arrow/pull/48915#issuecomment-3852079993
> >>>>>>> - The CI is defined at https://github.com/ursacomputing/crossbow
> >>>>>>>
> >>>>>>> I feel like this can be an alternative if any vendor is willing to 
> >>>>>>> support it.
> >>>>>>>
> >>>>>>> On Thu, 7 May 2026 at 04:09, Tian Gao via dev <[email protected]> 
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> I did some quick calculations, and we can't afford the CI with our 
> >>>>>>>> existing infra.
> >>>>>>>>
> >>>>>>>> Per ASF policy 
> >>>>>>>> (https://infra.apache.org/github-actions-policy.html), the maximum 
> >>>>>>>> weekly runner minutes we have is 250k. That's 1m per month, and last 
> >>>>>>>> month, we hit almost the exact number - 1,082,721 minutes.
> >>>>>>>>
> >>>>>>>> Our current CI consists of a few components (all numbers are per 
> >>>>>>>> month):
> >>>>>>>> * each commits on master branch - ~280k
> >>>>>>>> * 4.1 scheduled run - ~200k
> >>>>>>>> * 4.0 scheduled run - ~200k
> >>>>>>>> * 3.5 scheduled run - negligible because we don't run many tests
> >>>>>>>> * master scheduled run ~ 300k
> >>>>>>>>
> >>>>>>>> With the new release cadence, even if we only do scheduled run on 
> >>>>>>>> 4.x (which we shouldn't because it's an active dev branch but that's 
> >>>>>>>> another story), we need an extra 200k. With a 6-month maintenance 
> >>>>>>>> window, we will always have at least 3 active maintained versions 
> >>>>>>>> (including LTS) that require CI.
> >>>>>>>>
> >>>>>>>> If it's just 200k extra, maybe it's manageable. But I really believe 
> >>>>>>>> we need tests for the 4.x branch - we should treat that branch more 
> >>>>>>>> like master, than say 4.2. Even if we don't do pre-merge check on 
> >>>>>>>> it, we should do post-merge check for every commit. Daily check on 
> >>>>>>>> an active dev branch sounds a bit too risky to me. That would be 
> >>>>>>>> another 300k.
> >>>>>>>>
> >>>>>>>> This does not include the discussion about any pre-merge check for 
> >>>>>>>> 4.x, which we should actually think about in the future.
> >>>>>>>>
> >>>>>>>> So the question is - how do we deal with that? The solutions I can 
> >>>>>>>> think of are
> >>>>>>>> * Get some self-host runners and increase our CI capability limited 
> >>>>>>>> by ASF policy
> >>>>>>>> * Optimize our CIs and tests so it takes less time to run
> >>>>>>>> * Reduce the coverage of our tests so we can at least test all 
> >>>>>>>> branches
> >>>>>>>>
> >>>>>>>> Any idea is welcome.
> >>>>>>>>
> >>>>>>>> Tian

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [DISCUSS] How can we afford CI for the new release cadence?

Reply via email to