> I'm working on fixing branch-3.5 CI: > https://github.com/apache/spark/pull/55764. Hopefully I'll complete it this > week.
Closed the above PR as a duplicate of https://github.com/apache/spark/pull/55432. Sorry for the confusion. On Mon, May 11, 2026 at 3:22 PM Akira Ajisaka <[email protected]> wrote: > > > Also on the 3.5 side the CI is super broken so I’m trying to fix it up now, > > the timing is complicated by the Ubuntu PPA DDoS outages. > > I'm working on fixing branch-3.5 CI: > https://github.com/apache/spark/pull/55764. Hopefully I'll complete it > this week. The Ubuntu outage seems unrelated. > > Anyway, I'm +1 to reduce the frequency on non-active branches. > > Thanks, > Akira > > On Fri, May 8, 2026 at 5:30 AM Tian Gao via dev <[email protected]> wrote: > > > > Yeah I'm not surprised that 3.5 is not in its best shape at this point > > because we almost did not run tests on it. When we reduce the coverage for > > a branch, we will have issues when we try to release. That's why we should > > not only make efforts on that side. We should explore all different ways to > > make CI better. > > > > On Thu, May 7, 2026 at 12:02 PM Holden Karau <[email protected]> wrote: > >> > >> Smarter test selection is probably the magic but it’s going to be effort. > >> Also on the 3.5 side the CI is super broken so I’m trying to fix it up > >> now, the timing is complicated by the Ubuntu PPA DDoS outages. > >> > >> > >> Twitter: https://twitter.com/holdenkarau > >> Fight Health Insurance: https://www.fighthealthinsurance.com/ > >> Books (Learning Spark, High Performance Spark, etc.): > >> https://amzn.to/2MaRAG9 > >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau > >> Pronouns: she/her > >> > >> On Thu, May 7, 2026 at 11:28 AM Tian Gao via dev <[email protected]> > >> wrote: > >>> > >>> I definitely agree that we can save a lot of time by optimizing the CI. > >>> But currently, java tests take more time than python tests. They are > >>> comparable but java tests are still observably more expensive. We should > >>> not only focus on python ones. > >>> > >>> In the meantime, I'll take a look on low hanging fruits on CI to make it > >>> smarter. > >>> > >>> Tian > >>> > >>> On Thu, May 7, 2026 at 6:40 AM Ruifeng Zheng <[email protected]> wrote: > >>>> > >>>> I also did some data analysis, and think we should also revisit the the > >>>> CI: > >>>> 1, Deduplicate the compile. For example, the pyspark matrix executes 8 > >>>> byte-identical SBT compiles in parallel today, costing ~108m of > >>>> redundant work per run. > >>>> (I am working on a POC: https://github.com/apache/spark/pull/55726) > >>>> 2, Smarter test selection. 11% of recent 10000 commits are test-only > >>>> changes. Today these trigger the full pyspark matrix because the > >>>> dependency > >>>> graph in dev/sparktestsupport/modules.py cascades through > >>>> dependent_modules regardless of whether the change is in source or > >>>> tests. The cascade is correct > >>>> for source changes (downstream modules import the source) but > >>>> unnecessary for tests (no production code imports test code). > >>>> > >>>> On Thu, May 7, 2026 at 5:23 PM Hyukjin Kwon <[email protected]> wrote: > >>>>> > >>>>> For now, I created a PR to reduce the frequency by half: > >>>>> https://github.com/apache/spark/pull/55729 > >>>>> > >>>>> On Thu, 7 May 2026 at 07:56, Yicong Huang <[email protected]> > >>>>> wrote: > >>>>>> > >>>>>> I think we need to 1) cut CIs pressure and 2) look for more resources > >>>>>> to run CIs at the same time. > >>>>>> > >>>>>> Cut CIs: > >>>>>> > >>>>>> I think the biggest cut would be on the scheduled jobs first. For > >>>>>> instance change 3.5 and 4.0 scheduled jobs from daily to once in three > >>>>>> days, or even once per week. > >>>>>> Then for branch 4.x or more active release branches we can do daily > >>>>>> post merge CI, instead of after each commit? > >>>>>> Meanwhile we can explore ways to run selected tests on the actual > >>>>>> affected code path to avoid full runs. > >>>>>> And optimize tests themselves so they run faster. > >>>>>> > >>>>>> Expand resources: > >>>>>> > >>>>>> We can probably move some of the scheduled jobs out to another repo > >>>>>> like what Apache Arrow did. > >>>>>> I wonder if self hosted runners are acceptable to the community? This > >>>>>> sounds like a longer term solution if we were to introduce more checks > >>>>>> in the future. > >>>>>> > >>>>>> > >>>>>> Best regards, > >>>>>> Yicong Huang > >>>>>> > >>>>>> On Wed, May 6, 2026 at 3:04 PM Hyukjin Kwon <[email protected]> > >>>>>> wrote: > >>>>>>> > >>>>>>> We should probably reduce the scheduled build for the time being. > >>>>>>> > >>>>>>> As a reference, I worked in Apache Arrow, and they use an extra CI by > >>>>>>> thirdparty, e.g., see > >>>>>>> - PR: https://github.com/apache/arrow/pull/48915 > >>>>>>> - You comment like > >>>>>>> https://github.com/apache/arrow/pull/48915#issuecomment-3852062184 > >>>>>>> - It posts the CI link like > >>>>>>> https://github.com/apache/arrow/pull/48915#issuecomment-3852079993 > >>>>>>> - The CI is defined at https://github.com/ursacomputing/crossbow > >>>>>>> > >>>>>>> I feel like this can be an alternative if any vendor is willing to > >>>>>>> support it. > >>>>>>> > >>>>>>> On Thu, 7 May 2026 at 04:09, Tian Gao via dev <[email protected]> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> I did some quick calculations, and we can't afford the CI with our > >>>>>>>> existing infra. > >>>>>>>> > >>>>>>>> Per ASF policy > >>>>>>>> (https://infra.apache.org/github-actions-policy.html), the maximum > >>>>>>>> weekly runner minutes we have is 250k. That's 1m per month, and last > >>>>>>>> month, we hit almost the exact number - 1,082,721 minutes. > >>>>>>>> > >>>>>>>> Our current CI consists of a few components (all numbers are per > >>>>>>>> month): > >>>>>>>> * each commits on master branch - ~280k > >>>>>>>> * 4.1 scheduled run - ~200k > >>>>>>>> * 4.0 scheduled run - ~200k > >>>>>>>> * 3.5 scheduled run - negligible because we don't run many tests > >>>>>>>> * master scheduled run ~ 300k > >>>>>>>> > >>>>>>>> With the new release cadence, even if we only do scheduled run on > >>>>>>>> 4.x (which we shouldn't because it's an active dev branch but that's > >>>>>>>> another story), we need an extra 200k. With a 6-month maintenance > >>>>>>>> window, we will always have at least 3 active maintained versions > >>>>>>>> (including LTS) that require CI. > >>>>>>>> > >>>>>>>> If it's just 200k extra, maybe it's manageable. But I really believe > >>>>>>>> we need tests for the 4.x branch - we should treat that branch more > >>>>>>>> like master, than say 4.2. Even if we don't do pre-merge check on > >>>>>>>> it, we should do post-merge check for every commit. Daily check on > >>>>>>>> an active dev branch sounds a bit too risky to me. That would be > >>>>>>>> another 300k. > >>>>>>>> > >>>>>>>> This does not include the discussion about any pre-merge check for > >>>>>>>> 4.x, which we should actually think about in the future. > >>>>>>>> > >>>>>>>> So the question is - how do we deal with that? The solutions I can > >>>>>>>> think of are > >>>>>>>> * Get some self-host runners and increase our CI capability limited > >>>>>>>> by ASF policy > >>>>>>>> * Optimize our CIs and tests so it takes less time to run > >>>>>>>> * Reduce the coverage of our tests so we can at least test all > >>>>>>>> branches > >>>>>>>> > >>>>>>>> Any idea is welcome. > >>>>>>>> > >>>>>>>> Tian --------------------------------------------------------------------- To unsubscribe e-mail: [email protected]
