I think we can disable 4.x post-merge per-commit jobs. branch-4.x is the integration branch for the 4.x line, not a release branch. RCs are cut from the numbered branches (branch-4.0/4.1/4.2/4.3/...), and those keep their own per-merge CI.
On Sat, May 23, 2026 at 1:58 PM Yicong Huang <[email protected]> wrote: > Is it possible to run 4.x branch post merge as a scheduled job, possibly > daily, instead of after every commit? I think this can quickly cut the CI > usage. > > Best, > Yicong Huang > > On Fri, May 22, 2026 at 11:48 AM Tian Gao via dev <[email protected]> > wrote: > >> Like I mentioned a few weeks ago, we can't afford this. We received the >> warning from ASF today and took a quick look at our CI usage. >> >> We are using about 350k min/week now, and the limit is 250k min/week. The >> post merge itself took 180k+ min/week because now we have 2 active dev >> branches. >> >> I think we should put some effort into this. There are a few ways to make >> the situation better: >> >> 1. Run fewer tests - We disabled pandas on spark tests for post merge a >> while ago to comply with the ASF limit. >> 2. Make tests run faster - I occasionally optimize python tests, not sure >> if Java tests are being taken care of. Java tests took significantly >> more time in our CI now. >> 3. Run tests less frequently - helpful for scheduled CI which we already >> did, but won't help post merge. >> 4. Smart testing - this is a bit tricky for post-merge because ideally we >> want a full coverage for each commit. We can probably do some safe >> heuristics, but it takes time and we could potentially lose coverage. >> 5. Move scheduled tests to another repo - arrow seems to be doing this. >> This allows us to use all the ASF budget to run post-merge tests. However, >> we need some sponsor to achieve this. >> >> I think we have 2 weeks to at least temporarily reduce our CI usage under >> the limit, so we need something fast, then something good. >> >> Tian >> >> On Mon, May 11, 2026 at 3:14 AM Akira Ajisaka <[email protected]> >> wrote: >> >>> > I'm working on fixing branch-3.5 CI: >>> https://github.com/apache/spark/pull/55764 >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F55764&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261437820%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=32tQ4QP4bp4Rp%2Fby48RT9H%2FJc%2FxDzmHnKAcOgliiGX0%3D&reserved=0>. >>> Hopefully I'll complete it this week. >>> >>> Closed the above PR as a duplicate of >>> https://github.com/apache/spark/pull/55432 >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F55432&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261491001%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=RpGKoF%2Faw%2F3WPvqypNsgAcaLtl6do8A21UHMjnIoGR0%3D&reserved=0>. >>> Sorry for the confusion. >>> >>> On Mon, May 11, 2026 at 3:22 PM Akira Ajisaka <[email protected]> >>> wrote: >>> > >>> > > Also on the 3.5 side the CI is super broken so I’m trying to fix it >>> up now, the timing is complicated by the Ubuntu PPA DDoS outages. >>> > >>> > I'm working on fixing branch-3.5 CI: >>> > https://github.com/apache/spark/pull/55764 >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F55764&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261508821%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=1zB4MkjyMitFytJ4EBTv59q0SDwT%2BGiRrRK6rgqPyIM%3D&reserved=0>. >>> Hopefully I'll complete it >>> > this week. The Ubuntu outage seems unrelated. >>> > >>> > Anyway, I'm +1 to reduce the frequency on non-active branches. >>> > >>> > Thanks, >>> > Akira >>> > >>> > On Fri, May 8, 2026 at 5:30 AM Tian Gao via dev <[email protected]> >>> wrote: >>> > > >>> > > Yeah I'm not surprised that 3.5 is not in its best shape at this >>> point because we almost did not run tests on it. When we reduce the >>> coverage for a branch, we will have issues when we try to release. That's >>> why we should not only make efforts on that side. We should explore all >>> different ways to make CI better. >>> > > >>> > > On Thu, May 7, 2026 at 12:02 PM Holden Karau <[email protected]> >>> wrote: >>> > >> >>> > >> Smarter test selection is probably the magic but it’s going to be >>> effort. Also on the 3.5 side the CI is super broken so I’m trying to fix it >>> up now, the timing is complicated by the Ubuntu PPA DDoS outages. >>> > >> >>> > >> >>> > >> Twitter: https://twitter.com/holdenkarau >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fholdenkarau&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261525901%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=aFFfISTgnaDYMcPmA06d4Vvd2c44ywoBQziwGtXzKsw%3D&reserved=0> >>> > >> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.fighthealthinsurance.com%2F&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261542909%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=3aW7yRQFZELYPxGwPAvTa%2B1VOeB1DP%2BNlgzKODlj%2B9U%3D&reserved=0> >>> > >> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Famzn.to%2F2MaRAG9&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261559904%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=Y8Zo1UiKnIqFYIUtcg%2FFu5suNiYo0wYgn1gVby4CXMI%3D&reserved=0> >>> > >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fuser%2Fholdenkarau&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261578297%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=RDMAZ75eTV%2F%2B7xp6gbyXmaxxlKw87dhLBLEuq%2FIIEic%3D&reserved=0> >>> > >> Pronouns: she/her >>> > >> >>> > >> On Thu, May 7, 2026 at 11:28 AM Tian Gao via dev < >>> [email protected]> wrote: >>> > >>> >>> > >>> I definitely agree that we can save a lot of time by optimizing >>> the CI. But currently, java tests take more time than python tests. They >>> are comparable but java tests are still observably more expensive. We >>> should not only focus on python ones. >>> > >>> >>> > >>> In the meantime, I'll take a look on low hanging fruits on CI to >>> make it smarter. >>> > >>> >>> > >>> Tian >>> > >>> >>> > >>> On Thu, May 7, 2026 at 6:40 AM Ruifeng Zheng <[email protected]> >>> wrote: >>> > >>>> >>> > >>>> I also did some data analysis, and think we should also revisit >>> the the CI: >>> > >>>> 1, Deduplicate the compile. For example, the pyspark matrix >>> executes 8 byte-identical SBT compiles in parallel today, costing ~108m of >>> redundant work per run. >>> > >>>> (I am working on a POC: >>> https://github.com/apache/spark/pull/55726 >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F55726&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261599859%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=TxDWj%2BnFzWOTEgy31O2uZeoE0oOJhPRUquq4tOWgqBQ%3D&reserved=0> >>> ) >>> > >>>> 2, Smarter test selection. 11% of recent 10000 commits are >>> test-only changes. Today these trigger the full pyspark matrix because the >>> dependency >>> > >>>> graph in dev/sparktestsupport/modules.py cascades through >>> dependent_modules regardless of whether the change is in source or tests. >>> The cascade is correct >>> > >>>> for source changes (downstream modules import the source) but >>> unnecessary for tests (no production code imports test code). >>> > >>>> >>> > >>>> On Thu, May 7, 2026 at 5:23 PM Hyukjin Kwon <[email protected]> >>> wrote: >>> > >>>>> >>> > >>>>> For now, I created a PR to reduce the frequency by half: >>> https://github.com/apache/spark/pull/55729 >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F55729&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261623806%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=JCzWtIDdcS7Nv6gtv7KAIBioHdhKHUa%2F4VtuBFmlTCg%3D&reserved=0> >>> > >>>>> >>> > >>>>> On Thu, 7 May 2026 at 07:56, Yicong Huang < >>> [email protected]> wrote: >>> > >>>>>> >>> > >>>>>> I think we need to 1) cut CIs pressure and 2) look for more >>> resources to run CIs at the same time. >>> > >>>>>> >>> > >>>>>> Cut CIs: >>> > >>>>>> >>> > >>>>>> I think the biggest cut would be on the scheduled jobs first. >>> For instance change 3.5 and 4.0 scheduled jobs from daily to once in three >>> days, or even once per week. >>> > >>>>>> Then for branch 4.x or more active release branches we can do >>> daily post merge CI, instead of after each commit? >>> > >>>>>> Meanwhile we can explore ways to run selected tests on the >>> actual affected code path to avoid full runs. >>> > >>>>>> And optimize tests themselves so they run faster. >>> > >>>>>> >>> > >>>>>> Expand resources: >>> > >>>>>> >>> > >>>>>> We can probably move some of the scheduled jobs out to another >>> repo like what Apache Arrow did. >>> > >>>>>> I wonder if self hosted runners are acceptable to the >>> community? This sounds like a longer term solution if we were to introduce >>> more checks in the future. >>> > >>>>>> >>> > >>>>>> >>> > >>>>>> Best regards, >>> > >>>>>> Yicong Huang >>> > >>>>>> >>> > >>>>>> On Wed, May 6, 2026 at 3:04 PM Hyukjin Kwon < >>> [email protected]> wrote: >>> > >>>>>>> >>> > >>>>>>> We should probably reduce the scheduled build for the time >>> being. >>> > >>>>>>> >>> > >>>>>>> As a reference, I worked in Apache Arrow, and they use an >>> extra CI by thirdparty, e.g., see >>> > >>>>>>> - PR: https://github.com/apache/arrow/pull/48915 >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F48915&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261649564%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=pO7KvG4N7nYkiE9OM8BxWSgxhkqKQJGyOZEcv4sZKy4%3D&reserved=0> >>> > >>>>>>> - You comment like >>> https://github.com/apache/arrow/pull/48915#issuecomment-3852062184 >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F48915%23issuecomment-3852062184&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261686934%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=8RQy6xfBAuwucM1wkqb0qEIvrjZVwMr8bWrByPOOZ78%3D&reserved=0> >>> > >>>>>>> - It posts the CI link like >>> https://github.com/apache/arrow/pull/48915#issuecomment-3852079993 >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F48915%23issuecomment-3852079993&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261703594%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=m0qLqH1BBER1xUuF0Stp3asVlA0PNP8kr%2F%2Bcw%2BX3Cew%3D&reserved=0> >>> > >>>>>>> - The CI is defined at >>> https://github.com/ursacomputing/crossbow >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fursacomputing%2Fcrossbow&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261719788%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=tJRx0dNJD4obwoBKIPrSmBikJdirhy7UmkzJOWksVF4%3D&reserved=0> >>> > >>>>>>> >>> > >>>>>>> I feel like this can be an alternative if any vendor is >>> willing to support it. >>> > >>>>>>> >>> > >>>>>>> On Thu, 7 May 2026 at 04:09, Tian Gao via dev < >>> [email protected]> wrote: >>> > >>>>>>>> >>> > >>>>>>>> I did some quick calculations, and we can't afford the CI >>> with our existing infra. >>> > >>>>>>>> >>> > >>>>>>>> Per ASF policy ( >>> https://infra.apache.org/github-actions-policy.html >>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Finfra.apache.org%2Fgithub-actions-policy.html&data=05%7C02%7Cyiconghuang%40umass.edu%7C7af49e02e80043fd575008deb832aa20%7C7bd08b0b33954dc194bbd0b2e56a497f%7C0%7C0%7C639150725261737519%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C80000%7C%7C%7C&sdata=cxw2%2Fa8o%2FEKi75VskoCTcYJ24AOhBlOshNtrnjO%2BttM%3D&reserved=0>), >>> the maximum weekly runner minutes we have is 250k. That's 1m per month, and >>> last month, we hit almost the exact number - 1,082,721 minutes. >>> > >>>>>>>> >>> > >>>>>>>> Our current CI consists of a few components (all numbers are >>> per month): >>> > >>>>>>>> * each commits on master branch - ~280k >>> > >>>>>>>> * 4.1 scheduled run - ~200k >>> > >>>>>>>> * 4.0 scheduled run - ~200k >>> > >>>>>>>> * 3.5 scheduled run - negligible because we don't run many >>> tests >>> > >>>>>>>> * master scheduled run ~ 300k >>> > >>>>>>>> >>> > >>>>>>>> With the new release cadence, even if we only do scheduled >>> run on 4.x (which we shouldn't because it's an active dev branch but that's >>> another story), we need an extra 200k. With a 6-month maintenance window, >>> we will always have at least 3 active maintained versions (including LTS) >>> that require CI. >>> > >>>>>>>> >>> > >>>>>>>> If it's just 200k extra, maybe it's manageable. But I really >>> believe we need tests for the 4.x branch - we should treat that branch more >>> like master, than say 4.2. Even if we don't do pre-merge check on it, we >>> should do post-merge check for every commit. Daily check on an active dev >>> branch sounds a bit too risky to me. That would be another 300k. >>> > >>>>>>>> >>> > >>>>>>>> This does not include the discussion about any pre-merge >>> check for 4.x, which we should actually think about in the future. >>> > >>>>>>>> >>> > >>>>>>>> So the question is - how do we deal with that? The solutions >>> I can think of are >>> > >>>>>>>> * Get some self-host runners and increase our CI capability >>> limited by ASF policy >>> > >>>>>>>> * Optimize our CIs and tests so it takes less time to run >>> > >>>>>>>> * Reduce the coverage of our tests so we can at least test >>> all branches >>> > >>>>>>>> >>> > >>>>>>>> Any idea is welcome. >>> > >>>>>>>> >>> > >>>>>>>> Tian >>> >>
