BTW. Vikram, your accidental triggering of a full test build occurred for various reasons in 3.3% of all CI runs - with 3.6% impact on the compute time. So if you or others were afraid we were burning money, using power and wasting water, that was totally unfounded. This is was - as I explained - anecdotal experience, not something that happened consistency. Even if the impact is small, I added additional protection for similar cases in #68802.
J. On Mon, Jun 22, 2026 at 1:22 AM Jarek Potiuk <[email protected]> wrote: > Hi all, > > This afternoon, I worked with my agents to conduct a detailed analysis of > our CI compute and elapsed times over the past two weeks. I want to share > the findings, a handful of PRs, and — more importantly — set realistic (and > data-backed) expectations about what will and won't move the needle. > > This is a long one, but since there were just **feelings** that things are > different than they are, I wanted to clarify it with precise data gathered > over last 2 weeks. > > I think discussions on the dev call where there is no way to look at > details and drag data and where we mostly base any statements on just > wishful thinking - and where anecdotal experiences might not match > statistical reality—are unproductive. > > I think we should base more of our discussions on data and not "because we > think things are different". > > Full write-up (numbers, tables, methodology): > https://gist.github.com/potiuk/0ddc404dc76353db9849b801e7d43e26 > > > > *== Selective checks already save a LOT ==* > The headline is that our selective-checks machinery is already doing an > enormous amount of work. Measured over two weeks: > > - vs running full tests for every PR: ~60% fewer compute-hours > - vs the complete all-versions matrix (canary): ~81% fewer > > So PRs already run at roughly 19-40% of "run everything." The system > works, and it works well. I want to be clear about that before anyone reads > the optimizations as "CI is broken" — it isn't. > > > *== The optimizations are small and incremental — with one exception ==* > The one genuinely large item is not really an optimization, it's fixing a > mistake merged 2 weeks ago (!): generated/provider_dependencies.json was > accidentally re-added to git tracking ~2 weeks ago together with the > *ClickHouse* *provider *(even if the provider_dependencies.json + shasum > were gitignored - they were forcefully added). > > Because it was tracked again, every provider-dependency regeneration > looked like a dependency change and forced the full all-Python-versions > matrix — ~2,700 compute-hours / 2 weeks for nothing. Removing it (#68801) > should also fix some of the issues that few of us had when the generated > and committed file was stale - that was one of the reasons I removed it in > April 2025 when we switched our infra to uv. That's almost 12% less compute. > > Everything else is genuinely small and incremental, yielding only a few > percent each: > > - #68533 (merged) — standard venv-operator tests made DB-free so they > parallelize > - #68802 — stop non-test workflows + prek-only changes forcing the full > matrix > - #68814 — rebalance the provider test groups (split the serial monolith) > > - #68821 — reduce canary frequency (AMD 4->2/day, ARM 8->2/day). This one > is a real TRADE-OFF, not a free win: it saves ~6% of AMD compute, but it > makes our investigations potentially harder and our reaction time to main > regressions slower — a regression can sit undetected up to ~12h instead of > ~6h. We should only take it if we consciously accept that. > > Combined these are worth ~20% of AMD compute — meaningful, but > incremental, and the canary one trades coverage/latency for it. > > > *== The real bottleneck is the shared ASF public runner queue, not our > tests ==* > Our jobs spent ~3,960 job-hours over two weeks simply *waiting for a > runner*, and jobs in big runs wait ~2.5x longer than in small ones because > each big run floods the shared pool. No amount of our own test-trimming > changes this fundamentally — it's a capacity/contention problem, not a "we > test too much" problem. > > And the important reality check, based on history: the only thing that > changes the contributor experience *dramatically* — historically a ~4x > speedup in elapsed time — is more powerful hardware OUTSIDE the shared ASF > public runner queue. Our incremental optimizations help at the margins; > dedicated hardware is the step change. > > > *== Why "complaining" to Infra is the wrong move ==* > We have been here before—more than 5 years ago, when Infra had far fewer > runners, we went through a very similar exercise. The structural issues > haven't changed: > > - The public runner pool is SHARED across all Apache projects. Other > projects may or may not optimize their tests the way we do — we can't > control that. > - Everyone's traffic is up sharply because of AI-generated PRs, so the > contention is systemic, not Airflow-specific. > - Infra has no real mechanism to fix this for us other than what they > have always recommended: a PMC arranging its own dedicated, self-hosted > runners. If you look at the discussion we had in October 2020 > https://lists.apache.org/thread/8htrdgf2h8qz1hv7mbb96v8l8x8d1dyl - > this was the only solution that we could apply back then, and it worked. > > > *Vikram* — if you want to follow up on writing to them, that's fine, but > the message has to recognize the reality we're in. Last time, the thing > that actually worked was GitHub giving us ~3x more runners. That is > unlikely to repeat: GitHub themselves now have to scale their own > infrastructure 20-30x for the same AI-driven reasons, so a "please give us > 3x again" ask is not realistic. If we write, it has to be grounded in that. > > *Shahar* — I think your exploration of Kubernetes self-hosted runners on > cheaper spot instances is the most promising path by far. That is the lever > that gives us the ~4x, on our own terms, without depending on the shared > pool or on Infra's capacity - and one that we can easily control which runs > should be faster - canary runs + committer runs seems like a good idea - > and it is the only *100% *resilient to external AI-generated flood. I'd > strongly suggest continuing it — it's the fastest way to actually improve > our contributors' experience. > > == Next steps == > > Everyone - please review/approve those PRs and let's merge them. They will > help "a little," but won't solve the underlying runner contention: > > * https://github.com/apache/airflow/pull/68801 > * https://github.com/apache/airflow/pull/68802 > * https://github.com/apache/airflow/pull/68814 > * https://github.com/apache/airflow/pull/68821 > > Happy to discuss any of the above (based on data and analysis - ideally). > > Also, if others have concrete ideas on **what** we can improve—such as > which tests we might skip under certain circumstances - all concrete ideas > are welcome. The Gist above has a lot of data. > > But we should all realise that pretty much everything here is a trade-off > - usually we trade elapsed time and compute time for greater certainty that > a merge will not break all other PRs. > > Best, > Jarek >
