Re: CI compute & elapsed-time analysis — what helps, what doesn't, and the runner-queue reality

Jarek Potiuk Sun, 21 Jun 2026 22:33:41 -0700

BTW. Vikram, your accidental triggering of a full test build occurred for
various reasons in 3.3% of all CI runs - with 3.6% impact on the compute
time. So if you or others were afraid we were burning money, using power
and wasting water, that was totally unfounded. This is was - as I
explained - anecdotal experience, not something that happened consistency.
Even if the impact is small, I added additional protection for similar
cases in #68802.


J.


On Mon, Jun 22, 2026 at 1:22 AM Jarek Potiuk <[email protected]> wrote:

> Hi all,
>
> This afternoon, I worked with my agents to conduct a detailed analysis of
> our CI compute and elapsed times over the past two weeks. I want to share
> the findings, a handful of PRs, and — more importantly — set realistic (and
> data-backed) expectations about what will and won't move the needle.
>
> This is a long one, but since there were just **feelings** that things are
> different than they are, I wanted to clarify it with precise data gathered
> over last 2 weeks.
>
> I think discussions on the dev call where there is no way to look at
> details and drag data and where we mostly base any statements on just
> wishful thinking - and where anecdotal experiences might not match
> statistical reality—are unproductive.
>
> I think we should base more of our discussions on data and not "because we
> think things are different".
>
> Full write-up (numbers, tables, methodology):
> https://gist.github.com/potiuk/0ddc404dc76353db9849b801e7d43e26
>
>
>
> *== Selective checks already save a LOT ==*
> The headline is that our selective-checks machinery is already doing an
> enormous amount of work. Measured over two weeks:
>
> - vs running full tests for every PR: ~60% fewer compute-hours
> - vs the complete all-versions matrix (canary): ~81% fewer
>
> So PRs already run at roughly 19-40% of "run everything." The system
> works, and it works well. I want to be clear about that before anyone reads
> the optimizations as "CI is broken" — it isn't.
>
>
> *== The optimizations are small and incremental — with one exception ==*
> The one genuinely large item is not really an optimization, it's fixing a
> mistake merged 2 weeks ago (!): generated/provider_dependencies.json was
> accidentally re-added to git tracking ~2 weeks ago together with the
> *ClickHouse* *provider *(even if the provider_dependencies.json + shasum
> were gitignored - they were forcefully added).
>
> Because it was tracked again, every provider-dependency regeneration
> looked like a dependency change and forced the full all-Python-versions
> matrix — ~2,700 compute-hours / 2 weeks for nothing. Removing it (#68801)
> should also fix some of the issues that few of us had when the generated
> and committed file was stale - that was one of the reasons I removed it in
> April 2025 when we switched our infra to uv. That's almost 12% less compute.
>
> Everything else is genuinely small and incremental, yielding only a few
> percent each:
>
> - #68533 (merged) — standard venv-operator tests made DB-free so they
> parallelize
> - #68802 — stop non-test workflows + prek-only changes forcing the full
> matrix
> - #68814 — rebalance the provider test groups (split the serial monolith)
>
> - #68821 — reduce canary frequency (AMD 4->2/day, ARM 8->2/day). This one
> is a real TRADE-OFF, not a free win: it saves ~6% of AMD compute, but it
> makes our investigations potentially harder and our reaction time to main
> regressions slower — a regression can sit undetected up to ~12h instead of
> ~6h. We should only take it if we consciously accept that.
>
> Combined these are worth ~20% of AMD compute — meaningful, but
> incremental, and the canary one trades coverage/latency for it.
>
>
> *== The real bottleneck is the shared ASF public runner queue, not our
> tests ==*
> Our jobs spent ~3,960 job-hours over two weeks simply *waiting for a
> runner*, and jobs in big runs wait ~2.5x longer than in small ones because
> each big run floods the shared pool. No amount of our own test-trimming
> changes this fundamentally — it's a capacity/contention problem, not a "we
> test too much" problem.
>
> And the important reality check, based on history: the only thing that
> changes the contributor experience *dramatically* — historically a ~4x
> speedup in elapsed time — is more powerful hardware OUTSIDE the shared ASF
> public runner queue. Our incremental optimizations help at the margins;
> dedicated hardware is the step change.
>
>
> *== Why "complaining" to Infra is the wrong move ==*
> We have been here before—more than 5 years ago, when Infra had far fewer
> runners, we went through a very similar exercise. The structural issues
> haven't changed:
>
>    - The public runner pool is SHARED across all Apache projects. Other
>    projects may or may not optimize their tests the way we do — we can't
>    control that.
>    - Everyone's traffic is up sharply because of AI-generated PRs, so the
>    contention is systemic, not Airflow-specific.
>    - Infra has no real mechanism to fix this for us other than what they
>    have always recommended: a PMC arranging its own dedicated, self-hosted
>    runners. If you look at the discussion we had in October 2020
>    https://lists.apache.org/thread/8htrdgf2h8qz1hv7mbb96v8l8x8d1dyl -
>    this was the only solution that we could apply back then, and it worked.
>
>
> *Vikram* — if you want to follow up on writing to them, that's fine, but
> the message has to recognize the reality we're in. Last time, the thing
> that actually worked was GitHub giving us ~3x more runners. That is
> unlikely to repeat: GitHub themselves now have to scale their own
> infrastructure 20-30x for the same AI-driven reasons, so a "please give us
> 3x again" ask is not realistic. If we write, it has to be grounded in that.
>
> *Shahar* — I think your exploration of Kubernetes self-hosted runners on
> cheaper spot instances is the most promising path by far. That is the lever
> that gives us the ~4x, on our own terms, without depending on the shared
> pool or on Infra's capacity - and one that we can easily control which runs
> should be faster - canary runs + committer runs seems like a good idea -
> and it  is the only *100% *resilient to external AI-generated flood. I'd
> strongly suggest continuing it — it's the fastest way to actually improve
> our contributors' experience.
>
> == Next steps ==
>
> Everyone - please review/approve those PRs and let's merge them. They will
> help "a little," but won't solve the underlying runner contention:
>
> * https://github.com/apache/airflow/pull/68801
> * https://github.com/apache/airflow/pull/68802
> * https://github.com/apache/airflow/pull/68814
> * https://github.com/apache/airflow/pull/68821
>
> Happy to discuss any of the above (based on data and analysis - ideally).
>
> Also, if others have concrete ideas on **what** we can improve—such as
> which tests we might skip under certain circumstances - all concrete ideas
> are welcome. The Gist above has a lot of data.
>
> But we should all realise that pretty much everything here is a trade-off
> - usually we trade elapsed time and compute time for greater certainty that
> a merge will not break all other PRs.
>
> Best,
> Jarek
>

Re: CI compute & elapsed-time analysis — what helps, what doesn't, and the runner-queue reality

Reply via email to