Re: Iceberg Consumption of ASF Shared GitHub-hosted Runners

Abnob Doss Fri, 12 Jun 2026 17:34:30 -0700

Hi,

A potential small win from the subproject side: the iceberg-rust Python 
bindings CI had ended up building the Rust bindings twice per run, due to an 
accidental interaction between a few changes over time. One-line fix:
https://github.com/apache/iceberg-rust/pull/2636


Measured over the past 7 days, the duplicate build took a median of 8.4 min on 
Linux, 12.1 min on macOS, and 15.3 min on Windows, totaling about 2,400 
runner-minutes across 207 job executions. After the fix the same step takes a 
few seconds.

Thanks,
Abanoub

On Wednesday, June 3rd, 2026 at 9:49 AM, Bob Thomson <[email protected]> 
wrote:

> I don't think we have data to that level of granularity, it's a case of 
> looking at the Actions and their run time and frequency of execution in each 
> of your repos, and focussing on the longest running and most frequent ones. 
> That is, an Action run might only run for 5 minutes each time, but if it is 
> running 400 times a day then that occupies more than one job slot of the toal 
> of 900 ASF has, for the duration of that day.
> Experience so far suggests those actions that build Java are often the most 
> time consuming.
> 
> Thanks.
> 
> Kind regards,
> -Bob Thomson.
> 
> On 2026/06/01 18:39:38 Yufei Gu wrote:
> > Hi Bob,
> >
> > Thanks for the heads-up and for giving the Iceberg community time to work
> > on this.
> >
> > One question: Is the concern based on the overall GitHub Actions
> > consumption of the Iceberg projects(e.g., main repo, python repo, go repo,
> > etc), or only for the main Iceberg repository? Iceberg has multiple
> > repositories, including the main repository as well as Python, Go, Rust,
> > and C++ subprojects. Most of the discussion and optimization work in this
> > thread focuses on the main repository, where the majority of CI usage
> > occurs. If the overall project usage is within acceptable limits, would it
> > be possible to allow a higher quota for a single repo (the Iceberg main
> > repository), given its broader compatibility and integration testing
> > requirements?
> >
> > Yufei
> >
> >
> > On Mon, Jun 1, 2026 at 11:00 AM Steve Loughran <[email protected]> wrote:
> >
> > > This is really good for draft builds.
> > >
> > > If I'm committing and pushing work up to a WiP PR, it is often because I
> > > want *a* machine to do the testing; I don't care who it runs as.
> > >
> > > Forcing PRs to run as the submitter also hardens the OSS repo against
> > > vulnerabilities in the Github Actions and other parts of the build 
> > > process.
> > >
> > > On Mon, 1 Jun 2026 at 17:11, Prashant Singh <[email protected]>
> > > wrote:
> > >
> > >>   Hi all,
> > >>
> > >>   Great progress on the matrix reduction, incremental builds, and draft 
> > >> PR
> > >>   skipping ideas. I'd like to propose a complementary approach that can
> > >> work
> > >>   alongside all of those: running PR CI on contributor fork compute
> > >> instead
> > >>   of the ASF shared pool.
> > >>
> > >>   How it works:
> > >>
> > >>   Workflows switch from pull_request to push triggers on non-main
> > >>   branches. Each workflow:
> > >>
> > >>   1. Checks out apache/iceberg main (security boundary — untrusted code
> > >>   can't modify the workflow itself)
> > >>   2. Squash-merges the contributor's fork branch on top
> > >>   3. Runs tests on that merged tree
> > >>
> > >>   Because the push event fires on the fork, GitHub bills the CI minutes
> > >>   to the fork owner's account - not the ASF shared pool. This takes
> > >>   Iceberg's PR CI usage from the ASF runners to effectively zero,
> > >>   regardless of matrix size.
> > >>
> > >>   Why this is complementary:
> > >>
> > >>   The optimizations discussed so far all reduce how much CI runs.
> > >> Fork-compute changes where
> > >>   it runs. They compose - a leaner matrix running on fork compute is
> > >>   strictly better than either approach alone.
> > >>
> > >>   Inline PR status:
> > >>
> > >>   A lightweight notify_test_workflow.yml (using pull_request_target +
> > >>   Checks API) is included to post fork CI results directly onto the
> > >>   upstream PR's checks tab - so reviewers see green/red status inline as
> > >>   they do today.
> > >>
> > >>   *Prior art*:
> > >>
> > >>   Apache Spark adopted this pattern in 2024 (SPARK-47041) and has been
> > >>   running it in production since. Their full Spark CI matrix runs 
> > >> entirely
> > >>   on contributor forks.
> > >>
> > >>   PR: https://github.com/apache/iceberg/pull/15397: covers all 10
> > >>   workflow files. I've verified all workflows pass on fork computation.
> > >>
> > >>   This could be merged independently of the matrix/incremental
> > >>   optimizations and would immediately eliminate PR CI pressure on the
> > >>   ASF pool - well within the June 8 deadline.
> > >>
> > >>   Thoughts?
> > >>
> > >> Prashant Singh
> > >>
> > >> On Fri, May 29, 2026 at 8:47 PM Renjie Liu <[email protected]>
> > >> wrote:
> > >>
> > >>> I like the idea of cutting supported jvm runs in each ci. JVM has great
> > >>> backward compatibility, and we run on one jvm (maybe jvm 17) and 
> > >>> trigger a
> > >>> nightly run for jvm 21.
> > >>>
> > >>> On Wed, May 27, 2026 at 3:17 AM Steve Loughran <[email protected]>
> > >>> wrote:
> > >>>
> > >>>>
> > >>>> Doing a scan of the aws-sdk bundle.jar is halfway to an audit of the
> > >>>> maven repo, with spark the other half.
> > >>>>
> > >>>> It seems to me that only PRs which go near gradle/libs.versions.toml
> > >>>> are going to change dependences, so introduce new CVEs.
> > >>>>
> > >>>> There's the separate issue "CVEs are eternal" and all existing
> > >>>> dependencies are collections of undiscovered/unreported cves. That's
> > >>>> dependabot's homework, generally.
> > >>>>
> > >>>>
> > >>>> On Tue, 26 May 2026 at 19:49, Kevin Liu <[email protected]> wrote:
> > >>>>
> > >>>>> Thanks everyone for the great ideas.
> > >>>>>
> > >>>>> Here's where we stand today with respect to ASF runner usage (taken
> > >>>>> from the link [2] above):
> > >>>>> GitHub Actions Build Time Used
> > >>>>> - past 7 days total usage: 218,321 minutes
> > >>>>> - past 5 days total usage: 120,241 minutes
> > >>>>>
> > >>>>> *This puts us below the hard ceiling for resource usage* as described
> > >>>>> by https://infra.apache.org/github-actions-policy.html
> > >>>>>
> > >>>>> > The average number of minutes a project uses *per calendar week
> > >>>>> MUST NOT exceed the equivalent of 25 full-time runners (250,000 
> > >>>>> minutes, or
> > >>>>> 4,200 hours)*.
> > >>>>> > The average number of minutes a project uses *in any consecutive
> > >>>>> five-day period MUST NOT exceed the equivalent of 30 full-time runners
> > >>>>> (216,000 minutes, or 3,600 hours)*.
> > >>>>>
> > >>>>> We should still make improvements wherever possible.
> > >>>>>
> > >>>>> I have a few PRs to reduce CI usage further.
> > >>>>> - CI: Limit CVE scan runs to relevant changes #16513
> > >>>>> - Build: Simplify CI workflow path filters to avoid per-workflow
> > >>>>> maintenance #16302
> > >>>>>
> > >>>>> There are a couple of heuristics we can use
> > >>>>> 1. Don't run CI if not needed. For example, `site/` dir changes
> > >>>>> shouldn't trigger Spark/Flink/Java CI. This might be optimized 
> > >>>>> already, but
> > >>>>> we should double check just in case.
> > >>>>> 2. If we must run CI, fail fast. For example, if there is a formatter
> > >>>>> issue, fail all inflight CI tasks.
> > >>>>> 3. Within a specific CI workflow, reduce the matrix wherever possible.
> > >>>>> Do we really need to run all "Java versions" x "Scala versions" x 
> > >>>>> "Spark
> > >>>>> versions"?
> > >>>>> 4. Improve individual CI tasks. Spark CI dominates 57% of all resource
> > >>>>> usage. I have a tracking issue where I benchmarked where all that 
> > >>>>> time is
> > >>>>> spent. See https://github.com/apache/iceberg/issues/16397
> > >>>>>
> > >>>>> Top CI tasks as % of resource use:
> > >>>>> - Spark CI: 57.68%
> > >>>>> - Flink CI: 13.60%
> > >>>>> - Java CI: 7.02%
> > >>>>> - CVE Scan: 3.13%
> > >>>>>
> > >>>>> Best,
> > >>>>> Kevin Liu
> > >>>>>
> > >>>>> On Tue, May 26, 2026 at 5:35 AM Ajantha Bhat <[email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> How about implementing the incremental PR builder? (similar to
> > >>>>>> https://github.com/gitflow-incremental-builder/gitflow-incremental-builder
> > >>>>>> )
> > >>>>>>
> > >>>>>> I think one of the main causes of GitHub runner pressure in Iceberg
> > >>>>>> is the breadth of our CI matrix. We support multiple languages (java,
> > >>>>>> python, go, rust, cpp) and integrations, and for Java we test across
> > >>>>>> multiple JVM versions, Spark versions, Flink versions, Kafka, 
> > >>>>>> Hive/MR,
> > >>>>>> REST/OpenAPI, runtime bundles, and more. That coverage is valuable, 
> > >>>>>> but
> > >>>>>> running most of it for every PR is expensive and increases both 
> > >>>>>> runner
> > >>>>>> usage and CI wall time.
> > >>>>>>
> > >>>>>> I think the biggest win can be achieved by having an incremental PR
> > >>>>>> build.
> > >>>>>> We already have useful building blocks for it: Gradle build cache,
> > >>>>>> path filters, and version-selective build properties like 
> > >>>>>> -DsparkVersions
> > >>>>>> and -DflinkVersions.
> > >>>>>>
> > >>>>>> The idea is to keep full coverage on main, release branches, tags,
> > >>>>>> and global build changes, but make PR CI depend on the files changed:
> > >>>>>>
> > >>>>>>    - Spark-only changes run Spark CI, not Flink/Hive/Kafka.
> > >>>>>>    - spark/v4.1/** changes run only Spark 4.1, not every Spark
> > >>>>>>    version.
> > >>>>>>    - flink/v2.0/** changes run only Flink 2.0, not every Flink
> > >>>>>>    version.
> > >>>>>>    - API/Core/Data/File format changes run the owning Java checks
> > >>>>>>    plus selected downstream canaries, such as latest Spark and 
> > >>>>>> latest Flink,
> > >>>>>>    instead of the full engine matrix.
> > >>>>>>    - Runtime/bundle CVE checks run only for affected runtime
> > >>>>>>    artifacts.
> > >>>>>>    - A full-ci label or global Gradle/workflow changes can still
> > >>>>>>    force the full matrix.
> > >>>>>>
> > >>>>>>
> > >>>>>> Another possible optimization is JVM coverage. Today many PR jobs run
> > >>>>>> across both Java 17 and Java 21. We could consider running one 
> > >>>>>> primary JVM
> > >>>>>> for PRs, and reserve the full JVM matrix for main, release branches,
> > >>>>>> nightly/scheduled builds, or PRs labeled full-ci. That would further 
> > >>>>>> reduce
> > >>>>>> runner usage and PR wall time, while still preserving broad 
> > >>>>>> compatibility
> > >>>>>> coverage before changes become part of the main branch.
> > >>>>>>
> > >>>>>> A practical approach could be:
> > >>>>>>
> > >>>>>> PRs: incremental module/version selection, mostly one JVM, plus
> > >>>>>> targeted canaries.
> > >>>>>> main: full matrix across JVMs, Spark versions, Flink versions, and
> > >>>>>> runtime checks.
> > >>>>>> Manual override: full-ci label for risky or cross-cutting PRs.
> > >>>>>>
> > >>>>>> This should reduce queue time, lower GitHub runner consumption, and
> > >>>>>> give contributors faster feedback without giving up full coverage 
> > >>>>>> where it
> > >>>>>> matters most.
> > >>>>>>
> > >>>>>> I am working on a POC https://github.com/apache/iceberg/pull/16566
> > >>>>>> Suggestions are welcome.
> > >>>>>>
> > >>>>>> - Ajantha
> > >>>>>>
> > >>>>>> On Mon, May 25, 2026 at 7:35 PM Junwang Zhao <[email protected]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi Manu,
> > >>>>>>>
> > >>>>>>> On Mon, May 25, 2026 at 9:33 PM Manu Zhang <[email protected]>
> > >>>>>>> wrote:
> > >>>>>>> >
> > >>>>>>> > Hi Junwang,
> > >>>>>>> >
> > >>>>>>> > Not sure about others but I usually only change status to "Ready
> > >>>>>>> for review"  when CI has passed.
> > >>>>>>>
> > >>>>>>> Yeah, I agree there are trade-offs to disabling gh actions for draft
> > >>>>>>> PRs.
> > >>>>>>>
> > >>>>>>> Reasons to Disable:
> > >>>>>>>
> > >>>>>>> - Cost savings: large teams and monorepos can burn through GitHub
> > >>>>>>> Actions minutes quickly. Skipping CI for draft PRs avoids spending
> > >>>>>>> resources on code that may not even compile yet.
> > >>>>>>> - Reduced noise: draft PRs are often used for experimentation or
> > >>>>>>> work-in-progress changes. Disabling CI avoids cluttering the PR
> > >>>>>>> timeline with transient failures while the author is still 
> > >>>>>>> iterating.
> > >>>>>>> - Better resource utilization: orgs with limited self-hosted runners
> > >>>>>>> may prefer to prioritize "Ready for Review" PRs so
> > >>>>>>> production-relevant
> > >>>>>>> changes get feedback and merge capacity sooner.
> > >>>>>>>
> > >>>>>>> Reasons to Keep:
> > >>>>>>>
> > >>>>>>> - Early error detection: developers can use draft PRs as a sandbox 
> > >>>>>>> to
> > >>>>>>> validate builds and tests before requesting review.
> > >>>>>>> - Self-correction: failed checks on a draft PR allow authors to fix
> > >>>>>>> lint or test issues before involving reviewers.
> > >>>>>>> - Higher review confidence: by the time a PR is marked "Ready for
> > >>>>>>> Review", CI has often already passed at least once, leading to a
> > >>>>>>> smoother review process.
> > >>>>>>>
> > >>>>>>> For myself, when I create a draft PR, I'm usually sharing early
> > >>>>>>> work-in-progress code with other developers and may not have tested
> > >>>>>>> it
> > >>>>>>> thoroughly locally yet, so I sometimes prefer to disable CI. That's
> > >>>>>>> just my personal preference though.
> > >>>>>>>
> > >>>>>>> >
> > >>>>>>> > Regards,
> > >>>>>>> > Manu
> > >>>>>>> >
> > >>>>>>> > On Mon, May 25, 2026 at 3:21 PM Junwang Zhao <[email protected]>
> > >>>>>>> wrote:
> > >>>>>>> >>
> > >>>>>>> >> On Mon, May 25, 2026 at 11:20 AM Junwang Zhao <[email protected]>
> > >>>>>>> wrote:
> > >>>>>>> >> >
> > >>>>>>> >> > On Sun, May 24, 2026 at 12:13 PM Steven Wu <
> > >>>>>>> [email protected]> wrote:
> > >>>>>>> >> > >
> > >>>>>>> >> > > Kevin's PR of removing Spark 3.4 was merged a few days ago.
> > >>>>>>> It should reduce the Spark CI cost by ~25%.
> > >>>>>>> >> > >
> > >>>>>>> >> > > Some heavy-hitter test classes in Spark tests (core and
> > >>>>>>> extension) cause high load due to parameter combinations. I asked 
> > >>>>>>> AI to
> > >>>>>>> analyze the build log and recommend changes offering the best ROI. 
> > >>>>>>> Details
> > >>>>>>> are in this doc.
> > >>>>>>> >> > >
> > >>>>>>> >> > > I can look into dropping some combinations without
> > >>>>>>> sacrificing essential coverage. E.g., we can probably drop the 
> > >>>>>>> Hadoop
> > >>>>>>> catalog usage in test, as it wasn't recommended for production use 
> > >>>>>>> anyway.
> > >>>>>>> >> >
> > >>>>>>> >> > iceberg-cpp skips Actions for draft PRs [1] to reduce CI
> > >>>>>>> resource
> > >>>>>>> >> > usage a little bit. Perhaps we should apply the same approach
> > >>>>>>> across
> > >>>>>>> >> > all iceberg subprojects?
> > >>>>>>> >> >
> > >>>>>>> >> > [1] https://github.com/apache/iceberg-cpp/pull/680
> > >>>>>>> >>
> > >>>>>>> >> I've created a PR to show that, see [1], since it's a draft, the
> > >>>>>>> CI
> > >>>>>>> >> won't run. If I click the `Ready for review` button, the actions
> > >>>>>>> will
> > >>>>>>> >> be triggered. Let me know what you think about it.
> > >>>>>>> >>
> > >>>>>>> >> [1] https://github.com/apache/iceberg/pull/16561
> > >>>>>>> >>
> > >>>>>>> >> >
> > >>>>>>> >> > >
> > >>>>>>> >> > >
> > >>>>>>> >> > >
> > >>>>>>> >> > > On Fri, May 22, 2026 at 8:22 AM Matt Butrovich <
> > >>>>>>> [email protected]> wrote:
> > >>>>>>> >> > >>
> > >>>>>>> >> > >> Apache DataFusion similarly received this notice. For
> > >>>>>>> visibility to the Iceberg community, we have tracking issues to try 
> > >>>>>>> to
> > >>>>>>> discuss solutions:
> > >>>>>>> >> > >>
> > >>>>>>> >> > >> https://github.com/apache/datafusion/issues/22455
> > >>>>>>> >> > >> https://github.com/apache/datafusion-comet/issues/4406
> > >>>>>>> >> > >>
> > >>>>>>> >> > >> DataFusion Comet is consuming the vast majority of
> > >>>>>>> DataFusion resources, and like the Iceberg project it's due to 
> > >>>>>>> Spark tests
> > >>>>>>> (and Iceberg's Spark tests). We are doing some analysis on what 
> > >>>>>>> subsets
> > >>>>>>> might be appropriate for our workflows, features, and goals, and 
> > >>>>>>> will share
> > >>>>>>> anything that we think might translate back to the Iceberg CI 
> > >>>>>>> workflows.
> > >>>>>>> >> > >>
> > >>>>>>> >> > >> On Fri, May 22, 2026 at 7:43 AM Robert Thomson <
> > >>>>>>> [email protected]> wrote:
> > >>>>>>> >> > >>>
> > >>>>>>> >> > >>> Hello, Iceberg PMC.
> > >>>>>>> >> > >>>
> > >>>>>>> >> > >>> In 2024, the ASF introduced the policy for GitHub Actions
> > >>>>>>> usage
> > >>>>>>> >> > >>> across the foundation[1]. The ASF Github shared pool of
> > >>>>>>> >> > >>> Github-hosted runners has been at, or very close to the
> > >>>>>>> limit of
> > >>>>>>> >> > >>> 900 jobs most of the time in the past few weeks and this is
> > >>>>>>> the
> > >>>>>>> >> > >>> case again today.
> > >>>>>>> >> > >>>
> > >>>>>>> >> > >>> Your project has been identified as being among the top 5
> > >>>>>>> consumers of
> > >>>>>>> >> > >>> build time over the past 7 days and we request that you
> > >>>>>>> bring your
> > >>>>>>> >> > >>> usage down by stream-lining long-running builds. Contact
> > >>>>>>> Infra for
> > >>>>>>> >> > >>> a consultation if you are unable to streamline your builds
> > >>>>>>> further.
> > >>>>>>> >> > >>>
> > >>>>>>> >> > >>> You can use the infra reporting tool[2] to monitor your GHA
> > >>>>>>> usage as you
> > >>>>>>> >> > >>> work on stream-lining, as well as locate any bottlenecks in
> > >>>>>>> the workflows.
> > >>>>>>> >> > >>>
> > >>>>>>> >> > >>> Infra will allow you two weeks time (till the 8th of June,
> > >>>>>>> 2026) to
> > >>>>>>> >> > >>> progress this, but should you still be above the limits by
> > >>>>>>> then,
> > >>>>>>> >> > >>> without a viable path forward, we will be limiting your GHA
> > >>>>>>> usage.
> > >>>>>>> >> > >>>
> > >>>>>>> >> > >>> Kind regards,
> > >>>>>>> >> > >>> Bob Thomson, on behalf of ASF Infrastructure.
> > >>>>>>> >> > >>>
> > >>>>>>> >> > >>>
> > >>>>>>> >> > >>> [1] https://infra.apache.org/github-actions-policy.html
> > >>>>>>> >> > >>> [2]
> > >>>>>>> https://infra-reports.apache.org/#ghactions&project=iceberg&hours=24&limit=15&group=name
> > >>>>>>> >> > >>>
> > >>>>>>> >> >
> > >>>>>>> >> >
> > >>>>>>> >> > --
> > >>>>>>> >> > Regards
> > >>>>>>> >> > Junwang Zhao
> > >>>>>>> >>
> > >>>>>>> >>
> > >>>>>>> >>
> > >>>>>>> >> --
> > >>>>>>> >> Regards
> > >>>>>>> >> Junwang Zhao
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Regards
> > >>>>>>> Junwang Zhao
> > >>>>>>>
> > >>>>>>
> >
>

Re: Iceberg Consumption of ASF Shared GitHub-hosted Runners

Reply via email to