Re: Iceberg Consumption of ASF Shared GitHub-hosted Runners

Ajantha Bhat Tue, 26 May 2026 05:36:01 -0700

Hi all,

How about implementing the incremental PR builder? (similar to
https://github.com/gitflow-incremental-builder/gitflow-incremental-builder)


I think one of the main causes of GitHub runner pressure in Iceberg is the
breadth of our CI matrix. We support multiple languages (java, python, go,
rust, cpp) and integrations, and for Java we test across multiple JVM
versions, Spark versions, Flink versions, Kafka, Hive/MR, REST/OpenAPI,
runtime bundles, and more. That coverage is valuable, but running most of
it for every PR is expensive and increases both runner usage and CI wall
time.

I think the biggest win can be achieved by having an incremental PR build.
We already have useful building blocks for it: Gradle build cache, path
filters, and version-selective build properties like -DsparkVersions and
-DflinkVersions.

The idea is to keep full coverage on main, release branches, tags, and
global build changes, but make PR CI depend on the files changed:

   - Spark-only changes run Spark CI, not Flink/Hive/Kafka.
   - spark/v4.1/** changes run only Spark 4.1, not every Spark version.
   - flink/v2.0/** changes run only Flink 2.0, not every Flink version.
   - API/Core/Data/File format changes run the owning Java checks plus
   selected downstream canaries, such as latest Spark and latest Flink,
   instead of the full engine matrix.
   - Runtime/bundle CVE checks run only for affected runtime artifacts.
   - A full-ci label or global Gradle/workflow changes can still force the
   full matrix.


Another possible optimization is JVM coverage. Today many PR jobs run
across both Java 17 and Java 21. We could consider running one primary JVM
for PRs, and reserve the full JVM matrix for main, release branches,
nightly/scheduled builds, or PRs labeled full-ci. That would further reduce
runner usage and PR wall time, while still preserving broad compatibility
coverage before changes become part of the main branch.

A practical approach could be:

PRs: incremental module/version selection, mostly one JVM, plus targeted
canaries.
main: full matrix across JVMs, Spark versions, Flink versions, and runtime
checks.
Manual override: full-ci label for risky or cross-cutting PRs.

This should reduce queue time, lower GitHub runner consumption, and give
contributors faster feedback without giving up full coverage where it
matters most.

I am working on a POC https://github.com/apache/iceberg/pull/16566
Suggestions are welcome.

- Ajantha

On Mon, May 25, 2026 at 7:35 PM Junwang Zhao <[email protected]> wrote:

> Hi Manu,
>
> On Mon, May 25, 2026 at 9:33 PM Manu Zhang <[email protected]>
> wrote:
> >
> > Hi Junwang,
> >
> > Not sure about others but I usually only change status to "Ready for
> review"  when CI has passed.
>
> Yeah, I agree there are trade-offs to disabling gh actions for draft PRs.
>
> Reasons to Disable:
>
> - Cost savings: large teams and monorepos can burn through GitHub
> Actions minutes quickly. Skipping CI for draft PRs avoids spending
> resources on code that may not even compile yet.
> - Reduced noise: draft PRs are often used for experimentation or
> work-in-progress changes. Disabling CI avoids cluttering the PR
> timeline with transient failures while the author is still iterating.
> - Better resource utilization: orgs with limited self-hosted runners
> may prefer to prioritize "Ready for Review" PRs so production-relevant
> changes get feedback and merge capacity sooner.
>
> Reasons to Keep:
>
> - Early error detection: developers can use draft PRs as a sandbox to
> validate builds and tests before requesting review.
> - Self-correction: failed checks on a draft PR allow authors to fix
> lint or test issues before involving reviewers.
> - Higher review confidence: by the time a PR is marked "Ready for
> Review", CI has often already passed at least once, leading to a
> smoother review process.
>
> For myself, when I create a draft PR, I'm usually sharing early
> work-in-progress code with other developers and may not have tested it
> thoroughly locally yet, so I sometimes prefer to disable CI. That's
> just my personal preference though.
>
> >
> > Regards,
> > Manu
> >
> > On Mon, May 25, 2026 at 3:21 PM Junwang Zhao <[email protected]> wrote:
> >>
> >> On Mon, May 25, 2026 at 11:20 AM Junwang Zhao <[email protected]>
> wrote:
> >> >
> >> > On Sun, May 24, 2026 at 12:13 PM Steven Wu <[email protected]>
> wrote:
> >> > >
> >> > > Kevin's PR of removing Spark 3.4 was merged a few days ago. It
> should reduce the Spark CI cost by ~25%.
> >> > >
> >> > > Some heavy-hitter test classes in Spark tests (core and extension)
> cause high load due to parameter combinations. I asked AI to analyze the
> build log and recommend changes offering the best ROI. Details are in this
> doc.
> >> > >
> >> > > I can look into dropping some combinations without sacrificing
> essential coverage. E.g., we can probably drop the Hadoop catalog usage in
> test, as it wasn't recommended for production use anyway.
> >> >
> >> > iceberg-cpp skips Actions for draft PRs [1] to reduce CI resource
> >> > usage a little bit. Perhaps we should apply the same approach across
> >> > all iceberg subprojects?
> >> >
> >> > [1] https://github.com/apache/iceberg-cpp/pull/680
> >>
> >> I've created a PR to show that, see [1], since it's a draft, the CI
> >> won't run. If I click the `Ready for review` button, the actions will
> >> be triggered. Let me know what you think about it.
> >>
> >> [1] https://github.com/apache/iceberg/pull/16561
> >>
> >> >
> >> > >
> >> > >
> >> > >
> >> > > On Fri, May 22, 2026 at 8:22 AM Matt Butrovich <
> [email protected]> wrote:
> >> > >>
> >> > >> Apache DataFusion similarly received this notice. For visibility
> to the Iceberg community, we have tracking issues to try to discuss
> solutions:
> >> > >>
> >> > >> https://github.com/apache/datafusion/issues/22455
> >> > >> https://github.com/apache/datafusion-comet/issues/4406
> >> > >>
> >> > >> DataFusion Comet is consuming the vast majority of DataFusion
> resources, and like the Iceberg project it's due to Spark tests (and
> Iceberg's Spark tests). We are doing some analysis on what subsets might be
> appropriate for our workflows, features, and goals, and will share anything
> that we think might translate back to the Iceberg CI workflows.
> >> > >>
> >> > >> On Fri, May 22, 2026 at 7:43 AM Robert Thomson <
> [email protected]> wrote:
> >> > >>>
> >> > >>> Hello, Iceberg PMC.
> >> > >>>
> >> > >>> In 2024, the ASF introduced the policy for GitHub Actions usage
> >> > >>> across the foundation[1]. The ASF Github shared pool of
> >> > >>> Github-hosted runners has been at, or very close to the limit of
> >> > >>> 900 jobs most of the time in the past few weeks and this is the
> >> > >>> case again today.
> >> > >>>
> >> > >>> Your project has been identified as being among the top 5
> consumers of
> >> > >>> build time over the past 7 days and we request that you bring your
> >> > >>> usage down by stream-lining long-running builds. Contact Infra for
> >> > >>> a consultation if you are unable to streamline your builds
> further.
> >> > >>>
> >> > >>> You can use the infra reporting tool[2] to monitor your GHA usage
> as you
> >> > >>> work on stream-lining, as well as locate any bottlenecks in the
> workflows.
> >> > >>>
> >> > >>> Infra will allow you two weeks time (till the 8th of June, 2026)
> to
> >> > >>> progress this, but should you still be above the limits by then,
> >> > >>> without a viable path forward, we will be limiting your GHA usage.
> >> > >>>
> >> > >>> Kind regards,
> >> > >>> Bob Thomson, on behalf of ASF Infrastructure.
> >> > >>>
> >> > >>>
> >> > >>> [1] https://infra.apache.org/github-actions-policy.html
> >> > >>> [2]
> https://infra-reports.apache.org/#ghactions&project=iceberg&hours=24&limit=15&group=name
> >> > >>>
> >> >
> >> >
> >> > --
> >> > Regards
> >> > Junwang Zhao
> >>
> >>
> >>
> >> --
> >> Regards
> >> Junwang Zhao
>
>
>
> --
> Regards
> Junwang Zhao
>

Re: Iceberg Consumption of ASF Shared GitHub-hosted Runners

Reply via email to