Re: Iceberg Consumption of ASF Shared GitHub-hosted Runners

Steve Loughran Tue, 26 May 2026 12:17:46 -0700

Doing a scan of the aws-sdk bundle.jar is halfway to an audit of the
maven repo, with spark the other half.


It seems to me that only PRs which go near gradle/libs.versions.toml are
going to change dependences, so introduce new CVEs.

There's the separate issue "CVEs are eternal" and all existing dependencies
are collections of undiscovered/unreported cves. That's dependabot's
homework, generally.


On Tue, 26 May 2026 at 19:49, Kevin Liu <[email protected]> wrote:

> Thanks everyone for the great ideas.
>
> Here's where we stand today with respect to ASF runner usage (taken from
> the link [2] above):
> GitHub Actions Build Time Used
> - past 7 days total usage: 218,321 minutes
> - past 5 days total usage: 120,241 minutes
>
> *This puts us below the hard ceiling for resource usage* as described by
> https://infra.apache.org/github-actions-policy.html
>
> > The average number of minutes a project uses *per calendar week MUST
> NOT exceed the equivalent of 25 full-time runners (250,000 minutes, or
> 4,200 hours)*.
> > The average number of minutes a project uses *in any consecutive
> five-day period MUST NOT exceed the equivalent of 30 full-time runners
> (216,000 minutes, or 3,600 hours)*.
>
> We should still make improvements wherever possible.
>
> I have a few PRs to reduce CI usage further.
> - CI: Limit CVE scan runs to relevant changes #16513
> - Build: Simplify CI workflow path filters to avoid per-workflow
> maintenance #16302
>
> There are a couple of heuristics we can use
> 1. Don't run CI if not needed. For example, `site/` dir changes shouldn't
> trigger Spark/Flink/Java CI. This might be optimized already, but we should
> double check just in case.
> 2. If we must run CI, fail fast. For example, if there is a formatter
> issue, fail all inflight CI tasks.
> 3. Within a specific CI workflow, reduce the matrix wherever possible. Do
> we really need to run all "Java versions" x "Scala versions" x "Spark
> versions"?
> 4. Improve individual CI tasks. Spark CI dominates 57% of all resource
> usage. I have a tracking issue where I benchmarked where all that time is
> spent. See https://github.com/apache/iceberg/issues/16397
>
> Top CI tasks as % of resource use:
> - Spark CI: 57.68%
> - Flink CI: 13.60%
> - Java CI: 7.02%
> - CVE Scan: 3.13%
>
> Best,
> Kevin Liu
>
> On Tue, May 26, 2026 at 5:35 AM Ajantha Bhat <[email protected]>
> wrote:
>
>> Hi all,
>>
>> How about implementing the incremental PR builder? (similar to
>> https://github.com/gitflow-incremental-builder/gitflow-incremental-builder
>> )
>>
>> I think one of the main causes of GitHub runner pressure in Iceberg is
>> the breadth of our CI matrix. We support multiple languages (java, python,
>> go, rust, cpp) and integrations, and for Java we test across multiple JVM
>> versions, Spark versions, Flink versions, Kafka, Hive/MR, REST/OpenAPI,
>> runtime bundles, and more. That coverage is valuable, but running most of
>> it for every PR is expensive and increases both runner usage and CI wall
>> time.
>>
>> I think the biggest win can be achieved by having an incremental PR build.
>> We already have useful building blocks for it: Gradle build cache, path
>> filters, and version-selective build properties like -DsparkVersions and
>> -DflinkVersions.
>>
>> The idea is to keep full coverage on main, release branches, tags, and
>> global build changes, but make PR CI depend on the files changed:
>>
>>    - Spark-only changes run Spark CI, not Flink/Hive/Kafka.
>>    - spark/v4.1/** changes run only Spark 4.1, not every Spark version.
>>    - flink/v2.0/** changes run only Flink 2.0, not every Flink version.
>>    - API/Core/Data/File format changes run the owning Java checks plus
>>    selected downstream canaries, such as latest Spark and latest Flink,
>>    instead of the full engine matrix.
>>    - Runtime/bundle CVE checks run only for affected runtime artifacts.
>>    - A full-ci label or global Gradle/workflow changes can still force
>>    the full matrix.
>>
>>
>> Another possible optimization is JVM coverage. Today many PR jobs run
>> across both Java 17 and Java 21. We could consider running one primary JVM
>> for PRs, and reserve the full JVM matrix for main, release branches,
>> nightly/scheduled builds, or PRs labeled full-ci. That would further reduce
>> runner usage and PR wall time, while still preserving broad compatibility
>> coverage before changes become part of the main branch.
>>
>> A practical approach could be:
>>
>> PRs: incremental module/version selection, mostly one JVM, plus targeted
>> canaries.
>> main: full matrix across JVMs, Spark versions, Flink versions, and
>> runtime checks.
>> Manual override: full-ci label for risky or cross-cutting PRs.
>>
>> This should reduce queue time, lower GitHub runner consumption, and give
>> contributors faster feedback without giving up full coverage where it
>> matters most.
>>
>> I am working on a POC https://github.com/apache/iceberg/pull/16566
>> Suggestions are welcome.
>>
>> - Ajantha
>>
>> On Mon, May 25, 2026 at 7:35 PM Junwang Zhao <[email protected]> wrote:
>>
>>> Hi Manu,
>>>
>>> On Mon, May 25, 2026 at 9:33 PM Manu Zhang <[email protected]>
>>> wrote:
>>> >
>>> > Hi Junwang,
>>> >
>>> > Not sure about others but I usually only change status to "Ready for
>>> review"  when CI has passed.
>>>
>>> Yeah, I agree there are trade-offs to disabling gh actions for draft PRs.
>>>
>>> Reasons to Disable:
>>>
>>> - Cost savings: large teams and monorepos can burn through GitHub
>>> Actions minutes quickly. Skipping CI for draft PRs avoids spending
>>> resources on code that may not even compile yet.
>>> - Reduced noise: draft PRs are often used for experimentation or
>>> work-in-progress changes. Disabling CI avoids cluttering the PR
>>> timeline with transient failures while the author is still iterating.
>>> - Better resource utilization: orgs with limited self-hosted runners
>>> may prefer to prioritize "Ready for Review" PRs so production-relevant
>>> changes get feedback and merge capacity sooner.
>>>
>>> Reasons to Keep:
>>>
>>> - Early error detection: developers can use draft PRs as a sandbox to
>>> validate builds and tests before requesting review.
>>> - Self-correction: failed checks on a draft PR allow authors to fix
>>> lint or test issues before involving reviewers.
>>> - Higher review confidence: by the time a PR is marked "Ready for
>>> Review", CI has often already passed at least once, leading to a
>>> smoother review process.
>>>
>>> For myself, when I create a draft PR, I'm usually sharing early
>>> work-in-progress code with other developers and may not have tested it
>>> thoroughly locally yet, so I sometimes prefer to disable CI. That's
>>> just my personal preference though.
>>>
>>> >
>>> > Regards,
>>> > Manu
>>> >
>>> > On Mon, May 25, 2026 at 3:21 PM Junwang Zhao <[email protected]>
>>> wrote:
>>> >>
>>> >> On Mon, May 25, 2026 at 11:20 AM Junwang Zhao <[email protected]>
>>> wrote:
>>> >> >
>>> >> > On Sun, May 24, 2026 at 12:13 PM Steven Wu <[email protected]>
>>> wrote:
>>> >> > >
>>> >> > > Kevin's PR of removing Spark 3.4 was merged a few days ago. It
>>> should reduce the Spark CI cost by ~25%.
>>> >> > >
>>> >> > > Some heavy-hitter test classes in Spark tests (core and
>>> extension) cause high load due to parameter combinations. I asked AI to
>>> analyze the build log and recommend changes offering the best ROI. Details
>>> are in this doc.
>>> >> > >
>>> >> > > I can look into dropping some combinations without sacrificing
>>> essential coverage. E.g., we can probably drop the Hadoop catalog usage in
>>> test, as it wasn't recommended for production use anyway.
>>> >> >
>>> >> > iceberg-cpp skips Actions for draft PRs [1] to reduce CI resource
>>> >> > usage a little bit. Perhaps we should apply the same approach across
>>> >> > all iceberg subprojects?
>>> >> >
>>> >> > [1] https://github.com/apache/iceberg-cpp/pull/680
>>> >>
>>> >> I've created a PR to show that, see [1], since it's a draft, the CI
>>> >> won't run. If I click the `Ready for review` button, the actions will
>>> >> be triggered. Let me know what you think about it.
>>> >>
>>> >> [1] https://github.com/apache/iceberg/pull/16561
>>> >>
>>> >> >
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > > On Fri, May 22, 2026 at 8:22 AM Matt Butrovich <
>>> [email protected]> wrote:
>>> >> > >>
>>> >> > >> Apache DataFusion similarly received this notice. For visibility
>>> to the Iceberg community, we have tracking issues to try to discuss
>>> solutions:
>>> >> > >>
>>> >> > >> https://github.com/apache/datafusion/issues/22455
>>> >> > >> https://github.com/apache/datafusion-comet/issues/4406
>>> >> > >>
>>> >> > >> DataFusion Comet is consuming the vast majority of DataFusion
>>> resources, and like the Iceberg project it's due to Spark tests (and
>>> Iceberg's Spark tests). We are doing some analysis on what subsets might be
>>> appropriate for our workflows, features, and goals, and will share anything
>>> that we think might translate back to the Iceberg CI workflows.
>>> >> > >>
>>> >> > >> On Fri, May 22, 2026 at 7:43 AM Robert Thomson <
>>> [email protected]> wrote:
>>> >> > >>>
>>> >> > >>> Hello, Iceberg PMC.
>>> >> > >>>
>>> >> > >>> In 2024, the ASF introduced the policy for GitHub Actions usage
>>> >> > >>> across the foundation[1]. The ASF Github shared pool of
>>> >> > >>> Github-hosted runners has been at, or very close to the limit of
>>> >> > >>> 900 jobs most of the time in the past few weeks and this is the
>>> >> > >>> case again today.
>>> >> > >>>
>>> >> > >>> Your project has been identified as being among the top 5
>>> consumers of
>>> >> > >>> build time over the past 7 days and we request that you bring
>>> your
>>> >> > >>> usage down by stream-lining long-running builds. Contact Infra
>>> for
>>> >> > >>> a consultation if you are unable to streamline your builds
>>> further.
>>> >> > >>>
>>> >> > >>> You can use the infra reporting tool[2] to monitor your GHA
>>> usage as you
>>> >> > >>> work on stream-lining, as well as locate any bottlenecks in the
>>> workflows.
>>> >> > >>>
>>> >> > >>> Infra will allow you two weeks time (till the 8th of June,
>>> 2026) to
>>> >> > >>> progress this, but should you still be above the limits by then,
>>> >> > >>> without a viable path forward, we will be limiting your GHA
>>> usage.
>>> >> > >>>
>>> >> > >>> Kind regards,
>>> >> > >>> Bob Thomson, on behalf of ASF Infrastructure.
>>> >> > >>>
>>> >> > >>>
>>> >> > >>> [1] https://infra.apache.org/github-actions-policy.html
>>> >> > >>> [2]
>>> https://infra-reports.apache.org/#ghactions&project=iceberg&hours=24&limit=15&group=name
>>> >> > >>>
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Regards
>>> >> > Junwang Zhao
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Regards
>>> >> Junwang Zhao
>>>
>>>
>>>
>>> --
>>> Regards
>>> Junwang Zhao
>>>
>>

Re: Iceberg Consumption of ASF Shared GitHub-hosted Runners

Reply via email to