> selecting which jobs to process

Do you have a patch to implement this? IIRC it requires interacting with
outside service or at least we may add an ok-to-test label.

Besides, it increases committers/PMC members' workload - be aware of it, or
most of contributions will stall.

Best,
tison.


Lari Hotari <lhot...@apache.org> 于2022年9月8日周四 00:47写道:

> The problem with CI is becoming worse. The build queue is 235 jobs now and
> the queue time is over 7 hours.
>
> We will need to start shedding load in the build queue and get some fixes
> in.
> https://issues.apache.org/jira/browse/INFRA-23633 continues to contain
> details about some activities. I have created 2 GitHub Support tickets, but
> usually it takes up to a week to get a response.
>
> I have some assumptions about the issue, but they are just assumptions.
> One oddity is that when re-running failed jobs is used in a large
> workflow, the execution times for previously successful jobs get counted as
> if they have run.
> Here's an example:
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> The reported usage is about 3x than the actual usage.
> The assumption that I have is that the "fairness algorithm" that GitHub
> uses to provide all Apache projects about the same amount of GitHub Actions
> resources would take this flawed usage as the basis of it's decisions.
> The reason why we are getting hit by this now is that there is a high
> number of flaky test failures that cause almost every build to fail and we
> are re-running a lot of builds.
>
> Another problem there is that the GitHub Actions search doesn't always
> show all workflow runs that are running. This has happened before when the
> GitHub Actions workflow search index was corrupted. GitHub Support resolved
> that by rebuilding the search index with some manual admin operation behind
> the scenes.
>
> I'm proposing that we start shedding load from CI by cancelling build jobs
> and selecting which jobs to process so that we get the CI issue resolved.
> We might also have to disable required checks so that we have some way to
> get changes merged while CI doesn't work properly.
>
> I'm expecting lazy consensus on fixing CI unless someone proposes a better
> plan. Let's keep everyone informed in this mailing list thread.
>
> -Lari
>
>
> On 2022/09/06 14:41:07 Dave Fisher wrote:
> > We are going to need to take actions to fix our problems. See
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> >
> > Jarek has done a large amount of GitHub Action work with Apache Airflow
> and his suggestions might be helpful. One of his suggestions was Apache
> Yetus. I think he means using the Maven plugins -
> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> >
> >
> > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lhot...@apache.org> wrote:
> > >
> > > The Apache Infra ticket is
> https://issues.apache.org/jira/browse/INFRA-23633 .
> > >
> > > -Lari
> > >
> > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > >> I asked for an update on the Apache org GitHub Actions usage stats
> from Gavin McDonald on the-asf slack in this thread:
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> .
> > >>
> > >> I hope we get this issue resolved since it delays PR processing a lot.
> > >>
> > >> -Lari
> > >>
> > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > >>> Pulsar CI continues to be congested, and the build queue [1] is very
> long at the moment. There are 147 build jobs in the queue and 16 jobs in
> progress at the moment.
> > >>>
> > >>> I would strongly advice everyone to use "personal CI" to mitigate
> the issue of the long delay of CI feedback. You can simply open a PR to
> your own personal fork of apache/pulsar to run the builds in your "personal
> CI". There's more details in the previous emails in this thread.
> > >>>
> > >>> -Lari
> > >>>
> > >>> [1] - build queue:
> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > >>>
> > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > >>>> Pulsar CI continues to be congested, and the build queue is long.
> > >>>>
> > >>>> I would strongly advice everyone to use "personal CI" to mitigate
> the issue of the long delay of CI feedback. You can simply open a PR to
> your own personal fork of apache/pulsar to run the builds in your "personal
> CI". There's more details in the previous email in this thread.
> > >>>>
> > >>>> Some updates:
> > >>>>
> > >>>> There has been a discussion with Gavin McDonald from ASF infra on
> the-asf slack about getting usage reports from GitHub to support the
> investigation. Slack thread is the same one mentioned in the previous
> email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> Gavin already requested the usage report in GitHub UI, but it produced
> invalid results.
> > >>>>
> > >>>> I made a change to mitigate a source of additional GitHub Actions
> overhead.
> > >>>> In the past, each cherry-picked commit to a maintenance branch of
> Pulsar has triggered a lot of workflow runs.
> > >>>>
> > >>>> The solution for cancelling duplicate builds automatically is to
> add this definition to the workflow definition:
> > >>>> concurrency:
> > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > >>>>  cancel-in-progress: true
> > >>>>
> > >>>> I added this to all maintenance branch GitHub Actions workflows:
> > >>>>
> > >>>> branch-2.10 change:
> > >>>>
> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > >>>> branch-2.9 change:
> > >>>>
> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > >>>> branch-2.8 change:
> > >>>>
> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > >>>> branch-2.7:
> > >>>>
> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > >>>>
> > >>>> branch-2.11 already contains the necessary config for cancelling
> duplicate builds.
> > >>>>
> > >>>> The benefit of the above change is that when multiple commits are
> cherry-picked to a branch at once, only the build of the last commit will
> get run eventually. The builds for the intermediate commits will get
> cancelled. Obviously there's a tradeoff here that we don't get the
> information if one of the earlier commits breaks the build. It's the cost
> that we need to pay. Nevertheless our build is so flaky that it's hard to
> determine whether a failed build result is only caused by bad flaky test or
> whether it's an actual failure. Because of this we don't lose anything by
> cancelling builds. It's more important to save build resources. In the
> maintenance branches for 2.10 and older, the average total build time
> consumed is around 20 hours which is a lot.
> > >>>>
> > >>>> At this time, the overhead of maintenance branch builds doesn't
> seem to be the source of the problems. There must be some other issue which
> is possibly related to exceeding a usage quota. Hopefully we get the CI
> slowness issue solved asap.
> > >>>>
> > >>>> BR,
> > >>>>
> > >>>> Lari
> > >>>>
> > >>>>
> > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > >>>>> Hi,
> > >>>>>
> > >>>>> GitHub Actions builds have been piling up in the build queue in
> the last few days.
> > >>>>> I posted on bui...@apache.org
> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633
> about this issue.
> > >>>>> There's also a thread on the-asf slack,
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > >>>>>
> > >>>>> It seems that our build queue is finally getting picked up, but it
> would be great to see if we hit quota and whether that is the cause of
> pauses.
> > >>>>>
> > >>>>> Another issue is that the master branch broke after merging 2
> conflicting PRs.
> > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 .
> > >>>>>
> > >>>>> Merging PRs will be slow until we have these 2 problems solved and
> existing PRs rebased over the changes. Let's prioritize merging #17300
> before pushing more changes.
> > >>>>>
> > >>>>> I'd like to point out that a good way to get build feedback before
> sending a PR, is to run builds on your personal GitHub Actions CI. The
> benefit of this is that it doesn't consume the shared quota and builds
> usually start instantly.
> > >>>>> There are instructions in the contributors guide about this.
> > >>>>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > >>>>> You simply open PRs to your own fork of apache/pulsar to run
> builds on your personal GitHub Actions CI.
> > >>>>>
> > >>>>> BR,
> > >>>>>
> > >>>>> Lari
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Reply via email to