Re: Pulsar CI congested, master branch build broken

Michael Marshall Thu, 08 Sep 2022 19:09:56 -0700

Fantastic, thank you Lari and Nicolò!

- Michael


On Thu, Sep 8, 2022 at 9:03 PM Haiting Jiang <[email protected]> wrote:
>
> Great work. Thank you, Lari and Nicolò.
>
> BR,
> Haiting
>
> On Fri, Sep 9, 2022 at 9:36 AM tison <[email protected]> wrote:
> >
> > Thank you, Lari and Nicolò!
> > Best,
> > tison.
> >
> >
> > Nicolò Boschi <[email protected]> 于2022年9月9日周五 02:41写道：
> >
> > > Dear community,
> > >
> > > The plan has been executed.
> > > The summary of our actions is:
> > > 1. We cancelled all pending jobs (queue and in-progress)
> > > 2. We removed the required checks to be able to merge improvements on the
> > > CI workflow
> > > 3. We merged a couple of improvements:
> > >    1. workarounded the possible bug triggered by jobs retries. Now
> > > broker flaky tests are in a dedicated workflow
> > >    2. moved known flaky tests to the flaky suite
> > >    3. optimized the runner consumption for docs-only and cpp-only pulls
> > > 4. We reactivated the required checks.
> > >
> > >
> > > Now it's possible to come back to normal life.
> > > 1. You must rebase your branch to the latest master (there's a button for
> > > you in the UI) or eventually you can close/reopen the pull to trigger the
> > > checks
> > > 2. You can merge a pull request if you want
> > > 3. You will find a new job in the Checks section called "Pulsar CI / 
> > > Pulsar
> > > CI checks completed" that indicates the Pulsar CI successfully passed
> > >
> > > There's a slight chance that the CI will be stuck again in the next few
> > > days but we will take it monitored.
> > >
> > > Thanks Lari for the nice work!
> > >
> > > Regards,
> > > Nicolò Boschi
> > >
> > >
> > > Il giorno gio 8 set 2022 alle ore 10:55 Lari Hotari <[email protected]>
> > > ha
> > > scritto:
> > >
> > > > Thank you Nicolo.
> > > > There's lazy consensus, let's go forward with the action plan.
> > > >
> > > > -Lari
> > > >
> > > > On 2022/09/08 08:16:05 Nicolò Boschi wrote:
> > > > > This is the pull for step 2.
> > > https://github.com/apache/pulsar/pull/17539
> > > > >
> > > > > This is the script I'm going to use to cancel pending workflows.
> > > > >
> > > >
> > > https://github.com/nicoloboschi/pulsar-validation-tool/blob/master/pulsar-scripts/pulsar-gha/cancel-workflows.js
> > > > >
> > > > > I'm going to run the script in minutes.
> > > > >
> > > > > I advertised on Slack about what is happening:
> > > > >
> > > >
> > > https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1662624668695339?thread_ts=1662463042.016709&cid=C5ZSVEN4E
> > > > >
> > > > > >we’re going to execute the plan described in the ML. So any queued
> > > > actions
> > > > > will be cancelled. In order to validate your pull it is suggested to
> > > run
> > > > > the actions in your own Pulsar fork. Please don’t re-run failed jobs 
> > > > > or
> > > > > push any other commits to avoid triggering new actions
> > > > >
> > > > >
> > > > > Nicolò Boschi
> > > > >
> > > > >
> > > > > Il giorno gio 8 set 2022 alle ore 09:42 Nicolò Boschi <
> > > > [email protected]>
> > > > > ha scritto:
> > > > >
> > > > > > Thanks Lari for the detailed explanation. This is kind of an
> > > emergency
> > > > > > situation and I believe your plan is the way to go now.
> > > > > >
> > > > > > I already prepared a pull for moving the flaky suite out of the
> > > Pulsar
> > > > CI
> > > > > > workflow: https://github.com/nicoloboschi/pulsar/pull/8
> > > > > > I can take care of the execution of the plan.
> > > > > >
> > > > > > > 1. Cancel all existing builds in_progress or queued
> > > > > >
> > > > > > I have a script locally that uses GHA to check and cancel the 
> > > > > > pending
> > > > > > runs. We can extend it to all the queued builds (will share it 
> > > > > > soon).
> > > > > >
> > > > > > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > > > merging
> > > > > > PRs.
> > > > > > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > >
> > > > > > After the pull is out, we'll need to cancel all other workflows that
> > > > > > contributors may inadvertently have triggered.
> > > > > >
> > > > > > > 4. Disable all workflows
> > > > > > > 5. Process specific PRs manually to improve the situation.
> > > > > > >    - Make GHA workflow improvements such as
> > > > > > https://github.com/apache/pulsar/pull/17491 and
> > > > > > https://github.com/apache/pulsar/pull/17490
> > > > > > >    - Quarantine all very flaky tests so that everyone doesn't 
> > > > > > > waste
> > > > time
> > > > > > with those. It should be possible to merge a PR even when a
> > > quarantined
> > > > > > test fails.
> > > > > >
> > > > > > in this step we will merge this
> > > > > > https://github.com/nicoloboschi/pulsar/pull/8
> > > > > >
> > > > > > I want to add to the list this improvement to reduce runners usage 
> > > > > > in
> > > > case
> > > > > > of doc or cpp changes.
> > > > > > https://github.com/nicoloboschi/pulsar/pull/7
> > > > > >
> > > > > >
> > > > > > > 6. Rebase PRs (or close and re-open) that would be processed next
> > > so
> > > > > > that changes are picked up
> > > > > >
> > > > > > It's better to leave this task to the author of the pull in order to
> > > > not
> > > > > > create too much load at the same time
> > > > > >
> > > > > > > 7. Enable workflows
> > > > > > > 8. Start processing PRs with checks to see if things are handled
> > > in a
> > > > > > better way.
> > > > > > > 9. When things are stable, enable required checks again in
> > > > .asf.yaml, in
> > > > > > the meantime be careful about merging PRs
> > > > > > > 10. Fix quarantined flaky tests
> > > > > >
> > > > > >
> > > > > > Nicolò Boschi
> > > > > >
> > > > > >
> > > > > > Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <
> > > > [email protected]>
> > > > > > ha scritto:
> > > > > >
> > > > > >> If my assumption of the GitHub usage metrics bug in the GitHub
> > > Actions
> > > > > >> build job queue fairness algorithm is correct, what would help is
> > > > running
> > > > > >> the flaky unit test group outside of Pulsar CI workflow. In that
> > > > case, the
> > > > > >> impact of the usage metrics would be limited.
> > > > > >>
> > > > > >> The example of
> > > > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > shows
> > > > > >> this flaw as explained in the previous email. The total reported
> > > > execution
> > > > > >> time in that report is 1d 1h 40m 21s of usage and the actual usage
> > > is
> > > > about
> > > > > >> 1/3 of this.
> > > > > >>
> > > > > >> When we move the most commonly failing job out of Pulsar CI
> > > workflow,
> > > > the
> > > > > >> impact of the possible usage metrics bug would be much less. I hope
> > > > GitHub
> > > > > >> support responds to my issue and queries about this bug. It might
> > > > take up
> > > > > >> to 7 days to get a reply and for technical questions more time. In
> > > the
> > > > > >> meantime we need a solution for getting over this CI slowness 
> > > > > >> issue.
> > > > > >>
> > > > > >> -Lari
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 2022/09/08 06:34:42 Lari Hotari wrote:
> > > > > >> > My current assumption of the CI slowness problem is that the 
> > > > > >> > usage
> > > > > >> metrics for Apache Pulsar builds on GitHub side is done incorrectly
> > > > and
> > > > > >> that is resulting in apache/pulsar builds getting throttled. This
> > > > > >> assumption might be wrong, but it's the best guess at the moment.
> > > > > >> >
> > > > > >> > The facts that support this assumption is that when re-running
> > > > failed
> > > > > >> jobs in a workflow, the execution times for previously successful
> > > > jobs get
> > > > > >> counted as if they have all run:
> > > > > >> > Here's an example:
> > > > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > > >> > The reported total usage is about 3x than the actual usage.
> > > > > >> >
> > > > > >> > The assumption that I have is that the "fairness algorithm" that
> > > > GitHub
> > > > > >> uses to provide all Apache projects about the same amount of GitHub
> > > > Actions
> > > > > >> resources would take this flawed usage as the basis of it's
> > > decisions
> > > > and
> > > > > >> it decides to throttle apache/pulsar builds.
> > > > > >> >
> > > > > >> > The reason why we are getting hit by this now is that there is a
> > > > high
> > > > > >> number of flaky test failures that cause almost every build to fail
> > > > and we
> > > > > >> have been re-running a lot of builds.
> > > > > >> >
> > > > > >> > The other fact to support the theory of flawed usage metrics used
> > > in
> > > > > >> the fairness algorithm is that other Apache projects aren't
> > > reporting
> > > > > >> issues about GitHub Actions slowness. This is mentioned in Jarek
> > > > Potiuk's
> > > > > >> comments on INFRA-23633 [1]:
> > > > > >> > > Unlike the case 2 years ago, the problem is not affecting all
> > > > > >> projects. In Apache Airflow we do > not see any particular 
> > > > > >> slow-down
> > > > with
> > > > > >> Public Runners at this moment (just checked - >
> > > > > >> > > everything is "as usual").. So I'd say it is something specific
> > > to
> > > > > >> Pulsar not to "ASF" as a whole.
> > > > > >> >
> > > > > >> > There are also other comments from Jarek about the GitHub
> > > "fairness
> > > > > >> algorithm" (comment [2], other comment [3])
> > > > > >> > > But I believe the current problem is different - it might be
> > > > (looking
> > > > > >> at your jobs) simply a bug
> > > > > >> > > in GA that you hit or indeed your demands are simply too high.
> > > > > >> >
> > > > > >> > I have opened tickets (2 tickets: 2 days ago and yesterday) to
> > > > > >> support.github.com and there hasn't been any response to the
> > > ticket.
> > > > It
> > > > > >> might take up to 7 days to get a response. We cannot rely on GitHub
> > > > Support
> > > > > >> resolving this issue.
> > > > > >> >
> > > > > >> > I propose that we go ahead with the previously suggested action
> > > plan
> > > > > >> > > One possible way forward:
> > > > > >> > > 1. Cancel all existing builds in_progress or queued
> > > > > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement 
> > > > > >> > > for
> > > > > >> merging PRs.
> > > > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > >> > > 4. Disable all workflows
> > > > > >> > > 5. Process specific PRs manually to improve the situation.
> > > > > >> > >    - Make GHA workflow improvements such as
> > > > > >> https://github.com/apache/pulsar/pull/17491 and
> > > > > >> https://github.com/apache/pulsar/pull/17490
> > > > > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> > > > waste
> > > > > >> time with those. It should be possible to merge a PR even when a
> > > > > >> quarantined test fails.
> > > > > >> > > 6. Rebase PRs (or close and re-open) that would be processed
> > > next
> > > > so
> > > > > >> that changes are picked up
> > > > > >> > > 7. Enable workflows
> > > > > >> > > 8. Start processing PRs with checks to see if things are 
> > > > > >> > > handled
> > > > in a
> > > > > >> better way.
> > > > > >> > > 9. When things are stable, enable required checks again in
> > > > .asf.yaml,
> > > > > >> in the meantime be careful about merging PRs
> > > > > >> > > 10. Fix quarantined flaky tests
> > > > > >> >
> > > > > >> > To clarify, steps 1-6 would be done optimally in 1 day and we
> > > would
> > > > > >> stop processing ordinary PRs during this time. We would only handle
> > > > PRs
> > > > > >> that fix the CI situation during this exceptional period.
> > > > > >> >
> > > > > >> > -Lari
> > > > > >> >
> > > > > >> > Links to Jarek's comments:
> > > > > >> > [1]
> > > > > >>
> > > >
> > > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> > > > > >> > [2]
> > > > > >>
> > > >
> > > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > > > >> > [3]
> > > > > >>
> > > >
> > > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > > > >> >
> > > > > >> > On 2022/09/07 17:01:43 Lari Hotari wrote:
> > > > > >> > > One possible way forward:
> > > > > >> > > 1. Cancel all existing builds in_progress or queued
> > > > > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement 
> > > > > >> > > for
> > > > > >> merging PRs.
> > > > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > >> > > 4. Disable all workflows
> > > > > >> > > 5. Process specific PRs manually to improve the situation.
> > > > > >> > >    - Make GHA workflow improvements such as
> > > > > >> https://github.com/apache/pulsar/pull/17491 and
> > > > > >> https://github.com/apache/pulsar/pull/17490
> > > > > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> > > > waste
> > > > > >> time with those. It should be possible to merge a PR even when a
> > > > > >> quarantined test fails.
> > > > > >> > > 6. Rebase PRs (or close and re-open) that would be processed
> > > next
> > > > so
> > > > > >> that changes are picked up
> > > > > >> > > 7. Enable workflows
> > > > > >> > > 8. Start processing PRs with checks to see if things are 
> > > > > >> > > handled
> > > > in a
> > > > > >> better way.
> > > > > >> > > 9. When things are stable, enable required checks again in
> > > > .asf.yaml,
> > > > > >> in the meantime be careful about merging PRs
> > > > > >> > > 10. Fix quarantined flaky tests
> > > > > >> > >
> > > > > >> > > -Lari
> > > > > >> > >
> > > > > >> > > On 2022/09/07 16:47:09 Lari Hotari wrote:
> > > > > >> > > > The problem with CI is becoming worse. The build queue is 235
> > > > jobs
> > > > > >> now and the queue time is over 7 hours.
> > > > > >> > > >
> > > > > >> > > > We will need to start shedding load in the build queue and 
> > > > > >> > > > get
> > > > some
> > > > > >> fixes in.
> > > > > >> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues
> > > to
> > > > > >> contain details about some activities. I have created 2 GitHub
> > > Support
> > > > > >> tickets, but usually it takes up to a week to get a response.
> > > > > >> > > >
> > > > > >> > > > I have some assumptions about the issue, but they are just
> > > > > >> assumptions.
> > > > > >> > > > One oddity is that when re-running failed jobs is used in a
> > > > large
> > > > > >> workflow, the execution times for previously successful jobs get
> > > > counted as
> > > > > >> if they have run.
> > > > > >> > > > Here's an example:
> > > > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > > >> > > > The reported usage is about 3x than the actual usage.
> > > > > >> > > > The assumption that I have is that the "fairness algorithm"
> > > that
> > > > > >> GitHub uses to provide all Apache projects about the same amount of
> > > > GitHub
> > > > > >> Actions resources would take this flawed usage as the basis of it's
> > > > > >> decisions.
> > > > > >> > > > The reason why we are getting hit by this now is that there
> > > is a
> > > > > >> high number of flaky test failures that cause almost every build to
> > > > fail
> > > > > >> and we are re-running a lot of builds.
> > > > > >> > > >
> > > > > >> > > > Another problem there is that the GitHub Actions search
> > > doesn't
> > > > > >> always show all workflow runs that are running. This has happened
> > > > before
> > > > > >> when the GitHub Actions workflow search index was corrupted. GitHub
> > > > Support
> > > > > >> resolved that by rebuilding the search index with some manual admin
> > > > > >> operation behind the scenes.
> > > > > >> > > >
> > > > > >> > > > I'm proposing that we start shedding load from CI by
> > > cancelling
> > > > > >> build jobs and selecting which jobs to process so that we get the 
> > > > > >> CI
> > > > issue
> > > > > >> resolved. We might also have to disable required checks so that we
> > > > have
> > > > > >> some way to get changes merged while CI doesn't work properly.
> > > > > >> > > >
> > > > > >> > > > I'm expecting lazy consensus on fixing CI unless someone
> > > > proposes a
> > > > > >> better plan. Let's keep everyone informed in this mailing list
> > > thread.
> > > > > >> > > >
> > > > > >> > > > -Lari
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > > > >> > > > > We are going to need to take actions to fix our problems.
> > > See
> > > > > >>
> > > >
> > > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > > > >> > > > >
> > > > > >> > > > > Jarek has done a large amount of GitHub Action work with
> > > > Apache
> > > > > >> Airflow and his suggestions might be helpful. One of his 
> > > > > >> suggestions
> > > > was
> > > > > >> Apache Yetus. I think he means using the Maven plugins -
> > > > > >> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <
> > > [email protected]
> > > > >
> > > > > >> wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > The Apache Infra ticket is
> > > > > >> https://issues.apache.org/jira/browse/INFRA-23633 .
> > > > > >> > > > > >
> > > > > >> > > > > > -Lari
> > > > > >> > > > > >
> > > > > >> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > > > >> > > > > >> I asked for an update on the Apache org GitHub Actions
> > > > usage
> > > > > >> stats from Gavin McDonald on the-asf slack in this thread:
> > > > > >>
> > > >
> > > https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > > > > >> .
> > > > > >> > > > > >>
> > > > > >> > > > > >> I hope we get this issue resolved since it delays PR
> > > > > >> processing a lot.
> > > > > >> > > > > >>
> > > > > >> > > > > >> -Lari
> > > > > >> > > > > >>
> > > > > >> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > > > >> > > > > >>> Pulsar CI continues to be congested, and the build 
> > > > > >> > > > > >>> queue
> > > > [1]
> > > > > >> is very long at the moment. There are 147 build jobs in the queue
> > > and
> > > > 16
> > > > > >> jobs in progress at the moment.
> > > > > >> > > > > >>>
> > > > > >> > > > > >>> I would strongly advice everyone to use "personal CI" 
> > > > > >> > > > > >>> to
> > > > > >> mitigate the issue of the long delay of CI feedback. You can simply
> > > > open a
> > > > > >> PR to your own personal fork of apache/pulsar to run the builds in
> > > > your
> > > > > >> "personal CI". There's more details in the previous emails in this
> > > > thread.
> > > > > >> > > > > >>>
> > > > > >> > > > > >>> -Lari
> > > > > >> > > > > >>>
> > > > > >> > > > > >>> [1] - build queue:
> > > > > >> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > > > >> > > > > >>>
> > > > > >> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > > > >> > > > > >>>> Pulsar CI continues to be congested, and the build
> > > queue
> > > > is
> > > > > >> long.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> I would strongly advice everyone to use "personal CI"
> > > to
> > > > > >> mitigate the issue of the long delay of CI feedback. You can simply
> > > > open a
> > > > > >> PR to your own personal fork of apache/pulsar to run the builds in
> > > > your
> > > > > >> "personal CI". There's more details in the previous email in this
> > > > thread.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> Some updates:
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> There has been a discussion with Gavin McDonald from
> > > ASF
> > > > > >> infra on the-asf slack about getting usage reports from GitHub to
> > > > support
> > > > > >> the investigation. Slack thread is the same one mentioned in the
> > > > previous
> > > > > >> email,
> > > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
> > > > .
> > > > > >> Gavin already requested the usage report in GitHub UI, but it
> > > produced
> > > > > >> invalid results.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> I made a change to mitigate a source of additional
> > > GitHub
> > > > > >> Actions overhead.
> > > > > >> > > > > >>>> In the past, each cherry-picked commit to a 
> > > > > >> > > > > >>>> maintenance
> > > > > >> branch of Pulsar has triggered a lot of workflow runs.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> The solution for cancelling duplicate builds
> > > > automatically
> > > > > >> is to add this definition to the workflow definition:
> > > > > >> > > > > >>>> concurrency:
> > > > > >> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > > > >> > > > > >>>>  cancel-in-progress: true
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> I added this to all maintenance branch GitHub Actions
> > > > > >> workflows:
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> branch-2.10 change:
> > > > > >> > > > > >>>>
> > > > > >>
> > > >
> > > https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > > > >> > > > > >>>> branch-2.9 change:
> > > > > >> > > > > >>>>
> > > > > >>
> > > >
> > > https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > > > >> > > > > >>>> branch-2.8 change:
> > > > > >> > > > > >>>>
> > > > > >>
> > > >
> > > https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > > > >> > > > > >>>> branch-2.7:
> > > > > >> > > > > >>>>
> > > > > >>
> > > >
> > > https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> branch-2.11 already contains the necessary config for
> > > > > >> cancelling duplicate builds.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> The benefit of the above change is that when multiple
> > > > > >> commits are cherry-picked to a branch at once, only the build of 
> > > > > >> the
> > > > last
> > > > > >> commit will get run eventually. The builds for the intermediate
> > > > commits
> > > > > >> will get cancelled. Obviously there's a tradeoff here that we don't
> > > > get the
> > > > > >> information if one of the earlier commits breaks the build. It's 
> > > > > >> the
> > > > cost
> > > > > >> that we need to pay. Nevertheless our build is so flaky that it's
> > > > hard to
> > > > > >> determine whether a failed build result is only caused by bad flaky
> > > > test or
> > > > > >> whether it's an actual failure. Because of this we don't lose
> > > > anything by
> > > > > >> cancelling builds. It's more important to save build resources. In
> > > the
> > > > > >> maintenance branches for 2.10 and older, the average total build
> > > time
> > > > > >> consumed is around 20 hours which is a lot.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> At this time, the overhead of maintenance branch 
> > > > > >> > > > > >>>> builds
> > > > > >> doesn't seem to be the source of the problems. There must be some
> > > > other
> > > > > >> issue which is possibly related to exceeding a usage quota.
> > > Hopefully
> > > > we
> > > > > >> get the CI slowness issue solved asap.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> BR,
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> Lari
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > > >> > > > > >>>>> Hi,
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> GitHub Actions builds have been piling up in the 
> > > > > >> > > > > >>>>> build
> > > > > >> queue in the last few days.
> > > > > >> > > > > >>>>> I posted on [email protected]
> > > > > >> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s
> > > and
> > > > > >> created INFRA ticket
> > > > https://issues.apache.org/jira/browse/INFRA-23633
> > > > > >> about this issue.
> > > > > >> > > > > >>>>> There's also a thread on the-asf slack,
> > > > > >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> It seems that our build queue is finally getting
> > > picked
> > > > up,
> > > > > >> but it would be great to see if we hit quota and whether that is 
> > > > > >> the
> > > > cause
> > > > > >> of pauses.
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> Another issue is that the master branch broke after
> > > > merging
> > > > > >> 2 conflicting PRs.
> > > > > >> > > > > >>>>> The fix is in
> > > > https://github.com/apache/pulsar/pull/17300
> > > > > >> .
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> Merging PRs will be slow until we have these 2
> > > problems
> > > > > >> solved and existing PRs rebased over the changes. Let's prioritize
> > > > merging
> > > > > >> #17300 before pushing more changes.
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> I'd like to point out that a good way to get build
> > > > feedback
> > > > > >> before sending a PR, is to run builds on your personal GitHub
> > > Actions
> > > > CI.
> > > > > >> The benefit of this is that it doesn't consume the shared quota and
> > > > builds
> > > > > >> usually start instantly.
> > > > > >> > > > > >>>>> There are instructions in the contributors guide 
> > > > > >> > > > > >>>>> about
> > > > > >> this.
> > > > > >> > > > > >>>>>
> > > > > >> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > > >> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar
> > > to
> > > > > >> run builds on your personal GitHub Actions CI.
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> BR,
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> Lari
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>
> > > > > >> > > > > >>
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >

Re: Pulsar CI congested, master branch build broken

Reply via email to