Re: Pulsar CI congested, master branch build broken

Lari Hotari Wed, 07 Sep 2022 10:51:29 -0700

On 2022/09/07 17:27:45 tison wrote:
> Today Pulsar repo runs almost up to one worflow run at the same time. It's
> a new situation I didn't notice before.
> 
> > drop the "required checks"
> 
> This can be dangerous to the repo status. I think the essential problem we
> meet here is about prioritizing specific PR, instead of releasing the guard
> to all PRs.


Life is dangerous. :) 
I suggested dropping required checks for the time until specific PRs to address 
the CI issues have been merged. If we don't drop the "required checks" 
temporarily, there will be additional delays. It's pointless to have required 
checks enabled while fixing the issue. 
In emergency situations, it's also possible to push changes directly to the 
master branch by-passing PRs. I'm not saying that we should do that. It's just 
that we have also that option available.

> 
> > Fix quarantined flaky tests
> 
> But yes, to overcome the workload brought by unnecessary reruns, it can be
> a solution that we treat all tests as "unstable" and un-require them while
> adding back in a timing manner.

Dropping "required checks" can solve that. Moving all tests to flaky or 
quarantine is our current solution doesn't make sense. It's better to disable 
required checks and disable specific workflows temporarily. I'm just clarifying 
that most of the action plan I proposed should be implemented in 1-2 days. We 
have already solutions to move flaky tests to groups that don't block merging 
and we should do that for the most flaky tests. There's no meaning in rerunning 
known flaky tests. 

-Lari


> 
> Best,
> tison.
> 
> 
> Lari Hotari <lhot...@apache.org> 于2022年9月8日周四 01:15写道：
> 
> > On 2022/09/07 16:59:33 tison wrote:
> > > > selecting which jobs to process
> > >
> > > Do you have a patch to implement this? IIRC it requires interacting with
> > > outside service or at least we may add an ok-to-test label.
> >
> > Very good idea, I didn't think that far ahead. It seems that Apache Spark
> > has some solution
> > since in the the-asf slack channel discussion it was mentioned that Spark
> > requires
> > contributors to run validation in their own personal GHA quota.
> > I don't know how that is achieved.
> >
> > As you proposed, one possible solution would be to have a workflow that
> > only proceeds
> > when there's a "ok-to-test" label on the PR.
> >
> > For the immediate selection of jobs to process, I have ways to clear the
> > GHA build queue
> > for apache/pulsar using the GHA API.
> > I clarified the proposed action plan in a follow up message to the thread
> > [1].
> > We would primarily process PRs which help to get out of the situation
> > where we are.
> >
> > It would also be helpful if there would be a way to escalate
> > ASF INFRA support and GitHub Support. However, the ticket
> > https://issues.apache.org/jira/browse/INFRA-23633 discussion doesn't give
> > much hope
> > of this possibility.
> >
> >
> > -Lari
> >
> > [1] https://lists.apache.org/thread/rpq12tzm4hx8kozpkphd2jyqr8cj0yj5
> >
> > On 2022/09/07 16:59:33 tison wrote:
> > > > selecting which jobs to process
> > >
> > > Do you have a patch to implement this? IIRC it requires interacting with
> > > outside service or at least we may add an ok-to-test label.
> > >
> > > Besides, it increases committers/PMC members' workload - be aware of it,
> > or
> > > most of contributions will stall.
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > Lari Hotari <lhot...@apache.org> 于2022年9月8日周四 00:47写道：
> > >
> > > > The problem with CI is becoming worse. The build queue is 235 jobs now
> > and
> > > > the queue time is over 7 hours.
> > > >
> > > > We will need to start shedding load in the build queue and get some
> > fixes
> > > > in.
> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues to contain
> > > > details about some activities. I have created 2 GitHub Support
> > tickets, but
> > > > usually it takes up to a week to get a response.
> > > >
> > > > I have some assumptions about the issue, but they are just assumptions.
> > > > One oddity is that when re-running failed jobs is used in a large
> > > > workflow, the execution times for previously successful jobs get
> > counted as
> > > > if they have run.
> > > > Here's an example:
> > > > https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > The reported usage is about 3x than the actual usage.
> > > > The assumption that I have is that the "fairness algorithm" that GitHub
> > > > uses to provide all Apache projects about the same amount of GitHub
> > Actions
> > > > resources would take this flawed usage as the basis of it's decisions.
> > > > The reason why we are getting hit by this now is that there is a high
> > > > number of flaky test failures that cause almost every build to fail
> > and we
> > > > are re-running a lot of builds.
> > > >
> > > > Another problem there is that the GitHub Actions search doesn't always
> > > > show all workflow runs that are running. This has happened before when
> > the
> > > > GitHub Actions workflow search index was corrupted. GitHub Support
> > resolved
> > > > that by rebuilding the search index with some manual admin operation
> > behind
> > > > the scenes.
> > > >
> > > > I'm proposing that we start shedding load from CI by cancelling build
> > jobs
> > > > and selecting which jobs to process so that we get the CI issue
> > resolved.
> > > > We might also have to disable required checks so that we have some way
> > to
> > > > get changes merged while CI doesn't work properly.
> > > >
> > > > I'm expecting lazy consensus on fixing CI unless someone proposes a
> > better
> > > > plan. Let's keep everyone informed in this mailing list thread.
> > > >
> > > > -Lari
> > > >
> > > >
> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > > > We are going to need to take actions to fix our problems. See
> > > >
> > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > > >
> > > > > Jarek has done a large amount of GitHub Action work with Apache
> > Airflow
> > > > and his suggestions might be helpful. One of his suggestions was Apache
> > > > Yetus. I think he means using the Maven plugins -
> > > > https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > > >
> > > > >
> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lhot...@apache.org>
> > wrote:
> > > > > >
> > > > > > The Apache Infra ticket is
> > > > https://issues.apache.org/jira/browse/INFRA-23633 .
> > > > > >
> > > > > > -Lari
> > > > > >
> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > > > >> I asked for an update on the Apache org GitHub Actions usage stats
> > > > from Gavin McDonald on the-asf slack in this thread:
> > > >
> > https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > > > .
> > > > > >>
> > > > > >> I hope we get this issue resolved since it delays PR processing a
> > lot.
> > > > > >>
> > > > > >> -Lari
> > > > > >>
> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > > > >>> Pulsar CI continues to be congested, and the build queue [1] is
> > very
> > > > long at the moment. There are 147 build jobs in the queue and 16 jobs
> > in
> > > > progress at the moment.
> > > > > >>>
> > > > > >>> I would strongly advice everyone to use "personal CI" to mitigate
> > > > the issue of the long delay of CI feedback. You can simply open a PR to
> > > > your own personal fork of apache/pulsar to run the builds in your
> > "personal
> > > > CI". There's more details in the previous emails in this thread.
> > > > > >>>
> > > > > >>> -Lari
> > > > > >>>
> > > > > >>> [1] - build queue:
> > > > https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > > > >>>
> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > > > >>>> Pulsar CI continues to be congested, and the build queue is
> > long.
> > > > > >>>>
> > > > > >>>> I would strongly advice everyone to use "personal CI" to
> > mitigate
> > > > the issue of the long delay of CI feedback. You can simply open a PR to
> > > > your own personal fork of apache/pulsar to run the builds in your
> > "personal
> > > > CI". There's more details in the previous email in this thread.
> > > > > >>>>
> > > > > >>>> Some updates:
> > > > > >>>>
> > > > > >>>> There has been a discussion with Gavin McDonald from ASF infra
> > on
> > > > the-asf slack about getting usage reports from GitHub to support the
> > > > investigation. Slack thread is the same one mentioned in the previous
> > > > email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
> > .
> > > > Gavin already requested the usage report in GitHub UI, but it produced
> > > > invalid results.
> > > > > >>>>
> > > > > >>>> I made a change to mitigate a source of additional GitHub
> > Actions
> > > > overhead.
> > > > > >>>> In the past, each cherry-picked commit to a maintenance branch
> > of
> > > > Pulsar has triggered a lot of workflow runs.
> > > > > >>>>
> > > > > >>>> The solution for cancelling duplicate builds automatically is to
> > > > add this definition to the workflow definition:
> > > > > >>>> concurrency:
> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > > > >>>>  cancel-in-progress: true
> > > > > >>>>
> > > > > >>>> I added this to all maintenance branch GitHub Actions workflows:
> > > > > >>>>
> > > > > >>>> branch-2.10 change:
> > > > > >>>>
> > > >
> > https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > > > >>>> branch-2.9 change:
> > > > > >>>>
> > > >
> > https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > > > >>>> branch-2.8 change:
> > > > > >>>>
> > > >
> > https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > > > >>>> branch-2.7:
> > > > > >>>>
> > > >
> > https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > > > >>>>
> > > > > >>>> branch-2.11 already contains the necessary config for cancelling
> > > > duplicate builds.
> > > > > >>>>
> > > > > >>>> The benefit of the above change is that when multiple commits
> > are
> > > > cherry-picked to a branch at once, only the build of the last commit
> > will
> > > > get run eventually. The builds for the intermediate commits will get
> > > > cancelled. Obviously there's a tradeoff here that we don't get the
> > > > information if one of the earlier commits breaks the build. It's the
> > cost
> > > > that we need to pay. Nevertheless our build is so flaky that it's hard
> > to
> > > > determine whether a failed build result is only caused by bad flaky
> > test or
> > > > whether it's an actual failure. Because of this we don't lose anything
> > by
> > > > cancelling builds. It's more important to save build resources. In the
> > > > maintenance branches for 2.10 and older, the average total build time
> > > > consumed is around 20 hours which is a lot.
> > > > > >>>>
> > > > > >>>> At this time, the overhead of maintenance branch builds doesn't
> > > > seem to be the source of the problems. There must be some other issue
> > which
> > > > is possibly related to exceeding a usage quota. Hopefully we get the CI
> > > > slowness issue solved asap.
> > > > > >>>>
> > > > > >>>> BR,
> > > > > >>>>
> > > > > >>>> Lari
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > > >>>>> Hi,
> > > > > >>>>>
> > > > > >>>>> GitHub Actions builds have been piling up in the build queue in
> > > > the last few days.
> > > > > >>>>> I posted on bui...@apache.org
> > > > https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> > > > created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633
> > > > about this issue.
> > > > > >>>>> There's also a thread on the-asf slack,
> > > > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > > > > >>>>>
> > > > > >>>>> It seems that our build queue is finally getting picked up,
> > but it
> > > > would be great to see if we hit quota and whether that is the cause of
> > > > pauses.
> > > > > >>>>>
> > > > > >>>>> Another issue is that the master branch broke after merging 2
> > > > conflicting PRs.
> > > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 .
> > > > > >>>>>
> > > > > >>>>> Merging PRs will be slow until we have these 2 problems solved
> > and
> > > > existing PRs rebased over the changes. Let's prioritize merging #17300
> > > > before pushing more changes.
> > > > > >>>>>
> > > > > >>>>> I'd like to point out that a good way to get build feedback
> > before
> > > > sending a PR, is to run builds on your personal GitHub Actions CI. The
> > > > benefit of this is that it doesn't consume the shared quota and builds
> > > > usually start instantly.
> > > > > >>>>> There are instructions in the contributors guide about this.
> > > > > >>>>>
> > https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar to run
> > > > builds on your personal GitHub Actions CI.
> > > > > >>>>>
> > > > > >>>>> BR,
> > > > > >>>>>
> > > > > >>>>> Lari
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Pulsar CI congested, master branch build broken

Reply via email to