Re: Pulsar CI congested, master branch build broken

Lari Hotari Thu, 08 Sep 2022 00:27:17 -0700

If my assumption of the GitHub usage metrics bug in the GitHub Actions build 
job queue fairness algorithm is correct, what would help is running the flaky 
unit test group outside of Pulsar CI workflow. In that case, the impact of the 
usage metrics would be limited.


The example of https://github.com/apache/pulsar/actions/runs/3003787409/usage 
shows this flaw as explained in the previous email. The total reported 
execution time in that report is 1d 1h 40m 21s of usage and the actual usage is 
about 1/3 of this. 

When we move the most commonly failing job out of Pulsar CI workflow, the 
impact of the possible usage metrics bug would be much less. I hope GitHub 
support responds to my issue and queries about this bug. It might take up to 7 
days to get a reply and for technical questions more time. In the meantime we 
need a solution for getting over this CI slowness issue.

-Lari



On 2022/09/08 06:34:42 Lari Hotari wrote:
> My current assumption of the CI slowness problem is that the usage metrics 
> for Apache Pulsar builds on GitHub side is done incorrectly and that is 
> resulting in apache/pulsar builds getting throttled. This assumption might be 
> wrong, but it's the best guess at the moment.
> 
> The facts that support this assumption is that when re-running failed jobs in 
> a workflow, the execution times for previously successful jobs get counted as 
> if they have all run:
> Here's an example: 
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> The reported total usage is about 3x than the actual usage.
> 
> The assumption that I have is that the "fairness algorithm" that GitHub uses 
> to provide all Apache projects about the same amount of GitHub Actions 
> resources would take this flawed usage as the basis of it's decisions and it 
> decides to throttle apache/pulsar builds.
> 
> The reason why we are getting hit by this now is that there is a high number 
> of flaky test failures that cause almost every build to fail and we have been 
> re-running a lot of builds.
> 
> The other fact to support the theory of flawed usage metrics used in the 
> fairness algorithm is that other Apache projects aren't reporting issues 
> about GitHub Actions slowness. This is mentioned in Jarek Potiuk's comments 
> on INFRA-23633 [1]:
> > Unlike the case 2 years ago, the problem is not affecting all projects. In 
> > Apache Airflow we do > not see any particular slow-down with Public Runners 
> > at this moment (just checked - > 
> > everything is "as usual").. So I'd say it is something specific to Pulsar 
> > not to "ASF" as a whole.
> 
> There are also other comments from Jarek about the GitHub "fairness 
> algorithm" (comment [2], other comment [3])
> > But I believe the current problem is different - it might be (looking at 
> > your jobs) simply a bug 
> > in GA that you hit or indeed your demands are simply too high. 
> 
> I have opened tickets (2 tickets: 2 days ago and yesterday) to 
> support.github.com and there hasn't been any response to the ticket. It might 
> take up to 7 days to get a response. We cannot rely on GitHub Support 
> resolving this issue.
> 
> I propose that we go ahead with the previously suggested action plan
> > One possible way forward:
> > 1. Cancel all existing builds in_progress or queued
> > 2. Edit .asf.yaml and drop the "required checks" requirement for merging 
> > PRs.
> > 3. Wait for build to run for .asf.yaml change, merge it
> > 4. Disable all workflows
> > 5. Process specific PRs manually to improve the situation.
> >    - Make GHA workflow improvements such as 
> > https://github.com/apache/pulsar/pull/17491 and 
> > https://github.com/apache/pulsar/pull/17490
> >    - Quarantine all very flaky tests so that everyone doesn't waste time 
> > with those. It should be possible to merge a PR even when a quarantined 
> > test fails.
> > 6. Rebase PRs (or close and re-open) that would be processed next so that 
> > changes are picked up
> > 7. Enable workflows
> > 8. Start processing PRs with checks to see if things are handled in a 
> > better way.
> > 9. When things are stable, enable required checks again in .asf.yaml, in 
> > the meantime be careful about merging PRs
> > 10. Fix quarantined flaky tests
> 
> To clarify, steps 1-6 would be done optimally in 1 day and we would stop 
> processing ordinary PRs during this time. We would only handle PRs that fix 
> the CI situation during this exceptional period.
> 
> -Lari
> 
> Links to Jarek's comments:
> [1] 
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> [2] 
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> [3] 
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> 
> On 2022/09/07 17:01:43 Lari Hotari wrote:
> > One possible way forward:
> > 1. Cancel all existing builds in_progress or queued
> > 2. Edit .asf.yaml and drop the "required checks" requirement for merging 
> > PRs.
> > 3. Wait for build to run for .asf.yaml change, merge it
> > 4. Disable all workflows
> > 5. Process specific PRs manually to improve the situation.
> >    - Make GHA workflow improvements such as 
> > https://github.com/apache/pulsar/pull/17491 and 
> > https://github.com/apache/pulsar/pull/17490
> >    - Quarantine all very flaky tests so that everyone doesn't waste time 
> > with those. It should be possible to merge a PR even when a quarantined 
> > test fails.
> > 6. Rebase PRs (or close and re-open) that would be processed next so that 
> > changes are picked up
> > 7. Enable workflows
> > 8. Start processing PRs with checks to see if things are handled in a 
> > better way.
> > 9. When things are stable, enable required checks again in .asf.yaml, in 
> > the meantime be careful about merging PRs
> > 10. Fix quarantined flaky tests
> > 
> > -Lari
> > 
> > On 2022/09/07 16:47:09 Lari Hotari wrote:
> > > The problem with CI is becoming worse. The build queue is 235 jobs now 
> > > and the queue time is over 7 hours.
> > > 
> > > We will need to start shedding load in the build queue and get some fixes 
> > > in.
> > > https://issues.apache.org/jira/browse/INFRA-23633 continues to contain 
> > > details about some activities. I have created 2 GitHub Support tickets, 
> > > but usually it takes up to a week to get a response.
> > > 
> > > I have some assumptions about the issue, but they are just assumptions.
> > > One oddity is that when re-running failed jobs is used in a large 
> > > workflow, the execution times for previously successful jobs get counted 
> > > as if they have run. 
> > > Here's an example: 
> > > https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > The reported usage is about 3x than the actual usage.
> > > The assumption that I have is that the "fairness algorithm" that GitHub 
> > > uses to provide all Apache projects about the same amount of GitHub 
> > > Actions resources would take this flawed usage as the basis of it's 
> > > decisions.
> > > The reason why we are getting hit by this now is that there is a high 
> > > number of flaky test failures that cause almost every build to fail and 
> > > we are re-running a lot of builds.
> > > 
> > > Another problem there is that the GitHub Actions search doesn't always 
> > > show all workflow runs that are running. This has happened before when 
> > > the GitHub Actions workflow search index was corrupted. GitHub Support 
> > > resolved that by rebuilding the search index with some manual admin 
> > > operation behind the scenes.
> > > 
> > > I'm proposing that we start shedding load from CI by cancelling build 
> > > jobs and selecting which jobs to process so that we get the CI issue 
> > > resolved. We might also have to disable required checks so that we have 
> > > some way to get changes merged while CI doesn't work properly.
> > > 
> > > I'm expecting lazy consensus on fixing CI unless someone proposes a 
> > > better plan. Let's keep everyone informed in this mailing list thread.
> > > 
> > > -Lari
> > > 
> > > 
> > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > > We are going to need to take actions to fix our problems. See 
> > > > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > > 
> > > > Jarek has done a large amount of GitHub Action work with Apache Airflow 
> > > > and his suggestions might be helpful. One of his suggestions was Apache 
> > > > Yetus. I think he means using the Maven plugins - 
> > > > https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > > 
> > > > 
> > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lhot...@apache.org> wrote:
> > > > > 
> > > > > The Apache Infra ticket is 
> > > > > https://issues.apache.org/jira/browse/INFRA-23633 . 
> > > > > 
> > > > > -Lari
> > > > > 
> > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > > >> I asked for an update on the Apache org GitHub Actions usage stats 
> > > > >> from Gavin McDonald on the-asf slack in this thread: 
> > > > >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > > > >>  .
> > > > >> 
> > > > >> I hope we get this issue resolved since it delays PR processing a 
> > > > >> lot.
> > > > >> 
> > > > >> -Lari
> > > > >> 
> > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > > >>> Pulsar CI continues to be congested, and the build queue [1] is 
> > > > >>> very long at the moment. There are 147 build jobs in the queue and 
> > > > >>> 16 jobs in progress at the moment.
> > > > >>> 
> > > > >>> I would strongly advice everyone to use "personal CI" to mitigate 
> > > > >>> the issue of the long delay of CI feedback. You can simply open a 
> > > > >>> PR to your own personal fork of apache/pulsar to run the builds in 
> > > > >>> your "personal CI". There's more details in the previous emails in 
> > > > >>> this thread.
> > > > >>> 
> > > > >>> -Lari
> > > > >>> 
> > > > >>> [1] - build queue: 
> > > > >>> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > > >>> 
> > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > > >>>> Pulsar CI continues to be congested, and the build queue is long.
> > > > >>>> 
> > > > >>>> I would strongly advice everyone to use "personal CI" to mitigate 
> > > > >>>> the issue of the long delay of CI feedback. You can simply open a 
> > > > >>>> PR to your own personal fork of apache/pulsar to run the builds in 
> > > > >>>> your "personal CI". There's more details in the previous email in 
> > > > >>>> this thread.
> > > > >>>> 
> > > > >>>> Some updates:
> > > > >>>> 
> > > > >>>> There has been a discussion with Gavin McDonald from ASF infra on 
> > > > >>>> the-asf slack about getting usage reports from GitHub to support 
> > > > >>>> the investigation. Slack thread is the same one mentioned in the 
> > > > >>>> previous email, 
> > > > >>>> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> > > > >>>> Gavin already requested the usage report in GitHub UI, but it 
> > > > >>>> produced invalid results.
> > > > >>>> 
> > > > >>>> I made a change to mitigate a source of additional GitHub Actions 
> > > > >>>> overhead. 
> > > > >>>> In the past, each cherry-picked commit to a maintenance branch of 
> > > > >>>> Pulsar has triggered a lot of workflow runs. 
> > > > >>>> 
> > > > >>>> The solution for cancelling duplicate builds automatically is to 
> > > > >>>> add this definition to the workflow definition:
> > > > >>>> concurrency:
> > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > > >>>>  cancel-in-progress: true
> > > > >>>> 
> > > > >>>> I added this to all maintenance branch GitHub Actions workflows:
> > > > >>>> 
> > > > >>>> branch-2.10 change:
> > > > >>>> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > > >>>> branch-2.9 change:
> > > > >>>> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > > >>>> branch-2.8 change:
> > > > >>>> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > > >>>> branch-2.7:
> > > > >>>> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > > >>>> 
> > > > >>>> branch-2.11 already contains the necessary config for cancelling 
> > > > >>>> duplicate builds.
> > > > >>>> 
> > > > >>>> The benefit of the above change is that when multiple commits are 
> > > > >>>> cherry-picked to a branch at once, only the build of the last 
> > > > >>>> commit will get run eventually. The builds for the intermediate 
> > > > >>>> commits will get cancelled. Obviously there's a tradeoff here that 
> > > > >>>> we don't get the information if one of the earlier commits breaks 
> > > > >>>> the build. It's the cost that we need to pay. Nevertheless our 
> > > > >>>> build is so flaky that it's hard to determine whether a failed 
> > > > >>>> build result is only caused by bad flaky test or whether it's an 
> > > > >>>> actual failure. Because of this we don't lose anything by 
> > > > >>>> cancelling builds. It's more important to save build resources. In 
> > > > >>>> the maintenance branches for 2.10 and older, the average total 
> > > > >>>> build time consumed is around 20 hours which is a lot.
> > > > >>>> 
> > > > >>>> At this time, the overhead of maintenance branch builds doesn't 
> > > > >>>> seem to be the source of the problems. There must be some other 
> > > > >>>> issue which is possibly related to exceeding a usage quota. 
> > > > >>>> Hopefully we get the CI slowness issue solved asap.
> > > > >>>> 
> > > > >>>> BR,
> > > > >>>> 
> > > > >>>> Lari
> > > > >>>> 
> > > > >>>> 
> > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > >>>>> Hi,
> > > > >>>>> 
> > > > >>>>> GitHub Actions builds have been piling up in the build queue in 
> > > > >>>>> the last few days.
> > > > >>>>> I posted on bui...@apache.org 
> > > > >>>>> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s 
> > > > >>>>> and created INFRA ticket 
> > > > >>>>> https://issues.apache.org/jira/browse/INFRA-23633 about this 
> > > > >>>>> issue.
> > > > >>>>> There's also a thread on the-asf slack, 
> > > > >>>>> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> > > > >>>>> 
> > > > >>>>> It seems that our build queue is finally getting picked up, but 
> > > > >>>>> it would be great to see if we hit quota and whether that is the 
> > > > >>>>> cause of pauses. 
> > > > >>>>> 
> > > > >>>>> Another issue is that the master branch broke after merging 2 
> > > > >>>>> conflicting PRs. 
> > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 . 
> > > > >>>>> 
> > > > >>>>> Merging PRs will be slow until we have these 2 problems solved 
> > > > >>>>> and existing PRs rebased over the changes. Let's prioritize 
> > > > >>>>> merging #17300 before pushing more changes.
> > > > >>>>> 
> > > > >>>>> I'd like to point out that a good way to get build feedback 
> > > > >>>>> before sending a PR, is to run builds on your personal GitHub 
> > > > >>>>> Actions CI. The benefit of this is that it doesn't consume the 
> > > > >>>>> shared quota and builds usually start instantly.
> > > > >>>>> There are instructions in the contributors guide about this. 
> > > > >>>>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > >>>>> You simply open PRs to your own fork of apache/pulsar to run 
> > > > >>>>> builds on your personal GitHub Actions CI.
> > > > >>>>> 
> > > > >>>>> BR,
> > > > >>>>> 
> > > > >>>>> Lari
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>> 
> > > > >>> 
> > > > >> 
> > > > 
> > > > 
> > > 
> > 
>

Re: Pulsar CI congested, master branch build broken

Reply via email to