One possible way forward:
1. Cancel all existing builds in_progress or queued
2. Edit .asf.yaml and drop the "required checks" requirement for merging PRs.
3. Wait for build to run for .asf.yaml change, merge it
4. Disable all workflows
5. Process specific PRs manually to improve the situation.
   - Make GHA workflow improvements such as 
https://github.com/apache/pulsar/pull/17491 and 
https://github.com/apache/pulsar/pull/17490
   - Quarantine all very flaky tests so that everyone doesn't waste time with 
those. It should be possible to merge a PR even when a quarantined test fails.
6. Rebase PRs (or close and re-open) that would be processed next so that 
changes are picked up
7. Enable workflows
8. Start processing PRs with checks to see if things are handled in a better 
way.
9. When things are stable, enable required checks again in .asf.yaml, in the 
meantime be careful about merging PRs
10. Fix quarantined flaky tests

-Lari

On 2022/09/07 16:47:09 Lari Hotari wrote:
> The problem with CI is becoming worse. The build queue is 235 jobs now and 
> the queue time is over 7 hours.
> 
> We will need to start shedding load in the build queue and get some fixes in.
> https://issues.apache.org/jira/browse/INFRA-23633 continues to contain 
> details about some activities. I have created 2 GitHub Support tickets, but 
> usually it takes up to a week to get a response.
> 
> I have some assumptions about the issue, but they are just assumptions.
> One oddity is that when re-running failed jobs is used in a large workflow, 
> the execution times for previously successful jobs get counted as if they 
> have run. 
> Here's an example: 
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> The reported usage is about 3x than the actual usage.
> The assumption that I have is that the "fairness algorithm" that GitHub uses 
> to provide all Apache projects about the same amount of GitHub Actions 
> resources would take this flawed usage as the basis of it's decisions.
> The reason why we are getting hit by this now is that there is a high number 
> of flaky test failures that cause almost every build to fail and we are 
> re-running a lot of builds.
> 
> Another problem there is that the GitHub Actions search doesn't always show 
> all workflow runs that are running. This has happened before when the GitHub 
> Actions workflow search index was corrupted. GitHub Support resolved that by 
> rebuilding the search index with some manual admin operation behind the 
> scenes.
> 
> I'm proposing that we start shedding load from CI by cancelling build jobs 
> and selecting which jobs to process so that we get the CI issue resolved. We 
> might also have to disable required checks so that we have some way to get 
> changes merged while CI doesn't work properly.
> 
> I'm expecting lazy consensus on fixing CI unless someone proposes a better 
> plan. Let's keep everyone informed in this mailing list thread.
> 
> -Lari
> 
> 
> On 2022/09/06 14:41:07 Dave Fisher wrote:
> > We are going to need to take actions to fix our problems. See 
> > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > 
> > Jarek has done a large amount of GitHub Action work with Apache Airflow and 
> > his suggestions might be helpful. One of his suggestions was Apache Yetus. 
> > I think he means using the Maven plugins - 
> > https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > 
> > 
> > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lhot...@apache.org> wrote:
> > > 
> > > The Apache Infra ticket is 
> > > https://issues.apache.org/jira/browse/INFRA-23633 . 
> > > 
> > > -Lari
> > > 
> > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > >> I asked for an update on the Apache org GitHub Actions usage stats from 
> > >> Gavin McDonald on the-asf slack in this thread: 
> > >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > >>  .
> > >> 
> > >> I hope we get this issue resolved since it delays PR processing a lot.
> > >> 
> > >> -Lari
> > >> 
> > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > >>> Pulsar CI continues to be congested, and the build queue [1] is very 
> > >>> long at the moment. There are 147 build jobs in the queue and 16 jobs 
> > >>> in progress at the moment.
> > >>> 
> > >>> I would strongly advice everyone to use "personal CI" to mitigate the 
> > >>> issue of the long delay of CI feedback. You can simply open a PR to 
> > >>> your own personal fork of apache/pulsar to run the builds in your 
> > >>> "personal CI". There's more details in the previous emails in this 
> > >>> thread.
> > >>> 
> > >>> -Lari
> > >>> 
> > >>> [1] - build queue: 
> > >>> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > >>> 
> > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > >>>> Pulsar CI continues to be congested, and the build queue is long.
> > >>>> 
> > >>>> I would strongly advice everyone to use "personal CI" to mitigate the 
> > >>>> issue of the long delay of CI feedback. You can simply open a PR to 
> > >>>> your own personal fork of apache/pulsar to run the builds in your 
> > >>>> "personal CI". There's more details in the previous email in this 
> > >>>> thread.
> > >>>> 
> > >>>> Some updates:
> > >>>> 
> > >>>> There has been a discussion with Gavin McDonald from ASF infra on 
> > >>>> the-asf slack about getting usage reports from GitHub to support the 
> > >>>> investigation. Slack thread is the same one mentioned in the previous 
> > >>>> email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 
> > >>>> . Gavin already requested the usage report in GitHub UI, but it 
> > >>>> produced invalid results.
> > >>>> 
> > >>>> I made a change to mitigate a source of additional GitHub Actions 
> > >>>> overhead. 
> > >>>> In the past, each cherry-picked commit to a maintenance branch of 
> > >>>> Pulsar has triggered a lot of workflow runs. 
> > >>>> 
> > >>>> The solution for cancelling duplicate builds automatically is to add 
> > >>>> this definition to the workflow definition:
> > >>>> concurrency:
> > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > >>>>  cancel-in-progress: true
> > >>>> 
> > >>>> I added this to all maintenance branch GitHub Actions workflows:
> > >>>> 
> > >>>> branch-2.10 change:
> > >>>> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > >>>> branch-2.9 change:
> > >>>> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > >>>> branch-2.8 change:
> > >>>> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > >>>> branch-2.7:
> > >>>> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > >>>> 
> > >>>> branch-2.11 already contains the necessary config for cancelling 
> > >>>> duplicate builds.
> > >>>> 
> > >>>> The benefit of the above change is that when multiple commits are 
> > >>>> cherry-picked to a branch at once, only the build of the last commit 
> > >>>> will get run eventually. The builds for the intermediate commits will 
> > >>>> get cancelled. Obviously there's a tradeoff here that we don't get the 
> > >>>> information if one of the earlier commits breaks the build. It's the 
> > >>>> cost that we need to pay. Nevertheless our build is so flaky that it's 
> > >>>> hard to determine whether a failed build result is only caused by bad 
> > >>>> flaky test or whether it's an actual failure. Because of this we don't 
> > >>>> lose anything by cancelling builds. It's more important to save build 
> > >>>> resources. In the maintenance branches for 2.10 and older, the average 
> > >>>> total build time consumed is around 20 hours which is a lot.
> > >>>> 
> > >>>> At this time, the overhead of maintenance branch builds doesn't seem 
> > >>>> to be the source of the problems. There must be some other issue which 
> > >>>> is possibly related to exceeding a usage quota. Hopefully we get the 
> > >>>> CI slowness issue solved asap.
> > >>>> 
> > >>>> BR,
> > >>>> 
> > >>>> Lari
> > >>>> 
> > >>>> 
> > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > >>>>> Hi,
> > >>>>> 
> > >>>>> GitHub Actions builds have been piling up in the build queue in the 
> > >>>>> last few days.
> > >>>>> I posted on bui...@apache.org 
> > >>>>> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and 
> > >>>>> created INFRA ticket 
> > >>>>> https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> > >>>>> There's also a thread on the-asf slack, 
> > >>>>> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> > >>>>> 
> > >>>>> It seems that our build queue is finally getting picked up, but it 
> > >>>>> would be great to see if we hit quota and whether that is the cause 
> > >>>>> of pauses. 
> > >>>>> 
> > >>>>> Another issue is that the master branch broke after merging 2 
> > >>>>> conflicting PRs. 
> > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 . 
> > >>>>> 
> > >>>>> Merging PRs will be slow until we have these 2 problems solved and 
> > >>>>> existing PRs rebased over the changes. Let's prioritize merging 
> > >>>>> #17300 before pushing more changes.
> > >>>>> 
> > >>>>> I'd like to point out that a good way to get build feedback before 
> > >>>>> sending a PR, is to run builds on your personal GitHub Actions CI. 
> > >>>>> The benefit of this is that it doesn't consume the shared quota and 
> > >>>>> builds usually start instantly.
> > >>>>> There are instructions in the contributors guide about this. 
> > >>>>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > >>>>> You simply open PRs to your own fork of apache/pulsar to run builds 
> > >>>>> on your personal GitHub Actions CI.
> > >>>>> 
> > >>>>> BR,
> > >>>>> 
> > >>>>> Lari
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>> 
> > >>> 
> > >> 
> > 
> > 
> 

Reply via email to