Here is another patch that can reduce unnecessary workload: https://github.com/apache/pulsar/pull/17529
We don't create flaky-test issues/PRs frequently; it's about tens in one month. The project owner should be able to handle it manually in minutes per month (since candidates are already labeled); compared with now, we run for every issue/PR opened/labeled. Best, tison. tison <wander4...@gmail.com> 于2022年9月8日周四 01:27写道: > Today Pulsar repo runs almost up to one worflow run at the same time. It's > a new situation I didn't notice before. > > > drop the "required checks" > > This can be dangerous to the repo status. I think the essential problem > we meet here is about prioritizing specific PR, instead of releasing the > guard to all PRs. > > > Fix quarantined flaky tests > > But yes, to overcome the workload brought by unnecessary reruns, it can be > a solution that we treat all tests as "unstable" and un-require them while > adding back in a timing manner. > > Best, > tison. > > > Lari Hotari <lhot...@apache.org> 于2022年9月8日周四 01:15写道: > >> On 2022/09/07 16:59:33 tison wrote: >> > > selecting which jobs to process >> > >> > Do you have a patch to implement this? IIRC it requires interacting with >> > outside service or at least we may add an ok-to-test label. >> >> Very good idea, I didn't think that far ahead. It seems that Apache Spark >> has some solution >> since in the the-asf slack channel discussion it was mentioned that Spark >> requires >> contributors to run validation in their own personal GHA quota. >> I don't know how that is achieved. >> >> As you proposed, one possible solution would be to have a workflow that >> only proceeds >> when there's a "ok-to-test" label on the PR. >> >> For the immediate selection of jobs to process, I have ways to clear the >> GHA build queue >> for apache/pulsar using the GHA API. >> I clarified the proposed action plan in a follow up message to the thread >> [1]. >> We would primarily process PRs which help to get out of the situation >> where we are. >> >> It would also be helpful if there would be a way to escalate >> ASF INFRA support and GitHub Support. However, the ticket >> https://issues.apache.org/jira/browse/INFRA-23633 discussion doesn't >> give much hope >> of this possibility. >> >> >> -Lari >> >> [1] https://lists.apache.org/thread/rpq12tzm4hx8kozpkphd2jyqr8cj0yj5 >> >> On 2022/09/07 16:59:33 tison wrote: >> > > selecting which jobs to process >> > >> > Do you have a patch to implement this? IIRC it requires interacting with >> > outside service or at least we may add an ok-to-test label. >> > >> > Besides, it increases committers/PMC members' workload - be aware of >> it, or >> > most of contributions will stall. >> > >> > Best, >> > tison. >> > >> > >> > Lari Hotari <lhot...@apache.org> 于2022年9月8日周四 00:47写道: >> > >> > > The problem with CI is becoming worse. The build queue is 235 jobs >> now and >> > > the queue time is over 7 hours. >> > > >> > > We will need to start shedding load in the build queue and get some >> fixes >> > > in. >> > > https://issues.apache.org/jira/browse/INFRA-23633 continues to >> contain >> > > details about some activities. I have created 2 GitHub Support >> tickets, but >> > > usually it takes up to a week to get a response. >> > > >> > > I have some assumptions about the issue, but they are just >> assumptions. >> > > One oddity is that when re-running failed jobs is used in a large >> > > workflow, the execution times for previously successful jobs get >> counted as >> > > if they have run. >> > > Here's an example: >> > > https://github.com/apache/pulsar/actions/runs/3003787409/usage >> > > The reported usage is about 3x than the actual usage. >> > > The assumption that I have is that the "fairness algorithm" that >> GitHub >> > > uses to provide all Apache projects about the same amount of GitHub >> Actions >> > > resources would take this flawed usage as the basis of it's decisions. >> > > The reason why we are getting hit by this now is that there is a high >> > > number of flaky test failures that cause almost every build to fail >> and we >> > > are re-running a lot of builds. >> > > >> > > Another problem there is that the GitHub Actions search doesn't always >> > > show all workflow runs that are running. This has happened before >> when the >> > > GitHub Actions workflow search index was corrupted. GitHub Support >> resolved >> > > that by rebuilding the search index with some manual admin operation >> behind >> > > the scenes. >> > > >> > > I'm proposing that we start shedding load from CI by cancelling build >> jobs >> > > and selecting which jobs to process so that we get the CI issue >> resolved. >> > > We might also have to disable required checks so that we have some >> way to >> > > get changes merged while CI doesn't work properly. >> > > >> > > I'm expecting lazy consensus on fixing CI unless someone proposes a >> better >> > > plan. Let's keep everyone informed in this mailing list thread. >> > > >> > > -Lari >> > > >> > > >> > > On 2022/09/06 14:41:07 Dave Fisher wrote: >> > > > We are going to need to take actions to fix our problems. See >> > > >> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749 >> > > > >> > > > Jarek has done a large amount of GitHub Action work with Apache >> Airflow >> > > and his suggestions might be helpful. One of his suggestions was >> Apache >> > > Yetus. I think he means using the Maven plugins - >> > > https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/ >> > > > >> > > > >> > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lhot...@apache.org> >> wrote: >> > > > > >> > > > > The Apache Infra ticket is >> > > https://issues.apache.org/jira/browse/INFRA-23633 . >> > > > > >> > > > > -Lari >> > > > > >> > > > > On 2022/09/06 11:36:46 Lari Hotari wrote: >> > > > >> I asked for an update on the Apache org GitHub Actions usage >> stats >> > > from Gavin McDonald on the-asf slack in this thread: >> > > >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8 >> > > . >> > > > >> >> > > > >> I hope we get this issue resolved since it delays PR processing >> a lot. >> > > > >> >> > > > >> -Lari >> > > > >> >> > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote: >> > > > >>> Pulsar CI continues to be congested, and the build queue [1] is >> very >> > > long at the moment. There are 147 build jobs in the queue and 16 jobs >> in >> > > progress at the moment. >> > > > >>> >> > > > >>> I would strongly advice everyone to use "personal CI" to >> mitigate >> > > the issue of the long delay of CI feedback. You can simply open a PR >> to >> > > your own personal fork of apache/pulsar to run the builds in your >> "personal >> > > CI". There's more details in the previous emails in this thread. >> > > > >>> >> > > > >>> -Lari >> > > > >>> >> > > > >>> [1] - build queue: >> > > https://github.com/apache/pulsar/actions?query=is%3Aqueued >> > > > >>> >> > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote: >> > > > >>>> Pulsar CI continues to be congested, and the build queue is >> long. >> > > > >>>> >> > > > >>>> I would strongly advice everyone to use "personal CI" to >> mitigate >> > > the issue of the long delay of CI feedback. You can simply open a PR >> to >> > > your own personal fork of apache/pulsar to run the builds in your >> "personal >> > > CI". There's more details in the previous email in this thread. >> > > > >>>> >> > > > >>>> Some updates: >> > > > >>>> >> > > > >>>> There has been a discussion with Gavin McDonald from ASF infra >> on >> > > the-asf slack about getting usage reports from GitHub to support the >> > > investigation. Slack thread is the same one mentioned in the previous >> > > email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 >> . >> > > Gavin already requested the usage report in GitHub UI, but it produced >> > > invalid results. >> > > > >>>> >> > > > >>>> I made a change to mitigate a source of additional GitHub >> Actions >> > > overhead. >> > > > >>>> In the past, each cherry-picked commit to a maintenance branch >> of >> > > Pulsar has triggered a lot of workflow runs. >> > > > >>>> >> > > > >>>> The solution for cancelling duplicate builds automatically is >> to >> > > add this definition to the workflow definition: >> > > > >>>> concurrency: >> > > > >>>> group: ${{ github.workflow }}-${{ github.ref }} >> > > > >>>> cancel-in-progress: true >> > > > >>>> >> > > > >>>> I added this to all maintenance branch GitHub Actions >> workflows: >> > > > >>>> >> > > > >>>> branch-2.10 change: >> > > > >>>> >> > > >> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7 >> > > > >>>> branch-2.9 change: >> > > > >>>> >> > > >> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b >> > > > >>>> branch-2.8 change: >> > > > >>>> >> > > >> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54 >> > > > >>>> branch-2.7: >> > > > >>>> >> > > >> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630 >> > > > >>>> >> > > > >>>> branch-2.11 already contains the necessary config for >> cancelling >> > > duplicate builds. >> > > > >>>> >> > > > >>>> The benefit of the above change is that when multiple commits >> are >> > > cherry-picked to a branch at once, only the build of the last commit >> will >> > > get run eventually. The builds for the intermediate commits will get >> > > cancelled. Obviously there's a tradeoff here that we don't get the >> > > information if one of the earlier commits breaks the build. It's the >> cost >> > > that we need to pay. Nevertheless our build is so flaky that it's >> hard to >> > > determine whether a failed build result is only caused by bad flaky >> test or >> > > whether it's an actual failure. Because of this we don't lose >> anything by >> > > cancelling builds. It's more important to save build resources. In the >> > > maintenance branches for 2.10 and older, the average total build time >> > > consumed is around 20 hours which is a lot. >> > > > >>>> >> > > > >>>> At this time, the overhead of maintenance branch builds doesn't >> > > seem to be the source of the problems. There must be some other issue >> which >> > > is possibly related to exceeding a usage quota. Hopefully we get the >> CI >> > > slowness issue solved asap. >> > > > >>>> >> > > > >>>> BR, >> > > > >>>> >> > > > >>>> Lari >> > > > >>>> >> > > > >>>> >> > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote: >> > > > >>>>> Hi, >> > > > >>>>> >> > > > >>>>> GitHub Actions builds have been piling up in the build queue >> in >> > > the last few days. >> > > > >>>>> I posted on bui...@apache.org >> > > https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and >> > > created INFRA ticket >> https://issues.apache.org/jira/browse/INFRA-23633 >> > > about this issue. >> > > > >>>>> There's also a thread on the-asf slack, >> > > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . >> > > > >>>>> >> > > > >>>>> It seems that our build queue is finally getting picked up, >> but it >> > > would be great to see if we hit quota and whether that is the cause of >> > > pauses. >> > > > >>>>> >> > > > >>>>> Another issue is that the master branch broke after merging 2 >> > > conflicting PRs. >> > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 . >> > > > >>>>> >> > > > >>>>> Merging PRs will be slow until we have these 2 problems >> solved and >> > > existing PRs rebased over the changes. Let's prioritize merging #17300 >> > > before pushing more changes. >> > > > >>>>> >> > > > >>>>> I'd like to point out that a good way to get build feedback >> before >> > > sending a PR, is to run builds on your personal GitHub Actions CI. The >> > > benefit of this is that it doesn't consume the shared quota and builds >> > > usually start instantly. >> > > > >>>>> There are instructions in the contributors guide about this. >> > > > >>>>> >> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork >> > > > >>>>> You simply open PRs to your own fork of apache/pulsar to run >> > > builds on your personal GitHub Actions CI. >> > > > >>>>> >> > > > >>>>> BR, >> > > > >>>>> >> > > > >>>>> Lari >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> >> > > > >>>>> >> > > > >>>> >> > > > >>> >> > > > >> >> > > > >> > > > >> > > >> > >> >