For some concrete data, here are the stats for the latest build on two
community PRs I am currently reviewing:

https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14648/16/tests
- 18 unrelated test failures
- 13 unique tests
- only 1 out of 4 JDK builds were green with all tests passing

https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14735/4/tests
- 44(!) unrelated test failures
- not even going to count up the unique tests because there were too many
- 0 out of 4 JDK builds were green
- this particular build seemed to have a number of infra/timeout issues, so
it may have exacerbated the flakiness beyond the "usual", although as
others have noted, unstable infra is not uncommon

My point is, we clearly have a long way to go if we want to start enforcing
this policy and have any hope of merging any PRs and not driving away our
community of contributors.

This isn't meant to discourage anyone, actually the opposite: if we want to
start gating PRs on passing builds, we need to start tackling flaky tests
now!

On Tue, Nov 21, 2023 at 1:19 PM Sophie Blee-Goldman <sop...@responsive.dev>
wrote:

> In the interest of moving things forward, here is what we would need (in
> my opinion) to start enforcing this:
>
>    1. Get overall build failure percentage under a certain threshold
>       1. What is an acceptable number here?
>       2. How do we achieve this: wait until all of them are fixed,
>       disable everything that's flaky right away, etc
>    2. Come up with concrete policy rules so there's no confusion. I think
>    we need to agree on answers for these questions at least:
>    1. What happens if a new PR introduces a new test that is revealed to
>       be flaky?
>       2. What happens if a new PR makes an old test become flaky?
>       3. How long do we have to take action in the above cases?
>       4. How do we track failures?
>       5. How often does a test need to fail to be considered flaky enough
>       to take action?
>
> Here's my take on these questions, but would love to hear from others:
>
> 1.1) Imo the failure rate has to be under 50% at the very highest. At 50
> we still have half the builds failing, but 75% would pass after just one
> retry, and 87.5% after two retries. Is that acceptable? Maybe we should aim
> to kick things off with a higher success rate to give ourselves some wiggle
> room over time, and have 50% be the absolute maximum failure rate -- if it
> ever gets beyond that we trigger an emergency response (whether that be
> blocking all feature work until the test failures are addressed, ending
> this policy, etc)
> 1.2) I think we'd probably all agree that there's no way we'll be able to
> triage all of the currently flaky tests in a reasonable time frame, but I'm
> also wary of just disabling everything and forgetting them. Should we
> consider making all of the currently-disabled tests a 3.7 release blocker?
> Should we wait until after 3.7 to disable them and implement this policy to
> give us a longer time to address them? Will we just procrastinate until the
> next release deadline anyways? Many open questions here
>
> 2.1) We can either revert the entire PR or just disable the specific
> test(s) that are failing. Personally I'm in favor of giving a short window
> (see point #3) for the author to attempt a fix or else disable the specific
> failing test, otherwise we revert the entire commit. This should guarantee
> that tests aren't just introduced, disabled, and never looked at again. My
> main concern here is that sometimes reverting a commit can mean resolving
> merge conflicts that not everyone will have the context to do properly.
> This is just another reason to keep the window short.
> 2.2) Again I think we should allow a (short) window of time for someone
> with context to triage and take action, but after that, we need to either
> revert the PR or disable the test. Neither of those feels like a good
> option, since new changes can often induce timeout-related failures in my
> experience, which are solely a problem with the test and not with the
> change. Maybe if the test is failing very frequently/most of the time, then
> we should revert the change, but if it becomes only occasionally flaky,
> then we should disable the test, file a ticket, and mark it a blocker until
> it can be triaged (the triaging can of course include demoting it from a
> blocker if someone with context determines the test itself is at fault and
> offers little value).
> 2.3) I think any longer than a day (or over the weekend) starts to become
> an issue. Of course, we can/will be more lenient with tests that fail less
> often, especially since it might just take a while to notice
> 2.4) Presumably the best way is to file tickets for every failure and
> comment on that ticket for each subsequent failure. I used to do this quite
> religiously, but I have to admit, it was actually a non-trivial timesuck.
> If we can get to a low enough level of failure, however, this should be
> more reasonable. We just need to trust everyone to pay attention and follow
> up on failures they see. And will probably need to come up with a system to
> avoid overcounting by different reviewers of the same PR/build
> 2.5) This one is actually the most tricky in my opinion. I think everyone
> would agree that if a test is failing on all or most of the JDK runs within
> a single build, it is problematic. And probably most of us would agree that
> if it fails even once per build on almost all PRs, we need to take action.
> On the other hand, a single test that fails once every 50 builds might feel
> acceptable, but if we have 100 such tests, suddenly it's extremely
> difficult to get a clean build. So what's the threshold for action?
>
> So...any thoughts on how to implement this policy? And when to actually
> start enforcing it?
>
> On Tue, Nov 21, 2023 at 1:18 PM Sophie Blee-Goldman <sop...@responsive.dev>
> wrote:
>
>> So...where did we land on this?
>>
>> To take a few steps back and make sure we're all talking about the same
>> thing, I want to mention that I myself was responsible for merging a PR
>> that broke the build a few weeks ago. There was a warning that only
>> appeared in some of the versions, but when checking on the results I
>> immediately navigated to the "Tests" page before letting the main build
>> page load. I just assumed that if there were test failures that must mean
>> the build itself was fine, and with this tunnel-vision I missed that a
>> subset of JDK builds had failed.
>>
>> I'm mentioning this now not to make excuses, but because I really hope
>> that it goes without saying that this was an honest (though sloppy)
>> mistake, and that no one genuinely believes it is acceptable to merge any
>> PR that causes any of the builds to fail to compile. Yet multiple people in
>> this thread so far have voiced support for "gating merges on the
>> successful completion of all parts of the build except tests". Just to be
>> totally clear, I really don't think that was ever in question -- though it
>> certainly doesn't hurt to remind everyone.
>>
>> So, this thread is not about whether or not to merge with failing *builds,
>> *but it's whether it should be acceptable to merge with failing *tests.* It
>> seems like we're in agreement for the most part that the current system
>> isn't great, but I don't think we can (or rather, should) start enforcing
>> the new rule to only merge fully-green builds until we can actually get to
>> a reasonable percentage of green builds. Do we have any stats on how often
>> builds are failing right now due to flaky tests? Just speaking from
>> personal experience, it's been a while since I've seen a truly green build
>> with all tests passing.
>>
>> Lastly, just to play devil's advocate for a moment, before we commit to
>> this I think we need to consider how such a policy would impact the
>> contributor experience. It's incredibly disheartening to go through all the
>> work of submitting and getting it reviewed only to be blocked from merging
>> at the last minute due to test failures completely outside that
>> contributor's control. We already struggle just getting some folks, new
>> contributors in particular, all the way through the review process without
>> abandoning their PRs. I do think we've been doing a better job lately, but
>> it's a cycle that ebbs and flows, and most community PRs still take an
>> admirable degree of patience and persistence just to get enough reviews to
>> reach the point of merging. So I just think we need to be careful not to
>> make this situation even worse by having to wait for a green build. I'm
>> just worried about our ability to stay on top of disabling tests,
>> especially if we need to wait for someone with enough context to make a
>> judgement call. Can we really rely on everyone to drop everything at any
>> time to check on a failing test? At the same time I would be hesitant to be
>> overly aggressive about reverting/disabling tests without having someone
>> who understands the context take a look.
>>
>> That said, I do think it's worth trying this out as an experiment, as
>> long as we can be clear to frustrated contributors that this isn't
>> necessarily the new policy from here on out if it isn't going well.
>>
>> On Thu, Nov 16, 2023 at 3:32 AM Igor Soarez <i...@soarez.me> wrote:
>>
>>> Hi all,
>>>
>>> I think at least one of those is my fault, apologies.
>>> I'll try to make sure all my tests are passing from now on.
>>>
>>> It doesn't help that GitHub always shows that the tests have failed,
>>> even when they have not. I suspect this is because Jenkins always
>>> marks the builds as unstable, even when all tests pass, because
>>> the "Archive JUnit-formatted test results" step seems to persistently
>>> fail with "[Checks API] No suitable checks publisher found.".
>>> e.g.
>>> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14770/1/pipeline/
>>>
>>> Can we get rid of that persistent failure and actually mark successful
>>> test runs as green?
>>>
>>> --
>>> Igor
>>>
>>

Reply via email to