Thanks Sophie. Overall I agree with you.

I think 50% is too high as a general rule, and I believe something like 30% might be more appropriate (going lower, given the infrastructure at hand might become too aggressive)?

The difficult part about a policy like this is, that we don't really have statistics about it, so it will be up to the reviewers to keep monitoring and raising an "alert" (on the dev mailing list?) if we see too many failing builds.

How to achieve this? Personally, I think it would be best to agree on a "bug bash sprint". Not sure if folks are willing to sign up for this? The main question might be "when"? We recently discuss test stability in our team, and don't see capacity now, but want to invest more in Q1. (Btw: this issue goes beyond the build but also affect system tests.)

How long do we have to take action in the above cases?

You propose one day, but that's tricky? In the end, if a test is flaky, we won't know right away? Would we?

How do we track failures?

I also did file tickets and added comments to test in the past when they did re-fail. While labor intensive, it seems it worked best in the past. Justine mentioned Gradle Enterprise (I did not look into it yet, but maybe it can help to reduce manual labor)?

How often does a test need to fail to be considered flaky enough to take action?

Guess it's a judgement call, and I don't have a strong opinion. But I agree, that we might want to write down a rule. But again, a rule only make sense if we have data. If reviewers don't pay attention and don't comment on tickets to we can count how often a test fails, any number we put down won't be helpful.


In the end, to me it boils down to the willingness of all of us to tackle it, and to _make time_ to address flaky tests. On our side (ie, KS teams at Confluent), we want to put more time aside for this in quarterly planning, but in the end, without reliable data it's hard to know which tests to spent time on for the biggest bang for the bug. Thus, to me the second corner stone is to put the manual labor into tracking the frequency of flaky tests, what is a group effort.


-Matthias


On 11/21/23 1:33 PM, Sophie Blee-Goldman wrote:
For some concrete data, here are the stats for the latest build on two
community PRs I am currently reviewing:

https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14648/16/tests
- 18 unrelated test failures
- 13 unique tests
- only 1 out of 4 JDK builds were green with all tests passing

https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14735/4/tests
- 44(!) unrelated test failures
- not even going to count up the unique tests because there were too many
- 0 out of 4 JDK builds were green
- this particular build seemed to have a number of infra/timeout issues, so
it may have exacerbated the flakiness beyond the "usual", although as
others have noted, unstable infra is not uncommon

My point is, we clearly have a long way to go if we want to start enforcing
this policy and have any hope of merging any PRs and not driving away our
community of contributors.

This isn't meant to discourage anyone, actually the opposite: if we want to
start gating PRs on passing builds, we need to start tackling flaky tests
now!

On Tue, Nov 21, 2023 at 1:19 PM Sophie Blee-Goldman <sop...@responsive.dev>
wrote:

In the interest of moving things forward, here is what we would need (in
my opinion) to start enforcing this:

    1. Get overall build failure percentage under a certain threshold
       1. What is an acceptable number here?
       2. How do we achieve this: wait until all of them are fixed,
       disable everything that's flaky right away, etc
    2. Come up with concrete policy rules so there's no confusion. I think
    we need to agree on answers for these questions at least:
    1. What happens if a new PR introduces a new test that is revealed to
       be flaky?
       2. What happens if a new PR makes an old test become flaky?
       3. How long do we have to take action in the above cases?
       4. How do we track failures?
       5. How often does a test need to fail to be considered flaky enough
       to take action?

Here's my take on these questions, but would love to hear from others:

1.1) Imo the failure rate has to be under 50% at the very highest. At 50
we still have half the builds failing, but 75% would pass after just one
retry, and 87.5% after two retries. Is that acceptable? Maybe we should aim
to kick things off with a higher success rate to give ourselves some wiggle
room over time, and have 50% be the absolute maximum failure rate -- if it
ever gets beyond that we trigger an emergency response (whether that be
blocking all feature work until the test failures are addressed, ending
this policy, etc)
1.2) I think we'd probably all agree that there's no way we'll be able to
triage all of the currently flaky tests in a reasonable time frame, but I'm
also wary of just disabling everything and forgetting them. Should we
consider making all of the currently-disabled tests a 3.7 release blocker?
Should we wait until after 3.7 to disable them and implement this policy to
give us a longer time to address them? Will we just procrastinate until the
next release deadline anyways? Many open questions here

2.1) We can either revert the entire PR or just disable the specific
test(s) that are failing. Personally I'm in favor of giving a short window
(see point #3) for the author to attempt a fix or else disable the specific
failing test, otherwise we revert the entire commit. This should guarantee
that tests aren't just introduced, disabled, and never looked at again. My
main concern here is that sometimes reverting a commit can mean resolving
merge conflicts that not everyone will have the context to do properly.
This is just another reason to keep the window short.
2.2) Again I think we should allow a (short) window of time for someone
with context to triage and take action, but after that, we need to either
revert the PR or disable the test. Neither of those feels like a good
option, since new changes can often induce timeout-related failures in my
experience, which are solely a problem with the test and not with the
change. Maybe if the test is failing very frequently/most of the time, then
we should revert the change, but if it becomes only occasionally flaky,
then we should disable the test, file a ticket, and mark it a blocker until
it can be triaged (the triaging can of course include demoting it from a
blocker if someone with context determines the test itself is at fault and
offers little value).
2.3) I think any longer than a day (or over the weekend) starts to become
an issue. Of course, we can/will be more lenient with tests that fail less
often, especially since it might just take a while to notice
2.4) Presumably the best way is to file tickets for every failure and
comment on that ticket for each subsequent failure. I used to do this quite
religiously, but I have to admit, it was actually a non-trivial timesuck.
If we can get to a low enough level of failure, however, this should be
more reasonable. We just need to trust everyone to pay attention and follow
up on failures they see. And will probably need to come up with a system to
avoid overcounting by different reviewers of the same PR/build
2.5) This one is actually the most tricky in my opinion. I think everyone
would agree that if a test is failing on all or most of the JDK runs within
a single build, it is problematic. And probably most of us would agree that
if it fails even once per build on almost all PRs, we need to take action.
On the other hand, a single test that fails once every 50 builds might feel
acceptable, but if we have 100 such tests, suddenly it's extremely
difficult to get a clean build. So what's the threshold for action?

So...any thoughts on how to implement this policy? And when to actually
start enforcing it?

On Tue, Nov 21, 2023 at 1:18 PM Sophie Blee-Goldman <sop...@responsive.dev>
wrote:

So...where did we land on this?

To take a few steps back and make sure we're all talking about the same
thing, I want to mention that I myself was responsible for merging a PR
that broke the build a few weeks ago. There was a warning that only
appeared in some of the versions, but when checking on the results I
immediately navigated to the "Tests" page before letting the main build
page load. I just assumed that if there were test failures that must mean
the build itself was fine, and with this tunnel-vision I missed that a
subset of JDK builds had failed.

I'm mentioning this now not to make excuses, but because I really hope
that it goes without saying that this was an honest (though sloppy)
mistake, and that no one genuinely believes it is acceptable to merge any
PR that causes any of the builds to fail to compile. Yet multiple people in
this thread so far have voiced support for "gating merges on the
successful completion of all parts of the build except tests". Just to be
totally clear, I really don't think that was ever in question -- though it
certainly doesn't hurt to remind everyone.

So, this thread is not about whether or not to merge with failing *builds,
*but it's whether it should be acceptable to merge with failing *tests.* It
seems like we're in agreement for the most part that the current system
isn't great, but I don't think we can (or rather, should) start enforcing
the new rule to only merge fully-green builds until we can actually get to
a reasonable percentage of green builds. Do we have any stats on how often
builds are failing right now due to flaky tests? Just speaking from
personal experience, it's been a while since I've seen a truly green build
with all tests passing.

Lastly, just to play devil's advocate for a moment, before we commit to
this I think we need to consider how such a policy would impact the
contributor experience. It's incredibly disheartening to go through all the
work of submitting and getting it reviewed only to be blocked from merging
at the last minute due to test failures completely outside that
contributor's control. We already struggle just getting some folks, new
contributors in particular, all the way through the review process without
abandoning their PRs. I do think we've been doing a better job lately, but
it's a cycle that ebbs and flows, and most community PRs still take an
admirable degree of patience and persistence just to get enough reviews to
reach the point of merging. So I just think we need to be careful not to
make this situation even worse by having to wait for a green build. I'm
just worried about our ability to stay on top of disabling tests,
especially if we need to wait for someone with enough context to make a
judgement call. Can we really rely on everyone to drop everything at any
time to check on a failing test? At the same time I would be hesitant to be
overly aggressive about reverting/disabling tests without having someone
who understands the context take a look.

That said, I do think it's worth trying this out as an experiment, as
long as we can be clear to frustrated contributors that this isn't
necessarily the new policy from here on out if it isn't going well.

On Thu, Nov 16, 2023 at 3:32 AM Igor Soarez <i...@soarez.me> wrote:

Hi all,

I think at least one of those is my fault, apologies.
I'll try to make sure all my tests are passing from now on.

It doesn't help that GitHub always shows that the tests have failed,
even when they have not. I suspect this is because Jenkins always
marks the builds as unstable, even when all tests pass, because
the "Archive JUnit-formatted test results" step seems to persistently
fail with "[Checks API] No suitable checks publisher found.".
e.g.
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14770/1/pipeline/

Can we get rid of that persistent failure and actually mark successful
test runs as green?

--
Igor



Reply via email to