Re: [DISCUSS] Should we continue to merge without a green build? No!

Matthias J. Sax Tue, 14 Nov 2023 17:01:42 -0800

Thanks for starting this discussion David! I totally agree to "no"!

I think there is no excuse whatsoever for merging PRs with compilationerrors (except an honest mistake for conflicting PRs that got mergedinterleaved). -- Every committer must(!) check the Jenkins status beforemerging to avoid such an issue.

Similar for actual permanently broken tests. If there is no green build,and the same test failed across multiple Jenkins runs, a committershould detect this and cannot merge a PR.

Given the current state of the CI pipeline, it seems possible to getgreen runs, and thus I support the policy (that we actually always had)to only merge if there is at least one green build. If committers gotsloppy about this, we need to call it out and put a hold on this practice.

(The only exception from the above policy would be a very unstablestatus for which getting a green build is not possible at all, due totoo many flaky tests -- for this case, I would accept to merge eventhere is no green build, but committer need to manually ensure thatevery test did pass in at least one test run. -- We had this in thepast, but we I don't think we are in such a bad situation right now).

About disabling tests: I was never a fan of this, because in myexperience those tests are not fixed any time soon. Especially, becausewe do not consider such tickets as release blockers. To me, seeing testsfails on PR build is actually a good forcing function for people to feelthe pain, and thus get motivated to make time to fix the tests.

I have to admit that I was a little bit sloppy paying attention to flakytests recently, and I highly appreciate this effort. Also thanks toeveryone how actually filed a ticket! IMHO, we should file a ticket forevery flaky test, and also keep adding comments each time we see a testfail to be able to track the frequency at which a tests fails, so we canfix the most pressing ones first.

Te me, the best forcing function to get test stabilized is to filetickets and consider them release blockers. Disabling tests does notreally help much IMHO to tackle the problem (we can of course stilldisable them to get noise out of the system, but it would only introducetesting gaps for the time being and also does not help to figure out howoften a test fails, so it's not a solution to the problem IMHO)



-Matthias

On 11/13/23 11:40 PM, Sagar wrote:

Hi Divij,

I think this proposal overall makes sense. My only nit sort of a suggestion
is that let's also consider a label called newbie++[1] for flaky tests if
we are considering adding newbie as a label. I think some of the flaky
tests need familiarity with the codebase or the test setup so as a first
time contributor, it might be difficult. newbie++ IMO covers that aspect.

[1]
https://issues.apache.org/jira/browse/KAFKA-15406?jql=project%20%3D%20KAFKA%20AND%20labels%20%3D%20%22newbie%2B%2B%22

Let me know what you think.

Thanks!
Sagar.

On Mon, Nov 13, 2023 at 9:11 PM Divij Vaidya <divijvaidy...@gmail.com>
wrote:

  Please, do it.

We can use specific labels to effectively filter those tickets.

We already have a label and a way to discover flaky tests. They are tagged
with the label "flaky-test" [1]. There is also a label "newbie" [2] meant
for folks who are new to Apache Kafka code base.
My suggestion is to send a broader email to the community (since many will
miss details in this thread) and call for action for committers to
volunteer as "shepherds" for these tickets. I can send one out once we have
some consensus wrt next steps in this thread.


[1]

https://issues.apache.org/jira/browse/KAFKA-13421?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flaky-test%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC


[2] https://kafka.apache.org/contributing -> Finding a project to work on


Divij Vaidya



On Mon, Nov 13, 2023 at 4:24 PM Николай Ижиков <nizhi...@apache.org>
wrote:

To kickstart this effort, we can publish a list of such tickets in the

community and assign one or more committers the role of a «shepherd" for
each ticket.

Please, do it.
We can use specific label to effectively filter those tickets.

13 нояб. 2023 г., в 15:16, Divij Vaidya <divijvaidy...@gmail.com>

написал(а):


Thanks for bringing this up David.

My primary concern revolves around the possibility that the currently
disabled tests may remain inactive indefinitely. We currently have
unresolved JIRA tickets for flaky tests that have been pending for an
extended period. I am inclined to support the idea of disabling these

tests

temporarily and merging changes only when the build is successful,

provided

there is a clear plan for re-enabling them in the future.

To address this issue, I propose the following measures:

1\ Foster a supportive environment for new contributors within the
community, encouraging them to take on tickets associated with flaky

tests.

This initiative would require individuals familiar with the relevant

code

to offer guidance to those undertaking these tasks. Committers should
prioritize reviewing and addressing these tickets within their

available

bandwidth. To kickstart this effort, we can publish a list of such

tickets

in the community and assign one or more committers the role of a

"shepherd"

for each ticket.

2\ Implement a policy to block minor version releases until the Release
Manager (RM) is satisfied that the disabled tests do not result in gaps

in

our testing coverage. The RM may rely on Subject Matter Experts (SMEs)

in

the specific code areas to provide assurance before giving the green

light

for a release.

3\ Set a community-wide goal for 2024 to achieve a stable Continuous
Integration (CI) system. This goal should encompass projects such as
refining our test suite to eliminate flakiness and addressing
infrastructure issues if necessary. By publishing this goal, we create

shared vision for the community in 2024, fostering alignment on our
objectives. This alignment will aid in prioritizing tasks for community
members and guide reviewers in allocating their bandwidth effectively.

--
Divij Vaidya



On Sun, Nov 12, 2023 at 2:58 AM Justine Olshan

<jols...@confluent.io.invalid>

wrote:

I will say that I have also seen tests that seem to be more flaky
intermittently. It may be ok for some time and suddenly the CI is
overloaded and we see issues.
I have also seen the CI struggling with running out of space recently,

so I

wonder if we can also try to improve things on that front.

FWIW, I noticed, filed, or commented on several flaky test JIRAs last

week.

I'm happy to try to get to green builds, but everyone needs to be on

board.


https://issues.apache.org/jira/browse/KAFKA-15529
https://issues.apache.org/jira/browse/KAFKA-14806
https://issues.apache.org/jira/browse/KAFKA-14249
https://issues.apache.org/jira/browse/KAFKA-15798
https://issues.apache.org/jira/browse/KAFKA-15797
https://issues.apache.org/jira/browse/KAFKA-15690
https://issues.apache.org/jira/browse/KAFKA-15699
https://issues.apache.org/jira/browse/KAFKA-15772
https://issues.apache.org/jira/browse/KAFKA-15759
https://issues.apache.org/jira/browse/KAFKA-15760
https://issues.apache.org/jira/browse/KAFKA-15700

I've also seen that kraft transactions tests often flakily see that

the

producer id is not allocated and times out.
I can file a JIRA for that too.

Hopefully this is a place we can start from.

Justine

On Sat, Nov 11, 2023 at 11:35 AM Ismael Juma <m...@ismaeljuma.com>

wrote:

On Sat, Nov 11, 2023 at 10:32 AM John Roesler <vvcep...@apache.org>

wrote:

In other words, I’m biased to think that new flakiness indicates
non-deterministic bugs more often than it indicates a bad test.


My experience is exactly the opposite. As someone who has tracked

many

of

the flaky fixes, the vast majority of the time they are an issue with

the

test.

Ismael

Re: [DISCUSS] Should we continue to merge without a green build? No!

Reply via email to