Thanks for that link Ismael -- I'd never seen it before. That's super useful

In fact, it helped me realize another area where we could really improve
things. I took a look at the worst offenders on that chart and was
debugging the 2nd most flaky Streams test when I expanded the timeframe and
saw that it actually went from failing 0% of the time to being, well, the
2nd worst in all of Streams. The specific date it started failing made it
incredibly straightforward to find the culprit, so again, thanks for the
resource Ismael!

It's lucky this discussion picked up when it did because if I'd waited
another week to inspect this test, the 90-day retention would have eaten
the initial spike, and it would have been much, much harder to debug (the
culprit was not at all obvious without this hint)

But I think it's really concerning that we went almost three months without
noticing a new commit caused a formerly-stable test to start failing so
often. Obviously it's not practical to expect everyone to remember which
tests are flaky and which ones are not (yet), so it would be great if we
could set up some kind of alerting on sudden increases in test failure
rates. Have we ever looked into this?

On Tue, Nov 28, 2023 at 8:26 AM Chris Egerton <fearthecel...@gmail.com>
wrote:

> Hi David,
>
> Thanks for continuing to drive this.
>
> Yes, ignoreFailures does come with some risk and it's only suitable as an
> intermediate step as part of a long-term plan that eventually gates all
> merges on builds with passing tests. I brought it up because it would allow
> us to immediately prevent any non-green builds from being merged. It also
> wouldn't practically change much in the review process; right now we have
> to manually check test results for every PR because the majority of builds
> are yellow due to flakiness, and that manual step would still be necessary
> if we started using ignoreFailures. If there's a way to allow yellow and
> green builds to be merged then there's no need to use ignoreFailures as
> that option would be strictly inferior.
>
> Cheers,
>
> Chris
>
> On Tue, Nov 28, 2023 at 2:53 AM David Jacot <dja...@confluent.io.invalid>
> wrote:
>
> > Hi all,
> >
> > I am still experimenting with reducing the noise of flaky tests in build
> > results. I should have results to share early next week.
> >
> > Chris, I am also for a programmatic gate. Regarding using ignoreFailures,
> > it seems risky because the build may be green but with failed tests, no?
> >
> > I would also like to make it clear that the current rule applies until we
> > agree on a way forward here. At minimum, I think that a build should be
> > yellow for all the combinations and the failed tests should have been
> > triaged to ensure that they are not related to the changes. We should not
> > merge when a build is red or has not completed.
> >
> > Best,
> > David
> >
> > On Sat, Nov 25, 2023 at 5:25 AM Chris Egerton <fearthecel...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > There's a lot to catch up on here but I wanted to clarify something.
> > > Regarding this comment from Sophie:
> > >
> > >
> > > > Yet multiple people in this thread so
> > > far have voiced support for "gating merges on the successful completion
> > of
> > > all parts of the build except tests". Just to be totally clear, I
> really
> > > don't think that was ever in question -- though it certainly doesn't
> hurt
> > > to remind everyone.
> > >
> > > > So, this thread is not about whether or not to merge with failing
> > > *builds, *but it's
> > > whether it should be acceptable to merge with failing *tests.*
> > >
> > >
> > > I think there's a misunderstanding here. I was suggesting
> > > programmatic gating, not manual. If we could disable these types of
> > changes
> > > from being merged, instead of relying on committers to check and
> > interpret
> > > Jenkins results, that'd be a quick win IMO. And, because of the
> > > already-discussed issues with flaky tests, it seemed difficult to
> disable
> > > PRs from being merged with failing tests--just for other parts of the
> > > build.
> > >
> > > However, I think the retry logic brought up by David could be
> sufficient
> > to
> > > skip that kind of intermediate step and allow us to just start
> > > programmatically disabling PR merges if the build (including) tests
> > fails.
> > > But if anyone's interested, we can still prevent failing tests from
> > failing
> > > the build with the ignoreFailures property [1].
> > >
> > > [1] -
> > >
> > >
> >
> https://docs.gradle.org/current/dsl/org.gradle.api.tasks.testing.Test.html#org.gradle.api.tasks.testing.Test:ignoreFailures
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Wed, Nov 22, 2023 at 3:00 AM Ismael Juma <m...@ismaeljuma.com> wrote:
> > >
> > > > I think it breaks the Jenkins output otherwise. Feel free to test it
> > via
> > > a
> > > > PR.
> > > >
> > > > Ismael
> > > >
> > > > On Wed, Nov 22, 2023, 12:42 AM David Jacot
> <dja...@confluent.io.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi Ismael,
> > > > >
> > > > > No, I was not aware of KAFKA-12216. My understanding is that we
> could
> > > > still
> > > > > do it without the JUnitFlakyTestDataPublisher plugin and we could
> use
> > > > > gradle enterprise for this. Or do you think that reporting the
> flaky
> > > > tests
> > > > > in the build results is required?
> > > > >
> > > > > David
> > > > >
> > > > > On Wed, Nov 22, 2023 at 9:35 AM Ismael Juma <m...@ismaeljuma.com>
> > wrote:
> > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > Did you take a look at
> > > > https://issues.apache.org/jira/browse/KAFKA-12216
> > > > > ?
> > > > > > I
> > > > > > looked into this option already (yes, there isn't much that we
> > > haven't
> > > > > > considered in this space).
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Wed, Nov 22, 2023 at 12:24 AM David Jacot
> > > > <dja...@confluent.io.invalid
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Thanks for the good discussion and all the comments. Overall,
> it
> > > > seems
> > > > > > that
> > > > > > > we all agree on the bad state of our CI. That's a good first
> > step!
> > > > > > >
> > > > > > > I have talked to a few folks this week about it and it seems
> that
> > > > many
> > > > > > > folks (including me) are not comfortable with merging PRs at
> the
> > > > moment
> > > > > > > because the results of our builds are so bad. I had 40+ failed
> > > tests
> > > > in
> > > > > > one
> > > > > > > of my PRs, all unrelated to my changes. It is really hard to be
> > > > > > productive
> > > > > > > with this.
> > > > > > >
> > > > > > > Personally, I really want to move towards requiring a green
> build
> > > to
> > > > > > merge
> > > > > > > to trunk because this is a clear and binary signal. I agree
> that
> > we
> > > > > need
> > > > > > to
> > > > > > > stabilize the builds before we could even require this so here
> is
> > > my
> > > > > > > proposal.
> > > > > > >
> > > > > > > 1) We could leverage the `reports.junitXml.mergeReruns` option
> in
> > > > > gradle.
> > > > > > > From the doc [1]:
> > > > > > >
> > > > > > > > When mergeReruns is enabled, if a test fails but is then
> > retried
> > > > and
> > > > > > > succeeds, its failures will be recorded as <flakyFailure>
> instead
> > > of
> > > > > > > <failure>, within one <testcase>. This is effectively the
> > reporting
> > > > > > > produced by the surefire plugin of Apache Maven™ when enabling
> > > > reruns.
> > > > > If
> > > > > > > your CI server understands this format, it will indicate that
> the
> > > > test
> > > > > > was
> > > > > > > flaky. If it > does not, it will indicate that the test
> succeeded
> > > as
> > > > it
> > > > > > > will ignore the <flakyFailure> information. If the test does
> not
> > > > > succeed
> > > > > > > (i.e. it fails for every retry), it will be indicated as having
> > > > failed
> > > > > > > whether your tool understands this format or not.
> > > > > > > > When mergeReruns is disabled (the default), each execution
> of a
> > > > test
> > > > > > will
> > > > > > > be listed as a separate test case.
> > > > > > >
> > > > > > > It would not resolve all the faky tests for sure but it would
> at
> > > > least
> > > > > > > reduce the noise. I see this as a means to get to green builds
> > > > faster.
> > > > > I
> > > > > > > played a bit with this setting and I discovered [2]. I was
> hoping
> > > > that
> > > > > > [3]
> > > > > > > could help to resolve it but I need to confirm.
> > > > > > >
> > > > > > > 2) I suppose that we would still have flaky tests preventing us
> > > from
> > > > > > > getting a green build even with the setting in place. For
> those,
> > I
> > > > > think
> > > > > > > that we need to review them one by one and decide whether we
> want
> > > to
> > > > > fix
> > > > > > or
> > > > > > > disable them. This is a short term effort to help us get to
> green
> > > > > builds.
> > > > > > >
> > > > > > > 3) When we get to a point where we can get green builds
> > > consistently,
> > > > > we
> > > > > > > could enforce it.
> > > > > > >
> > > > > > > 4) Flaky tests won't disappear with this. They are just hidden.
> > > > > > Therefore,
> > > > > > > we also need a process to review the flaky tests and address
> > them.
> > > > > Here,
> > > > > > I
> > > > > > > think that we could leverage the dashboard shared by Ismael.
> One
> > > > > > > possibility would be to review it regularly and decide for each
> > > test
> > > > > > > whether it should be fixed, disabled or even removed.
> > > > > > >
> > > > > > > Please let me know what you think.
> > > > > > >
> > > > > > > Another angle that we could consider is improving the CI
> > > > infrastructure
> > > > > > as
> > > > > > > well. I think that many of those flaky tests are due to
> > overloaded
> > > > > > Jenkins
> > > > > > > workers. We should perhaps discuss with the infra team to see
> > > whether
> > > > > we
> > > > > > > could do something there.
> > > > > > >
> > > > > > > Best,
> > > > > > > David
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > >
> > >
> https://docs.gradle.org/current/userguide/java_testing.html#mergereruns
> > > > > > > [2] https://github.com/gradle/gradle/issues/23324
> > > > > > > [3] https://github.com/apache/kafka/pull/14687
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Nov 22, 2023 at 4:10 AM Ismael Juma <m...@ismaeljuma.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > We have a dashboard already:
> > > > > > > >
> > > > > > > > [image: image.png]
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.sortField=FLAKY
> > > > > > > >
> > > > > > > > On Tue, Nov 14, 2023 at 10:41 PM Николай Ижиков <
> > > > nizhi...@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hello guys.
> > > > > > > >>
> > > > > > > >> I want to tell you about one more approach to deal with
> flaky
> > > > tests.
> > > > > > > >> We adopt this approach in Apache Ignite community, so may be
> > it
> > > > can
> > > > > be
> > > > > > > >> helpful for Kafka, also.
> > > > > > > >>
> > > > > > > >> TL;DR: Apache Ignite community have a tool that provide a
> > > > statistic
> > > > > of
> > > > > > > >> tests and can tell if PR introduces new failures.
> > > > > > > >>
> > > > > > > >> Apache Ignite has a many tests.
> > > > > > > >> Latest «Run All» contains around 75k.
> > > > > > > >> Most of test has integration style therefore count of flacky
> > are
> > > > > > > >> significant.
> > > > > > > >>
> > > > > > > >> We build a tool - Team City Bot [1]
> > > > > > > >> That provides a combined statistic of flaky tests [2]
> > > > > > > >>
> > > > > > > >> This tool can compare results of Run All for PR and master.
> > > > > > > >> If all OK one can comment jira ticket with a visa from bot
> [3]
> > > > > > > >>
> > > > > > > >> Visa is a quality proof of PR for Ignite committers.
> > > > > > > >> And we can sort out most flaky tests and prioritize fixes
> with
> > > the
> > > > > bot
> > > > > > > >> statistic [2]
> > > > > > > >>
> > > > > > > >> TC bot integrated with the Team City only, for now.
> > > > > > > >> But, if Kafka community interested we can try to integrate
> it
> > > with
> > > > > > > >> Jenkins.
> > > > > > > >>
> > > > > > > >> [1] https://github.com/apache/ignite-teamcity-bot
> > > > > > > >> [2]
> > > > > > >
> > > https://tcbot2.sbt-ignite-dev.ru/current.html?branch=master&count=10
> > > > > > > >> [3]
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/IGNITE-19950?focusedCommentId=17767394&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17767394
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> > 15 нояб. 2023 г., в 09:18, Ismael Juma <m...@ismaeljuma.com
> >
> > > > > > > написал(а):
> > > > > > > >> >
> > > > > > > >> > To use the pain analogy, people seem to have really good
> > > > > painkillers
> > > > > > > and
> > > > > > > >> > hence they somehow don't feel the pain already. ;)
> > > > > > > >> >
> > > > > > > >> > The reality is that important and high quality tests will
> > get
> > > > > fixed.
> > > > > > > >> Poor
> > > > > > > >> > quality tests (low signal to noise ratio) might not get
> > fixed
> > > > and
> > > > > > > >> that's ok.
> > > > > > > >> >
> > > > > > > >> > I'm not opposed to marking the tests as release blockers
> as
> > a
> > > > > > starting
> > > > > > > >> > point, but I'm saying it's fine if people triage them and
> > > decide
> > > > > > they
> > > > > > > >> are
> > > > > > > >> > not blockers. In fact, that has already happened in the
> > past.
> > > > > > > >> >
> > > > > > > >> > Ismael
> > > > > > > >> >
> > > > > > > >> > On Tue, Nov 14, 2023 at 10:02 PM Matthias J. Sax <
> > > > > mj...@apache.org>
> > > > > > > >> wrote:
> > > > > > > >> >
> > > > > > > >> >> I agree on the test gap argument. However, my worry is,
> if
> > we
> > > > > don't
> > > > > > > >> >> "force the pain", it won't get fixed at all. -- I also
> > know,
> > > > that
> > > > > > we
> > > > > > > >> try
> > > > > > > >> >> to find an working approach for many years...
> > > > > > > >> >>
> > > > > > > >> >> My take is that if we disable a test and file a
> > non-blocking
> > > > > Jira,
> > > > > > > it's
> > > > > > > >> >> basically the same as just deleting the test all together
> > and
> > > > > never
> > > > > > > >> talk
> > > > > > > >> >> about it again. -- I believe, this is not want we aim
> for,
> > > but
> > > > we
> > > > > > aim
> > > > > > > >> >> for good test coverage and a way to get these test fixed?
> > > > > > > >> >>
> > > > > > > >> >> Thus IMHO we need some forcing function (either keep the
> > > tests
> > > > > and
> > > > > > > feel
> > > > > > > >> >> the pain on every PR), or disable the test and file a
> > blocker
> > > > > JIRA
> > > > > > so
> > > > > > > >> >> the pain surfaces on a release forcing us to do something
> > > about
> > > > > it.
> > > > > > > >> >>
> > > > > > > >> >> If there is no forcing function, it basically means we
> are
> > > > > willing
> > > > > > to
> > > > > > > >> >> accept test gaps forever.
> > > > > > > >> >>
> > > > > > > >> >>
> > > > > > > >> >> -Matthias
> > > > > > > >> >>
> > > > > > > >> >> On 11/14/23 9:09 PM, Ismael Juma wrote:
> > > > > > > >> >>> Matthias,
> > > > > > > >> >>>
> > > > > > > >> >>> Flaky tests are worse than useless. I know engineers
> find
> > it
> > > > > hard
> > > > > > to
> > > > > > > >> >>> disable them because of the supposed test gap (I find it
> > > hard
> > > > > > too),
> > > > > > > >> but
> > > > > > > >> >> the
> > > > > > > >> >>> truth is that the test gap is already there! No-one
> blocks
> > > > > merging
> > > > > > > >> PRs on
> > > > > > > >> >>> flaky tests, but they do get used to ignoring build
> > > failures.
> > > > > > > >> >>>
> > > > > > > >> >>> The current approach has been attempted for nearly a
> > decade
> > > > and
> > > > > it
> > > > > > > has
> > > > > > > >> >>> never worked. I think we should try something different.
> > > > > > > >> >>>
> > > > > > > >> >>> When it comes to marking flaky tests as release
> blockers,
> > I
> > > > > don't
> > > > > > > >> think
> > > > > > > >> >>> this should be done as a general rule. We should instead
> > > > assess
> > > > > > on a
> > > > > > > >> case
> > > > > > > >> >>> by case basis, same way we do it for bugs.
> > > > > > > >> >>>
> > > > > > > >> >>> Ismael
> > > > > > > >> >>>
> > > > > > > >> >>> On Tue, Nov 14, 2023 at 5:02 PM Matthias J. Sax <
> > > > > mj...@apache.org
> > > > > > >
> > > > > > > >> >> wrote:
> > > > > > > >> >>>
> > > > > > > >> >>>> Thanks for starting this discussion David! I totally
> > agree
> > > to
> > > > > > "no"!
> > > > > > > >> >>>>
> > > > > > > >> >>>> I think there is no excuse whatsoever for merging PRs
> > with
> > > > > > > >> compilation
> > > > > > > >> >>>> errors (except an honest mistake for conflicting PRs
> that
> > > got
> > > > > > > merged
> > > > > > > >> >>>> interleaved). -- Every committer must(!) check the
> > Jenkins
> > > > > status
> > > > > > > >> before
> > > > > > > >> >>>> merging to avoid such an issue.
> > > > > > > >> >>>>
> > > > > > > >> >>>> Similar for actual permanently broken tests. If there
> is
> > no
> > > > > green
> > > > > > > >> build,
> > > > > > > >> >>>> and the same test failed across multiple Jenkins runs,
> a
> > > > > > committer
> > > > > > > >> >>>> should detect this and cannot merge a PR.
> > > > > > > >> >>>>
> > > > > > > >> >>>> Given the current state of the CI pipeline, it seems
> > > possible
> > > > > to
> > > > > > > get
> > > > > > > >> >>>> green runs, and thus I support the policy (that we
> > actually
> > > > > > always
> > > > > > > >> had)
> > > > > > > >> >>>> to only merge if there is at least one green build. If
> > > > > committers
> > > > > > > got
> > > > > > > >> >>>> sloppy about this, we need to call it out and put a
> hold
> > on
> > > > > this
> > > > > > > >> >> practice.
> > > > > > > >> >>>>
> > > > > > > >> >>>> (The only exception from the above policy would be a
> very
> > > > > > unstable
> > > > > > > >> >>>> status for which getting a green build is not possible
> at
> > > > all,
> > > > > > due
> > > > > > > to
> > > > > > > >> >>>> too many flaky tests -- for this case, I would accept
> to
> > > > merge
> > > > > > even
> > > > > > > >> >>>> there is no green build, but committer need to manually
> > > > ensure
> > > > > > that
> > > > > > > >> >>>> every test did pass in at least one test run. -- We had
> > > this
> > > > in
> > > > > > the
> > > > > > > >> >>>> past, but we I don't think we are in such a bad
> situation
> > > > right
> > > > > > > now).
> > > > > > > >> >>>>
> > > > > > > >> >>>> About disabling tests: I was never a fan of this,
> because
> > > in
> > > > my
> > > > > > > >> >>>> experience those tests are not fixed any time soon.
> > > > Especially,
> > > > > > > >> because
> > > > > > > >> >>>> we do not consider such tickets as release blockers. To
> > me,
> > > > > > seeing
> > > > > > > >> tests
> > > > > > > >> >>>> fails on PR build is actually a good forcing function
> for
> > > > > people
> > > > > > to
> > > > > > > >> feel
> > > > > > > >> >>>> the pain, and thus get motivated to make time to fix
> the
> > > > tests.
> > > > > > > >> >>>>
> > > > > > > >> >>>> I have to admit that I was a little bit sloppy paying
> > > > attention
> > > > > > to
> > > > > > > >> flaky
> > > > > > > >> >>>> tests recently, and I highly appreciate this effort.
> Also
> > > > > thanks
> > > > > > to
> > > > > > > >> >>>> everyone how actually filed a ticket! IMHO, we should
> > file
> > > a
> > > > > > ticket
> > > > > > > >> for
> > > > > > > >> >>>> every flaky test, and also keep adding comments each
> time
> > > we
> > > > > see
> > > > > > a
> > > > > > > >> test
> > > > > > > >> >>>> fail to be able to track the frequency at which a tests
> > > > fails,
> > > > > so
> > > > > > > we
> > > > > > > >> can
> > > > > > > >> >>>> fix the most pressing ones first.
> > > > > > > >> >>>>
> > > > > > > >> >>>> Te me, the best forcing function to get test stabilized
> > is
> > > to
> > > > > > file
> > > > > > > >> >>>> tickets and consider them release blockers. Disabling
> > tests
> > > > > does
> > > > > > > not
> > > > > > > >> >>>> really help much IMHO to tackle the problem (we can of
> > > course
> > > > > > still
> > > > > > > >> >>>> disable them to get noise out of the system, but it
> would
> > > > only
> > > > > > > >> introduce
> > > > > > > >> >>>> testing gaps for the time being and also does not help
> to
> > > > > figure
> > > > > > > out
> > > > > > > >> how
> > > > > > > >> >>>> often a test fails, so it's not a solution to the
> problem
> > > > IMHO)
> > > > > > > >> >>>>
> > > > > > > >> >>>>
> > > > > > > >> >>>> -Matthias
> > > > > > > >> >>>>
> > > > > > > >> >>>> On 11/13/23 11:40 PM, Sagar wrote:
> > > > > > > >> >>>>> Hi Divij,
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> I think this proposal overall makes sense. My only nit
> > > sort
> > > > > of a
> > > > > > > >> >>>> suggestion
> > > > > > > >> >>>>> is that let's also consider a label called newbie++[1]
> > for
> > > > > flaky
> > > > > > > >> tests
> > > > > > > >> >> if
> > > > > > > >> >>>>> we are considering adding newbie as a label. I think
> > some
> > > of
> > > > > the
> > > > > > > >> flaky
> > > > > > > >> >>>>> tests need familiarity with the codebase or the test
> > setup
> > > > so
> > > > > > as a
> > > > > > > >> >> first
> > > > > > > >> >>>>> time contributor, it might be difficult. newbie++ IMO
> > > covers
> > > > > > that
> > > > > > > >> >> aspect.
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> [1]
> > > > > > > >> >>>>>
> > > > > > > >> >>>>
> > > > > > > >> >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/KAFKA-15406?jql=project%20%3D%20KAFKA%20AND%20labels%20%3D%20%22newbie%2B%2B%22
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> Let me know what you think.
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> Thanks!
> > > > > > > >> >>>>> Sagar.
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> On Mon, Nov 13, 2023 at 9:11 PM Divij Vaidya <
> > > > > > > >> divijvaidy...@gmail.com>
> > > > > > > >> >>>>> wrote:
> > > > > > > >> >>>>>
> > > > > > > >> >>>>>>>   Please, do it.
> > > > > > > >> >>>>>> We can use specific labels to effectively filter
> those
> > > > > tickets.
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> We already have a label and a way to discover flaky
> > > tests.
> > > > > They
> > > > > > > are
> > > > > > > >> >>>> tagged
> > > > > > > >> >>>>>> with the label "flaky-test" [1]. There is also a
> label
> > > > > "newbie"
> > > > > > > [2]
> > > > > > > >> >>>> meant
> > > > > > > >> >>>>>> for folks who are new to Apache Kafka code base.
> > > > > > > >> >>>>>> My suggestion is to send a broader email to the
> > community
> > > > > > (since
> > > > > > > >> many
> > > > > > > >> >>>> will
> > > > > > > >> >>>>>> miss details in this thread) and call for action for
> > > > > committers
> > > > > > > to
> > > > > > > >> >>>>>> volunteer as "shepherds" for these tickets. I can
> send
> > > one
> > > > > out
> > > > > > > >> once we
> > > > > > > >> >>>> have
> > > > > > > >> >>>>>> some consensus wrt next steps in this thread.
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> [1]
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>
> > > > > > > >> >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/KAFKA-13421?jql=project%20%3D%20KAFKA%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20%22Patch%20Available%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flaky-test%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> [2] https://kafka.apache.org/contributing ->
> Finding a
> > > > > project
> > > > > > > to
> > > > > > > >> >> work
> > > > > > > >> >>>> on
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> Divij Vaidya
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> On Mon, Nov 13, 2023 at 4:24 PM Николай Ижиков <
> > > > > > > >> nizhi...@apache.org>
> > > > > > > >> >>>>>> wrote:
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>>> To kickstart this effort, we can publish a list of
> > such
> > > > > > tickets
> > > > > > > >> in
> > > > > > > >> >> the
> > > > > > > >> >>>>>>> community and assign one or more committers the role
> > of
> > > a
> > > > > > > >> «shepherd"
> > > > > > > >> >>>> for
> > > > > > > >> >>>>>>> each ticket.
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>> Please, do it.
> > > > > > > >> >>>>>>> We can use specific label to effectively filter
> those
> > > > > tickets.
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>>> 13 нояб. 2023 г., в 15:16, Divij Vaidya <
> > > > > > > divijvaidy...@gmail.com
> > > > > > > >> >
> > > > > > > >> >>>>>>> написал(а):
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> Thanks for bringing this up David.
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> My primary concern revolves around the possibility
> > that
> > > > the
> > > > > > > >> >> currently
> > > > > > > >> >>>>>>>> disabled tests may remain inactive indefinitely. We
> > > > > currently
> > > > > > > >> have
> > > > > > > >> >>>>>>>> unresolved JIRA tickets for flaky tests that have
> > been
> > > > > > pending
> > > > > > > >> for
> > > > > > > >> >> an
> > > > > > > >> >>>>>>>> extended period. I am inclined to support the idea
> of
> > > > > > disabling
> > > > > > > >> >> these
> > > > > > > >> >>>>>>> tests
> > > > > > > >> >>>>>>>> temporarily and merging changes only when the build
> > is
> > > > > > > >> successful,
> > > > > > > >> >>>>>>> provided
> > > > > > > >> >>>>>>>> there is a clear plan for re-enabling them in the
> > > future.
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> To address this issue, I propose the following
> > > measures:
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> 1\ Foster a supportive environment for new
> > contributors
> > > > > > within
> > > > > > > >> the
> > > > > > > >> >>>>>>>> community, encouraging them to take on tickets
> > > associated
> > > > > > with
> > > > > > > >> flaky
> > > > > > > >> >>>>>>> tests.
> > > > > > > >> >>>>>>>> This initiative would require individuals familiar
> > with
> > > > the
> > > > > > > >> relevant
> > > > > > > >> >>>>>> code
> > > > > > > >> >>>>>>>> to offer guidance to those undertaking these tasks.
> > > > > > Committers
> > > > > > > >> >> should
> > > > > > > >> >>>>>>>> prioritize reviewing and addressing these tickets
> > > within
> > > > > > their
> > > > > > > >> >>>>>> available
> > > > > > > >> >>>>>>>> bandwidth. To kickstart this effort, we can
> publish a
> > > > list
> > > > > of
> > > > > > > >> such
> > > > > > > >> >>>>>>> tickets
> > > > > > > >> >>>>>>>> in the community and assign one or more committers
> > the
> > > > role
> > > > > > of
> > > > > > > a
> > > > > > > >> >>>>>>> "shepherd"
> > > > > > > >> >>>>>>>> for each ticket.
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> 2\ Implement a policy to block minor version
> releases
> > > > until
> > > > > > the
> > > > > > > >> >>>> Release
> > > > > > > >> >>>>>>>> Manager (RM) is satisfied that the disabled tests
> do
> > > not
> > > > > > result
> > > > > > > >> in
> > > > > > > >> >>>> gaps
> > > > > > > >> >>>>>>> in
> > > > > > > >> >>>>>>>> our testing coverage. The RM may rely on Subject
> > Matter
> > > > > > Experts
> > > > > > > >> >> (SMEs)
> > > > > > > >> >>>>>> in
> > > > > > > >> >>>>>>>> the specific code areas to provide assurance before
> > > > giving
> > > > > > the
> > > > > > > >> green
> > > > > > > >> >>>>>>> light
> > > > > > > >> >>>>>>>> for a release.
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> 3\ Set a community-wide goal for 2024 to achieve a
> > > stable
> > > > > > > >> Continuous
> > > > > > > >> >>>>>>>> Integration (CI) system. This goal should encompass
> > > > > projects
> > > > > > > >> such as
> > > > > > > >> >>>>>>>> refining our test suite to eliminate flakiness and
> > > > > addressing
> > > > > > > >> >>>>>>>> infrastructure issues if necessary. By publishing
> > this
> > > > > goal,
> > > > > > we
> > > > > > > >> >> create
> > > > > > > >> >>>>>> a
> > > > > > > >> >>>>>>>> shared vision for the community in 2024, fostering
> > > > > alignment
> > > > > > on
> > > > > > > >> our
> > > > > > > >> >>>>>>>> objectives. This alignment will aid in prioritizing
> > > tasks
> > > > > for
> > > > > > > >> >>>> community
> > > > > > > >> >>>>>>>> members and guide reviewers in allocating their
> > > bandwidth
> > > > > > > >> >> effectively.
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> --
> > > > > > > >> >>>>>>>> Divij Vaidya
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> On Sun, Nov 12, 2023 at 2:58 AM Justine Olshan
> > > > > > > >> >>>>>>> <jols...@confluent.io.invalid>
> > > > > > > >> >>>>>>>> wrote:
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>>> I will say that I have also seen tests that seem
> to
> > be
> > > > > more
> > > > > > > >> flaky
> > > > > > > >> >>>>>>>>> intermittently. It may be ok for some time and
> > > suddenly
> > > > > the
> > > > > > CI
> > > > > > > >> is
> > > > > > > >> >>>>>>>>> overloaded and we see issues.
> > > > > > > >> >>>>>>>>> I have also seen the CI struggling with running
> out
> > of
> > > > > space
> > > > > > > >> >>>> recently,
> > > > > > > >> >>>>>>> so I
> > > > > > > >> >>>>>>>>> wonder if we can also try to improve things on
> that
> > > > front.
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>> FWIW, I noticed, filed, or commented on several
> > flaky
> > > > test
> > > > > > > JIRAs
> > > > > > > >> >> last
> > > > > > > >> >>>>>>> week.
> > > > > > > >> >>>>>>>>> I'm happy to try to get to green builds, but
> > everyone
> > > > > needs
> > > > > > to
> > > > > > > >> be
> > > > > > > >> >> on
> > > > > > > >> >>>>>>> board.
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15529
> > > > > > > >> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-14806
> > > > > > > >> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-14249
> > > > > > > >> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15798
> > > > > > > >> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15797
> > > > > > > >> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15690
> > > > > > > >> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15699
> > > > > > > >> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15772
> > > > > > > >> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15759
> > > > > > > >> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15760
> > > > > > > >> >>>>>>>>> https://issues.apache.org/jira/browse/KAFKA-15700
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>> I've also seen that kraft transactions tests often
> > > > flakily
> > > > > > see
> > > > > > > >> that
> > > > > > > >> >>>>>> the
> > > > > > > >> >>>>>>>>> producer id is not allocated and times out.
> > > > > > > >> >>>>>>>>> I can file a JIRA for that too.
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>> Hopefully this is a place we can start from.
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>> Justine
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>> On Sat, Nov 11, 2023 at 11:35 AM Ismael Juma <
> > > > > > > m...@ismaeljuma.com
> > > > > > > >> >
> > > > > > > >> >>>>>> wrote:
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>>> On Sat, Nov 11, 2023 at 10:32 AM John Roesler <
> > > > > > > >> >> vvcep...@apache.org>
> > > > > > > >> >>>>>>>>> wrote:
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>>> In other words, I’m biased to think that new
> > > flakiness
> > > > > > > >> indicates
> > > > > > > >> >>>>>>>>>>> non-deterministic bugs more often than it
> > indicates
> > > a
> > > > > bad
> > > > > > > >> test.
> > > > > > > >> >>>>>>>>>>>
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>> My experience is exactly the opposite. As someone
> > who
> > > > has
> > > > > > > >> tracked
> > > > > > > >> >>>>>> many
> > > > > > > >> >>>>>>> of
> > > > > > > >> >>>>>>>>>> the flaky fixes, the vast majority of the time
> they
> > > are
> > > > > an
> > > > > > > >> issue
> > > > > > > >> >>>> with
> > > > > > > >> >>>>>>> the
> > > > > > > >> >>>>>>>>>> test.
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>> Ismael
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>
> > > > > > > >> >>>>
> > > > > > > >> >>>
> > > > > > > >> >>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to