This would substantially increase the burden on codeowners, because now in 
addition to looking at the code itself, they would have to wait for any running 
PR checks to complete, as well as remember to come back and look at the PR 
after any subsequent commits to make sure the checks are still passing.

On 6/9/21, 10:56 AM, "Mark Hanson" <hans...@vmware.com> wrote:

    I think that process is a bit much. PRs are already a challenge. What do 
people think about code owners being the gate. We can accept by custom that 
there should be no stress-new-test failures. If there is a failure, a code 
owner can request a change or decide to let it go. I think that is sufficient 
all things considered.

    Thanks,
    Mark

    On 6/9/21, 10:43 AM, "Owen Nichols" <onich...@vmware.com> wrote:

        I feel that a traditional [DISCUSS] and [VOTE] on the dev list would be 
sufficient and proper to grant approval for an override.  Any PR already needs 
approval from 1 codeowner per area to merge, so codeowners already have a 
little more say because they hold veto power over the PR.

        In terms of "practicalities of how this would actually work":
        Step 1: start a [DISCUSS] thread explaining the problem and why you 
think an override is justified
        Step 2: if there is concensus, [VOTE]
        Step 3: Myself (or whoever performs the override) must cite a link to 
the vote thread


        On 6/9/21, 10:16 AM, "Dale Emery" <dem...@vmware.com> wrote:

            I too like #1 best for now… assuming it’s possible to give code 
owners this ability.

            Coincidentally, about option #3, II was reading the git release 
notes just now, and noticed there’s a new “trailers” feature. It gives git the 
ability to parse “key: value” pairs at the end of a commit message. We could 
potentially (with a sufficiently current version of git) use that to exclude a 
test from a PR stress test run.

            And, yeah, option #2 brings back the @FlakyTest annotation that we 
worked so hard to eliminate.

            As Mark said, none of this fixes the underlying problem, which I’d 
frame as: We have too many tests whose results we don’t trust.

            Dale

            From: Kirk Lund <kl...@apache.org>
            Date: Wednesday, June 9, 2021 at 9:59 AM
            To: dev@geode.apache.org <dev@geode.apache.org>
            Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement 
from PRs
            I do like the suggestions offered up by Dale and would encourage 
(or even
            plead) with my fellow contributors to consider these:

              *   Allow code owners to override the block, if they can be 
convinced the
            > override is justified.
            >   *   Exclude troublesome tests from stress test runs, either via
            > annotations or via an `assumeThat(…)` that can detect that it’s 
running as
            > a stress test. Whatever the mechanism for excluding, it would be 
in the
            > code, and therefore subject to code owner review. (This, too, 
feels overly
            > broad to me, as it would exclude the test from all stress test 
runs.)
            >   *   A way to exclude a specific test method from running in the 
stress
            > tests for a specific PR or commit. I don’t have any ideas for how 
to
            > declare such an exclusion, but if it could be done via a file it 
would be
            > subject to code owner review.


            1) Allow code owners to override the block, if they can be 
convinced the
            override is justified.

            After all, if we don't trust our code owners...

            2-3) Use a custom annotation to exclude the test method or test 
class only
            from stress-new-test.

            At first I really liked this idea, but then we end up with growing a
            collection of flaky tests that are excluded in some way from
            stress-new-test that still occasionally fail in distributedTest.

            #1 really sounds like the best option to me. I believe that leaving 
our
            stress-new-test process as-is will only discourage everyone from 
fixing one
            or two flaky tests in a large dunit test. However, I also believe 
that if
            we give code owners the authority to override stress-new-test, then 
we
            need to encourage them not to override this too often.

            On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <dem...@vmware.com> 
wrote:

            > Maybe we can find a way to relax the requirement, or to allow 
addressing
            > specific situations like the tangle you find yourself in.
            >
            > Removing the requirement altogether feels overly broad. I fear it 
would
            > allow us to quietly disregard all intermittent test failures, and 
I think
            > we already quietly (or even actively) disregard way too many 
kinds of
            > failures.
            >
            > I would prefer some way to explicitly disregard only the specific 
test
            > failures that prevent us from merging, and only with some amount 
of
            > explicit justification.
            >
            > I’m not sure what that would look like. Some half-baked 
possibilities:
            >
            >   *   Allow code owners to override the block, if they can be 
convinced
            > the override is justified.
            >   *   Exclude troublesome tests from stress test runs, either via
            > annotations or via an `assumeThat(…)` that can detect that it’s 
running as
            > a stress test. Whatever the mechanism for excluding, it would be 
in the
            > code, and therefore subject to code owner review. (This, too, 
feels overly
            > broad to me, as it would exclude the test from all stress test 
runs.)
            >   *   A way to exclude a specific test method from running in the 
stress
            > tests for a specific PR or commit. I don’t have any ideas for how 
to
            > declare such an exclusion, but if it could be done via a file it 
would be
            > subject to code owner review.
            >
            > Dale
            >
            > From: Kirk Lund <kl...@apache.org>
            > Date: Tuesday, June 8, 2021 at 9:33 AM
            > To: dev@geode.apache.org <dev@geode.apache.org>
            > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement 
from PRs
            > Our requirement for stress-new-test-openjdk11 to pass before 
allowing merge
            > doesn't really work as intended for fixing distributed tests that 
contain
            > multiple flaky test methods. In fact, I think it causes 
contributors to
            > avoid tackling flaky tests.
            >
            > I've been working on GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions 
FAILED
            > 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189619389%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6HhCuHXEJiWvGcjKr3UKyarforIrRHnPjhfcU4CCUIo%3D&amp;reserved=0>
 and was able to fix it.
            >
            > However, stress-new-test-openjdk11 then continued to fail for 
other flaky
            > tests in PutAllClientServerDistributedTest. I managed to fix 
GEODE-9296 and
            > GEODE-8528 as well. I also tried but have not been able to fix 
GEODE-9242
            > which remains flaky.
            >
            > Unfortunately, I cannot merge any of my fixes for
            > PutAllClientServerDistributedTest unless every single flaky test 
in it is
            > fixed by my PR. I think this is undesirable because it would be 
better to
            > merge the fix for 3 flaky test methods than none.
            >
            > UPDATE: After running my precheckin more times, I did get
            > stress-new-test-openjdk11 to pass once so I can merge, but that's 
more of a
            > loophole than anything because I didn't manage to fix GEODE-9242.
            >
            > Despite having PR #6542 eventually pass, I would like to discuss 
removing
            > or relaxing our requirement that stress-new-test-openjdk11 must 
pass in
            > order to merge a PR...
            >
            > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
            > PutAllClientServerDistributedTest
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189619389%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dDGAwAuQQiTT9gwARFsccHEPhbW2nlX40O0XqWOpPmw%3D&amp;reserved=0
            > >
            >
            > Fixed in PR #6542:
            > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
            > testPartialKeyInPRSingleHopWithRedundancy
            > 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189619389%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nvNvfHOeFuURlTPHeoKK9zv5YsCtH71wyrUU%2BMabexQ%3D&amp;reserved=0>
            > * GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions 
FAILED
            > 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189619389%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6HhCuHXEJiWvGcjKr3UKyarforIrRHnPjhfcU4CCUIo%3D&amp;reserved=0>
            > * GEODE-8528: 
PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
            > fails due to missing disk store after server restart
            > 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189629337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=7TQOe44Mb%2BwM2f6JC94znuQoo53OplqDF4k4C6jkRBE%3D&amp;reserved=0>
            >
            > Still flaky:
            > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
            > testEventIdOutOfOrderInPartitionRegionSingleHop
            > 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189629337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qmFx1X78eoGaomBiukube64kInwYjiY%2BBa3hoyt%2B2BM%3D&amp;reserved=0>
            >
            > Thanks,
            > Kirk
            >



Reply via email to