I did think about splitting up dunit tests, but I believe
testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky even if I
move it to a new dunit test. No matter how you dice it up, we end up with a
PR that cannot be merged to develop unless you get lucky after running
stress-new-test many times.

One could try being tricky by marking it with @ignore or deleting the flaky
test in one PR, and then re-add it in a 2nd PR. But, even then that 2nd PR
is very unlikely to pass stress-new-test unless someone fixes the cause of
that test's flakiness.

As it stands, stress-new-test just ends up being a dead-end or an endless
time-sync for fixing multiple flaky tests in one dunit.

On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <dasm...@vmware.com> wrote:

> Would it be possible to just split that test up into multiple classes? It
> sounds like the issue is that there is so many flaky tests in that class
> that you can't fix them all in one PR, which might indicate it's too big.
>
> If we can't get StressNewTest to pass - that means our builds are failing
> more than 2% of the time due to this one test failure. Yikes!
>
> -Dan
> ________________________________
> From: Kirk Lund <kl...@apache.org>
> Sent: Tuesday, June 8, 2021 9:33 AM
> To: dev@geode.apache.org <dev@geode.apache.org>
> Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
>
> Our requirement for stress-new-test-openjdk11 to pass before allowing merge
> doesn't really work as intended for fixing distributed tests that contain
> multiple flaky test methods. In fact, I think it causes contributors to
> avoid tackling flaky tests.
>
> I've been working on GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391258409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=9EmtMx3BhTe2zoLdUMOjCEblldw0VigUMnK3O2Ia%2FRY%3D&amp;reserved=0>
> and was able to fix it.
>
> However, stress-new-test-openjdk11 then continued to fail for other flaky
> tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
> GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
> which remains flaky.
>
> Unfortunately, I cannot merge any of my fixes for
> PutAllClientServerDistributedTest unless every single flaky test in it is
> fixed by my PR. I think this is undesirable because it would be better to
> merge the fix for 3 flaky test methods than none.
>
> UPDATE: After running my precheckin more times, I did get
> stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
> loophole than anything because I didn't manage to fix GEODE-9242.
>
> Despite having PR #6542 eventually pass, I would like to discuss removing
> or relaxing our requirement that stress-new-test-openjdk11 must pass in
> order to merge a PR...
>
> PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
> PutAllClientServerDistributedTest
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391258409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Z4WRZBNikFsIEDQuDmEpKwsZO2WqLATudaMix%2BDrfMs%3D&amp;reserved=0
> >
>
> Fixed in PR #6542:
> * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
> testPartialKeyInPRSingleHopWithRedundancy
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2FiNd3TORn9Al4Y%2BTwbzMmfy3jB7%2F5XxNibhtWtiCOfM%3D&amp;reserved=0
> >
> * GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=9LiNnczBArXP%2FoNjyyhRRgDnpJqWgMuQtYRwzscQ2TQ%3D&amp;reserved=0
> >
> * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
> fails due to missing disk store after server restart
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Cq1PCAjkoR3OoClV3NQTFrzyDpL6FXDs0LrI%2B4vyiac%3D&amp;reserved=0
> >
>
> Still flaky:
> * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
> testEventIdOutOfOrderInPartitionRegionSingleHop
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2B20jJatRlhxWoYyihBD8yI%2FMaEVWJ0E17VgdjHODmvY%3D&amp;reserved=0
> >
>
> Thanks,
> Kirk
>

Reply via email to