I agree, I am willing to concede this discussion, as long as we are judicious and specifically only use it when it is not our test that is failing. It is a real problem.
Thanks, Mark On 6/9/21, 9:46 AM, "Kirk Lund" <kl...@apache.org> wrote: I did think about splitting up dunit tests, but I believe testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky even if I move it to a new dunit test. No matter how you dice it up, we end up with a PR that cannot be merged to develop unless you get lucky after running stress-new-test many times. One could try being tricky by marking it with @ignore or deleting the flaky test in one PR, and then re-add it in a 2nd PR. But, even then that 2nd PR is very unlikely to pass stress-new-test unless someone fixes the cause of that test's flakiness. As it stands, stress-new-test just ends up being a dead-end or an endless time-sync for fixing multiple flaky tests in one dunit. On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <dasm...@vmware.com> wrote: > Would it be possible to just split that test up into multiple classes? It > sounds like the issue is that there is so many flaky tests in that class > that you can't fix them all in one PR, which might indicate it's too big. > > If we can't get StressNewTest to pass - that means our builds are failing > more than 2% of the time due to this one test failure. Yikes! > > -Dan > ________________________________ > From: Kirk Lund <kl...@apache.org> > Sent: Tuesday, June 8, 2021 9:33 AM > To: dev@geode.apache.org <dev@geode.apache.org> > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs > > Our requirement for stress-new-test-openjdk11 to pass before allowing merge > doesn't really work as intended for fixing distributed tests that contain > multiple flaky test methods. In fact, I think it causes contributors to > avoid tackling flaky tests. > > I've been working on GEODE-9103: CI Failure: > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XNyeNQmLHIIyRgpnXRwbHn%2F2lph6S%2BhiO%2BdcOOgeoSc%3D&reserved=0> > and was able to fix it. > > However, stress-new-test-openjdk11 then continued to fail for other flaky > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242 > which remains flaky. > > Unfortunately, I cannot merge any of my fixes for > PutAllClientServerDistributedTest unless every single flaky test in it is > fixed by my PR. I think this is undesirable because it would be better to > merge the fix for 3 flaky test methods than none. > > UPDATE: After running my precheckin more times, I did get > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a > loophole than anything because I didn't manage to fix GEODE-9242. > > Despite having PR #6542 eventually pass, I would like to discuss removing > or relaxing our requirement that stress-new-test-openjdk11 must pass in > order to merge a PR... > > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in > PutAllClientServerDistributedTest > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=gRAwmAsrHsH4p9jkKTuC63nJPRGBW6TGsYG41xpLF4w%3D&reserved=0 > > > > Fixed in PR #6542: > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest > > testPartialKeyInPRSingleHopWithRedundancy > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Q3ogdEG4omEyoc7Wrg2yMZEC0HWDTKhVqKMCJ8ZqTzc%3D&reserved=0 > > > * GEODE-9103: CI Failure: > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XNyeNQmLHIIyRgpnXRwbHn%2F2lph6S%2BhiO%2BdcOOgeoSc%3D&reserved=0 > > > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop > fails due to missing disk store after server restart > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2zkPkNsCFNbXHre2ekltWWrckA74YKD2SqUjp5Tehbg%3D&reserved=0 > > > > Still flaky: > * GEODE-9242: CI failure in PutAllClientServerDistributedTest > > testEventIdOutOfOrderInPartitionRegionSingleHop > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fINGTY1FLXuJkr14LZNasMed6%2B1aJw9nIldCzzSf6rM%3D&reserved=0 > > > > Thanks, > Kirk >