Maybe we can find a way to relax the requirement, or to allow addressing 
specific situations like the tangle you find yourself in.

Removing the requirement altogether feels overly broad. I fear it would allow 
us to quietly disregard all intermittent test failures, and I think we already 
quietly (or even actively) disregard way too many kinds of failures.

I would prefer some way to explicitly disregard only the specific test failures 
that prevent us from merging, and only with some amount of explicit 
justification.

I’m not sure what that would look like. Some half-baked possibilities:

  *   Allow code owners to override the block, if they can be convinced the 
override is justified.
  *   Exclude troublesome tests from stress test runs, either via annotations 
or via an `assumeThat(…)` that can detect that it’s running as a stress test. 
Whatever the mechanism for excluding, it would be in the code, and therefore 
subject to code owner review. (This, too, feels overly broad to me, as it would 
exclude the test from all stress test runs.)
  *   A way to exclude a specific test method from running in the stress tests 
for a specific PR or commit. I don’t have any ideas for how to declare such an 
exclusion, but if it could be done via a file it would be subject to code owner 
review.

Dale

From: Kirk Lund <kl...@apache.org>
Date: Tuesday, June 8, 2021 at 9:33 AM
To: dev@geode.apache.org <dev@geode.apache.org>
Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
Our requirement for stress-new-test-openjdk11 to pass before allowing merge
doesn't really work as intended for fixing distributed tests that contain
multiple flaky test methods. In fact, I think it causes contributors to
avoid tackling flaky tests.

I've been working on GEODE-9103: CI Failure:
PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
<https://issues.apache.org/jira/browse/GEODE-9103> and was able to fix it.

However, stress-new-test-openjdk11 then continued to fail for other flaky
tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
which remains flaky.

Unfortunately, I cannot merge any of my fixes for
PutAllClientServerDistributedTest unless every single flaky test in it is
fixed by my PR. I think this is undesirable because it would be better to
merge the fix for 3 flaky test methods than none.

UPDATE: After running my precheckin more times, I did get
stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
loophole than anything because I didn't manage to fix GEODE-9242.

Despite having PR #6542 eventually pass, I would like to discuss removing
or relaxing our requirement that stress-new-test-openjdk11 must pass in
order to merge a PR...

PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
PutAllClientServerDistributedTest
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cdemery%40vmware.com%7Ca1f2e89aed9a4cb0940c08d92a9b2ac6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668190431746%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=CEeIvbWkOY%2B00EEVMUIqcsil%2BYUXLcIiyhSiVUxlD%2Fs%3D&amp;reserved=0>

Fixed in PR #6542:
* GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
testPartialKeyInPRSingleHopWithRedundancy
<https://issues.apache.org/jira/browse/GEODE-9296>
* GEODE-9103: CI Failure:
PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
<https://issues.apache.org/jira/browse/GEODE-9103>
* GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
fails due to missing disk store after server restart
<https://issues.apache.org/jira/browse/GEODE-8528>

Still flaky:
* GEODE-9242: CI failure in PutAllClientServerDistributedTest >
testEventIdOutOfOrderInPartitionRegionSingleHop
<https://issues.apache.org/jira/browse/GEODE-9242>

Thanks,
Kirk

Reply via email to