Chi-Hsuan Huang created HDDS-15605:
--------------------------------------

             Summary: Intermittent failure in 
TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException
                 Key: HDDS-15605
                 URL: https://issues.apache.org/jira/browse/HDDS-15605
             Project: Apache Ozone
          Issue Type: Sub-task
          Components: test
            Reporter: Chi-Hsuan Huang


h2. Symptom

{{TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException}}
 fails intermittently \(observed \~1/40 on CI, not reproducible locally\) with 
an assertion failure, not a timeout:

{code}
TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException:398
java.lang.AssertionError:
Expecting empty but was: 
\[5efc24c5\-0b87\-4bf7\-80b0\-751fafcf3248\(null/null\)\]
{code}

Line 398 asserts {{keyOutputStream.getExcludeList\(\).getDatanodes\(\)}} is 
empty: only the closed container should be excluded, no datanode.

h2. Root cause analysis

In {{KeyOutputStream.handleException}} \(around lines 386\-400\), excluding a 
datanode and excluding the container are two independent decisions that can 
both fire:

{code}
Collection failedServers = streamEntry.getFailedServers\(\);
if \(\!failedServers.isEmpty\(\)\) {
  excludeList.addDatanodes\(failedServers\);          // populates 
getDatanodes\(\)
}
if \(containerExclusionException\) {
  excludeList.addConatinerId\(...\);                  // container \(expected 
by the test\)
} else {
  excludeList.addPipeline\(pipelineId\);
}
{code}

The test assumes the second write fails only with {{ClosedContainerException}}, 
so {{failedServers}} is empty. But the excluded datanode is printed as 
{{\(null/null\)}}, which is what {{XceiverClientRatis.addDatanodetoReply}} 
produces \(it builds {{DatanodeDetails}} from the Ratis peer UUID only, with no 
IP or hostname\). This points to a Ratis peer write/watch failure rather than a 
clean {{ClosedContainerException}}.

Sequence:h1. {{TestHelper.waitForContainerClose}} closes the container, which 
also tears down the Ratis pipeline on the datanodes.
h1. The subsequent write \(or its watch\-for\-commit\) to a Ratis peer can fail 
or time out while the pipeline is closing, so that peer is recorded in 
{{failedServers}}.
h1. {{handleException}} then adds that datanode to the exclude list in addition 
to the container, so {{getDatanodes\(\)}} is non\-empty and the assertion on 
line 398 fails.
This is a timing race between container close and the in\-flight Ratis 
write/watch, which is why it only shows up under load on CI.

h2. Notes

* Distinct from HDDS\-7878 \(resolved\), which tracked an intermittent 
_timeout_ in the same method. This is an _assertion_ failure with a different 
cause.
* Observed in CI: 
[https://github.com/chihsuan/ozone/actions/runs/27691671664|https://github.com/chihsuan/ozone/actions/runs/27691671664]
 \(job: integration \(client\)\).
* The assertion on line 398 \(and likely 399 for pipelines\) may be too strict 
given that container close can legitimately surface a transient 
datanode/pipeline failure.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to