[
https://issues.apache.org/jira/browse/HDDS-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chi-Hsuan Huang reassigned HDDS-15605:
--------------------------------------
Assignee: Chi-Hsuan Huang
> Intermittent failure in
> TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException
> ------------------------------------------------------------------------------------------------------
>
> Key: HDDS-15605
> URL: https://issues.apache.org/jira/browse/HDDS-15605
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: test
> Reporter: Chi-Hsuan Huang
> Assignee: Chi-Hsuan Huang
> Priority: Minor
>
> h2. Symptom
> {{TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException}}
> fails intermittently \(observed \~1/40 on CI, not reproducible locally\)
> with an assertion failure, not a timeout:
> {code}
> TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException:398
> java.lang.AssertionError:
> Expecting empty but was:
> \[5efc24c5\-0b87\-4bf7\-80b0\-751fafcf3248\(null/null\)\]
> {code}
> Line 398 asserts {{keyOutputStream.getExcludeList\(\).getDatanodes\(\)}} is
> empty: only the closed container should be excluded, no datanode.
> h2. Root cause analysis
> In {{KeyOutputStream.handleException}} \(around lines 386\-400\), excluding a
> datanode and excluding the container are two independent decisions that can
> both fire:
> {code}
> Collection failedServers = streamEntry.getFailedServers\(\);
> if \(\!failedServers.isEmpty\(\)\) {
> excludeList.addDatanodes\(failedServers\); // populates
> getDatanodes\(\)
> }
> if \(containerExclusionException\) {
> excludeList.addConatinerId\(...\); // container \(expected
> by the test\)
> } else {
> excludeList.addPipeline\(pipelineId\);
> }
> {code}
> The test assumes the second write fails only with
> {{ClosedContainerException}}, so {{failedServers}} is empty. But the excluded
> datanode is printed as {{\(null/null\)}}, which is what
> {{XceiverClientRatis.addDatanodetoReply}} produces \(it builds
> {{DatanodeDetails}} from the Ratis peer UUID only, with no IP or hostname\).
> This points to a Ratis peer write/watch failure rather than a clean
> {{ClosedContainerException}}.
> Sequence:h1. {{TestHelper.waitForContainerClose}} closes the container, which
> also tears down the Ratis pipeline on the datanodes.
> h1. The subsequent write \(or its watch\-for\-commit\) to a Ratis peer can
> fail or time out while the pipeline is closing, so that peer is recorded in
> {{failedServers}}.
> h1. {{handleException}} then adds that datanode to the exclude list in
> addition to the container, so {{getDatanodes\(\)}} is non\-empty and the
> assertion on line 398 fails.
> This is a timing race between container close and the in\-flight Ratis
> write/watch, which is why it only shows up under load on CI.
> h2. Notes
> * Distinct from HDDS\-7878 \(resolved\), which tracked an intermittent
> _timeout_ in the same method. This is an _assertion_ failure with a different
> cause.
> * Observed in CI:
> [https://github.com/chihsuan/ozone/actions/runs/27691671664|https://github.com/chihsuan/ozone/actions/runs/27691671664]
> \(job: integration \(client\)\).
> * The assertion on line 398 \(and likely 399 for pipelines\) may be too
> strict given that container close can legitimately surface a transient
> datanode/pipeline failure.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]