[ 
https://issues.apache.org/jira/browse/HDDS-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chi-Hsuan Huang reassigned HDDS-15605:
--------------------------------------

    Assignee: Chi-Hsuan Huang

> Intermittent failure in 
> TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-15605
>                 URL: https://issues.apache.org/jira/browse/HDDS-15605
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: test
>            Reporter: Chi-Hsuan Huang
>            Assignee: Chi-Hsuan Huang
>            Priority: Minor
>
> h2. Symptom
> {{TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException}}
>  fails intermittently \(observed \~1/40 on CI, not reproducible locally\) 
> with an assertion failure, not a timeout:
> {code}
> TestFailureHandlingByClient.testContainerExclusionWithClosedContainerException:398
> java.lang.AssertionError:
> Expecting empty but was: 
> \[5efc24c5\-0b87\-4bf7\-80b0\-751fafcf3248\(null/null\)\]
> {code}
> Line 398 asserts {{keyOutputStream.getExcludeList\(\).getDatanodes\(\)}} is 
> empty: only the closed container should be excluded, no datanode.
> h2. Root cause analysis
> In {{KeyOutputStream.handleException}} \(around lines 386\-400\), excluding a 
> datanode and excluding the container are two independent decisions that can 
> both fire:
> {code}
> Collection failedServers = streamEntry.getFailedServers\(\);
> if \(\!failedServers.isEmpty\(\)\) {
>   excludeList.addDatanodes\(failedServers\);          // populates 
> getDatanodes\(\)
> }
> if \(containerExclusionException\) {
>   excludeList.addConatinerId\(...\);                  // container \(expected 
> by the test\)
> } else {
>   excludeList.addPipeline\(pipelineId\);
> }
> {code}
> The test assumes the second write fails only with 
> {{ClosedContainerException}}, so {{failedServers}} is empty. But the excluded 
> datanode is printed as {{\(null/null\)}}, which is what 
> {{XceiverClientRatis.addDatanodetoReply}} produces \(it builds 
> {{DatanodeDetails}} from the Ratis peer UUID only, with no IP or hostname\). 
> This points to a Ratis peer write/watch failure rather than a clean 
> {{ClosedContainerException}}.
> Sequence:h1. {{TestHelper.waitForContainerClose}} closes the container, which 
> also tears down the Ratis pipeline on the datanodes.
> h1. The subsequent write \(or its watch\-for\-commit\) to a Ratis peer can 
> fail or time out while the pipeline is closing, so that peer is recorded in 
> {{failedServers}}.
> h1. {{handleException}} then adds that datanode to the exclude list in 
> addition to the container, so {{getDatanodes\(\)}} is non\-empty and the 
> assertion on line 398 fails.
> This is a timing race between container close and the in\-flight Ratis 
> write/watch, which is why it only shows up under load on CI.
> h2. Notes
> * Distinct from HDDS\-7878 \(resolved\), which tracked an intermittent 
> _timeout_ in the same method. This is an _assertion_ failure with a different 
> cause.
> * Observed in CI: 
> [https://github.com/chihsuan/ozone/actions/runs/27691671664|https://github.com/chihsuan/ozone/actions/runs/27691671664]
>  \(job: integration \(client\)\).
> * The assertion on line 398 \(and likely 399 for pipelines\) may be too 
> strict given that container close can legitimately surface a transient 
> datanode/pipeline failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to