chungen0126 opened a new pull request, #10397:
URL: https://github.com/apache/ozone/pull/10397

   ## What changes were proposed in this pull request?
   
   ### Summary
   Fix intermittent failure in 
`TestContainerStateMachine#testApplyTransactionFailure`, 
`TestContainerStateMachine#testContainerStateMachineRestartWithDNChangePipeline`,
 `testWriteStateMachineDataIdempotencyWithClosedContainer`, and 
`testApplyTransactionIdempotencyWithClosedContainer`.
   
   ### Changes
   #### For `testWriteStateMachineDataIdempotencyWithClosedContainer`: 
   
   The test stemmed from a race between a retry write operation and a close 
container request. The test expects idempotency for identical data, but 
intermittent failures occurred because the initial write and the retry write 
contained different data.
   
   - Case A (Success): If close container executes first, no error occurs.
   - Case B (Failure): If the retry write executes before the close container, 
a mismatch occurs between the written data "hello" and the committed metadata. 
While the container successfully closes, it is later marked as "unhealthy" by 
the scanner due to a checksum mismatch.
   
   Fix: Updated the test to ensure data consistency during retries or adjusted 
the timing expectations to handle the race condition correctly.
   
   #### For testContainerStateMachineRestartWithDNChangePipeline & 
testApplyTransactionFailure
   
   These tests failed due to testContainerStateMachineFailures, which triggers 
a Ratis storage reset that breaks existing pipelines. Because these pipelines 
are closed passively via client-side retries instead of the ScrubbingService, 
they remain in the PipelineManager, leading to inevitable failures in 
subsequent tests that inadvertently select them.
   
   Fix: Make `testContainerStateMachineFailures` at the end of the class.
   
   #### For testApplyTransactionIdempotencyWithClosedContainer
   
   When the `close container` command finishes sending, it does not guarantee 
that the last applied index has been updated concurrently. If take snapshot`is 
triggered immediately afterward, the resulting snapshot may not reflect the 
latest state.
   
   Fix: Added a waiting step after the `close container` command is sent to 
ensure that the `last applied index` has been fully updated before proceeding 
to `take snapshot`. This guarantees that the generated snapshot is always 
up-to-date with the latest index.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-13482
   https://issues.apache.org/jira/browse/HDDS-12215
   https://issues.apache.org/jira/browse/HDDS-14962
   https://issues.apache.org/jira/browse/HDDS-6115
   
   ## How was this patch tested?
   
   Before changes: TestContainerStateMachine failed 22 times in 20 * 10 
iterations. https://github.com/chungen0126/ozone/actions/runs/26375145366
   
   After changes: TestContainerStateMachine passed: 20 * 10 iterations after 
changes. https://github.com/chungen0126/ozone/actions/runs/26706385476
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to