[ 
https://issues.apache.org/jira/browse/HDDS-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059966#comment-18059966
 ] 

Ilya commented on HDDS-5525:
----------------------------

I would like to ask you a question about the assignment, clarify my 
understanding, and ask for feedback if I'm making a mistake somewhere.
 
1) As far as I understand, the statement "In HDDS-5513, a race condition 
occurred because snapshot installation was occurring before the main 
DatanodeStateMachine loop" is incorrect, because 
ContainerStateMachine#loadSnapshot is just a side effect due to an incorrect 
initialization chain in ratis 2.1.0, which led to the creation of 
OzoneContainer in the DatanodeStateMachine constructor 
XceiverServerRatis#notifyGroupAdd() and as a result, to 
ContainerStateMachine#initialize (which is where loadSnapshot is performed). In 
other words, the exception from HDDS-5513 arose precisely because of ratis, the 
code of which was later fixed in RATIS-1465 (commit 53a3eaa). 
 
The exception from HDDS-5513 is easy to reproduce if:
1) there will be an action for at least one LayoutFeature to enter the 
BasicUpgradeFinalizer#runFirstUpgradeAction
2) stop the thread inside runFirstUpgradeAction 
3) add a small delay to ContainerStateMachine#initialize before 
XceiverServerRatis#notifyGroupAdd() so that the DatanodeStateMachine thread has 
time to reach the stop point
(This is not possible after updating the ratis linked above)
 
2) On the other hand, if you remove the HDDS-5513 changes (commit d405ebf), 
then not from the point of view of the consistency of the state machine, but 
from the point of view of the public DatanodeStateMachine#triggerHeartbeat 
method, this exception can still be reproduced in the test environment, even 
with the fixed ratis, which means that the HDDS-5513 fix (commit d405ebf) takes 
place. to be still.
 
3) "but this will pose a problem if we need to run pre-finalize actions 
involving container data in the future" do I understand correctly that 
interaction with container data means the same interaction via 
triggerHeartbeat()?
If so, then the problem is clear and the solution is unclear, because 
pre-finalize actions occur before the main DatanodeStateMachine loop and the 
execution of context.execute() (as mentioned earlier)
If we are talking about loadSnapshot, then with the proven absence of 
involvement in the data race (I repeat that the reason was in ratis), the 
problem becomes even more difficult to imagine.
I would like an example, because at the moment it is not entirely clear what 
exactly it is about, given the inconsistency of loadSnapshot and the data race, 
which I cited in paragraph (1) 
 
If I'm wrong about something, correct me.

> Datanode snapshot can be installed while pre-finalize actions are running
> -------------------------------------------------------------------------
>
>                 Key: HDDS-5525
>                 URL: https://issues.apache.org/jira/browse/HDDS-5525
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Ethan Rose
>            Priority: Major
>
> In HDDS-5513, a race condition occurred because snapshot installation was 
> occurring before the main DatanodeStateMachine loop, and therefore occurring 
> while pre-finalize actions could be running. In that Jira a workaround was 
> implemented to unblock upgrades, but this will pose a problem if we need to 
> run pre-finalize actions involving container data in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to