[jira] [Commented] (HDDS-9258) LegacyReplicationManager: Pending deletes on unhealthy replicas can cause calculation errors

Stephen O'Donnell (Jira) Tue, 12 Dec 2023 13:13:07 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795939#comment-17795939
 ]


Stephen O'Donnell commented on HDDS-9258:
-----------------------------------------

I added a small test to LegacyReplicationManager test, and it proves this issue 
does exist. That is, if we have 3 closed and 1 or more unhealthy replicas, then 
on the first pass it will send a delete, and then on the second pass it will be 
marked as under replicated This is the test that proves it:

{code}
+    @Test
+    public void testUnhealthyWithDelete() throws Exception {
+      final ContainerInfo container = createContainer(LifeCycleState.CLOSED, 0,
+          0);
+      addReplica(container, new NodeStatus(IN_SERVICE, HEALTHY), CLOSED);
+      addReplica(container, new NodeStatus(IN_SERVICE, HEALTHY), CLOSED);
+      addReplica(container, new NodeStatus(IN_SERVICE, HEALTHY), CLOSED);
+      addReplica(container, new NodeStatus(IN_SERVICE, HEALTHY), UNHEALTHY);
+
+      assertDeleteScheduled(1);
+      assertOverReplicatedCount(1);
+      assertUnderReplicatedCount(0);
+
+      assertDeleteScheduled(0);
+      assertUnderReplicatedCount(0);
+    }
{code}

It will fail on the last assertion.

At this point, I don't believe it is worth fixing this, for a few reasons:

1. The problem is only in the legacy RM, which is being phased out. Fixing this 
for 1.4 or on the master branch will never see it used.
2. The code is somewhat tricky in the legacy manager, so there is a danger of 
causing further issues with a fix.
3. The problems itself doesn't do any real harm. In the normal case, the delete 
on the first pass will have been competed before the next pass and then the 
problem does not occur. Even if it does not complete in time, the worst case is 
an extra replica gets scheduled that will eventually get removed as over 
replicated, so there is no risk of data loss or corruption.

With the above in mind, I a going to close this as "won't fix" for now.

> LegacyReplicationManager: Pending deletes on unhealthy replicas can cause 
> calculation errors
> --------------------------------------------------------------------------------------------
>
>                 Key: HDDS-9258
>                 URL: https://issues.apache.org/jira/browse/HDDS-9258
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM
>            Reporter: Siddhant Sangwan
>            Priority: Major
>
> In RatisContainerReplicaCount, should we discount any pending deletes for 
> replicas that LRM sees as unhealthy? Since we ignore UNHEALTHY containers, it 
> makes sense to not count their pending deletes.
> Suppose there's a  CLOSED container with replicas:
> CLOSED, CLOSED, CLOSED, UNHEALTHY (not counted, seen as excess that can be 
> deleted).
> In the current iteration, RM sends a delete command for the unhealthy, so now 
> there's a pending delete. In the next iteration, if the delete is still 
> pending, then RM will see 3 CLOSED replicas - 1 pending delete + 1 UNHEALTHY 
> replica. But UNHEALTHY replicas are ignored, that's effectively 3 CLOSED 
> replicas - 1 pending delete (even though the delete is for the UNHEALTHY). 
> This means the effective count becomes 2, which is seen as under replicated. 
> Of course, this container is not actually under replicated. We need to verify 
> if it's actually a bug - I have not written any tests to reproduce this yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-9258) LegacyReplicationManager: Pending deletes on unhealthy replicas can cause calculation errors

Reply via email to