siddhantsangwan commented on PR #6967: URL: https://github.com/apache/ozone/pull/6967#issuecomment-2249695064
My intention for including quasi-closed replicas to decide on the container's state is to basically try and prevent complete data loss, even if there has been some data loss already. This is the direction that we had decided on when refactoring the replication manager last year. > In order for a container to have moved to the DELETING state in the first place, it must have first been in the CLOSED state. Agreed. > If all CLOSED replicas were all lost, then that is a data loss scenario caused by too many faults in the system. There's data loss, right, but shouldn't we try and salvage any data that is remaining in the quasi-closed replicas? > and we should not try to save them from remaining partial/lagging replicas in states OPEN, CLOSING, QUASI-CLOSED. Not sure I understand why we shouldn't save lagging quasi-closed replicas here - are you expecting the data race in [HDDS-8129](https://issues.apache.org/jira/browse/HDDS-8129) to have completely corrupted these replicas? > Even though this patch is only trying to handle QUASI-CLOSED replicas right now, I'm grouping QUASI-CLOSED, OPEN, and CLOSING replicas together because: > > If the SCM state is DELETING then we know these replicas states are lagging behind CLOSED replicas that may or may not still exist. > [They do not process block deletes](https://github.com/apache/ozone/blob/91c409d501c842d29d5355287af16d32db23d310/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/BlockDeletingService.java#L204). Therefore block deleting service will leave their isEmpty flag to false. > This flag may be set to true on [restart](https://github.com/apache/ozone/blob/930d3f0552b86d6605625cabe5e8961652d14fa2/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/KeyValueContainerUtil.java#L351) regardless of container state, but I think that is a separate issue we should fix, such that only CLOSED containers can have isEmtpy = true. > There is extra complexity in trying to save these stale replicas. For example currently the patch does not correctly handle the case were we see CLOSED replicas that are empty and non-CLOSED replicas which are not empty. The empty CLOSED replica confirms the non-empty non-closed one is stale and the container can be deleted. Whether or not the patch does this depends on the order of the container reports received and how those interleave with replication manager runs. I'm not very clear on whether block deletion can directly cause a container state change. As far as I know, a container should transition to DELETING (correctly) only when it has a positive number of replicas and _all_ its replicas known to SCM are closed and empty: ``` private boolean isContainerEmptyAndClosed(final ContainerInfo container, final Set<ContainerReplica> replicas) { return container.getState() == HddsProtos.LifeCycleState.CLOSED && !replicas.isEmpty() && replicas.stream().allMatch( r -> r.getState() == ContainerReplicaProto.State.CLOSED && r.isEmpty()); } ``` That means if any non-closed replicas exist at all, the container is incorrectly in the deleting state, UNLESS all the closed replicas were empty. Now consider our situation where a container moved to the deleting state incorrectly in a previous version because of keyCount being 0 when the container wasn't actually empty. Suppose it has a quasi-closed replica that was offline, and all the closed replicas have keyCount = 0 (but not actually empty). On upgrading to the latest version with our fix, let's say the CLOSED replicas are somehow lost and the quasi-closed replica comes back. At this point, the container is incorrectly in the deleting state because its closed replicas, that are now lost, weren't actually empty. So we should transition it back to CLOSED when the quasi-closed replica is reported. If we don't, we risk losing that replica as well. > For example currently the patch does not correctly handle the case were we see CLOSED replicas that are empty and non-CLOSED replicas which are not empty. The empty CLOSED replica confirms the non-empty non-closed one is stale and the container can be deleted. If the container has a non-closed non-empty replica, it shouldn't have moved to the DELETING state in the first place according to the code I quoted above. If the non-closed replica was somehow offline and that's why the check in the code passed, and we now transition the container back to CLOSED from DELETING on the basis of that non-closed replica, then we have an orphan container. I think that's an acceptable tradeoff. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
