Re: [PR] HDDS-15066. Read-Write Lock race leave stale references to container creating orphan replicas [ozone]

via GitHub Wed, 13 May 2026 03:51:55 -0700


Gargi-jais11 commented on code in PR #10109:
URL: https://github.com/apache/ozone/pull/10109#discussion_r3233543415



##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java:
##########
@@ -279,6 +285,49 @@ public Container<?> getContainer(long containerId) {
     return containerMap.get(containerId);
   }
 
+  /**
+   * Returns the max retry for a container map swap while acquiring container 
lock.
+   * @return max retry count
+   */
+  public static int maxContainerMapSwapRetries() {
+    return MAX_CONTAINER_MAP_SWAP_RETRIES;
+  }
+
+  /**
+   * Locks the container mapped to {@code containerId} for write, and verifies 
that the instance locked is still
+   * the one stored in this set. If the mapping is swapped or the container no 
longer exists, unlocks and retries up to
+   * {@link #maxContainerMapSwapRetries()} times, then returns {@code null}.
+   *
+   * @return the locked container, or {@code null} if none mapped, or mapping 
could not be stabilized
+   */
+  @Nullable
+  public Container<?> acquireContainerLock(long containerId) {
+    for (int retry = 0; retry < MAX_CONTAINER_MAP_SWAP_RETRIES; retry++) {
+      Container<?> candidate = getContainer(containerId);
+      if (candidate == null) {
+        LOG.info("Container {} no longer present in ContainerSet, skipping.", 
containerId);
+        return null;
+      }
+      candidate.writeLock();
+      Container<?> current = getContainer(containerId);
+      if (current == null) {
+        candidate.writeUnlock();
+        LOG.info("Container {} no longer exists in ContainerSet while 
acquiring lock.", containerId);
+        return null;
+      }
+      if (current != candidate) {
+        candidate.writeUnlock();
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("Container {} mapping changed during lock acquisition 
(attempt {}); retrying.",

Review Comment:
   If you thinking for future unrelated changes similar to diskbalancer then 
yes what u are saying is definitely correct and keeping retry logic is correct 
way to avoid this fix fail. 
   @ChenSammi what do you say?



##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerSet.java:
##########
@@ -279,6 +285,49 @@ public Container<?> getContainer(long containerId) {
     return containerMap.get(containerId);
   }
 
+  /**
+   * Returns the max retry for a container map swap while acquiring container 
lock.
+   * @return max retry count
+   */
+  public static int maxContainerMapSwapRetries() {
+    return MAX_CONTAINER_MAP_SWAP_RETRIES;
+  }
+
+  /**
+   * Locks the container mapped to {@code containerId} for write, and verifies 
that the instance locked is still
+   * the one stored in this set. If the mapping is swapped or the container no 
longer exists, unlocks and retries up to
+   * {@link #maxContainerMapSwapRetries()} times, then returns {@code null}.
+   *
+   * @return the locked container, or {@code null} if none mapped, or mapping 
could not be stabilized
+   */
+  @Nullable
+  public Container<?> acquireContainerLock(long containerId) {
+    for (int retry = 0; retry < MAX_CONTAINER_MAP_SWAP_RETRIES; retry++) {
+      Container<?> candidate = getContainer(containerId);
+      if (candidate == null) {
+        LOG.info("Container {} no longer present in ContainerSet, skipping.", 
containerId);
+        return null;
+      }
+      candidate.writeLock();
+      Container<?> current = getContainer(containerId);
+      if (current == null) {
+        candidate.writeUnlock();
+        LOG.info("Container {} no longer exists in ContainerSet while 
acquiring lock.", containerId);
+        return null;
+      }
+      if (current != candidate) {
+        candidate.writeUnlock();
+        if (LOG.isDebugEnabled()) {
+          LOG.debug("Container {} mapping changed during lock acquisition 
(attempt {}); retrying.",

Review Comment:
   If you are thinking for future unrelated changes similar to diskbalancer 
then yes what u are saying is definitely correct and keeping retry logic is 
correct way to avoid this fix fail. 
   @ChenSammi what do you say?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-15066. Read-Write Lock race leave stale references to container creating orphan replicas [ozone]

Reply via email to