Re: [PR] PHOENIX-7871 :- Fail-fast batched mutation block via DNRIOE inheritance [phoenix]

via GitHub Fri, 29 May 2026 15:09:09 -0700


tkhurana commented on code in PR #2491:
URL: https://github.com/apache/phoenix/pull/2491#discussion_r3327134255



##########
phoenix-core/src/it/java/org/apache/phoenix/end2end/index/IndexRegionObserverMutationBlockingIT.java:
##########
@@ -277,12 +277,276 @@ public void 
testSystemHAGroupTableMutationsAllowedDuringActiveToStandby() throws
     }
   }
 
+  /**
+   * Walks the cause chain of a {@link CommitException} looking for a
+   * {@link MutationBlockedIOException}. Handles two paths:
+   * <ul>
+   *   <li>Direct chain — {@code MutationBlockedIOException} now extends
+   *       {@link org.apache.hadoop.hbase.DoNotRetryIOException}, so HBase's 
RPC retry layers
+   *       fail-fast and propagate the exception without wrapping it in
+   *       {@link RetriesExhaustedWithDetailsException}.</li>
+   *   <li>REWDE-wrapped — the legacy retry path; preserved here for any 
caller that still
+   *       reaches it (defense-in-depth).</li>
+   * </ul>
+   */
   private boolean containsMutationBlockedException(CommitException e) {
-    Throwable cause = e.getCause();
-    if (cause instanceof RetriesExhaustedWithDetailsException) {
-      RetriesExhaustedWithDetailsException re = 
(RetriesExhaustedWithDetailsException) cause;
-      return re.getCause(0) instanceof MutationBlockedIOException;
+    Throwable t = e.getCause();
+    while (t != null) {
+      if (t instanceof MutationBlockedIOException) {
+        return true;
+      }
+      if (t instanceof RetriesExhaustedWithDetailsException) {
+        RetriesExhaustedWithDetailsException re = 
(RetriesExhaustedWithDetailsException) t;
+        for (int i = 0; i < re.getNumExceptions(); i++) {
+          if (re.getCause(i) instanceof MutationBlockedIOException) {
+            return true;
+          }
+        }
+      }
+      t = t.getCause();
     }
     return false;
   }
+
+  /**
+   * Regression test for the fail-fast fix on the batched-mutation path. The 
fix is the inheritance
+   * change on {@link MutationBlockedIOException} — it now extends
+   * {@code DoNotRetryIOException}, signaling intent to fail fast.
+   *
+   * <p>Empirical verification: server-side rehydration (via
+   * {@code ProtobufUtil.toException}) delivers a real {@code 
MutationBlockedIOException} instance
+   * to HBase's batched-RPC retry layer ({@code 
AsyncRequestFutureImpl.manageError}); the
+   * {@code instanceof DoNotRetryIOException} check at line 749 returns true 
post-inheritance, so
+   * no retries fire and the failure surfaces immediately. The inheritance 
change alone is
+   * sufficient for the batched path.
+   *
+   * <p>Test asserts:
+   * <ol>
+   *   <li>An {@link MutationBlockedIOException} appears somewhere on the 
{@link CommitException}
+   *       cause chain (proves the server gate fired).</li>
+   *   <li>Wall-clock duration of the failed commit is well below the pre-fix 
HBase retry-budget
+   *       tail (which was on the order of tens of seconds for default 16 
retries × per-retry
+   *       timeout). A bound of {@code MAX_FAIL_FAST_DURATION_MS} captures 
this with generous
+   *       headroom for slow CI; on a healthy mini-cluster the actual duration 
is
+   *       sub-second.</li>
+   * </ol>
+   */
+  @Test
+  public void testMutationBlockedFailsFastWithDNRIOE() throws Exception {
+    String dataTableName = generateUniqueName();
+    String indexName = generateUniqueName();
+
+    try (FailoverPhoenixConnection conn = (FailoverPhoenixConnection) 
DriverManager
+      .getConnection(CLUSTERS.getJdbcHAUrl(), clientProps)) {
+      conn.createStatement().execute(
+        "CREATE TABLE " + dataTableName + " (id VARCHAR PRIMARY KEY, name 
VARCHAR, age INTEGER)");
+      conn.createStatement()
+        .execute("CREATE INDEX " + indexName + " ON " + dataTableName + 
"(name)");
+
+      // Set up HAGroupStoreRecord that will block mutations 
(ACTIVE_TO_STANDBY state)
+      HAGroupStoreRecord haGroupStoreRecord =
+        new HAGroupStoreRecord(HAGroupStoreRecord.DEFAULT_PROTOCOL_VERSION, 
haGroupName,
+          HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY, 0L,
+          HighAvailabilityPolicy.FAILOVER.toString(), this.peerZkUrl, 
CLUSTERS.getMasterAddress1(),
+          CLUSTERS.getMasterAddress2(), CLUSTERS.getHdfsUrl1(), 
CLUSTERS.getHdfsUrl2(), 0L);
+      haAdmin.updateHAGroupStoreRecordInZooKeeper(haGroupName, 
haGroupStoreRecord, -1);
+      awaitZkPropagation(haGroupName,
+        HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY);
+
+      long startMs = System.currentTimeMillis();
+      try {
+        conn.createStatement()
+          .execute("UPSERT INTO " + dataTableName + " VALUES ('1', 'Alice', 
28)");
+        conn.commit();
+        fail("Expected MutationBlockedIOException to be thrown");
+      } catch (CommitException e) {
+        long durationMs = System.currentTimeMillis() - startMs;
+        assertTrue(
+          "Expected MutationBlockedIOException somewhere on the cause chain 
(helper walks both "
+            + "direct and REWDE-wrapped layouts).",
+          containsMutationBlockedException(e));
+        assertTrue("Expected fail-fast: failed commit took " + durationMs
+          + "ms which exceeds the bound of " + MAX_FAIL_FAST_DURATION_MS
+          + "ms. Pre-fix this took on the order of tens of seconds while HBase 
exhausted its "
+          + "retry budget against the now-blocking server. If this assertion 
fails, HBase has "
+          + "started retrying despite DoNotRetryIOException — investigate 
whether the inheritance"
+          + " change on MutationBlockedIOException is still in place.",
+          durationMs < MAX_FAIL_FAST_DURATION_MS);
+      }
+    }
+  }
+
+  /**
+   * Wall-clock bound for fail-fast on the batched mutation path. Pre-fix, 
HBase's default
+   * {@code hbase.client.retries.number=16} multiplied by per-retry RPC 
timeouts produced tails
+   * on the order of tens of seconds. Post-fix (server-side DNRIOE 
inheritance), the first hit
+   * propagates immediately. The 10-second bound gives generous headroom for 
slow CI hardware
+   * while staying well below any plausible retry tail.
+   */
+  private static final long MAX_FAIL_FAST_DURATION_MS = 10_000L;
+
+  /**
+   * Crisp empirical proof that the DNRIOE inheritance change alone — with no 
Phoenix-side
+   * client-code changes — is sufficient to deliver fail-fast on the batched 
mutation path.
+   *
+   * <p>Server-side rehydration via {@code ProtobufUtil.toException} delivers 
a real
+   * {@link MutationBlockedIOException} instance to HBase's batched-RPC retry 
layer
+   * ({@code AsyncRequestFutureImpl.manageError}); the {@code instanceof 
DoNotRetryIOException}
+   * check at line 749 returns true post-inheritance, so no retries fire and 
the failure
+   * surfaces immediately. This test exercises that path against a 
configured-elevated retry
+   * budget — without inheritance, a multi-second retry tail would be observed.
+   *
+   * <p>Distinguishes from {@link #testMutationBlockedFailsFastWithDNRIOE} by 
being a tighter,
+   * shorter assertion focused exclusively on the inheritance-driven fail-fast 
claim under a
+   * realistic retry config.
+   */
+  @Test
+  public void testMutationBlockedFailsFastViaInheritanceAlone() throws 
Exception {
+    String dataTableName = generateUniqueName();
+
+    Properties retryProps = new Properties();
+    retryProps.putAll(clientProps);
+    retryProps.setProperty("hbase.client.retries.number", "16");
+    retryProps.setProperty("hbase.client.pause", "100");
+
+    try (FailoverPhoenixConnection conn = (FailoverPhoenixConnection) 
DriverManager
+      .getConnection(CLUSTERS.getJdbcHAUrl(), retryProps)) {
+      conn.createStatement().execute(
+        "CREATE TABLE " + dataTableName + " (id VARCHAR PRIMARY KEY, name 
VARCHAR)");
+
+      HAGroupStoreRecord haGroupStoreRecord =
+        new HAGroupStoreRecord(HAGroupStoreRecord.DEFAULT_PROTOCOL_VERSION, 
haGroupName,
+          HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY, 0L,
+          HighAvailabilityPolicy.FAILOVER.toString(), this.peerZkUrl, 
CLUSTERS.getMasterAddress1(),
+          CLUSTERS.getMasterAddress2(), CLUSTERS.getHdfsUrl1(), 
CLUSTERS.getHdfsUrl2(), 0L);
+      haAdmin.updateHAGroupStoreRecordInZooKeeper(haGroupName, 
haGroupStoreRecord, -1);
+      awaitZkPropagation(haGroupName,
+        HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY);
+
+      long startMs = System.currentTimeMillis();
+      try {
+        conn.createStatement()
+          .execute("UPSERT INTO " + dataTableName + " VALUES ('1', 'Alice')");
+        conn.commit();
+        fail("Expected MutationBlockedIOException to be thrown");
+      } catch (CommitException ce) {
+        long elapsedMs = System.currentTimeMillis() - startMs;
+        assertTrue("MutationBlockedIOException expected on the cause chain",
+          containsMutationBlockedException(ce));
+        assertTrue("Expected fail-fast (<" + MAX_FAIL_FAST_DURATION_MS + "ms); 
got " + elapsedMs
+          + "ms. Without DNRIOE inheritance, this would scale with retries × 
pause and exceed "
+          + "the bound — proving the inheritance change is the load-bearing 
piece of the fix.",
+          elapsedMs < MAX_FAIL_FAST_DURATION_MS);
+      }
+    }
+  }
+
+  /**
+   * Synthetic forced-retry variant that drives the multi-attempt 
batched-mutation path
+   * explicitly. The default mini-cluster client config surfaces only one 
batch attempt before
+   * the failure propagates ({@code attempt=1/5}), so the wall-clock assertion 
in the unforced
+   * test is technically a no-op on this harness. This test raises
+   * {@code hbase.client.retries.number} and {@code hbase.client.pause} to 
values that would,
+   * absent the DNRIOE inheritance, force HBase's batched-RPC retry layer to 
consume many
+   * seconds before surfacing the failure.
+   *
+   * <p>With the inheritance change in place, wall-clock stays under
+   * {@link #MAX_FAIL_FAST_DURATION_MS} even under these elevated retry 
settings — that is the
+   * structural signal the fix works as intended on the batched path. Without 
the inheritance,
+   * wall-clock would scale with retries × pause.
+   *
+   * <p>Note: even with elevated retry settings, the mini-cluster batched path 
may still surface
+   * only a single attempt before propagation due to harness internals not 
faithfully modeling
+   * the production-perf retry tail observed on real clusters. In that case, 
this test functions
+   * as additional regression coverage rather than a true behavioral 
discriminator; real-cluster
+   * perf testing remains the authoritative validation of the wall-clock 
benefit.
+   */
+  @Test
+  public void testMutationBlockedFailsFastUnderElevatedRetries() throws 
Exception {
+    String dataTableName = generateUniqueName();
+    String indexName = generateUniqueName();
+
+    Properties retryProps = new Properties();
+    retryProps.putAll(clientProps);
+    // Elevate HBase client retry budget so absent-inheritance failure tail 
would scale into the
+    // tens of seconds: 16 retries × ~100ms base pause (with exponential 
backoff in HBase) yields
+    // multi-second tails before the per-attempt pause caps out.
+    retryProps.setProperty("hbase.client.retries.number", "16");
+    retryProps.setProperty("hbase.client.pause", "100");
+
+    try (FailoverPhoenixConnection conn = (FailoverPhoenixConnection) 
DriverManager
+      .getConnection(CLUSTERS.getJdbcHAUrl(), retryProps)) {
+      conn.createStatement().execute(
+        "CREATE TABLE " + dataTableName + " (id VARCHAR PRIMARY KEY, name 
VARCHAR, age INTEGER)");
+      conn.createStatement()
+        .execute("CREATE INDEX " + indexName + " ON " + dataTableName + 
"(name)");
+
+      HAGroupStoreRecord haGroupStoreRecord =
+        new HAGroupStoreRecord(HAGroupStoreRecord.DEFAULT_PROTOCOL_VERSION, 
haGroupName,
+          HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY, 0L,
+          HighAvailabilityPolicy.FAILOVER.toString(), this.peerZkUrl, 
CLUSTERS.getMasterAddress1(),
+          CLUSTERS.getMasterAddress2(), CLUSTERS.getHdfsUrl1(), 
CLUSTERS.getHdfsUrl2(), 0L);
+      haAdmin.updateHAGroupStoreRecordInZooKeeper(haGroupName, 
haGroupStoreRecord, -1);
+      awaitZkPropagation(haGroupName,
+        HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY);
+
+      long startMs = System.currentTimeMillis();
+      try {
+        conn.createStatement()
+          .execute("UPSERT INTO " + dataTableName + " VALUES ('1', 'Alice', 
28)");
+        conn.commit();
+        fail("Expected MutationBlockedIOException to be thrown");
+      } catch (CommitException e) {
+        long durationMs = System.currentTimeMillis() - startMs;
+        assertTrue(
+          "Expected MutationBlockedIOException somewhere on the cause chain.",
+          containsMutationBlockedException(e));
+        assertTrue("Expected fail-fast under elevated retry settings: failed 
commit took "
+          + durationMs + "ms which exceeds the bound of " + 
MAX_FAIL_FAST_DURATION_MS
+          + "ms. Under hbase.client.retries.number=16 and 
hbase.client.pause=100, the absence "
+          + "of the DNRIOE inheritance change would surface a multi-second 
retry tail. "
+          + "If this assertion fails, the inheritance change is not in place 
or HBase has "
+          + "started retrying despite DoNotRetryIOException — investigate the 
cause chain.",
+          durationMs < MAX_FAIL_FAST_DURATION_MS);
+      }
+    }
+  }

Review Comment:
   testMutationBlockedFailsFastViaInheritanceAlone and 
testMutationBlockedFailsFastUnderElevatedRetries look very similar. Either have 
one test or differentiate them by what they actually exercise.



##########
phoenix-core/src/it/java/org/apache/phoenix/end2end/index/IndexRegionObserverMutationBlockingIT.java:
##########
@@ -277,12 +278,276 @@ public void 
testSystemHAGroupTableMutationsAllowedDuringActiveToStandby() throws
     }
   }
 
+  /**
+   * Walks the cause chain of a {@link CommitException} looking for a
+   * {@link MutationBlockedIOException}. Handles two paths:
+   * <ul>
+   *   <li>Direct chain — {@code MutationBlockedIOException} now extends
+   *       {@link org.apache.hadoop.hbase.DoNotRetryIOException}, so HBase's 
RPC retry layers
+   *       fail-fast and propagate the exception without wrapping it in
+   *       {@link RetriesExhaustedWithDetailsException}.</li>
+   *   <li>REWDE-wrapped — the legacy retry path; preserved here for any 
caller that still
+   *       reaches it (defense-in-depth).</li>
+   * </ul>
+   */
   private boolean containsMutationBlockedException(CommitException e) {
-    Throwable cause = e.getCause();
-    if (cause instanceof RetriesExhaustedWithDetailsException) {
-      RetriesExhaustedWithDetailsException re = 
(RetriesExhaustedWithDetailsException) cause;
-      return re.getCause(0) instanceof MutationBlockedIOException;
+    Throwable t = e.getCause();
+    while (t != null) {
+      if (t instanceof MutationBlockedIOException) {
+        return true;
+      }
+      if (t instanceof RetriesExhaustedWithDetailsException) {
+        RetriesExhaustedWithDetailsException re = 
(RetriesExhaustedWithDetailsException) t;
+        for (int i = 0; i < re.getNumExceptions(); i++) {
+          if (re.getCause(i) instanceof MutationBlockedIOException) {
+            return true;
+          }
+        }
+      }
+      t = t.getCause();
     }
     return false;
   }
+
+  /**
+   * Regression test for the dual-path fail-fast fix on the batched-mutation 
path:
+   * <ul>
+   *   <li><b>Server-side change:</b> {@link MutationBlockedIOException} 
extends
+   *       {@code DoNotRetryIOException}, signaling intent to fail fast.</li>
+   *   <li><b>Phoenix client-side intercept:</b> {@code MutationState.send}'s 
catch block
+   *       inspects the cause chain via {@code findDoNotRetryCauseInChain} and 
short-circuits

Review Comment:
   The actual method is findHaFailoverCauseInChain



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] PHOENIX-7871 :- Fail-fast batched mutation block via DNRIOE inheritance [phoenix]

Reply via email to