lokiore commented on code in PR #2491:
URL: https://github.com/apache/phoenix/pull/2491#discussion_r3327334918
##########
phoenix-core/src/it/java/org/apache/phoenix/end2end/index/IndexRegionObserverMutationBlockingIT.java:
##########
@@ -277,12 +277,276 @@ public void
testSystemHAGroupTableMutationsAllowedDuringActiveToStandby() throws
}
}
+ /**
+ * Walks the cause chain of a {@link CommitException} looking for a
+ * {@link MutationBlockedIOException}. Handles two paths:
+ * <ul>
+ * <li>Direct chain — {@code MutationBlockedIOException} now extends
+ * {@link org.apache.hadoop.hbase.DoNotRetryIOException}, so HBase's
RPC retry layers
+ * fail-fast and propagate the exception without wrapping it in
+ * {@link RetriesExhaustedWithDetailsException}.</li>
+ * <li>REWDE-wrapped — the legacy retry path; preserved here for any
caller that still
+ * reaches it (defense-in-depth).</li>
+ * </ul>
+ */
private boolean containsMutationBlockedException(CommitException e) {
- Throwable cause = e.getCause();
- if (cause instanceof RetriesExhaustedWithDetailsException) {
- RetriesExhaustedWithDetailsException re =
(RetriesExhaustedWithDetailsException) cause;
- return re.getCause(0) instanceof MutationBlockedIOException;
+ Throwable t = e.getCause();
+ while (t != null) {
+ if (t instanceof MutationBlockedIOException) {
+ return true;
+ }
+ if (t instanceof RetriesExhaustedWithDetailsException) {
+ RetriesExhaustedWithDetailsException re =
(RetriesExhaustedWithDetailsException) t;
+ for (int i = 0; i < re.getNumExceptions(); i++) {
+ if (re.getCause(i) instanceof MutationBlockedIOException) {
+ return true;
+ }
+ }
+ }
+ t = t.getCause();
}
return false;
}
+
+ /**
+ * Regression test for the fail-fast fix on the batched-mutation path. The
fix is the inheritance
+ * change on {@link MutationBlockedIOException} — it now extends
+ * {@code DoNotRetryIOException}, signaling intent to fail fast.
+ *
+ * <p>Empirical verification: server-side rehydration (via
+ * {@code ProtobufUtil.toException}) delivers a real {@code
MutationBlockedIOException} instance
+ * to HBase's batched-RPC retry layer ({@code
AsyncRequestFutureImpl.manageError}); the
+ * {@code instanceof DoNotRetryIOException} check at line 749 returns true
post-inheritance, so
+ * no retries fire and the failure surfaces immediately. The inheritance
change alone is
+ * sufficient for the batched path.
+ *
+ * <p>Test asserts:
+ * <ol>
+ * <li>An {@link MutationBlockedIOException} appears somewhere on the
{@link CommitException}
+ * cause chain (proves the server gate fired).</li>
+ * <li>Wall-clock duration of the failed commit is well below the pre-fix
HBase retry-budget
+ * tail (which was on the order of tens of seconds for default 16
retries × per-retry
+ * timeout). A bound of {@code MAX_FAIL_FAST_DURATION_MS} captures
this with generous
+ * headroom for slow CI; on a healthy mini-cluster the actual duration
is
+ * sub-second.</li>
+ * </ol>
+ */
+ @Test
+ public void testMutationBlockedFailsFastWithDNRIOE() throws Exception {
+ String dataTableName = generateUniqueName();
+ String indexName = generateUniqueName();
+
+ try (FailoverPhoenixConnection conn = (FailoverPhoenixConnection)
DriverManager
+ .getConnection(CLUSTERS.getJdbcHAUrl(), clientProps)) {
+ conn.createStatement().execute(
+ "CREATE TABLE " + dataTableName + " (id VARCHAR PRIMARY KEY, name
VARCHAR, age INTEGER)");
+ conn.createStatement()
+ .execute("CREATE INDEX " + indexName + " ON " + dataTableName +
"(name)");
+
+ // Set up HAGroupStoreRecord that will block mutations
(ACTIVE_TO_STANDBY state)
+ HAGroupStoreRecord haGroupStoreRecord =
+ new HAGroupStoreRecord(HAGroupStoreRecord.DEFAULT_PROTOCOL_VERSION,
haGroupName,
+ HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY, 0L,
+ HighAvailabilityPolicy.FAILOVER.toString(), this.peerZkUrl,
CLUSTERS.getMasterAddress1(),
+ CLUSTERS.getMasterAddress2(), CLUSTERS.getHdfsUrl1(),
CLUSTERS.getHdfsUrl2(), 0L);
+ haAdmin.updateHAGroupStoreRecordInZooKeeper(haGroupName,
haGroupStoreRecord, -1);
+ awaitZkPropagation(haGroupName,
+ HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY);
+
+ long startMs = System.currentTimeMillis();
+ try {
+ conn.createStatement()
+ .execute("UPSERT INTO " + dataTableName + " VALUES ('1', 'Alice',
28)");
+ conn.commit();
+ fail("Expected MutationBlockedIOException to be thrown");
+ } catch (CommitException e) {
+ long durationMs = System.currentTimeMillis() - startMs;
+ assertTrue(
+ "Expected MutationBlockedIOException somewhere on the cause chain
(helper walks both "
+ + "direct and REWDE-wrapped layouts).",
+ containsMutationBlockedException(e));
+ assertTrue("Expected fail-fast: failed commit took " + durationMs
+ + "ms which exceeds the bound of " + MAX_FAIL_FAST_DURATION_MS
+ + "ms. Pre-fix this took on the order of tens of seconds while HBase
exhausted its "
+ + "retry budget against the now-blocking server. If this assertion
fails, HBase has "
+ + "started retrying despite DoNotRetryIOException — investigate
whether the inheritance"
+ + " change on MutationBlockedIOException is still in place.",
+ durationMs < MAX_FAIL_FAST_DURATION_MS);
+ }
+ }
+ }
+
+ /**
+ * Wall-clock bound for fail-fast on the batched mutation path. Pre-fix,
HBase's default
+ * {@code hbase.client.retries.number=16} multiplied by per-retry RPC
timeouts produced tails
+ * on the order of tens of seconds. Post-fix (server-side DNRIOE
inheritance), the first hit
+ * propagates immediately. The 10-second bound gives generous headroom for
slow CI hardware
+ * while staying well below any plausible retry tail.
+ */
+ private static final long MAX_FAIL_FAST_DURATION_MS = 10_000L;
+
+ /**
+ * Crisp empirical proof that the DNRIOE inheritance change alone — with no
Phoenix-side
+ * client-code changes — is sufficient to deliver fail-fast on the batched
mutation path.
+ *
+ * <p>Server-side rehydration via {@code ProtobufUtil.toException} delivers
a real
+ * {@link MutationBlockedIOException} instance to HBase's batched-RPC retry
layer
+ * ({@code AsyncRequestFutureImpl.manageError}); the {@code instanceof
DoNotRetryIOException}
+ * check at line 749 returns true post-inheritance, so no retries fire and
the failure
+ * surfaces immediately. This test exercises that path against a
configured-elevated retry
+ * budget — without inheritance, a multi-second retry tail would be observed.
+ *
+ * <p>Distinguishes from {@link #testMutationBlockedFailsFastWithDNRIOE} by
being a tighter,
+ * shorter assertion focused exclusively on the inheritance-driven fail-fast
claim under a
+ * realistic retry config.
+ */
+ @Test
+ public void testMutationBlockedFailsFastViaInheritanceAlone() throws
Exception {
+ String dataTableName = generateUniqueName();
+
+ Properties retryProps = new Properties();
+ retryProps.putAll(clientProps);
+ retryProps.setProperty("hbase.client.retries.number", "16");
+ retryProps.setProperty("hbase.client.pause", "100");
+
+ try (FailoverPhoenixConnection conn = (FailoverPhoenixConnection)
DriverManager
+ .getConnection(CLUSTERS.getJdbcHAUrl(), retryProps)) {
+ conn.createStatement().execute(
+ "CREATE TABLE " + dataTableName + " (id VARCHAR PRIMARY KEY, name
VARCHAR)");
+
+ HAGroupStoreRecord haGroupStoreRecord =
+ new HAGroupStoreRecord(HAGroupStoreRecord.DEFAULT_PROTOCOL_VERSION,
haGroupName,
+ HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY, 0L,
+ HighAvailabilityPolicy.FAILOVER.toString(), this.peerZkUrl,
CLUSTERS.getMasterAddress1(),
+ CLUSTERS.getMasterAddress2(), CLUSTERS.getHdfsUrl1(),
CLUSTERS.getHdfsUrl2(), 0L);
+ haAdmin.updateHAGroupStoreRecordInZooKeeper(haGroupName,
haGroupStoreRecord, -1);
+ awaitZkPropagation(haGroupName,
+ HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY);
+
+ long startMs = System.currentTimeMillis();
+ try {
+ conn.createStatement()
+ .execute("UPSERT INTO " + dataTableName + " VALUES ('1', 'Alice')");
+ conn.commit();
+ fail("Expected MutationBlockedIOException to be thrown");
+ } catch (CommitException ce) {
+ long elapsedMs = System.currentTimeMillis() - startMs;
+ assertTrue("MutationBlockedIOException expected on the cause chain",
+ containsMutationBlockedException(ce));
+ assertTrue("Expected fail-fast (<" + MAX_FAIL_FAST_DURATION_MS + "ms);
got " + elapsedMs
+ + "ms. Without DNRIOE inheritance, this would scale with retries ×
pause and exceed "
+ + "the bound — proving the inheritance change is the load-bearing
piece of the fix.",
+ elapsedMs < MAX_FAIL_FAST_DURATION_MS);
+ }
+ }
+ }
+
+ /**
+ * Synthetic forced-retry variant that drives the multi-attempt
batched-mutation path
+ * explicitly. The default mini-cluster client config surfaces only one
batch attempt before
+ * the failure propagates ({@code attempt=1/5}), so the wall-clock assertion
in the unforced
+ * test is technically a no-op on this harness. This test raises
+ * {@code hbase.client.retries.number} and {@code hbase.client.pause} to
values that would,
+ * absent the DNRIOE inheritance, force HBase's batched-RPC retry layer to
consume many
+ * seconds before surfacing the failure.
+ *
+ * <p>With the inheritance change in place, wall-clock stays under
+ * {@link #MAX_FAIL_FAST_DURATION_MS} even under these elevated retry
settings — that is the
+ * structural signal the fix works as intended on the batched path. Without
the inheritance,
+ * wall-clock would scale with retries × pause.
+ *
+ * <p>Note: even with elevated retry settings, the mini-cluster batched path
may still surface
+ * only a single attempt before propagation due to harness internals not
faithfully modeling
+ * the production-perf retry tail observed on real clusters. In that case,
this test functions
+ * as additional regression coverage rather than a true behavioral
discriminator; real-cluster
+ * perf testing remains the authoritative validation of the wall-clock
benefit.
+ */
+ @Test
+ public void testMutationBlockedFailsFastUnderElevatedRetries() throws
Exception {
+ String dataTableName = generateUniqueName();
+ String indexName = generateUniqueName();
+
+ Properties retryProps = new Properties();
+ retryProps.putAll(clientProps);
+ // Elevate HBase client retry budget so absent-inheritance failure tail
would scale into the
+ // tens of seconds: 16 retries × ~100ms base pause (with exponential
backoff in HBase) yields
+ // multi-second tails before the per-attempt pause caps out.
+ retryProps.setProperty("hbase.client.retries.number", "16");
+ retryProps.setProperty("hbase.client.pause", "100");
+
+ try (FailoverPhoenixConnection conn = (FailoverPhoenixConnection)
DriverManager
+ .getConnection(CLUSTERS.getJdbcHAUrl(), retryProps)) {
+ conn.createStatement().execute(
+ "CREATE TABLE " + dataTableName + " (id VARCHAR PRIMARY KEY, name
VARCHAR, age INTEGER)");
+ conn.createStatement()
+ .execute("CREATE INDEX " + indexName + " ON " + dataTableName +
"(name)");
+
+ HAGroupStoreRecord haGroupStoreRecord =
+ new HAGroupStoreRecord(HAGroupStoreRecord.DEFAULT_PROTOCOL_VERSION,
haGroupName,
+ HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY, 0L,
+ HighAvailabilityPolicy.FAILOVER.toString(), this.peerZkUrl,
CLUSTERS.getMasterAddress1(),
+ CLUSTERS.getMasterAddress2(), CLUSTERS.getHdfsUrl1(),
CLUSTERS.getHdfsUrl2(), 0L);
+ haAdmin.updateHAGroupStoreRecordInZooKeeper(haGroupName,
haGroupStoreRecord, -1);
+ awaitZkPropagation(haGroupName,
+ HAGroupStoreRecord.HAGroupState.ACTIVE_IN_SYNC_TO_STANDBY);
+
+ long startMs = System.currentTimeMillis();
+ try {
+ conn.createStatement()
+ .execute("UPSERT INTO " + dataTableName + " VALUES ('1', 'Alice',
28)");
+ conn.commit();
+ fail("Expected MutationBlockedIOException to be thrown");
+ } catch (CommitException e) {
+ long durationMs = System.currentTimeMillis() - startMs;
+ assertTrue(
+ "Expected MutationBlockedIOException somewhere on the cause chain.",
+ containsMutationBlockedException(e));
+ assertTrue("Expected fail-fast under elevated retry settings: failed
commit took "
+ + durationMs + "ms which exceeds the bound of " +
MAX_FAIL_FAST_DURATION_MS
+ + "ms. Under hbase.client.retries.number=16 and
hbase.client.pause=100, the absence "
+ + "of the DNRIOE inheritance change would surface a multi-second
retry tail. "
+ + "If this assertion fails, the inheritance change is not in place
or HBase has "
+ + "started retrying despite DoNotRetryIOException — investigate the
cause chain.",
+ durationMs < MAX_FAIL_FAST_DURATION_MS);
+ }
+ }
+ }
Review Comment:
Fair point — dropped `testMutationBlockedFailsFastViaInheritanceAlone`. The
test was redundant with `testMutationBlockedFailsFastUnderElevatedRetries`,
which already exercises the same elevated-retry inheritance-alone path. Updated
the Javadoc on the kept test to make the empirical-proof framing explicit.
Amended commit pushed.
_(Posted by Loki's Claude companion)_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]