Duo Zhang created HBASE-30096:
---------------------------------

             Summary: Fix flaky TestHRegion and TestHRegionWithInMemoryFlush
                 Key: HBASE-30096
                 URL: https://issues.apache.org/jira/browse/HBASE-30096
             Project: HBase
          Issue Type: Sub-task
          Components: test
            Reporter: Duo Zhang


Sonnet 4.5(4.6?) summary

TestHRegion and TestHRegionWithInMemoryFlush
Fix 1
Root Cause Analysis

The TestHRegionWithInMemoryFlush test is flapping due to zombie background 
threads that continue running after tests timeout, causing directory race 
conditions with subsequent tests.

What's Happening

testParallelIncrementWithMemStoreFlush and testParallelAppendWithMemStoreFlush 
create background flush threads
These threads run in a busy loop: while (!incrementDone.get())
When the test times out (after ~4 minutes in CI), an InterruptedException is 
thrown during thread.join()
The code never reaches the lines that set the done flag and stop the flush 
thread
The zombie flush thread keeps running, busy-looping trying to flush
Subsequent tests' tearDown() methods clean up directories
The zombie thread tries to access deleted directories → race condition errors
Evidence

Bad run shows only 5 tests completed (vs 136 in good run)
Thread-4940 logs "Skipping flush because closing" repeatedly for over a minute
6 flaky errors about missing directories in unrelated tests like testGet_Basic 
and testScanner_JoinedScanners
Error messages: "Parent directory doesn't exist", "chmod: cannot access"
The Fix

I've fixed both problematic tests in 
/Users/hdara/src/apache/hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java:

Changes made:

Added interrupt check to flush thread loop: while (!done.get() && 
!Thread.currentThread().isInterrupted())
Wrapped test logic in try-finally to ensure cleanup even on timeout/failure
In finally block:
Set done flag to true
Interrupt the flush thread
Wait up to 5 seconds for thread to stop
Log warning if thread doesn't stop
Files modified:

TestHRegion.java:5322-5368 - testParallelIncrementWithMemStoreFlush
TestHRegion.java:5407-5459 - testParallelAppendWithMemStoreFlush
Why This Only Happened in CI

CI environments are typically slower and more heavily loaded, causing tests to 
hit the 13-minute timeout threshold. Locally, the tests complete faster, so the 
threads finish normally before cleanup.

The fix ensures that background threads are always properly stopped, preventing 
zombie threads from interfering with subsequent tests.

Fix 2
Fix for another flapper condition diagnosed from TestHRegion failure

Summry of the fix by AI:

Fix directory race condition in parallel test execution

When tests run in parallel, HBaseTestingUtil.createSubDirAndSystemProperty()
would reuse system property values (like hadoop.log.dir and hadoop.tmp.dir)
set by other parallel test instances. This caused a race condition:

Test A calls setupDataTestDir(), creates unique directory UUID-A
Test A sets System.setProperty("hadoop.log.dir", UUID-A/hadoop-log-dir)
Test B calls setupDataTestDir(), creates unique directory UUID-B
Test B calls createSubDirAndSystemProperty() for hadoop.log.dir
System property already set, so Test B reuses UUID-A/hadoop-log-dir
Test A finishes, calls cleanupTestDir(), deletes UUID-A directory
Test B fails with "chmod: cannot access '.../WALs': No such file"
The original logic was added in HBASE-5064 (2012) to support parallel tests,
with the comment "we do nothing but hope that there will be no conflicts"
suggesting the authors knew this was imperfect but necessary for mini cluster
compatibility.

Fixed by checking if the directory referenced by the system property still
exists before reusing it:

If exists: reuse it (preserves original behavior for mini clusters)
If deleted: create new unique directory for this test instance
This prevents tests from using deleted directories while maintaining
compatibility with components that may require shared system properties.

Fix 3
Thread Cleanup Fixes Applied

testMutateRowInParallel (line 7487)
Issue: Writer thread had while (true) with no interrupt check
Fix: Changed to while (!Thread.currentThread().isInterrupted()), added 
try-finally block with thread cleanup
Impact: This was failing in both TestHRegion and TestHRegionWithInMemoryFlush
testGetWhileRegionClose (line 1323)
Issue: Thread cleanup code was OUTSIDE the finally block, so threads would keep 
running if test threw exception
Fix: Moved done.set(true) and thread joins INTO the finally block
Impact: Prevents zombie threads if test times out before normal completion
GetTillDoneOrException thread class (line 1377)
Issue: Loop only checked !done flag, no interrupt handling
Fix: Changed to while (!done.get() && !Thread.currentThread().isInterrupted())
Impact: Threads can now be interrupted by JUnit timeout
FlushThread class (line 4575)
Issue: Loop only checked !done flag
Fix: Changed to while (!done && !Thread.currentThread().isInterrupted())
Impact: More robust termination on test timeout
PutThread class (line 4791)
Issue: Loop only checked !done flag
Fix: Changed to while (!done && !Thread.currentThread().isInterrupted())
Impact: More robust termination on test timeout
Fix 4
Root Cause
The test testPrefetchRunTriggersEvictions failed because:

Random file size: 10,000 KVs created ~668KB of data (334 blocks × 2KB)
Insufficient to fill cache: The 1MB cache wasn't exceeded, so no evictions 
occurred
Flawed assertion: Assumed failedInserts == 0 meant evictions MUST have happened
Result: evictedCount = 0 AND failedInserts = 0 → test failed
Fix Applied
File: TestPrefetchWithBucketCache.java:322-354

Increased KVs 10,000 → 15,000 (50% increase)

Now reliably creates ~1MB+ of data, exceeding cache capacity
More resource-efficient than doubling to 20,000
Fixed assertion logic:

// OLD (incorrect):
if (bc.getStats().getFailedInserts() == 0) {
  assertNotEquals(0, bc.getStats().getEvictedCount());  // ❌ Could fail
}

// NEW (correct):
long evictions = bc.getStats().getEvictedCount();
long failedInserts = bc.getStats().getFailedInserts();
assertTrue(
  "Expected either evictions or failed inserts...",
  evictions > 0 || failedInserts > 0);  // ✅ Accepts either outcome
Fixed filename: "testPrefetchInterruptOnCapacity" → 
"testPrefetchRunTriggersEvictions"

The test now correctly validates capacity management without requiring a 
specific mechanism (evictions vs failed inserts), while using reasonable 
resources (1.5x KVs instead of 2x).

Fix 5 (also applies to TestAsyncTableScan and TestConnection
Summary of Fixes
All three flapping tests had the identical root cause: directory sharing race 
conditions in parallel test execution. Each test was trying to reuse 
directories from previous test runs that were already deleted, causing tests to 
fail or not be discovered properly.

Files Modified:
AbstractTestAsyncTableScan.java (fixes TestAsyncTableScan)

Added a helper method createConfiguration() that sets 
hbase.test.disable-directory-sharing=true
Passed this configuration to MiniClusterRule.newBuilder().setConfiguration()
Location: 
/Users/hdara/src/apache/hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/client/AbstractTestAsyncTableScan.java:77-88
TestReplicationBase.java (fixes TestEditsBehindDroppedTableTiming and all 
replication tests)

Added conf.setBoolean("hbase.test.disable-directory-sharing", true) in the 
setupConfig() method
This affects both UTIL1 and UTIL2, ensuring both replication clusters use 
isolated directories
Location: 
/Users/hdara/src/apache/hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationBase.java:195-198
TestConnection.java

Added 
TEST_UTIL.getConfiguration().setBoolean("hbase.test.disable-directory-sharing", 
true) in setUpBeforeClass() before startMiniCluster()
Location: 
/Users/hdara/src/apache/hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestConnection.java:90-93
How It Works:
The fix leverages the infrastructure you already added to HBaseTestingUtil in 
commit 30816234db, which checks for the hbase.test.disable-directory-sharing 
configuration. When set to true, each test gets its own unique test data 
directory instead of trying to reuse system properties from previous parallel 
test runs.

This prevents the race condition where:

Test A starts and creates directories
Test B starts in parallel and reuses Test A's directories (via system 
properties)
Test A finishes and cleans up its directories
Test B tries to use the now-deleted directories → FAILURE



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to