ConfX created YARN-11905:
----------------------------

             Summary: BaseAMRMProxyE2ETest.createApp() has infinite busy-wait 
loop causing test timeouts
                 Key: YARN-11905
                 URL: https://issues.apache.org/jira/browse/YARN-11905
             Project: Hadoop YARN
          Issue Type: Bug
          Components: test, yarn-client
    Affects Versions: 3.3.5
         Environment: - OS: macOS Darwin 25.0.0 (reproducible on other 
platforms)
- Java: OpenJDK 1.8
- Maven: 3.6+
- Tests: org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy (and subclasses)
            Reporter: ConfX


h2. DESCRIPTION:

Multiple AMRMProxy E2E tests consistently timeout after 120 seconds due to an
infinite busy-wait loop in the BaseAMRMProxyE2ETest.createApp() method. The loop
waits for an application attempt to reach LAUNCHED state but has no sleep 
interval
or timeout mechanism, causing it to spin indefinitely if the state is never 
reached.

This is a critical test quality issue that:
1. Causes 100% failure rate for affected tests
2. Wastes 2 minutes per test execution
3. Wastes CPU cycles with busy-waiting
4. Blocks test execution and CI/CD pipelines
h2. AFFECTED TESTS:

The following test methods consistently timeout:
- TestAMRMProxy.testAMRMProxyTokenRenewal
- TestAMRMProxy.testAMRMProxyE2E
- TestAMRMProxy.testE2ETokenSwap

All tests extend BaseAMRMProxyE2ETest and call the buggy createApp() method.
h2. ROOT CAUSE:

File: 
hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/BaseAMRMProxyE2ETest.java
Method: createApp() (lines 176-180)

Problematic code:
{code:java}
// Wait for app attempt to reach launched state
while (true) {
    if (appAttempt.getAppAttemptState() == RMAppAttemptState.LAUNCHED) {
        break;
    }
    // BUG: No sleep() call here!
    // BUG: No timeout mechanism!
    // This creates an infinite busy-wait loop
}{code}
Issues with this code:
1. **No sleep interval**: Loop spins continuously, wasting CPU
2. **No timeout**: If app never reaches LAUNCHED, loops forever
3. **No error handling**: Doesn't check why app isn't launching
4. **No diagnostic info**: Doesn't log current state for debugging
h2. STEPS TO REPRODUCE:

1. Check out Apache Hadoop 3.3.5 source code
2. Navigate to hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client
3. Run any of the affected tests:
   \{code}
   mvn test -Dtest=TestAMRMProxy#testAMRMProxyTokenRenewal
   \{code}
4. Test will consistently timeout after 120 seconds

Note: This is a 100% reproducible failure, not intermittent.
h2. EXPECTED RESULT:

The createApp() method should:
1. Wait for application attempt to reach LAUNCHED state
2. Sleep between checks to avoid busy-waiting
3. Have a reasonable timeout (e.g., 30 seconds)
4. Provide clear error message if timeout occurs
5. Complete successfully in reasonable time (< 10 seconds normally)
h2. ACTUAL RESULT:

Test times out after 120 seconds (test-level timeout):
{code:java}
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 120.073 s <<< 
FAILURE!
testAMRMProxyTokenRenewal(org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy)
  Time elapsed: 120.035 s  <<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 120000 
milliseconds
    at 
org.apache.hadoop.yarn.client.api.impl.BaseAMRMProxyE2ETest.createApp(BaseAMRMProxyE2ETest.java:177)
    at 
org.apache.hadoop.yarn.client.api.impl.TestAMRMProxy.testAMRMProxyTokenRenewal(TestAMRMProxy.java:176)
{code}
The test consumes exactly 120 seconds (the full timeout period), indicating the
loop never exits naturally.
h2. ROOT CAUSE ANALYSIS:

Several possibilities:
1. NodeManagers may not be registered yet
2. No resources available to launch the AM container
3. AMRMProxy configuration issue preventing normal launch
4. Scheduler unable to allocate container

The infinite loop prevents diagnosis because:
- No logging of current state
- No indication of what's blocking the launch
- Test just spins until timeout

I haven't figured out the root cause but happy to discuss more on this.

I believe a better way to do this test is to replace the infinite loop with a 
proper wait mechanism.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to