[ 
https://issues.apache.org/jira/browse/YARN-11904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ConfX updated YARN-11904:
-------------------------
    Description: 
h2. DESCRIPTION:

The TestNMClient test class has a race condition in its @Before setup() method
that causes intermittent test failures with IndexOutOfBoundsException.

The issue occurs because the test fetches NodeManager reports immediately after
starting the YARN cluster, without waiting for NodeManagers to fully register
and transition to RUNNING state. This results in an empty nodeReports list,
which later causes an IndexOutOfBoundsException when the test tries to access
nodeReports.get(0) in the allocateContainers() method.

This is a timing-dependent bug that may pass on fast hardware (where NMs 
register
quickly) but fails on slower systems or under load.

 
h2. ROOT CAUSE:

File: 
hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestNMClient.java
Method: setup() (lines 160-182)

Problematic code sequence:

 
{code:java}
@Before
public void setup() throws YarnException, IOException {
    // start minicluster
    yarnCluster = new MiniYARNCluster(TestAMRMClient.class.getName(), 
nodeCount, 1, 1);
    yarnCluster.init(conf);
    yarnCluster.start();              // ← NodeManagers start asynchronously    
// start rm client
    yarnClient = (YarnClientImpl) YarnClient.createYarnClient();
    yarnClient.init(conf);
    yarnClient.start();    // get node info
    nodeReports = yarnClient.getNodeReports(NodeState.RUNNING);  // ← RACE 
CONDITION!
    // At this point, NodeManagers may not have registered yet
    // Result: nodeReports is EMPTY    // ... rest of setup ...
} {code}
Later in the test, allocateContainers() method tries to access:
{code:java}
String node = nodeReports.get(0).getNodeId().getHost();  // ← 
IndexOutOfBoundsException! {code}
STEPS TO REPRODUCE:
 # Check out Apache Hadoop 3.3.5 source code
 # Navigate to hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client
 # Run the test:  
{code:java}
mvn test -Dtest=TestNMClient#testNMClientNoCleanupOnStop   {code}

 # The test may fail intermittently depending on system performanceNote: The 
failure is timing-dependent. It may pass on fast systems but fail on slower 
systems or when system is under load.

 # 
 ## EXPECTED RESULT:

Test should wait for NodeManagers to register and be in RUNNING state before
fetching node reports. The test should pass consistently regardless of system
performance.

 

ACTUAL RESULT:

Test fails with IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.rangeCheck(ArrayList.java:659)
    at java.util.ArrayList.get(ArrayList.java:435)
    at 
org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:324)
    at 
org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:290){code}
 # 
 ## PROPOSED FIX:

Add proper synchronization in the setup() method to wait for NodeManagers
before fetching node reports:
{code:java}
@Before
public void setup() throws YarnException, IOException {
    // start minicluster
    conf = new YarnConfiguration();
    conf.set(YarnConfiguration.NM_CONTAINER_STATE_TRANSITION_LISTENERS,
        DebugSumContainerStateListener.class.getName());
    yarnCluster =
        new MiniYARNCluster(TestAMRMClient.class.getName(), nodeCount, 1, 1);
    yarnCluster.init(conf);
    yarnCluster.start();
    assertNotNull(yarnCluster);
    assertEquals(STATE.STARTED, yarnCluster.getServiceState());    // Wait for 
NodeManagers to connect
    try {
        if (!yarnCluster.waitForNodeManagersToConnect(30000)) {
            fail("NodeManagers failed to connect within 30 seconds");
        }
    } catch (InterruptedException e) {
        fail("Interrupted while waiting for NodeManagers: " + e.getMessage());
    }    // start rm client
    yarnClient = (YarnClientImpl) YarnClient.createYarnClient();
    yarnClient.init(conf);
    yarnClient.start();
    assertNotNull(yarnClient);
    assertEquals(STATE.STARTED, yarnClient.getServiceState());    // get node 
info - wait for nodes to be in RUNNING state
    int retries = 10;
    while (retries > 0) {
        nodeReports = yarnClient.getNodeReports(NodeState.RUNNING);
        if (nodeReports != null && !nodeReports.isEmpty()) {
            break;
        }
        sleep(1000);
        retries--;
    }    if (nodeReports == null || nodeReports.isEmpty()) {
        fail("No NodeManagers in RUNNING state after waiting");
    }    // ... rest of setup ...
} {code}
I'm happy to submit a patch for this.

  was:
## DESCRIPTION:

The TestNMClient test class has a race condition in its @Before setup() method
that causes intermittent test failures with IndexOutOfBoundsException.

The issue occurs because the test fetches NodeManager reports immediately after
starting the YARN cluster, without waiting for NodeManagers to fully register
and transition to RUNNING state. This results in an empty nodeReports list,
which later causes an IndexOutOfBoundsException when the test tries to access
nodeReports.get(0) in the allocateContainers() method.

This is a timing-dependent bug that may pass on fast hardware (where NMs 
register
quickly) but fails on slower systems or under load.

 

## ROOT CAUSE:

File: 
hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestNMClient.java
Method: setup() (lines 160-182)

Problematic code sequence:

 
{code:java}
@Before
public void setup() throws YarnException, IOException {
    // start minicluster
    yarnCluster = new MiniYARNCluster(TestAMRMClient.class.getName(), 
nodeCount, 1, 1);
    yarnCluster.init(conf);
    yarnCluster.start();              // ← NodeManagers start asynchronously    
// start rm client
    yarnClient = (YarnClientImpl) YarnClient.createYarnClient();
    yarnClient.init(conf);
    yarnClient.start();    // get node info
    nodeReports = yarnClient.getNodeReports(NodeState.RUNNING);  // ← RACE 
CONDITION!
    // At this point, NodeManagers may not have registered yet
    // Result: nodeReports is EMPTY    // ... rest of setup ...
} {code}
Later in the test, allocateContainers() method tries to access:
{code:java}
String node = nodeReports.get(0).getNodeId().getHost();  // ← 
IndexOutOfBoundsException! {code}
 # 
 ## STEPS TO REPRODUCE:
 # Check out Apache Hadoop 3.3.5 source code
 # Navigate to hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client
 # Run the test:  
{code:java}
mvn test -Dtest=TestNMClient#testNMClientNoCleanupOnStop   {code}

 # The test may fail intermittently depending on system performanceNote: The 
failure is timing-dependent. It may pass on fast systems but fail on slower 
systems or when system is under load.

 # 
 ## EXPECTED RESULT:

Test should wait for NodeManagers to register and be in RUNNING state before
fetching node reports. The test should pass consistently regardless of system
performance.

 

ACTUAL RESULT:

Test fails with IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.rangeCheck(ArrayList.java:659)
    at java.util.ArrayList.get(ArrayList.java:435)
    at 
org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:324)
    at 
org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:290){code}
 # 
 ## PROPOSED FIX:

Add proper synchronization in the setup() method to wait for NodeManagers
before fetching node reports:
{code:java}
@Before
public void setup() throws YarnException, IOException {
    // start minicluster
    conf = new YarnConfiguration();
    conf.set(YarnConfiguration.NM_CONTAINER_STATE_TRANSITION_LISTENERS,
        DebugSumContainerStateListener.class.getName());
    yarnCluster =
        new MiniYARNCluster(TestAMRMClient.class.getName(), nodeCount, 1, 1);
    yarnCluster.init(conf);
    yarnCluster.start();
    assertNotNull(yarnCluster);
    assertEquals(STATE.STARTED, yarnCluster.getServiceState());    // Wait for 
NodeManagers to connect
    try {
        if (!yarnCluster.waitForNodeManagersToConnect(30000)) {
            fail("NodeManagers failed to connect within 30 seconds");
        }
    } catch (InterruptedException e) {
        fail("Interrupted while waiting for NodeManagers: " + e.getMessage());
    }    // start rm client
    yarnClient = (YarnClientImpl) YarnClient.createYarnClient();
    yarnClient.init(conf);
    yarnClient.start();
    assertNotNull(yarnClient);
    assertEquals(STATE.STARTED, yarnClient.getServiceState());    // get node 
info - wait for nodes to be in RUNNING state
    int retries = 10;
    while (retries > 0) {
        nodeReports = yarnClient.getNodeReports(NodeState.RUNNING);
        if (nodeReports != null && !nodeReports.isEmpty()) {
            break;
        }
        sleep(1000);
        retries--;
    }    if (nodeReports == null || nodeReports.isEmpty()) {
        fail("No NodeManagers in RUNNING state after waiting");
    }    // ... rest of setup ...
} {code}
I'm happy to submit a patch for this.


> TestNMClient has race condition causing intermittent IndexOutOfBoundsException
> ------------------------------------------------------------------------------
>
>                 Key: YARN-11904
>                 URL: https://issues.apache.org/jira/browse/YARN-11904
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: test, yarn-client
>    Affects Versions: 3.3.5
>         Environment: - OS: macOS Darwin 25.0.0 (also reproducible on other 
> platforms)
> - Java: OpenJDK 1.8
> - Maven: 3.6+
> - Test: org.apache.hadoop.yarn.client.api.impl.TestNMClient
>            Reporter: ConfX
>            Priority: Major
>
> h2. DESCRIPTION:
> The TestNMClient test class has a race condition in its @Before setup() method
> that causes intermittent test failures with IndexOutOfBoundsException.
> The issue occurs because the test fetches NodeManager reports immediately 
> after
> starting the YARN cluster, without waiting for NodeManagers to fully register
> and transition to RUNNING state. This results in an empty nodeReports list,
> which later causes an IndexOutOfBoundsException when the test tries to access
> nodeReports.get(0) in the allocateContainers() method.
> This is a timing-dependent bug that may pass on fast hardware (where NMs 
> register
> quickly) but fails on slower systems or under load.
>  
> h2. ROOT CAUSE:
> File: 
> hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestNMClient.java
> Method: setup() (lines 160-182)
> Problematic code sequence:
>  
> {code:java}
> @Before
> public void setup() throws YarnException, IOException {
>     // start minicluster
>     yarnCluster = new MiniYARNCluster(TestAMRMClient.class.getName(), 
> nodeCount, 1, 1);
>     yarnCluster.init(conf);
>     yarnCluster.start();              // ← NodeManagers start asynchronously  
>   // start rm client
>     yarnClient = (YarnClientImpl) YarnClient.createYarnClient();
>     yarnClient.init(conf);
>     yarnClient.start();    // get node info
>     nodeReports = yarnClient.getNodeReports(NodeState.RUNNING);  // ← RACE 
> CONDITION!
>     // At this point, NodeManagers may not have registered yet
>     // Result: nodeReports is EMPTY    // ... rest of setup ...
> } {code}
> Later in the test, allocateContainers() method tries to access:
> {code:java}
> String node = nodeReports.get(0).getNodeId().getHost();  // ← 
> IndexOutOfBoundsException! {code}
> STEPS TO REPRODUCE:
>  # Check out Apache Hadoop 3.3.5 source code
>  # Navigate to hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client
>  # Run the test:  
> {code:java}
> mvn test -Dtest=TestNMClient#testNMClientNoCleanupOnStop   {code}
>  # The test may fail intermittently depending on system performanceNote: The 
> failure is timing-dependent. It may pass on fast systems but fail on slower 
> systems or when system is under load.
>  # 
>  ## EXPECTED RESULT:
> Test should wait for NodeManagers to register and be in RUNNING state before
> fetching node reports. The test should pass consistently regardless of system
> performance.
>  
> ACTUAL RESULT:
> Test fails with IndexOutOfBoundsException:
> {code:java}
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>     at java.util.ArrayList.rangeCheck(ArrayList.java:659)
>     at java.util.ArrayList.get(ArrayList.java:435)
>     at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.allocateContainers(TestNMClient.java:324)
>     at 
> org.apache.hadoop.yarn.client.api.impl.TestNMClient.testNMClientNoCleanupOnStop(TestNMClient.java:290){code}
>  # 
>  ## PROPOSED FIX:
> Add proper synchronization in the setup() method to wait for NodeManagers
> before fetching node reports:
> {code:java}
> @Before
> public void setup() throws YarnException, IOException {
>     // start minicluster
>     conf = new YarnConfiguration();
>     conf.set(YarnConfiguration.NM_CONTAINER_STATE_TRANSITION_LISTENERS,
>         DebugSumContainerStateListener.class.getName());
>     yarnCluster =
>         new MiniYARNCluster(TestAMRMClient.class.getName(), nodeCount, 1, 1);
>     yarnCluster.init(conf);
>     yarnCluster.start();
>     assertNotNull(yarnCluster);
>     assertEquals(STATE.STARTED, yarnCluster.getServiceState());    // Wait 
> for NodeManagers to connect
>     try {
>         if (!yarnCluster.waitForNodeManagersToConnect(30000)) {
>             fail("NodeManagers failed to connect within 30 seconds");
>         }
>     } catch (InterruptedException e) {
>         fail("Interrupted while waiting for NodeManagers: " + e.getMessage());
>     }    // start rm client
>     yarnClient = (YarnClientImpl) YarnClient.createYarnClient();
>     yarnClient.init(conf);
>     yarnClient.start();
>     assertNotNull(yarnClient);
>     assertEquals(STATE.STARTED, yarnClient.getServiceState());    // get node 
> info - wait for nodes to be in RUNNING state
>     int retries = 10;
>     while (retries > 0) {
>         nodeReports = yarnClient.getNodeReports(NodeState.RUNNING);
>         if (nodeReports != null && !nodeReports.isEmpty()) {
>             break;
>         }
>         sleep(1000);
>         retries--;
>     }    if (nodeReports == null || nodeReports.isEmpty()) {
>         fail("No NodeManagers in RUNNING state after waiting");
>     }    // ... rest of setup ...
> } {code}
> I'm happy to submit a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to