Susheel Gupta created YARN-11817:
------------------------------------
Summary: Differentiate between container-executor and application
exit codes to prevent false NM health issues.
Key: YARN-11817
URL: https://issues.apache.org/jira/browse/YARN-11817
Project: Hadoop YARN
Issue Type: Improvement
Components: yarn
Reporter: Susheel Gupta
YARN treats container exit code 24 as a critical error (INVALID_CONFIG_FILE)
and marks the NodeManager as unhealthy. However, some applications also use
exit code 24 for their own logic—like signaling a missing config file. Since
YARN can’t distinguish between executor-level errors and app-level exit codes,
it ends up flagging healthy NodeManagers as unhealthy, which affects other apps
running on the same node.
{noformat}
2025-04-13 10:36:21,919 WARN
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception
from container-launch with container ID:
container_e51_1739441938175_0092_02_000001 and exit code: 24
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
Launch container failed
...
2025-04-13 10:36:21,920 ERROR
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Failed to launch container due to configuration error.
org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container
Executor reached unrecoverable exception{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]