Hello, We started are using CGroups with LinuxContainerExecutor recently, running Apache Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn container will fail with a message like the following: WARN privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 35. Privileged Execution Operation Stderr: Could not create container dirsCould not create local files and directories
Looking at the container executor source it's traceable to errors here: https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604 And ultimately to https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672 The root failure seems to be in the underlying mkdir call, but that exit code / errno is swallowed so we don't have more details. We tend to see this when many containers start at the same time for the same application on a host, and suspect it may be related to some race conditions around those shared directories between containers for the same application. Has anyone seen similar failures in using the LinuxContainerExecutor? This issue compounded because LinuxContainerExecutor renders the node unhealthy in these scenarios: https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L566 Under some circumstances this seems appropriate, but since this is a transient failure (none of these machines were at capacity for disks, inodes, etc) we shouldn't down the NodeManager. The behavior to add this blacklisting came as part of https://issues.apache.org/jira/browse/YARN-6302 which seems perfectly valid, but perhaps we should make this configurable so certain users can opt out? Cheers, Jon