Hello,

We started are using CGroups with LinuxContainerExecutor recently, running
Apache Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a
yarn container will fail with a message like the following:
WARN privileged.PrivilegedOperationExecutor: Shell execution returned exit
code: 35. Privileged Execution Operation Stderr:
Could not create container dirsCould not create local files and directories

Looking at the container executor source it's traceable to errors here:
https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604

And ultimately to
https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672

The root failure seems to be in the underlying mkdir call, but that exit
code / errno is swallowed so we don't have more details. We tend to see
this when many containers start at the same time for the same application
on a host, and suspect it may be related to some race conditions around
those shared directories between containers for the same application.

Has anyone seen similar failures in using the LinuxContainerExecutor?

This issue compounded because LinuxContainerExecutor renders the node
unhealthy in these scenarios:
https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L566

Under some circumstances this seems appropriate, but since this is a
transient failure (none of these machines were at capacity for disks,
inodes, etc) we shouldn't down the NodeManager. The behavior to add this
blacklisting came as part of https://issues.apache.org/jira/browse/YARN-6302
which seems perfectly valid, but perhaps we should make this configurable
so certain users can opt out?

Cheers,
Jon

Reply via email to