[ https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619451#comment-16619451 ]
Jon Bender commented on YARN-8786: ---------------------------------- Ah, I figured we wanted to leave this open until the race conditions on the executor were resolved, but if we feel it's infrequent enough and the user-facing impact is minimal on 3.1.2+ I'm OK duping this into YARN-8751 > LinuxContainerExecutor fails sporadically in create_local_dirs > -------------------------------------------------------------- > > Key: YARN-8786 > URL: https://issues.apache.org/jira/browse/YARN-8786 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.0.0 > Reporter: Jon Bender > Priority: Major > > We started using CGroups with LinuxContainerExecutor recently, running Apache > Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn > container will fail with a message like the following: > {code:java} > [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: > Container container_1530684675517_516620_01_020846 transitioned from > SCHEDULED to RUNNING > [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO > monitor.ContainersMonitorImpl: Starting resource-monitoring for > container_1530684675517_516620_01_020846 > [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN > privileged.PrivilegedOperationExecutor: Shell execution returned exit code: > 35. Privileged Execution Operation Stderr: > [2018-09-02 23:48:02.506159] Could not create container dirsCould not create > local files and directories > [2018-09-02 23:48:02.506220] > [2018-09-02 23:48:02.506238] Stdout: main : command provided 1 > [2018-09-02 23:48:02.506258] main : run as user is nobody > [2018-09-02 23:48:02.506282] main : requested yarn user is root > [2018-09-02 23:48:02.506294] Getting exit code file... > [2018-09-02 23:48:02.506307] Creating script paths... > [2018-09-02 23:48:02.506330] Writing pid file... > [2018-09-02 23:48:02.506366] Writing to tmp file > /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp > [2018-09-02 23:48:02.506389] Writing to cgroup task files... > [2018-09-02 23:48:02.506402] Creating local dirs... > [2018-09-02 23:48:02.506414] Getting exit code file... > [2018-09-02 23:48:02.506435] Creating script paths... > {code} > Looking at the container executor source it's traceable to errors here: > [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604] > And ultimately to > [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672] > The root failure seems to be in the underlying mkdir call, but that exit code > / errno is swallowed so we don't have more details. We tend to see this when > many containers start at the same time for the same application on a host, > and suspect it may be related to some race conditions around those shared > directories between containers for the same application. > For example, this is a typical pattern in the audit logs: > {code:java} > [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO > nodemanager.NMAuditLogger: USER=root IP=<> Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1530684675517_559126 > CONTAINERID=container_1530684675517_559126_01_012871 > [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO > nodemanager.NMAuditLogger: USER=root IP=<> Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1530684675517_559126 > CONTAINERID=container_1530684675517_559126_01_012870 > [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN > nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - > Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed > with state: EXITED_WITH_FAILURE APPID=application_1530684675517_559126 > CONTAINERID=container_1530684675517_559126_01_012871 > {code} > Two containers for the same application starting in quick succession followed > by the EXITED_WITH_FAILURE step (exit code 35). > We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, > the only major JIRAs that affected the executor since 3.0.0 seem unrelated > ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8] > and > [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org