Jim Brennan created YARN-8515: --------------------------------- Summary: container-executor can crash with SIGPIPE after nodemanager restart Key: YARN-8515 URL: https://issues.apache.org/jira/browse/YARN-8515 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan Assignee: Jim Brennan
When running with docker on large clusters, we have noticed that sometimes docker containers are not removed - they remain in the exited state, and the corresponding container-executor is no longer running. Upon investigation, we noticed that this always seemed to happen after a nodemanager restart. The sequence leading to the stranded docker containers is: # Nodemanager restarts # Containers are recovered and then run for a while # Containers are killed for some (legitimate) reason # Container-executor exits without removing the docker container. After reproducing this on a test cluster, we found that the container-executor was exiting due to a SIGPIPE. What is happening is that the shell command executor that is used to start container-executor has threads reading from c-e's stdout and stderr. When the NM is restarted, these threads are killed. Then when the container-executor continues executing after the container exits with error, it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is not handled, this crashes the container-executor before it can actually remove the docker container. We ran into this in branch 2.8. The way docker containers are removed has been completely redesigned in trunk, so I don't think it will lead to this exact failure, but after an NM restart, potentially any write to stderr or stdout in the container-executor could cause it to crash. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org