[ https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Varun Vasudev updated YARN-4309: -------------------------------- Attachment: YARN-4309.006.patch Uploaded a new patch to address [~sidharta-s]'s comments. [~leftnoteasy] - bq. Since debug information fetch script (like copy script and list files) is at the end of launch_container.sh, is it possible that a container is killed so such script cannot be executed? It's not at the end - it's just before the actually container process is launched so if we reach a stage where we are ready to call launch_container.sh it should almost always be run. This is what the relevant lines from launch_container.sh look like with the patch: {code} echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001/directory.info" find -L . -maxdepth 5 -type l -ls 1>>"/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001/directory.info" exec /bin/bash -c "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -Xmx1024m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001/stdout 2>/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001/stderr " {code} > Add debug information to application logs when a container fails > ---------------------------------------------------------------- > > Key: YARN-4309 > URL: https://issues.apache.org/jira/browse/YARN-4309 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Reporter: Varun Vasudev > Assignee: Varun Vasudev > Attachments: YARN-4309.001.patch, YARN-4309.002.patch, > YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch, > YARN-4309.006.patch > > > Sometimes when a container fails, it can be pretty hard to figure out why it > failed. > My proposal is that if a container fails, we collect information about the > container local dir and dump it into the container log dir. Ideally, I'd like > to tar up the directory entirely, but I'm not sure of the security and space > implications of such a approach. At the very least, we can list all the files > in the container local dir, and dump the contents of launch_container.sh(into > the container log dir). > When log aggregation occurs, all this information will automatically get > collected and make debugging such failures much easier. -- This message was sent by Atlassian JIRA (v6.3.4#6332)