[ 
https://issues.apache.org/jira/browse/YARN-4309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-4309:
--------------------------------
    Attachment: YARN-4309.006.patch

Uploaded a new patch to address [~sidharta-s]'s comments.

[~leftnoteasy] - 
bq. Since debug information fetch script (like copy script and list files) is 
at the end of launch_container.sh, is it possible that a container is killed so 
such script cannot be executed?

It's not at the end - it's just before the actually container process is 
launched so if we reach a stage where we are ready to call launch_container.sh 
it should almost always be run. This is what the relevant lines from 
launch_container.sh look like with the patch:

{code}
echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 
1>>"/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001/directory.info"
find -L . -maxdepth 5 -type l -ls 
1>>"/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001/directory.info"
exec /bin/bash -c "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp 
-Dlog4j.configuration=container-log4j.properties 
-Dyarn.app.container.log.dir=/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001
 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA 
-Dhadoop.root.logfile=syslog  -Xmx1024m 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster 
1>/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001/stdout
 
2>/var/hadoop/hadoop-3-data/grid/log/application_1449046677123_0002/container_1449046677123_0002_01_000001/stderr
 "
{code}

> Add debug information to application logs when a container fails
> ----------------------------------------------------------------
>
>                 Key: YARN-4309
>                 URL: https://issues.apache.org/jira/browse/YARN-4309
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>         Attachments: YARN-4309.001.patch, YARN-4309.002.patch, 
> YARN-4309.003.patch, YARN-4309.004.patch, YARN-4309.005.patch, 
> YARN-4309.006.patch
>
>
> Sometimes when a container fails, it can be pretty hard to figure out why it 
> failed.
> My proposal is that if a container fails, we collect information about the 
> container local dir and dump it into the container log dir. Ideally, I'd like 
> to tar up the directory entirely, but I'm not sure of the security and space 
> implications of such a approach. At the very least, we can list all the files 
> in the container local dir, and dump the contents of launch_container.sh(into 
> the container log dir).
> When log aggregation occurs, all this information will automatically get 
> collected and make debugging such failures much easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to