[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated MAPREDUCE-6415:
-------------------------------------
    Attachment: MAPREDUCE-6415_branch-2_prelim_001.patch
                MAPREDUCE-6415_prelim_001.patch

I've uploaded a preliminary patch.  It adds a command that will look for 
eligible apps to process, generate a script that will run the 'hadoop archive' 
command, and runs the script in the distributed shell.  It also modifies the 
'yarn logs' command and JHS to be able to read the har files.  All as described 
in the design document.

I still have to write some unit tests and split up the patch into MAPREDUCE and 
YARN (and HADOOP?) JIRAs.

We can also discuss if we have the right criteria for eligibility.  I 
implemented the ones mentioned in the design document, but it shouldn't be too 
hard to change them.

Here's the CLI usage:
{noformat}
>> bin/mapred archive-logs -help
usage: yarn archive-logs
 -help                       Prints this message
 -maxEligibleApps <n>        The maximum number of eligible apps to
                             process (default: -1 (all))
 -maxTotalLogsSize <bytes>   The maximum total logs size required to be
                             eligible (default: 1GB)
 -memory <megabytes>         The amount of memory for each container
                             (default: 1024)
 -minNumberLogFiles <n>      The minimum number of log files required to
                             be eligible (default: 20)
{noformat}

I know it's a bit hard to tell from the Java code what the shell script looks 
like, so here's an example of one:
{code}
#!/bin/bash
set -e
set -x
CONTAINER_ID_NUM=`echo $CONTAINER_ID | cut -d "_" -f 5`
if [ "$CONTAINER_ID_NUM" == "000002" ]; then
        appId="application_1437514991365_0004"
        user="rkanter"
elif [ "$CONTAINER_ID_NUM" == "000003" ]; then
        appId="application_1437514991365_0005"
        user="rkanter"
elif [ "$CONTAINER_ID_NUM" == "000004" ]; then
        appId="application_1437514991365_0003"
        user="rkanter"
elif [ "$CONTAINER_ID_NUM" == "000005" ]; then
        appId="application_1437514991365_0007"
        user="rkanter"
elif [ "$CONTAINER_ID_NUM" == "000006" ]; then
        appId="application_1437514991365_0006"
        user="rkanter"
else
        echo "Unknown Mapping!"
        exit -1
fi
export HADOOP_CLIENT_OPTS="-Xmx1024m"
$HADOOP_HOME/bin/hadoop archive -Dmapreduce.framework.name=local -archiveName 
$appId.har -p /tmp/logs/$user/logs/$appId \* /tmp/logs/archive-logs-work
$HADOOP_HOME/bin/hadoop fs -mv /tmp/logs/archive-logs-work/$appId.har 
/tmp/logs/$user/logs/$appId/$appId.har
originalLogs=`$HADOOP_HOME/bin/hadoop fs -ls /tmp/logs/$user/logs/$appId | grep 
"^-" | awk '{print $8}'`
if [ ! -z "$originalLogs" ]; then
        $HADOOP_HOME/bin/hadoop fs -rm $originalLogs
fi
{code}

> Create a tool to combine aggregated logs into HAR files
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-6415
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6415
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>         Attachments: HAR-ableAggregatedLogs_v1.pdf, 
> MAPREDUCE-6415_branch-2_prelim_001.patch, MAPREDUCE-6415_prelim_001.patch
>
>
> While we wait for YARN-2942 to become viable, it would still be great to 
> improve the aggregated logs problem.  We can write a tool that combines 
> aggregated log files into a single HAR file per application, which should 
> solve the too many files and too many blocks problems.  See the design 
> document for details.
> See YARN-2942 for more context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to