[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723969#comment-14723969
 ] 

Jason Lowe commented on MAPREDUCE-6415:
---------------------------------------

mvn dependency:analyze says there's a number of things that should be cleaned 
up in the new pom:
{noformat}
[INFO] --- maven-dependency-plugin:2.2:analyze (default-cli) @ 
hadoop-archive-logs ---
[WARNING] Used undeclared dependencies found:
[WARNING]    org.apache.hadoop:hadoop-yarn-common:jar:2.8.0-SNAPSHOT:provided
[WARNING]    com.google.guava:guava:jar:11.0.2:provided
[WARNING]    commons-io:commons-io:jar:2.4:compile
[WARNING]    commons-logging:commons-logging:jar:1.1.3:provided
[WARNING]    org.apache.hadoop:hadoop-yarn-client:jar:2.8.0-SNAPSHOT:provided
[WARNING]    
org.apache.hadoop:hadoop-yarn-server-resourcemanager:jar:2.8.0-SNAPSHOT:test
[WARNING]    org.apache.hadoop:hadoop-yarn-api:jar:2.8.0-SNAPSHOT:provided
[WARNING]    commons-cli:commons-cli:jar:1.2:provided
[WARNING] Unused declared dependencies found:
[WARNING]    org.apache.hadoop:hadoop-annotations:jar:2.8.0-SNAPSHOT:provided
[WARNING]    
org.apache.hadoop:hadoop-mapreduce-client-hs:jar:2.8.0-SNAPSHOT:test
[WARNING]    
org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.8.0-SNAPSHOT:provided
[WARNING]    
org.apache.hadoop:hadoop-mapreduce-client-jobclient:test-jar:tests:2.8.0-SNAPSHOT:test
[WARNING]    org.apache.hadoop:hadoop-hdfs:jar:2.8.0-SNAPSHOT:provided
[WARNING]    org.apache.hadoop:hadoop-common:test-jar:tests:2.8.0-SNAPSHOT:test
{noformat}

It would be nice if the usage output used the actual values in the code rather 
than hardcoded strings.  For example, we now have to keep minNumLogFiles and 
the usage string manually in sync.  If the usage output leveraged the 
minNumLogFiles value directly then updating it would automatically correct the 
usage message.  On a related note the usage currently mentions values like 
"1GB", but I don't believe the code supports memory units.

Do we only want to consider aggregating logs that have totally succeeded?  What 
about the FAILED case or other terminal states?  Seems like any terminal state 
where we know there aren't going to be any more logs arriving should be 
eligible.

Nit: it's wasteful for checkFiles to continue iterating the files once it finds 
an excluding condition.  We can also eliminate the need to track file counts 
explicitly and simply check files.length directly before we even start looping.

Is there a reason to support maxEligible being zero?  Wondering if that should 
be equivalent to a negative value and just cover everything.

Should the working directory contain something unique like the application ID 
in it somewhere?  This has the benefit of making it easier to cleanup after a 
run and not worry about affecting other, possibly simultaneous runs.

> Create a tool to combine aggregated logs into HAR files
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-6415
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6415
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 2.8.0
>            Reporter: Robert Kanter
>            Assignee: Robert Kanter
>         Attachments: HAR-ableAggregatedLogs_v1.pdf, MAPREDUCE-6415.001.patch, 
> MAPREDUCE-6415_branch-2.001.patch, MAPREDUCE-6415_branch-2_prelim_001.patch, 
> MAPREDUCE-6415_branch-2_prelim_002.patch, MAPREDUCE-6415_prelim_001.patch, 
> MAPREDUCE-6415_prelim_002.patch
>
>
> While we wait for YARN-2942 to become viable, it would still be great to 
> improve the aggregated logs problem.  We can write a tool that combines 
> aggregated log files into a single HAR file per application, which should 
> solve the too many files and too many blocks problems.  See the design 
> document for details.
> See YARN-2942 for more context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to