[ https://issues.apache.org/jira/browse/MAPREDUCE-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723969#comment-14723969 ]
Jason Lowe commented on MAPREDUCE-6415: --------------------------------------- mvn dependency:analyze says there's a number of things that should be cleaned up in the new pom: {noformat} [INFO] --- maven-dependency-plugin:2.2:analyze (default-cli) @ hadoop-archive-logs --- [WARNING] Used undeclared dependencies found: [WARNING] org.apache.hadoop:hadoop-yarn-common:jar:2.8.0-SNAPSHOT:provided [WARNING] com.google.guava:guava:jar:11.0.2:provided [WARNING] commons-io:commons-io:jar:2.4:compile [WARNING] commons-logging:commons-logging:jar:1.1.3:provided [WARNING] org.apache.hadoop:hadoop-yarn-client:jar:2.8.0-SNAPSHOT:provided [WARNING] org.apache.hadoop:hadoop-yarn-server-resourcemanager:jar:2.8.0-SNAPSHOT:test [WARNING] org.apache.hadoop:hadoop-yarn-api:jar:2.8.0-SNAPSHOT:provided [WARNING] commons-cli:commons-cli:jar:1.2:provided [WARNING] Unused declared dependencies found: [WARNING] org.apache.hadoop:hadoop-annotations:jar:2.8.0-SNAPSHOT:provided [WARNING] org.apache.hadoop:hadoop-mapreduce-client-hs:jar:2.8.0-SNAPSHOT:test [WARNING] org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.8.0-SNAPSHOT:provided [WARNING] org.apache.hadoop:hadoop-mapreduce-client-jobclient:test-jar:tests:2.8.0-SNAPSHOT:test [WARNING] org.apache.hadoop:hadoop-hdfs:jar:2.8.0-SNAPSHOT:provided [WARNING] org.apache.hadoop:hadoop-common:test-jar:tests:2.8.0-SNAPSHOT:test {noformat} It would be nice if the usage output used the actual values in the code rather than hardcoded strings. For example, we now have to keep minNumLogFiles and the usage string manually in sync. If the usage output leveraged the minNumLogFiles value directly then updating it would automatically correct the usage message. On a related note the usage currently mentions values like "1GB", but I don't believe the code supports memory units. Do we only want to consider aggregating logs that have totally succeeded? What about the FAILED case or other terminal states? Seems like any terminal state where we know there aren't going to be any more logs arriving should be eligible. Nit: it's wasteful for checkFiles to continue iterating the files once it finds an excluding condition. We can also eliminate the need to track file counts explicitly and simply check files.length directly before we even start looping. Is there a reason to support maxEligible being zero? Wondering if that should be equivalent to a negative value and just cover everything. Should the working directory contain something unique like the application ID in it somewhere? This has the benefit of making it easier to cleanup after a run and not worry about affecting other, possibly simultaneous runs. > Create a tool to combine aggregated logs into HAR files > ------------------------------------------------------- > > Key: MAPREDUCE-6415 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6415 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Affects Versions: 2.8.0 > Reporter: Robert Kanter > Assignee: Robert Kanter > Attachments: HAR-ableAggregatedLogs_v1.pdf, MAPREDUCE-6415.001.patch, > MAPREDUCE-6415_branch-2.001.patch, MAPREDUCE-6415_branch-2_prelim_001.patch, > MAPREDUCE-6415_branch-2_prelim_002.patch, MAPREDUCE-6415_prelim_001.patch, > MAPREDUCE-6415_prelim_002.patch > > > While we wait for YARN-2942 to become viable, it would still be great to > improve the aggregated logs problem. We can write a tool that combines > aggregated log files into a single HAR file per application, which should > solve the too many files and too many blocks problems. See the design > document for details. > See YARN-2942 for more context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)