Repository: hadoop Updated Branches: refs/heads/branch-2 fa2378d86 -> 3165e778f
MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw) Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/3165e778 Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/3165e778 Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/3165e778 Branch: refs/heads/branch-2 Commit: 3165e778f3f000ca12717565fe03a53cd1e8ac93 Parents: fa2378d Author: Allen Wittenauer <a...@apache.org> Authored: Thu Jan 29 14:18:16 2015 -0800 Committer: Allen Wittenauer <a...@apache.org> Committed: Thu Jan 29 14:18:25 2015 -0800 ---------------------------------------------------------------------- hadoop-mapreduce-project/CHANGES.txt | 2 + hadoop-project/src/site/site.xml | 1 + .../hadoop-rumen/src/site/markdown/Rumen.md.vm | 135 ++++++++++++------- 3 files changed, 91 insertions(+), 47 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/hadoop/blob/3165e778/hadoop-mapreduce-project/CHANGES.txt ---------------------------------------------------------------------- diff --git a/hadoop-mapreduce-project/CHANGES.txt b/hadoop-mapreduce-project/CHANGES.txt index d7ee8fb..1d16f78 100644 --- a/hadoop-mapreduce-project/CHANGES.txt +++ b/hadoop-mapreduce-project/CHANGES.txt @@ -32,6 +32,8 @@ Release 2.7.0 - UNRELEASED MAPREDUCE-6141. History server leveldb recovery store (jlowe) + MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw) + OPTIMIZATIONS MAPREDUCE-6169. MergeQueue should release reference to the current item http://git-wip-us.apache.org/repos/asf/hadoop/blob/3165e778/hadoop-project/src/site/site.xml ---------------------------------------------------------------------- diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml index b553489..c5b6740 100644 --- a/hadoop-project/src/site/site.xml +++ b/hadoop-project/src/site/site.xml @@ -106,6 +106,7 @@ <item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/> <item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/> <item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/> + <item name="Rumen" href="hadoop-rumen/Rumen.html"/> </menu> <menu name="MapReduce REST APIs" inherit="top"> http://git-wip-us.apache.org/repos/asf/hadoop/blob/3165e778/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm ---------------------------------------------------------------------- diff --git a/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm b/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm index e25f3a7..bee976a 100644 --- a/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm +++ b/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm @@ -29,9 +29,7 @@ Rumen - [Components](#Components) - [How to use Rumen?](#How_to_use_Rumen) - [Trace Builder](#Trace_Builder) - - [Example](#Example) - [Folder](#Folder) - - [Examples](#Examples) - [Appendix](#Appendix) - [Resources](#Resources) - [Dependencies](#Dependencies) @@ -128,18 +126,21 @@ can use the `Folder` utility to fold the current trace to the desired length. The remaining part of this section explains these utilities in detail. -> Examples in this section assumes that certain libraries are present -> in the java CLASSPATH. See <em>Section-3.2</em> for more details. +Examples in this section assumes that certain libraries are present +in the java CLASSPATH. See [Dependencies](#Dependencies) for more details. $H3 Trace Builder -`Command:` +$H4 Command - java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs> +``` +java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs> +``` -This command invokes the `TraceBuilder` utility of -*Rumen*. It converts the JobHistory files into a series of JSON +This command invokes the `TraceBuilder` utility of *Rumen*. + +TraceBuilder converts the JobHistory files into a series of JSON objects and writes them into the `<jobtrace-output>` file. It also extracts the cluster layout (topology) and writes it in the`<topology-output>` file. @@ -169,7 +170,7 @@ Cluster topology is used as follows : * To extrapolate splits information for tasks with missing splits details or synthetically generated tasks. -`Options :` +$H4 Options <table> <tr> @@ -204,33 +205,45 @@ Cluster topology is used as follows : $H4 Example - java org.apache.hadoop.tools.rumen.TraceBuilder file:///home/user/job-trace.json file:///home/user/topology.output file:///home/user/logs/history/done +*Rumen* expects certain library *JARs* to be present in the *CLASSPATH*. +One simple way to run Rumen is to use +`$HADOOP_HOME/bin/hadoop jar` command to run it as example below. -This will analyze all the jobs in +``` +java org.apache.hadoop.tools.rumen.TraceBuilder \ + file:///tmp/job-trace.json \ + file:///tmp/job-topology.json \ + hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser +``` -`/home/user/logs/history/done` stored on the -`local` FileSystem and output the jobtraces in -`/home/user/job-trace.json` along with topology -information in `/home/user/topology.output`. +This will analyze all the jobs in +`/tmp/hadoop-yarn/staging/history/done_intermediate/testuser` +stored on the `HDFS` FileSystem +and output the jobtraces in `/tmp/job-trace.json` +along with topology information in `/tmp/job-topology.json` +stored on the `local` FileSystem. $H3 Folder -`Command`: +$H4 Command - java org.apache.hadoop.tools.rumen.Folder [options] [input] [output] - -> Input and output to `Folder` is expected to be a fully -> qualified FileSystem path. So use file:// to specify -> files on the `local` FileSystem and hdfs:// to -> specify files on HDFS. +``` +java org.apache.hadoop.tools.rumen.Folder [options] [input] [output] +``` This command invokes the `Folder` utility of *Rumen*. Folding essentially means that the output duration of the resulting trace is fixed and job timelines are adjusted to respect the final output duration. -`Options :` +> Input and output to `Folder` is expected to be a fully +> qualified FileSystem path. So use `file://` to specify +> files on the `local` FileSystem and `hdfs://` to +> specify files on HDFS. + + +$H4 Options <table> <tr> @@ -335,14 +348,28 @@ to respect the final output duration. $H4 Examples $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime - - java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json + +``` +java org.apache.hadoop.tools.rumen.Folder \ + -output-duration 1h \ + -input-cycle 20m \ + file:///tmp/job-trace.json \ + file:///tmp/job-trace-1hr.json +``` If the folded jobs are out of order then the command will bail out. $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness - java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -allow-missorting -skew-buffer-length 100 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json +``` +java org.apache.hadoop.tools.rumen.Folder \ + -output-duration 1h \ + -input-cycle 20m \ + -allow-missorting \ + -skew-buffer-length 100 \ + file:///tmp/job-trace.json \ + file:///tmp/job-trace-1hr.json +``` If the folded jobs are out of order, then atmost 100 jobs will be de-skewed. If the 101<sup>st</sup> job is @@ -350,23 +377,37 @@ If the folded jobs are out of order, then atmost $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode - java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -debug -temp-directory file:///tmp/debug file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json +``` +java org.apache.hadoop.tools.rumen.Folder \ + -output-duration 1h \ + -input-cycle 20m \ + -debug -temp-directory file:///tmp/debug \ + file:///tmp/job-trace.json \ + file:///tmp/job-trace-1hr.json +``` This will fold the 10hr job-trace file -`file:///home/user/job-trace.json` to finish within 1hr +`file:///tmp/job-trace.json` to finish within 1hr and use `file:///tmp/debug` as the temporary directory. The intermediate files in the temporary directory will not be cleaned up. $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration. - java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -concentration 2 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source> +``` +java org.apache.hadoop.tools.rumen.Folder \ + -output-duration 1h \ + -input-cycle 20m \ + -concentration 2 \ + file:///tmp/job-trace.json \ + file:///tmp/job-trace-1hr.json +``` This will fold the 10hr job-trace file -`file:///home/user/job-trace.json` to finish within 1hr -with concentration of 2. `Example-2.3.2` will retain 10% -of the jobs. With *concentration* as 2, 20% of the total input -jobs will be retained. +`file:///tmp/job-trace.json` to finish within 1hr +with concentration of 2. +If the 10h job-trace is folded to 1h, it retains 10% of the jobs by default. +With *concentration* as 2, 20% of the total input jobs will be retained. Appendix @@ -377,21 +418,21 @@ $H3 Resources <a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a> is the main JIRA that introduced *Rumen* to *MapReduce*. Look at the MapReduce -<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617"> -rumen-component</a>for further details. +<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">rumen-component</a> +for further details. $H3 Dependencies -*Rumen* expects certain library *JARs* to be present in -the *CLASSPATH*. The required libraries are - -* `Hadoop MapReduce Tools` (`hadoop-mapred-tools-{hadoop-version}.jar`) -* `Hadoop Common` (`hadoop-common-{hadoop-version}.jar`) -* `Apache Commons Logging` (`commons-logging-1.1.1.jar`) -* `Apache Commons CLI` (`commons-cli-1.2.jar`) -* `Jackson Mapper` (`jackson-mapper-asl-1.4.2.jar`) -* `Jackson Core` (`jackson-core-asl-1.4.2.jar`) - -> One simple way to run Rumen is to use '$HADOOP_HOME/bin/hadoop jar' -> option to run it. +*Rumen* expects certain library *JARs* to be present in the *CLASSPATH*. +One simple way to run Rumen is to use +`hadoop jar` command to run it as example below. + +``` +$HADOOP_HOME/bin/hadoop jar \ + $HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \ + org.apache.hadoop.tools.rumen.TraceBuilder \ + file:///tmp/job-trace.json \ + file:///tmp/job-topology.json \ + hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser +```