[25/34] hbase git commit: HBASE-12918 Backport asciidoc changes (apurtell and enis)

enis Tue, 27 Jan 2015 19:31:57 -0800
http://git-wip-us.apache.org/repos/asf/hbase/blob/cb77a925/src/main/asciidoc/_chapters/mapreduce.adoc
----------------------------------------------------------------------
diff --git a/src/main/asciidoc/_chapters/mapreduce.adoc 
b/src/main/asciidoc/_chapters/mapreduce.adoc
new file mode 100644
index 0000000..1228f57
--- /dev/null
+++ b/src/main/asciidoc/_chapters/mapreduce.adoc
@@ -0,0 +1,586 @@
+////
+/**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+////
+
+[[mapreduce]]
+= HBase and MapReduce
+:doctype: book
+:numbered:
+:toc: left
+:icons: font
+:experimental:
+
+Apache MapReduce is a software framework used to analyze large amounts of 
data, and is the framework used most often with 
link:http://hadoop.apache.org/[Apache Hadoop].
+MapReduce itself is out of the scope of this document.
+A good place to get started with MapReduce is 
link:http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.
+MapReduce version 2 (MR2)is now part of 
link:http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/[YARN]. 
+
+This chapter discusses specific configuration steps you need to take to use 
MapReduce on data within HBase.
+In addition, it discusses other interactions and issues between HBase and 
MapReduce jobs. 
+
+.mapred and mapreduce
+[NOTE]
+====
+There are two mapreduce packages in HBase as in MapReduce itself: 
_org.apache.hadoop.hbase.mapred_      and _org.apache.hadoop.hbase.mapreduce_.
+The former does old-style API and the latter the new style.
+The latter has more facility though you can usually find an equivalent in the 
older package.
+Pick the package that goes with your mapreduce deploy.
+When in doubt or starting over, pick the _org.apache.hadoop.hbase.mapreduce_.
+In the notes below, we refer to o.a.h.h.mapreduce but replace with the 
o.a.h.h.mapred if that is what you are using. 
+====  
+
+[[hbase.mapreduce.classpath]]
+== HBase, MapReduce, and the CLASSPATH
+
+By default, MapReduce jobs deployed to a MapReduce cluster do not have access 
to either the HBase configuration under `$HBASE_CONF_DIR` or the HBase classes.
+
+To give the MapReduce jobs the access they need, you could add 
_hbase-site.xml_ to the _$HADOOP_HOME/conf/_ directory and add the HBase JARs 
to the _`$HADOOP_HOME`/conf/_        directory, then copy these changes across 
your cluster.
+You could add hbase-site.xml to `$HADOOP_HOME`/conf and add HBase jars to the 
$HADOOP_HOME/lib.
+You would then need to copy these changes across your cluster or edit 
_`$HADOOP_HOME`/conf/hadoop-env.sh_ and add them to the `HADOOP_CLASSPATH` 
variable.
+However, this approach is not recommended because it will pollute your Hadoop 
install with HBase references.
+It also requires you to restart the Hadoop cluster before Hadoop can use the 
HBase data.
+
+Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration 
itself.
+The dependencies only need to be available on the local `CLASSPATH`.
+The following example runs the bundled HBase 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter]
        MapReduce job against a table named [systemitem]+usertable+ If you have 
not set the environment variables expected in the command (the parts prefixed 
by a `$` sign and curly braces), you can use the actual system paths instead.
+Be sure to use the correct version of the HBase JAR for your system.
+The backticks (``` symbols) cause ths shell to execute the sub-commands, 
setting the CLASSPATH as part of the command.
+This example assumes you use a BASH-compatible shell. 
+
+[source,bash]
+----
+$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` 
${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter 
usertable
+----
+
+When the command runs, internally, the HBase JAR finds the dependencies it 
needs for zookeeper, guava, and its other dependencies on the passed 
`HADOOP_CLASSPATH`        and adds the JARs to the MapReduce job configuration.
+See the source at 
TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how 
this is done. 
+
+[NOTE]
+====
+The example may not work if you are running HBase from its build directory 
rather than an installed location.
+You may see an error like the following:
+
+----
+java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper
+----
+
+If this occurs, try modifying the command as follows, so that it uses the 
HBase JARs from the _target/_ directory within the build environment.
+
+[source,bash]
+----
+$ 
HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase
 classpath` ${HADOOP_HOME}/bin/hadoop jar 
${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter 
usertable
+----
+====
+
+.Notice to Mapreduce users of HBase 0.96.1 and above
+[CAUTION]
+====
+Some mapreduce jobs that use HBase fail to launch.
+The symptom is an exception similar to the following:
+
+----
+Exception in thread "main" java.lang.IllegalAccessError: class
+    com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
+    com.google.protobuf.LiteralByteString
+    at java.lang.ClassLoader.defineClass1(Native Method)
+    at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
+    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
+    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
+    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
+    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
+    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
+    at java.security.AccessController.doPrivileged(Native Method)
+    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
+    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
+    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
+    at
+    org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818)
+    at
+    
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433)
+    at
+    
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186)
+    at
+    
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147)
+    at
+    
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270)
+    at
+    
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
+...
+----
+
+This is caused by an optimization introduced in 
link:https://issues.apache.org/jira/browse/HBASE-9867[HBASE-9867] that 
inadvertently introduced a classloader dependency. 
+
+This affects both jobs using the `-libjars` option and "fat jar," those which 
package their runtime dependencies in a nested `lib` folder.
+
+In order to satisfy the new classloader requirements, hbase-protocol.jar must 
be included in Hadoop's classpath.
+See <<hbase.mapreduce.classpath,hbase.mapreduce.classpath>> for current 
recommendations for resolving classpath errors.
+The following is included for historical purposes.
+
+This can be resolved system-wide by including a reference to the 
hbase-protocol.jar in hadoop's lib directory, via a symlink or by copying the 
jar into the new location.
+
+This can also be achieved on a per-job launch basis by including it in the 
`HADOOP_CLASSPATH` environment variable at job submission time.
+When launching jobs that package their dependencies, all three of the 
following job launching commands satisfy this requirement:
+
+[source,bash]
+----
+$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar 
MyJob.jar MyJobMainClass
+$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar 
MyJobMainClass
+$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass
+----
+
+For jars that do not package their dependencies, the following command 
structure is necessary:
+
+[source,bash]
+----
+$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar 
MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ...
+----
+
+See also link:https://issues.apache.org/jira/browse/HBASE-10304[HBASE-10304] 
for further discussion of this issue.
+====
+
+== MapReduce Scan Caching
+
+TableMapReduceUtil now restores the option to set scanner caching (the number 
of rows which are cached before returning the result to the client) on the Scan 
object that is passed in.
+This functionality was lost due to a bug in HBase 0.95 
(link:https://issues.apache.org/jira/browse/HBASE-11558[HBASE-11558]), which is 
fixed for HBase 0.98.5 and 0.96.3.
+The priority order for choosing the scanner caching is as follows:
+
+. Caching settings which are set on the scan object.
+. Caching settings which are specified via the configuration option 
+hbase.client.scanner.caching+, which can either be set manually in 
_hbase-site.xml_ or via the helper method 
`TableMapReduceUtil.setScannerCaching()`.
+. The default value `HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING`, which 
is set to `100`.
+
+Optimizing the caching settings is a balance between the time the client waits 
for a result and the number of sets of results the client needs to receive.
+If the caching setting is too large, the client could end up waiting for a 
long time or the request could even time out.
+If the setting is too small, the scan needs to return results in several 
pieces.
+If you think of the scan as a shovel, a bigger cache setting is analogous to a 
bigger shovel, and a smaller cache setting is equivalent to more shoveling in 
order to fill the bucket.
+
+The list of priorities mentioned above allows you to set a reasonable default, 
and override it for specific operations.
+
+See the API documentation for 
link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan]
 for more details.
+
+== Bundled HBase MapReduce Jobs
+
+The HBase JAR also serves as a Driver for some bundled mapreduce jobs.
+To learn about the bundled MapReduce jobs, run the following command.
+
+[source,bash]
+----
+$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar
+An example program must be given as the first argument.
+Valid program names are:
+  copytable: Export a table from local cluster to peer cluster
+  completebulkload: Complete a bulk data load.
+  export: Write table data to HDFS.
+  import: Import data written by Export.
+  importtsv: Import data in TSV format.
+  rowcounter: Count rows in HBase table
+----
+
+Each of the valid program names are bundled MapReduce jobs.
+To run one of the jobs, model your command after the following example.
+
+[source,bash]
+----
+$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar 
rowcounter myTable
+----
+
+== HBase as a MapReduce Job Data Source and Data Sink
+
+HBase can be used as a data source, 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html[TableInputFormat],
 and data sink, 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat]
        or 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html[MultiTableOutputFormat],
 for MapReduce jobs.
+Writing MapReduce jobs that read or write HBase, it is advisable to subclass 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html[TableMapper]
        and/or 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html[TableReducer].
+See the do-nothing pass-through classes 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html[IdentityTableMapper]
        and 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html[IdentityTableReducer]
        for basic usage.
+For a more involved example, see 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter]
        or review the `org.apache.hadoop.hbase.mapreduce.TestTableMapReduce` 
unit test. 
+
+If you run MapReduce jobs that use HBase as source or sink, need to specify 
source and sink table and column names in your configuration.
+
+When you read from HBase, the `TableInputFormat` requests the list of regions 
from HBase and makes a map, which is either a `map-per-region` or 
`mapreduce.job.maps` map, whichever is smaller.
+If your job only has two maps, raise `mapreduce.job.maps` to a number greater 
than the number of regions.
+Maps will run on the adjacent TaskTracker if you are running a TaskTracer and 
RegionServer per node.
+When writing to HBase, it may make sense to avoid the Reduce step and write 
back into HBase from within your map.
+This approach works when your job does not need the sort and collation that 
MapReduce does on the map-emitted data.
+On insert, HBase 'sorts' so there is no point double-sorting (and shuffling 
data around your MapReduce cluster) unless you need to.
+If you do not need the Reduce, you myour map might emit counts of records 
processed for reporting at the end of the jobj, or set the number of Reduces to 
zero and use TableOutputFormat.
+If running the Reduce step makes sense in your case, you should typically use 
multiple reducers so that load is spread across the HBase cluster.
+
+A new HBase partitioner, the 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html[HRegionPartitioner],
 can run as many reducers the number of existing regions.
+The HRegionPartitioner is suitable when your table is large and your upload 
will not greatly alter the number of existing regions upon completion.
+Otherwise use the default partitioner. 
+
+== Writing HFiles Directly During Bulk Import
+
+If you are importing into a new table, you can bypass the HBase API and write 
your content directly to the filesystem, formatted into HBase data files 
(HFiles). Your import will run faster, perhaps an order of magnitude faster.
+For more on how this mechanism works, see <<arch.bulk.load,arch.bulk.load>>.
+
+== RowCounter Example
+
+The included 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter]
        MapReduce job uses `TableInputFormat` and does a count of all rows in 
the specified table.
+To run it, use the following command: 
+
+[source,bash]
+----
+$ ./bin/hadoop jar hbase-X.X.X.jar
+----
+
+This will invoke the HBase MapReduce Driver class.
+Select `rowcounter` from the choice of jobs offered.
+This will print rowcouner usage advice to standard output.
+Specify the tablename, column to count, and output directory.
+If you have classpath errors, see 
<<hbase.mapreduce.classpath,hbase.mapreduce.classpath>>.
+
+[[splitter]]
+== Map-Task Splitting
+
+[[splitter.default]]
+=== The Default HBase MapReduce Splitter
+
+When 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html[TableInputFormat]
          is used to source an HBase table in a MapReduce job, its splitter 
will make a map task for each region of the table.
+Thus, if there are 100 regions in the table, there will be 100 map-tasks for 
the job - regardless of how many column families are selected in the Scan.
+
+[[splitter.custom]]
+=== Custom Splitters
+
+For those interested in implementing custom splitters, see the method 
`getSplits` in 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html[TableInputFormatBase].
+That is where the logic for map-task assignment resides. 
+
+[[mapreduce.example]]
+== HBase MapReduce Examples
+
+[[mapreduce.example.read]]
+=== HBase MapReduce Read Example
+
+The following is an example of using HBase as a MapReduce source in read-only 
manner.
+Specifically, there is a Mapper instance but no Reducer, and nothing is being 
emitted from the Mapper.
+There job would be defined as follows...
+
+[source,java]
+----
+Configuration config = HBaseConfiguration.create();
+Job job = new Job(config, "ExampleRead");
+job.setJarByClass(MyReadJob.class);     // class that contains mapper
+
+Scan scan = new Scan();
+scan.setCaching(500);        // 1 is the default in Scan, which will be bad 
for MapReduce jobs
+scan.setCacheBlocks(false);  // don't set to true for MR jobs
+// set other scan attrs
+...
+
+TableMapReduceUtil.initTableMapperJob(
+  tableName,        // input HBase table name
+  scan,             // Scan instance to control CF and attribute selection
+  MyMapper.class,   // mapper
+  null,             // mapper output key
+  null,             // mapper output value
+  job);
+job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't 
emitting anything from mapper
+
+boolean b = job.waitForCompletion(true);
+if (!b) {
+  throw new IOException("error with job!");
+}
+----
+
+...and the mapper instance would extend 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html[TableMapper]...
+
+[source,java]
+----
+public static class MyMapper extends TableMapper<Text, Text> {
+
+  public void map(ImmutableBytesWritable row, Result value, Context context) 
throws InterruptedException, IOException {
+    // process data for the row from the Result instance.
+   }
+}
+----
+
+[[mapreduce.example.readwrite]]
+=== HBase MapReduce Read/Write Example
+
+The following is an example of using HBase both as a source and as a sink with 
MapReduce.
+This example will simply copy data from one table to another.
+
+[source,java]
+----
+Configuration config = HBaseConfiguration.create();
+Job job = new Job(config,"ExampleReadWrite");
+job.setJarByClass(MyReadWriteJob.class);    // class that contains mapper
+
+Scan scan = new Scan();
+scan.setCaching(500);        // 1 is the default in Scan, which will be bad 
for MapReduce jobs
+scan.setCacheBlocks(false);  // don't set to true for MR jobs
+// set other scan attrs
+
+TableMapReduceUtil.initTableMapperJob(
+       sourceTable,      // input table
+       scan,             // Scan instance to control CF and attribute selection
+       MyMapper.class,   // mapper class
+       null,             // mapper output key
+       null,             // mapper output value
+       job);
+TableMapReduceUtil.initTableReducerJob(
+       targetTable,      // output table
+       null,             // reducer class
+       job);
+job.setNumReduceTasks(0);
+
+boolean b = job.waitForCompletion(true);
+if (!b) {
+    throw new IOException("error with job!");
+}
+----
+
+An explanation is required of what `TableMapReduceUtil` is doing, especially 
with the reducer. 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat]
          is being used as the outputFormat class, and several parameters are 
being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as 
setting the reducer output key to `ImmutableBytesWritable` and reducer value to 
`Writable`.
+These could be set by the programmer on the job and conf, but 
`TableMapReduceUtil` tries to make things easier.
+
+The following is the example mapper, which will create a `Put`          and 
matching the input `Result` and emit it.
+Note: this is what the CopyTable utility does. 
+
+[source,java]
+----
+public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put>  
{
+
+       public void map(ImmutableBytesWritable row, Result value, Context 
context) throws IOException, InterruptedException {
+               // this example is just copying the data from the source 
table...
+               context.write(row, resultToPut(row,value));
+       }
+
+       private static Put resultToPut(ImmutableBytesWritable key, Result 
result) throws IOException {
+               Put put = new Put(key.get());
+               for (KeyValue kv : result.raw()) {
+                       put.add(kv);
+               }
+               return put;
+       }
+}
+----
+
+There isn't actually a reducer step, so `TableOutputFormat` takes care of 
sending the `Put` to the target table. 
+
+This is just an example, developers could choose not to use 
`TableOutputFormat` and connect to the target table themselves. 
+
+[[mapreduce.example.readwrite.multi]]
+=== HBase MapReduce Read/Write Example With Multi-Table Output
+
+TODO: example for `MultiTableOutputFormat`. 
+
+[[mapreduce.example.summary]]
+=== HBase MapReduce Summary to HBase Example
+
+The following example uses HBase as a MapReduce source and sink with a 
summarization step.
+This example will count the number of distinct instances of a value in a table 
and write those summarized counts in another table. 
+
+[source,java]
+----
+Configuration config = HBaseConfiguration.create();
+Job job = new Job(config,"ExampleSummary");
+job.setJarByClass(MySummaryJob.class);     // class that contains mapper and 
reducer
+
+Scan scan = new Scan();
+scan.setCaching(500);        // 1 is the default in Scan, which will be bad 
for MapReduce jobs
+scan.setCacheBlocks(false);  // don't set to true for MR jobs
+// set other scan attrs
+
+TableMapReduceUtil.initTableMapperJob(
+       sourceTable,        // input table
+       scan,               // Scan instance to control CF and attribute 
selection
+       MyMapper.class,     // mapper class
+       Text.class,         // mapper output key
+       IntWritable.class,  // mapper output value
+       job);
+TableMapReduceUtil.initTableReducerJob(
+       targetTable,        // output table
+       MyTableReducer.class,    // reducer class
+       job);
+job.setNumReduceTasks(1);   // at least one, adjust as required
+
+boolean b = job.waitForCompletion(true);
+if (!b) {
+       throw new IOException("error with job!");
+}
+----          
+
+In this example mapper a column with a String-value is chosen as the value to 
summarize upon.
+This value is used as the key to emit from the mapper, and an `IntWritable` 
represents an instance counter. 
+
+[source,java]
+----
+public static class MyMapper extends TableMapper<Text, IntWritable>  {
+       public static final byte[] CF = "cf".getBytes();
+       public static final byte[] ATTR1 = "attr1".getBytes();
+
+       private final IntWritable ONE = new IntWritable(1);
+       private Text text = new Text();
+
+       public void map(ImmutableBytesWritable row, Result value, Context 
context) throws IOException, InterruptedException {
+               String val = new String(value.getValue(CF, ATTR1));
+               text.set(val);     // we can only emit Writables...
+
+               context.write(text, ONE);
+       }
+}
+----          
+
+In the reducer, the "ones" are counted (just like any other MR example that 
does this), and then emits a `Put`. 
+
+[source,java]
+----
+public static class MyTableReducer extends TableReducer<Text, IntWritable, 
ImmutableBytesWritable>  {
+       public static final byte[] CF = "cf".getBytes();
+       public static final byte[] COUNT = "count".getBytes();
+
+       public void reduce(Text key, Iterable<IntWritable> values, Context 
context) throws IOException, InterruptedException {
+               int i = 0;
+               for (IntWritable val : values) {
+                       i += val.get();
+               }
+               Put put = new Put(Bytes.toBytes(key.toString()));
+               put.add(CF, COUNT, Bytes.toBytes(i));
+
+               context.write(null, put);
+       }
+}
+----        
+
+[[mapreduce.example.summary.file]]
+=== HBase MapReduce Summary to File Example
+
+This very similar to the summary example above, with exception that this is 
using HBase as a MapReduce source but HDFS as the sink.
+The differences are in the job setup and in the reducer.
+The mapper remains the same. 
+
+[source,java]
+----
+Configuration config = HBaseConfiguration.create();
+Job job = new Job(config,"ExampleSummaryToFile");
+job.setJarByClass(MySummaryFileJob.class);     // class that contains mapper 
and reducer
+
+Scan scan = new Scan();
+scan.setCaching(500);        // 1 is the default in Scan, which will be bad 
for MapReduce jobs
+scan.setCacheBlocks(false);  // don't set to true for MR jobs
+// set other scan attrs
+
+TableMapReduceUtil.initTableMapperJob(
+       sourceTable,        // input table
+       scan,               // Scan instance to control CF and attribute 
selection
+       MyMapper.class,     // mapper class
+       Text.class,         // mapper output key
+       IntWritable.class,  // mapper output value
+       job);
+job.setReducerClass(MyReducer.class);    // reducer class
+job.setNumReduceTasks(1);    // at least one, adjust as required
+FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile"));  // 
adjust directories as required
+
+boolean b = job.waitForCompletion(true);
+if (!b) {
+       throw new IOException("error with job!");
+}
+----
+
+As stated above, the previous Mapper can run unchanged with this example.
+As for the Reducer, it is a "generic" Reducer instead of extending TableMapper 
and emitting Puts.
+
+[source,java]
+----
+public static class MyReducer extends Reducer<Text, IntWritable, Text, 
IntWritable>  {
+
+       public void reduce(Text key, Iterable<IntWritable> values, Context 
context) throws IOException, InterruptedException {
+               int i = 0;
+               for (IntWritable val : values) {
+                       i += val.get();
+               }
+               context.write(key, new IntWritable(i));
+       }
+}
+----
+
+[[mapreduce.example.summary.noreducer]]
+=== HBase MapReduce Summary to HBase Without Reducer
+
+It is also possible to perform summaries without a reducer - if you use HBase 
as the reducer. 
+
+An HBase target table would need to exist for the job summary.
+The Table method `incrementColumnValue` would be used to atomically increment 
values.
+From a performance perspective, it might make sense to keep a Map of values 
with their values to be incremeneted for each map-task, and make one update per 
key at during the `cleanup` method of the mapper.
+However, your milage may vary depending on the number of rows to be processed 
and unique keys. 
+
+In the end, the summary results are in HBase. 
+
+[[mapreduce.example.summary.rdbms]]
+=== HBase MapReduce Summary to RDBMS
+
+Sometimes it is more appropriate to generate summaries to an RDBMS.
+For these cases, it is possible to generate summaries directly to an RDBMS via 
a custom reducer.
+The `setup` method can connect to an RDBMS (the connection information can be 
passed via custom parameters in the context) and the cleanup method can close 
the connection. 
+
+It is critical to understand that number of reducers for the job affects the 
summarization implementation, and you'll have to design this into your reducer.
+Specifically, whether it is designed to run as a singleton (one reducer) or 
multiple reducers.
+Neither is right or wrong, it depends on your use-case.
+Recognize that the more reducers that are assigned to the job, the more 
simultaneous connections to the RDBMS will be created - this will scale, but 
only to a point. 
+
+[source,java]
+----
+ public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, 
IntWritable>  {
+
+       private Connection c = null;
+
+       public void setup(Context context) {
+               // create DB connection...
+       }
+
+       public void reduce(Text key, Iterable<IntWritable> values, Context 
context) throws IOException, InterruptedException {
+               // do summarization
+               // in this example the keys are Text, but this is just an 
example
+       }
+
+       public void cleanup(Context context) {
+               // close db connection
+       }
+
+}
+----
+
+In the end, the summary results are written to your RDBMS table/s. 
+
+[[mapreduce.htable.access]]
+== Accessing Other HBase Tables in a MapReduce Job
+
+Although the framework currently allows one HBase table as input to a 
MapReduce job, other HBase tables can be accessed as lookup tables, etc., in a 
MapReduce job via creating an Table instance in the setup method of the Mapper. 
+[source,java]
+----
+public class MyMapper extends TableMapper<Text, LongWritable> {
+  private Table myOtherTable;
+
+  public void setup(Context context) {
+    // In here create a Connection to the cluster and save it or use the 
Connection
+    // from the existing table
+    myOtherTable = connection.getTable("myOtherTable");
+  }
+
+  public void map(ImmutableBytesWritable row, Result value, Context context) 
throws IOException, InterruptedException {
+       // process Result...
+       // use 'myOtherTable' for lookups
+  }
+----      
+
+[[mapreduce.specex]]
+== Speculative Execution
+
+It is generally advisable to turn off speculative execution for MapReduce jobs 
that use HBase as a source.
+This can either be done on a per-Job basis through properties, on on the 
entire cluster.
+Especially for longer running jobs, speculative execution will create 
duplicate map-tasks which will double-write your data to HBase; this is 
probably not what you want. 
+
+See <<spec.ex,spec.ex>> for more information.
[25/34] hbase git commit: HBASE-12918 Backport asciidoc changes (apurtell and enis)

Reply via email to