[11/12] hbase git commit: Pull in documentation updates from trunk made since last 0.98 release

apurtell Mon, 02 Mar 2015 17:30:36 -0800

http://git-wip-us.apache.org/repos/asf/hbase/blob/7139c90e/src/main/asciidoc/_chapters/architecture.adoc
----------------------------------------------------------------------
diff --git a/src/main/asciidoc/_chapters/architecture.adoc 
b/src/main/asciidoc/_chapters/architecture.adoc
index 9e0b0c2..6de7208 100644
--- a/src/main/asciidoc/_chapters/architecture.adoc
+++ b/src/main/asciidoc/_chapters/architecture.adoc
@@ -35,25 +35,25 @@
 === NoSQL?
 
 HBase is a type of "NoSQL" database.
-"NoSQL" is a general term meaning that the database isn't an RDBMS which 
supports SQL as its primary access language, but there are many types of NoSQL 
databases:  BerkeleyDB is an example of a local NoSQL database, whereas HBase 
is very much a distributed database.
-Technically speaking, HBase is really more a "Data Store" than "Data Base" 
because it lacks many of the features you find in an RDBMS, such as typed 
columns, secondary indexes, triggers, and advanced query languages, etc. 
+"NoSQL" is a general term meaning that the database isn't an RDBMS which 
supports SQL as its primary access language, but there are many types of NoSQL 
databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is 
very much a distributed database.
+Technically speaking, HBase is really more a "Data Store" than "Data Base" 
because it lacks many of the features you find in an RDBMS, such as typed 
columns, secondary indexes, triggers, and advanced query languages, etc.
 
 However, HBase has many features which supports both linear and modular 
scaling.
 HBase clusters expand by adding RegionServers that are hosted on commodity 
class servers.
 If a cluster expands from 10 to 20 RegionServers, for example, it doubles both 
in terms of storage and as well as processing capacity.
 RDBMS can scale well, but only up to a point - specifically, the size of a 
single database server - and for the best performance requires specialized 
hardware and storage devices.
-HBase features of note are: 
+HBase features of note are:
 
 * Strongly consistent reads/writes:  HBase is not an "eventually consistent" 
DataStore.
   This makes it very suitable for tasks such as high-speed counter aggregation.
-* Automatic sharding:  HBase tables are distributed on the cluster via 
regions, and regions are automatically split and re-distributed as your data 
grows.
+* Automatic sharding: HBase tables are distributed on the cluster via regions, 
and regions are automatically split and re-distributed as your data grows.
 * Automatic RegionServer failover
-* Hadoop/HDFS Integration:  HBase supports HDFS out of the box as its 
distributed file system.
-* MapReduce:  HBase supports massively parallelized processing via MapReduce 
for using HBase as both source and sink.
-* Java Client API:  HBase supports an easy to use Java API for programmatic 
access.
-* Thrift/REST API:  HBase also supports Thrift and REST for non-Java 
front-ends.
-* Block Cache and Bloom Filters:  HBase supports a Block Cache and Bloom 
Filters for high volume query optimization.
-* Operational Management:  HBase provides build-in web-pages for operational 
insight as well as JMX metrics.     
+* Hadoop/HDFS Integration: HBase supports HDFS out of the box as its 
distributed file system.
+* MapReduce: HBase supports massively parallelized processing via MapReduce 
for using HBase as both source and sink.
+* Java Client API: HBase supports an easy to use Java API for programmatic 
access.
+* Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
+* Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom 
Filters for high volume query optimization.
+* Operational Management: HBase provides build-in web-pages for operational 
insight as well as JMX metrics.
 
 [[arch.overview.when]]
 === When Should I Use HBase?
@@ -62,15 +62,15 @@ HBase isn't suitable for every problem.
 
 First, make sure you have enough data.
 If you have hundreds of millions or billions of rows, then HBase is a good 
candidate.
-If you only have a few thousand/million rows, then using a traditional RDBMS 
might be a better choice due to the fact that all of your data might wind up on 
a single node (or two) and the rest of the cluster may be sitting idle. 
+If you only have a few thousand/million rows, then using a traditional RDBMS 
might be a better choice due to the fact that all of your data might wind up on 
a single node (or two) and the rest of the cluster may be sitting idle.
 
 Second, make sure you can live without all the extra features that an RDBMS 
provides (e.g., typed columns, secondary indexes, transactions, advanced query 
languages, etc.)  An application built against an RDBMS cannot be "ported" to 
HBase by simply changing a JDBC driver, for example.
-Consider moving from an RDBMS to HBase as a complete redesign as opposed to a 
port. 
+Consider moving from an RDBMS to HBase as a complete redesign as opposed to a 
port.
 
 Third, make sure you have enough hardware.
-Even HDFS doesn't do well with anything less than 5 DataNodes (due to things 
such as HDFS block replication which has a default of 3), plus a NameNode. 
+Even HDFS doesn't do well with anything less than 5 DataNodes (due to things 
such as HDFS block replication which has a default of 3), plus a NameNode.
 
-HBase can run quite well stand-alone on a laptop - but this should be 
considered a development configuration only. 
+HBase can run quite well stand-alone on a laptop - but this should be 
considered a development configuration only.
 
 [[arch.overview.hbasehdfs]]
 === What Is The Difference Between HBase and Hadoop/HDFS?
@@ -80,12 +80,12 @@ Its documentation states that it is not, however, a general 
purpose file system,
 HBase, on the other hand, is built on top of HDFS and provides fast record 
lookups (and updates) for large tables.
 This can sometimes be a point of conceptual confusion.
 HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for 
high-speed lookups.
-See the <<datamodel,datamodel>> and the rest of this chapter for more 
information on how HBase achieves its goals. 
+See the <<datamodel>> and the rest of this chapter for more information on how 
HBase achieves its goals.
 
 [[arch.catalog]]
 == Catalog Tables
 
-The catalog table `hbase:meta` exists as an HBase table and is filtered out of 
the HBase shell's `list` command, but is in fact a table just like any other. 
+The catalog table `hbase:meta` exists as an HBase table and is filtered out of 
the HBase shell's `list` command, but is in fact a table just like any other.
 
 [[arch.catalog.root]]
 === -ROOT-
@@ -94,87 +94,94 @@ NOTE: The `-ROOT-` table was removed in HBase 0.96.0.
 Information here should be considered historical.
 
 The `-ROOT-` table kept track of the location of the `.META` table (the 
previous name for the table now called `hbase:meta`) prior to HBase 0.96.
-The `-ROOT-` table structure was as follows: 
+The `-ROOT-` table structure was as follows:
 
-* .Key.META.
+.Key
+
+* .META.
   region key (`.META.,,1`)
 
-* .Values`info:regioninfo` (serialized 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html[HRegionInfo]
              instance of hbase:meta)
-* `info:server` (server:port of the RegionServer holding hbase:meta)
-* `info:serverstartcode` (start-time of the RegionServer process holding 
hbase:meta)
+.Values
+
+* `info:regioninfo` (serialized 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html[HRegionInfo]
 instance of `hbase:meta`)
+* `info:server` (server:port of the RegionServer holding `hbase:meta`)
+* `info:serverstartcode` (start-time of the RegionServer process holding 
`hbase:meta`)
 
 [[arch.catalog.meta]]
 === hbase:meta
 
 The `hbase:meta` table (previously called `.META.`) keeps a list of all 
regions in the system.
-The location of `hbase:meta` was previously tracked within the `-ROOT-` table, 
but is now stored in Zookeeper.
+The location of `hbase:meta` was previously tracked within the `-ROOT-` table, 
but is now stored in ZooKeeper.
+
+The `hbase:meta` table structure is as follows:
 
-The `hbase:meta` table structure is as follows: 
+.Key
 
-* .KeyRegion key of the format (`[table],[region start key],[region id]`)
+* Region key of the format (`[table],[region start key],[region id]`)
 
-* .Values`info:regioninfo` (serialized 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html[
-  HRegionInfo] instance for this region)
+.Values
+
+* `info:regioninfo` (serialized 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html[HRegionInfo]
 instance for this region)
 * `info:server` (server:port of the RegionServer containing this region)
 * `info:serverstartcode` (start-time of the RegionServer process containing 
this region)
 
 When a table is in the process of splitting, two other columns will be 
created, called `info:splitA` and `info:splitB`.
 These columns represent the two daughter regions.
 The values for these columns are also serialized HRegionInfo instances.
-After the region has been split, eventually this row will be deleted. 
+After the region has been split, eventually this row will be deleted.
 
 .Note on HRegionInfo
 [NOTE]
 ====
 The empty key is used to denote table start and table end.
 A region with an empty start key is the first region in a table.
-If a region has both an empty start and an empty end key, it is the only 
region in the table 
+If a region has both an empty start and an empty end key, it is the only 
region in the table
 ====
 
-In the (hopefully unlikely) event that programmatic processing of catalog 
metadata is required, see the 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Writables.html#getHRegionInfo%28byte[]%29[Writables]
          utility. 
+In the (hopefully unlikely) event that programmatic processing of catalog 
metadata is required, see the 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Writables.html#getHRegionInfo%28byte[]%29[Writables]
 utility.
 
 [[arch.catalog.startup]]
 === Startup Sequencing
 
-First, the location of `hbase:meta` is looked up in Zookeeper.
+First, the location of `hbase:meta` is looked up in ZooKeeper.
 Next, `hbase:meta` is updated with server and startcode values.
 
-For information on region-RegionServer assignment, see 
<<regions.arch.assignment,regions.arch.assignment>>. 
+For information on region-RegionServer assignment, see 
<<regions.arch.assignment>>.
 
 [[architecture.client]]
 == Client
 
 The HBase client finds the RegionServers that are serving the particular row 
range of interest.
 It does this by querying the `hbase:meta` table.
-See <<arch.catalog.meta,arch.catalog.meta>> for details.
+See <<arch.catalog.meta>> for details.
 After locating the required region(s), the client contacts the RegionServer 
serving that region, rather than going through the master, and issues the read 
or write request.
 This information is cached in the client so that subsequent requests need not 
go through the lookup process.
-Should a region be reassigned either by the master load balancer or because a 
RegionServer has died, the client will requery the catalog tables to determine 
the new location of the user region. 
+Should a region be reassigned either by the master load balancer or because a 
RegionServer has died, the client will requery the catalog tables to determine 
the new location of the user region.
 
-See <<master.runtime,master.runtime>> for more information about the impact of 
the Master on HBase Client communication. 
+See <<master.runtime>> for more information about the impact of the Master on 
HBase Client communication.
 
-Administrative functions are done via an instance of 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html[Admin]
      
+Administrative functions are done via an instance of 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html[Admin]
 
 [[client.connections]]
 === Cluster Connections
 
-The API changed in HBase 1.0.
+The API changed in HBase 1.0. For connection configuration information, see 
<<client_dependencies>>.
+
+==== API as of HBase 1.0.0
+
 Its been cleaned up and users are returned Interfaces to work against rather 
than particular types.
-In HBase 1.0, obtain a cluster Connection from ConnectionFactory and 
thereafter, get from it instances of Table, Admin, and RegionLocator on an 
as-need basis.
-When done, close obtained instances.
-Finally, be sure to cleanup your Connection instance before exiting.
-Connections are heavyweight objects.
-Create once and keep an instance around.
-Table, Admin and RegionLocator instances are lightweight.
+In HBase 1.0, obtain a `Connection` object from `ConnectionFactory` and 
thereafter, get from it instances of `Table`, `Admin`, and `RegionLocator` on 
an as-need basis.
+When done, close the obtained instances.
+Finally, be sure to cleanup your `Connection` instance before exiting.
+`Connections` are heavyweight objects but thread-safe so you can create one 
for your application and keep the instance around.
+`Table`, `Admin` and `RegionLocator` instances are lightweight.
 Create as you go and then let go as soon as you are done by closing them.
-See the 
link:/Users/stack/checkouts/hbase.git/target/site/apidocs/org/apache/hadoop/hbase/client/package-summary.html[Client
 Package Javadoc Description] for example usage of the new HBase 1.0 API.
+See the 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/package-summary.html[Client
 Package Javadoc Description] for example usage of the new HBase 1.0 API.
 
-For connection configuration information, see <<client_dependencies,client 
dependencies>>. 
+==== API before HBase 1.0.0
 
-_link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table]
-            instances are not thread-safe_.
-Only one thread can use an instance of Table at any given time.
-When creating Table instances, it is advisable to use the same 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration[HBaseConfiguration]
          instance.
+Instances of `HTable` are the way to interact with an HBase cluster earlier 
than 1.0.0. 
_link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table]
 instances are not thread-safe_. Only one thread can use an instance of Table 
at any given time.
+When creating Table instances, it is advisable to use the same 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration[HBaseConfiguration]
 instance.
 This will ensure sharing of ZooKeeper and socket instances to the 
RegionServers which is usually what you want.
 For example, this is preferred:
 
@@ -195,24 +202,24 @@ HBaseConfiguration conf2 = HBaseConfiguration.create();
 HTable table2 = new HTable(conf2, "myTable");
 ----
 
-For more information about how connections are handled in the HBase client, 
see 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html[HConnectionManager].
 
+For more information about how connections are handled in the HBase client, 
see 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ConnectionFactory.html[ConnectionFactory].
 
 [[client.connection.pooling]]
-==== Connection Pooling
+===== Connection Pooling
 
-For applications which require high-end multithreaded access (e.g., 
web-servers or application servers that may serve many application threads in a 
single JVM), you can pre-create an `HConnection`, as shown in the following 
example:
+For applications which require high-end multithreaded access (e.g., 
web-servers or application servers that may serve many application threads in a 
single JVM), you can pre-create a `Connection`, as shown in the following 
example:
 
-.Pre-Creating a `HConnection`
+.Pre-Creating a `Connection`
 ====
 [source,java]
 ----
 // Create a connection to the cluster.
-HConnection connection = HConnectionManager.createConnection(Configuration);
-HTableInterface table = connection.getTable("myTable");
-// use table as needed, the table returned is lightweight
-table.close();
-// use the connection for other access to the cluster
-connection.close();
+Configuration conf = HBaseConfiguration.create();
+try (Connection connection = ConnectionFactory.createConnection(conf)) {
+  try (Table table = connection.getTable(TableName.valueOf(tablename)) {
+    // use table as needed, the table returned is lightweight
+  }
+}
 ----
 ====
 
@@ -221,34 +228,32 @@ Constructing HTableInterface implementation is very 
lightweight and resources ar
 .`HTablePool` is Deprecated
 [WARNING]
 ====
-Previous versions of this guide discussed `HTablePool`, which was deprecated 
in HBase 0.94, 0.95, and 0.96, and removed in 0.98.1, by 
link:https://issues.apache.org/jira/browse/HBASE-6580[HBASE-6500].
-Please use 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnection.html[HConnection]
 instead.
+Previous versions of this guide discussed `HTablePool`, which was deprecated 
in HBase 0.94, 0.95, and 0.96, and removed in 0.98.1, by 
link:https://issues.apache.org/jira/browse/HBASE-6580[HBASE-6500], or 
`HConnection`, which is deprecated in HBase 1.0 by `Connection`.
+Please use 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Connection.html[Connection]
 instead.
 ====
 
 [[client.writebuffer]]
 === WriteBuffer and Batch Methods
 
-If <<perf.hbase.client.autoflush,perf.hbase.client.autoflush>> is turned off 
on 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html[HTable],
 `Put`s are sent to RegionServers when the writebuffer is filled.
-The writebuffer is 2MB by default.
-Before an HTable instance is discarded, either [method]+close()+ or 
[method]+flushCommits()+ should be invoked so Puts will not be lost. 
+In HBase 1.0 and later, 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html[HTable]
 is deprecated in favor of 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table].
 `Table` does not use autoflush. To do buffered writes, use the BufferedMutator 
class.
 
-Note: `htable.delete(Delete);` does not go in the writebuffer!  This only 
applies to Puts. 
+Before a `Table` or `HTable` instance is discarded, invoke either `close()` or 
`flushCommits()`, so `Put`s will not be lost.
 
-For additional information on write durability, review the 
link:../acid-semantics.html[ACID semantics] page. 
+For additional information on write durability, review the 
link:../acid-semantics.html[ACID semantics] page.
 
-For fine-grained control of batching of `Put`s or `Delete`s, see the 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#batch%28java.util.List%29[batch]
 methods on HTable. 
+For fine-grained control of batching of ``Put``s or ``Delete``s, see the 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#batch%28java.util.List%29[batch]
 methods on Table.
 
 [[client.external]]
 === External Clients
 
-Information on non-Java clients and custom protocols is covered in 
<<external_apis,external apis>>           
+Information on non-Java clients and custom protocols is covered in 
<<external_apis>>
 
 [[client.filter]]
 == Client Request Filters
 
-link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html[Get]
 and 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan]
 instances can be optionally configured with 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html[filters]
 which are applied on the RegionServer. 
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html[Get]
 and 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan]
 instances can be optionally configured with 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html[filters]
 which are applied on the RegionServer.
 
-Filters can be confusing because there are many different types, and it is 
best to approach them by understanding the groups of Filter functionality. 
+Filters can be confusing because there are many different types, and it is 
best to approach them by understanding the groups of Filter functionality.
 
 [[client.filter.structural]]
 === Structural
@@ -258,25 +263,25 @@ Structural Filters contain other Filters.
 [[client.filter.structural.fl]]
 ==== FilterList
 
-link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.html[FilterList]
          represents a list of Filters with a relationship of 
`FilterList.Operator.MUST_PASS_ALL` or `FilterList.Operator.MUST_PASS_ONE` 
between the Filters.
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.html[FilterList]
 represents a list of Filters with a relationship of 
`FilterList.Operator.MUST_PASS_ALL` or `FilterList.Operator.MUST_PASS_ONE` 
between the Filters.
 The following example shows an 'or' between two Filters (checking for either 
'my value' or 'my other value' on the same attribute).
 
 [source,java]
 ----
 FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE);
 SingleColumnValueFilter filter1 = new SingleColumnValueFilter(
-       cf,
-       column,
-       CompareOp.EQUAL,
-       Bytes.toBytes("my value")
-       );
+  cf,
+  column,
+  CompareOp.EQUAL,
+  Bytes.toBytes("my value")
+  );
 list.add(filter1);
 SingleColumnValueFilter filter2 = new SingleColumnValueFilter(
-       cf,
-       column,
-       CompareOp.EQUAL,
-       Bytes.toBytes("my other value")
-       );
+  cf,
+  column,
+  CompareOp.EQUAL,
+  Bytes.toBytes("my other value")
+  );
 list.add(filter2);
 scan.setFilter(list);
 ----
@@ -287,16 +292,16 @@ scan.setFilter(list);
 [[client.filter.cv.scvf]]
 ==== SingleColumnValueFilter
 
-link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html[SingleColumnValueFilter]
            can be used to test column values for equivalence 
(`CompareOp.EQUAL`), inequality (`CompareOp.NOT_EQUAL`), or ranges (e.g., 
`CompareOp.GREATER`). The following is example of testing equivalence a column 
to a String value "my value"...
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html[SingleColumnValueFilter]
 can be used to test column values for equivalence (`CompareOp.EQUAL`), 
inequality (`CompareOp.NOT_EQUAL`), or ranges (e.g., `CompareOp.GREATER`). The 
following is example of testing equivalence a column to a String value "my 
value"...
 
 [source,java]
 ----
 SingleColumnValueFilter filter = new SingleColumnValueFilter(
-       cf,
-       column,
-       CompareOp.EQUAL,
-       Bytes.toBytes("my value")
-       );
+  cf,
+  column,
+  CompareOp.EQUAL,
+  Bytes.toBytes("my value")
+  );
 scan.setFilter(filter);
 ----
 
@@ -304,44 +309,43 @@ scan.setFilter(filter);
 === Column Value Comparators
 
 There are several Comparator classes in the Filter package that deserve 
special mention.
-These Comparators are used in concert with other Filters, such as 
<<client.filter.cv.scvf,client.filter.cv.scvf>>. 
+These Comparators are used in concert with other Filters, such as 
<<client.filter.cv.scvf>>.
 
 [[client.filter.cvp.rcs]]
 ==== RegexStringComparator
 
-link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RegexStringComparator.html[RegexStringComparator]
            supports regular expressions for value comparisons.
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RegexStringComparator.html[RegexStringComparator]
 supports regular expressions for value comparisons.
 
 [source,java]
 ----
 RegexStringComparator comp = new RegexStringComparator("my.");   // any value 
that starts with 'my'
 SingleColumnValueFilter filter = new SingleColumnValueFilter(
-       cf,
-       column,
-       CompareOp.EQUAL,
-       comp
-       );
+  cf,
+  column,
+  CompareOp.EQUAL,
+  comp
+  );
 scan.setFilter(filter);
 ----
 
-See the Oracle JavaDoc for 
link:http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html[supported
-              RegEx patterns in Java]. 
+See the Oracle JavaDoc for 
link:http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html[supported
 RegEx patterns in Java].
 
 [[client.filter.cvp.substringcomparator]]
 ==== SubstringComparator
 
-link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SubstringComparator.html[SubstringComparator]
            can be used to determine if a given substring exists in a value.
-The comparison is case-insensitive. 
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SubstringComparator.html[SubstringComparator]
 can be used to determine if a given substring exists in a value.
+The comparison is case-insensitive.
 
 [source,java]
 ----
 
 SubstringComparator comp = new SubstringComparator("y val");   // looking for 
'my value'
 SingleColumnValueFilter filter = new SingleColumnValueFilter(
-       cf,
-       column,
-       CompareOp.EQUAL,
-       comp
-       );
+  cf,
+  column,
+  CompareOp.EQUAL,
+  comp
+  );
 scan.setFilter(filter);
 ----
 
@@ -358,29 +362,29 @@ See 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryCo
 [[client.filter.kvm]]
 === KeyValue Metadata
 
-As HBase stores data internally as KeyValue pairs, KeyValue Metadata Filters 
evaluate the existence of keys (i.e., ColumnFamily:Column qualifiers) for a 
row, as opposed to values the previous section. 
+As HBase stores data internally as KeyValue pairs, KeyValue Metadata Filters 
evaluate the existence of keys (i.e., ColumnFamily:Column qualifiers) for a 
row, as opposed to values the previous section.
 
 [[client.filter.kvm.ff]]
 ==== FamilyFilter
 
-link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FamilyFilter.html[FamilyFilter]
            can be used to filter on the ColumnFamily.
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FamilyFilter.html[FamilyFilter]
 can be used to filter on the ColumnFamily.
 It is generally a better idea to select ColumnFamilies in the Scan than to do 
it with a Filter.
 
 [[client.filter.kvm.qf]]
 ==== QualifierFilter
 
-link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/QualifierFilter.html[QualifierFilter]
            can be used to filter based on Column (aka Qualifier) name. 
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/QualifierFilter.html[QualifierFilter]
 can be used to filter based on Column (aka Qualifier) name.
 
 [[client.filter.kvm.cpf]]
 ==== ColumnPrefixFilter
 
-link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnPrefixFilter.html[ColumnPrefixFilter]
            can be used to filter based on the lead portion of Column (aka 
Qualifier) names. 
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnPrefixFilter.html[ColumnPrefixFilter]
 can be used to filter based on the lead portion of Column (aka Qualifier) 
names.
 
 A ColumnPrefixFilter seeks ahead to the first column matching the prefix in 
each row and for each involved column family.
-It can be used to efficiently get a subset of the columns in very wide rows. 
+It can be used to efficiently get a subset of the columns in very wide rows.
 
 Note: The same column qualifier can be used in different column families.
-This filter returns all matching columns. 
+This filter returns all matching columns.
 
 Example: Find all columns in a row and family that start with "abc"
 
@@ -407,10 +411,10 @@ rs.close();
 [[client.filter.kvm.mcpf]]
 ==== MultipleColumnPrefixFilter
 
-link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/MultipleColumnPrefixFilter.html[MultipleColumnPrefixFilter]
            behaves like ColumnPrefixFilter but allows specifying multiple 
prefixes. 
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/MultipleColumnPrefixFilter.html[MultipleColumnPrefixFilter]
 behaves like ColumnPrefixFilter but allows specifying multiple prefixes.
 
 Like ColumnPrefixFilter, MultipleColumnPrefixFilter efficiently seeks ahead to 
the first column matching the lowest prefix and also seeks past ranges of 
columns between prefixes.
-It can be used to efficiently get discontinuous sets of columns from very wide 
rows. 
+It can be used to efficiently get discontinuous sets of columns from very wide 
rows.
 
 Example: Find all columns in a row and family that start with "abc" or "xyz"
 
@@ -437,15 +441,15 @@ rs.close();
 [[client.filter.kvm.crf]]
 ==== ColumnRangeFilter
 
-A 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html[ColumnRangeFilter]
 allows efficient intra row scanning. 
+A 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html[ColumnRangeFilter]
 allows efficient intra row scanning.
 
 A ColumnRangeFilter can seek ahead to the first matching column for each 
involved column family.
 It can be used to efficiently get a 'slice' of the columns of a very wide row.
 i.e.
-you have a million columns in a row but you only want to look at columns 
bbbb-bbdd. 
+you have a million columns in a row but you only want to look at columns 
bbbb-bbdd.
 
 Note: The same column qualifier can be used in different column families.
-This filter returns all matching columns. 
+This filter returns all matching columns.
 
 Example: Find all columns in a row and family between "bbbb" (inclusive) and 
"bbdd" (inclusive)
 
@@ -493,66 +497,65 @@ See 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKey
 
 `HMaster` is the implementation of the Master Server.
 The Master server is responsible for monitoring all RegionServer instances in 
the cluster, and is the interface for all metadata changes.
-In a distributed cluster, the Master typically runs on the 
<<arch.hdfs.nn,arch.hdfs.nn>>.
-J Mohamed Zahoor goes into some more detail on the Master Architecture in this 
blog posting, 
link:http://blog.zahoor.in/2012/08/hbase-hmaster-architecture/[HBase HMaster
-          Architecture ].
+In a distributed cluster, the Master typically runs on the <<arch.hdfs.nn>>.
+J Mohamed Zahoor goes into some more detail on the Master Architecture in this 
blog posting, 
link:http://blog.zahoor.in/2012/08/hbase-hmaster-architecture/[HBase HMaster 
Architecture ].
 
 [[master.startup]]
 === Startup Behavior
 
 If run in a multi-Master environment, all Masters compete to run the cluster.
-If the active Master loses its lease in ZooKeeper (or the Master shuts down), 
then then the remaining Masters jostle to take over the Master role. 
+If the active Master loses its lease in ZooKeeper (or the Master shuts down), 
then the remaining Masters jostle to take over the Master role.
 
 [[master.runtime]]
 === Runtime Impact
 
 A common dist-list question involves what happens to an HBase cluster when the 
Master goes down.
-Because the HBase client talks directly to the RegionServers, the cluster can 
still function in a "steady state." Additionally, per 
<<arch.catalog,arch.catalog>>, `hbase:meta` exists as an HBase table and is not 
resident in the Master.
+Because the HBase client talks directly to the RegionServers, the cluster can 
still function in a "steady state". Additionally, per <<arch.catalog>>, 
`hbase:meta` exists as an HBase table and is not resident in the Master.
 However, the Master controls critical functions such as RegionServer failover 
and completing region splits.
-So while the cluster can still run for a short time without the Master, the 
Master should be restarted as soon as possible. 
+So while the cluster can still run for a short time without the Master, the 
Master should be restarted as soon as possible.
 
 [[master.api]]
 === Interface
 
-The methods exposed by `HMasterInterface` are primarily metadata-oriented 
methods: 
+The methods exposed by `HMasterInterface` are primarily metadata-oriented 
methods:
 
-* Table (createTable, modifyTable, removeTable, enable, disable) 
-* ColumnFamily (addColumn, modifyColumn, removeColumn) 
-* Region (move, assign, unassign)          For example, when the `HBaseAdmin` 
method `disableTable` is invoked, it is serviced by the Master server. 
+* Table (createTable, modifyTable, removeTable, enable, disable)
+* ColumnFamily (addColumn, modifyColumn, removeColumn)
+* Region (move, assign, unassign) For example, when the `Admin` method 
`disableTable` is invoked, it is serviced by the Master server.
 
 [[master.processes]]
 === Processes
 
-The Master runs several background threads: 
+The Master runs several background threads:
 
 [[master.processes.loadbalancer]]
 ==== LoadBalancer
 
 Periodically, and when there are no regions in transition, a load balancer 
will run and move regions around to balance the cluster's load.
-See <<balancer_config,balancer config>> for configuring this property.
+See <<balancer_config>> for configuring this property.
 
-See <<regions.arch.assignment,regions.arch.assignment>> for more information 
on region assignment. 
+See <<regions.arch.assignment>> for more information on region assignment.
 
 [[master.processes.catalog]]
 ==== CatalogJanitor
 
-Periodically checks and cleans up the hbase:meta table.
-See <<arch.catalog.meta,arch.catalog.meta>> for more information on META.
+Periodically checks and cleans up the `hbase:meta` table.
+See <arch.catalog.meta>> for more information on the meta table.
 
 [[regionserver.arch]]
 == RegionServer
 
 `HRegionServer` is the RegionServer implementation.
 It is responsible for serving and managing regions.
-In a distributed cluster, a RegionServer runs on a 
<<arch.hdfs.dn,arch.hdfs.dn>>. 
+In a distributed cluster, a RegionServer runs on a <<arch.hdfs.dn>>.
 
 [[regionserver.arch.api]]
 === Interface
 
-The methods exposed by `HRegionRegionInterface` contain both data-oriented and 
region-maintenance methods: 
+The methods exposed by `HRegionRegionInterface` contain both data-oriented and 
region-maintenance methods:
 
 * Data (get, put, delete, next, etc.)
-* Region (splitRegion, compactRegion, etc.) For example, when the `HBaseAdmin` 
method `majorCompact` is invoked on a table, the client is actually iterating 
through all regions for the specified table and requesting a major compaction 
directly to each region. 
+* Region (splitRegion, compactRegion, etc.) For example, when the `Admin` 
method `majorCompact` is invoked on a table, the client is actually iterating 
through all regions for the specified table and requesting a major compaction 
directly to each region.
 
 [[regionserver.arch.processes]]
 === Processes
@@ -582,94 +585,92 @@ Periodically checks the RegionServer's WAL.
 === Coprocessors
 
 Coprocessors were added in 0.92.
-There is a thorough 
link:https://blogs.apache.org/hbase/entry/coprocessor_introduction[Blog Overview
-            of CoProcessors] posted.
-Documentation will eventually move to this reference guide, but the blog is 
the most current information available at this time. 
+There is a thorough 
link:https://blogs.apache.org/hbase/entry/coprocessor_introduction[Blog 
Overview of CoProcessors] posted.
+Documentation will eventually move to this reference guide, but the blog is 
the most current information available at this time.
 
 [[block.cache]]
 === Block Cache
 
-HBase provides two different BlockCache implementations: the default onheap 
LruBlockCache and BucketCache, which is (usually) offheap.
+HBase provides two different BlockCache implementations: the default on-heap 
`LruBlockCache` and the `BucketCache`, which is (usually) off-heap.
 This section discusses benefits and drawbacks of each implementation, how to 
choose the appropriate option, and configuration options for each.
 
 .Block Cache Reporting: UI
 [NOTE]
 ====
 See the RegionServer UI for detail on caching deploy.
-Since HBase-0.98.4, the Block Cache detail has been significantly extended 
showing configurations, sizings, current usage, time-in-the-cache, and even 
detail on block counts and types.
+Since HBase 0.98.4, the Block Cache detail has been significantly extended 
showing configurations, sizings, current usage, time-in-the-cache, and even 
detail on block counts and types.
 ====
 
 ==== Cache Choices
 
-`LruBlockCache` is the original implementation, and is entirely within the 
Java heap. `BucketCache` is mainly intended for keeping blockcache data 
offheap, although BucketCache can also keep data onheap and serve from a 
file-backed cache. 
+`LruBlockCache` is the original implementation, and is entirely within the 
Java heap. `BucketCache` is mainly intended for keeping block cache data 
off-heap, although `BucketCache` can also keep data on-heap and serve from a 
file-backed cache.
 
-.BucketCache is production ready as of hbase-0.98.6
+.BucketCache is production ready as of HBase 0.98.6
 [NOTE]
 ====
 To run with BucketCache, you need HBASE-11678.
-This was included in hbase-0.98.6. 
-====          
+This was included in 0.98.6.
+====
 
-Fetching will always be slower when fetching from BucketCache, as compared to 
the native onheap LruBlockCache.
+Fetching will always be slower when fetching from BucketCache, as compared to 
the native on-heap LruBlockCache.
 However, latencies tend to be less erratic across time, because there is less 
garbage collection when you use BucketCache since it is managing BlockCache 
allocations, not the GC.
-If the BucketCache is deployed in offheap mode, this memory is not managed by 
the GC at all.
+If the BucketCache is deployed in off-heap mode, this memory is not managed by 
the GC at all.
 This is why you'd use BucketCache, so your latencies are less erratic and to 
mitigate GCs and heap fragmentation.
-See Nick Dimiduk's link:http://www.n10k.com/blog/blockcache-101/[BlockCache 
101] for comparisons running onheap vs offheap tests.
-Also see link:http://people.apache.org/~stack/bc/[Comparing BlockCache 
Deploys]            which finds that if your dataset fits inside your 
LruBlockCache deploy, use it otherwise if you are experiencing cache churn (or 
you want your cache to exist beyond the vagaries of java GC), use BucketCache. 
+See Nick Dimiduk's link:http://www.n10k.com/blog/blockcache-101/[BlockCache 
101] for comparisons running on-heap vs off-heap tests.
+Also see link:http://people.apache.org/~stack/bc/[Comparing BlockCache 
Deploys] which finds that if your dataset fits inside your LruBlockCache 
deploy, use it otherwise if you are experiencing cache churn (or you want your 
cache to exist beyond the vagaries of java GC), use BucketCache.
 
-When you enable BucketCache, you are enabling a two tier caching system, an L1 
cache which is implemented by an instance of LruBlockCache and an offheap L2 
cache which is implemented by BucketCache.
+When you enable BucketCache, you are enabling a two tier caching system, an L1 
cache which is implemented by an instance of LruBlockCache and an off-heap L2 
cache which is implemented by BucketCache.
 Management of these two tiers and the policy that dictates how blocks move 
between them is done by `CombinedBlockCache`.
-It keeps all DATA blocks in the L2 BucketCache and meta blocks -- INDEX and 
BLOOM blocks -- onheap in the L1 `LruBlockCache`.
-See <<offheap.blockcache,offheap.blockcache>> for more detail on going offheap.
+It keeps all DATA blocks in the L2 BucketCache and meta blocks -- INDEX and 
BLOOM blocks -- on-heap in the L1 `LruBlockCache`.
+See <<offheap.blockcache>> for more detail on going off-heap.
 
 [[cache.configurations]]
 ==== General Cache Configurations
 
 Apart from the cache implementation itself, you can set some general 
configuration options to control how the cache performs.
-See 
link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html.
+See 
http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html.
 After setting any of these options, restart or rolling restart your cluster 
for the configuration to take effect.
 Check logs for errors or unexpected behavior.
 
-See also <<blockcache.prefetch,blockcache.prefetch>>, which discusses a new 
option introduced in 
link:https://issues.apache.org/jira/browse/HBASE-9857[HBASE-9857].
+See also <<blockcache.prefetch>>, which discusses a new option introduced in 
link:https://issues.apache.org/jira/browse/HBASE-9857[HBASE-9857].
 
 [[block.cache.design]]
 ==== LruBlockCache Design
 
-The LruBlockCache is an LRU cache that contains three levels of block priority 
to allow for scan-resistance and in-memory ColumnFamilies: 
+The LruBlockCache is an LRU cache that contains three levels of block priority 
to allow for scan-resistance and in-memory ColumnFamilies:
 
 * Single access priority: The first time a block is loaded from HDFS it 
normally has this priority and it will be part of the first group to be 
considered during evictions.
   The advantage is that scanned blocks are more likely to get evicted than 
blocks that are getting more usage.
-* Mutli access priority: If a block in the previous priority group is accessed 
again, it upgrades to this priority.
+* Multi access priority: If a block in the previous priority group is accessed 
again, it upgrades to this priority.
   It is thus part of the second group considered during evictions.
 * In-memory access priority: If the block's family was configured to be 
"in-memory", it will be part of this priority disregarding the number of times 
it was accessed.
   Catalog tables are configured like this.
   This group is the last one considered during evictions.
 +
-To mark a column family as in-memory, call 
+To mark a column family as in-memory, call
 
 [source,java]
 ----
 HColumnDescriptor.setInMemory(true);
----- 
+----
+
+if creating a table from java, or set `IN_MEMORY => true` when creating or 
altering a table in the shell: e.g.
 
-if creating a table from java, or set +IN_MEMORY => true+ when creating or 
altering a table in the shell: e.g.
- 
 [source]
 ----
 hbase(main):003:0> create  't', {NAME => 'f', IN_MEMORY => 'true'}
 ----
 
-For more information, see the 
link:http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/LruBlockCache.html[LruBlockCache
-              source]          
+For more information, see the 
link:http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/LruBlockCache.html[LruBlockCache
 source]
 
 [[block.cache.usage]]
 ==== LruBlockCache Usage
 
 Block caching is enabled by default for all the user tables which means that 
any read operation will load the LRU cache.
 This might be good for a large number of use cases, but further tunings are 
usually required in order to achieve better performance.
-An important concept is the 
link:http://en.wikipedia.org/wiki/Working_set_size[working set size], or WSS, 
which is: "the amount of memory needed to compute the answer to a problem". For 
a website, this would be the data that's needed to answer the queries over a 
short amount of time. 
+An important concept is the 
link:http://en.wikipedia.org/wiki/Working_set_size[working set size], or WSS, 
which is: "the amount of memory needed to compute the answer to a problem". For 
a website, this would be the data that's needed to answer the queries over a 
short amount of time.
 
-The way to calculate how much memory is available in HBase for caching is: 
+The way to calculate how much memory is available in HBase for caching is:
 
 [source]
 ----
@@ -679,47 +680,46 @@ number of region servers * heap size * 
hfile.block.cache.size * 0.99
 The default value for the block cache is 0.25 which represents 25% of the 
available heap.
 The last value (99%) is the default acceptable loading factor in the LRU cache 
after which eviction is started.
 The reason it is included in this equation is that it would be unrealistic to 
say that it is possible to use 100% of the available memory since this would 
make the process blocking from the point where it loads new blocks.
-Here are some examples: 
+Here are some examples:
 
 * One region server with the default heap size (1 GB) and the default block 
cache size will have 253 MB of block cache available.
 * 20 region servers with the heap size set to 8 GB and a default block cache 
size will have 39.6 of block cache.
 * 100 region servers with the heap size set to 24 GB and a block cache size of 
0.5 will have about 1.16 TB of block cache.
 
 Your data is not the only resident of the block cache.
-Here are others that you may have to take into account: 
+Here are others that you may have to take into account:
 
 Catalog Tables::
-  The `-ROOT-` (prior to HBase 0.96.
-  See <<arch.catalog.root,arch.catalog.root>>) and `hbase:meta` tables are 
forced into the block cache and have the in-memory priority which means that 
they are harder to evict.
-  The former never uses more than a few hundreds of bytes while the latter can 
occupy a few MBs (depending on the number of regions).
+  The `-ROOT-` (prior to HBase 0.96, see 
<<arch.catalog.root,arch.catalog.root>>) and `hbase:meta` tables are forced 
into the block cache and have the in-memory priority which means that they are 
harder to evict.
+  The former never uses more than a few hundreds bytes while the latter can 
occupy a few MBs (depending on the number of regions).
 
 HFiles Indexes::
-  An [firstterm]_hfile_ is the file format that HBase uses to store data in 
HDFS.
+  An _HFile_ is the file format that HBase uses to store data in HDFS.
   It contains a multi-layered index which allows HBase to seek to the data 
without having to read the whole file.
   The size of those indexes is a factor of the block size (64KB by default), 
the size of your keys and the amount of data you are storing.
   For big data sets it's not unusual to see numbers around 1GB per region 
server, although not all of it will be in cache because the LRU will evict 
indexes that aren't used.
 
 Keys::
-  The values that are stored are only half the picture, since each value is 
stored along with its keys (row key, family qualifier, and timestamp). See 
<<keysize,keysize>>.
+  The values that are stored are only half the picture, since each value is 
stored along with its keys (row key, family qualifier, and timestamp). See 
<<keysize>>.
 
 Bloom Filters::
   Just like the HFile indexes, those data structures (when enabled) are stored 
in the LRU.
 
 Currently the recommended way to measure HFile indexes and bloom filters sizes 
is to look at the region server web UI and checkout the relevant metrics.
 For keys, sampling can be done by using the HFile command line tool and look 
for the average key size metric.
-Since HBase 0.98.3, you can view detail on BlockCache stats and metrics in a 
special Block Cache section in the UI.
+Since HBase 0.98.3, you can view details on BlockCache stats and metrics in a 
special Block Cache section in the UI.
 
 It's generally bad to use block caching when the WSS doesn't fit in memory.
 This is the case when you have for example 40GB available across all your 
region servers' block caches but you need to process 1TB of data.
 One of the reasons is that the churn generated by the evictions will trigger 
more garbage collections unnecessarily.
-Here are two use cases: 
+Here are two use cases:
 
 * Fully random reading pattern: This is a case where you almost never access 
the same row twice within a short amount of time such that the chance of 
hitting a cached block is close to 0.
   Setting block caching on such a table is a waste of memory and CPU cycles, 
more so that it will generate more garbage to pick up by the JVM.
-  For more information on monitoring GC, see <<trouble.log.gc,trouble.log.gc>>.
+  For more information on monitoring GC, see <<trouble.log.gc>>.
 * Mapping a table: In a typical MapReduce job that takes a table in input, 
every row will be read only once so there's no need to put them into the block 
cache.
   The Scan object has the option of turning this off via the setCaching method 
(set it to false). You can still keep block caching turned on on this table if 
you need fast random read access.
-  An example would be counting the number of rows in a table that serves live 
traffic, caching every block of that table would create massive churn and would 
surely evict data that's currently in use. 
+  An example would be counting the number of rows in a table that serves live 
traffic, caching every block of that table would create massive churn and would 
surely evict data that's currently in use.
 
 [[data.blocks.in.fscache]]
 ===== Caching META blocks only (DATA blocks in fscache)
@@ -727,57 +727,55 @@ Here are two use cases:
 An interesting setup is one where we cache META blocks only and we read DATA 
blocks in on each access.
 If the DATA blocks fit inside fscache, this alternative may make sense when 
access is completely random across a very large dataset.
 To enable this setup, alter your table and for each column family set 
`BLOCKCACHE => 'false'`.
-You are 'disabling' the BlockCache for this column family only you can never 
disable the caching of META blocks.
-Since link:https://issues.apache.org/jira/browse/HBASE-4683[HBASE-4683 Always 
cache index and bloom blocks], we will cache META blocks even if the BlockCache 
is disabled. 
+You are 'disabling' the BlockCache for this column family only. You can never 
disable the caching of META blocks.
+Since link:https://issues.apache.org/jira/browse/HBASE-4683[HBASE-4683 Always 
cache index and bloom blocks], we will cache META blocks even if the BlockCache 
is disabled.
 
 [[offheap.blockcache]]
-==== Offheap Block Cache
+==== Off-heap Block Cache
 
 [[enable.bucketcache]]
 ===== How to Enable BucketCache
 
-The usual deploy of BucketCache is via a managing class that sets up two 
caching tiers: an L1 onheap cache implemented by LruBlockCache and a second L2 
cache implemented with BucketCache.
+The usual deploy of BucketCache is via a managing class that sets up two 
caching tiers: an L1 on-heap cache implemented by LruBlockCache and a second L2 
cache implemented with BucketCache.
 The managing class is 
link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.html[CombinedBlockCache]
 by default.
-The just-previous link describes the caching 'policy' implemented by 
CombinedBlockCache.
-In short, it works by keeping meta blocks -- INDEX and BLOOM in the L1, onheap 
LruBlockCache tier -- and DATA blocks are kept in the L2, BucketCache tier.
-It is possible to amend this behavior in HBase since version 1.0 and ask that 
a column family have both its meta and DATA blocks hosted onheap in the L1 tier 
by setting `cacheDataInL1` via `(HColumnDescriptor.setCacheDataInL1(true)`      
      or in the shell, creating or amending column families setting 
`CACHE_DATA_IN_L1`            to true: e.g. 
+The previous link describes the caching 'policy' implemented by 
CombinedBlockCache.
+In short, it works by keeping meta blocks -- INDEX and BLOOM in the L1, 
on-heap LruBlockCache tier -- and DATA blocks are kept in the L2, BucketCache 
tier.
+It is possible to amend this behavior in HBase since version 1.0 and ask that 
a column family have both its meta and DATA blocks hosted on-heap in the L1 
tier by setting `cacheDataInL1` via `(HColumnDescriptor.setCacheDataInL1(true)` 
or in the shell, creating or amending column families setting 
`CACHE_DATA_IN_L1` to true: e.g.
 [source]
 ----
 hbase(main):003:0> create 't', {NAME => 't', CONFIGURATION => 
{CACHE_DATA_IN_L1 => 'true'}}
 ----
 
-The BucketCache Block Cache can be deployed onheap, offheap, or file based.
+The BucketCache Block Cache can be deployed on-heap, off-heap, or file based.
 You set which via the `hbase.bucketcache.ioengine` setting.
-Setting it to `heap` will have BucketCache deployed inside the  allocated java 
heap.
-Setting it to `offheap` will have BucketCache make its allocations offheap, 
and an ioengine setting of `file:PATH_TO_FILE` will direct BucketCache to use a 
file caching (Useful in particular if you have some fast i/o attached to the 
box such as SSDs). 
+Setting it to `heap` will have BucketCache deployed inside the allocated Java 
heap.
+Setting it to `offheap` will have BucketCache make its allocations off-heap, 
and an ioengine setting of `file:PATH_TO_FILE` will direct BucketCache to use a 
file caching (Useful in particular if you have some fast I/O attached to the 
box such as SSDs).
 
 It is possible to deploy an L1+L2 setup where we bypass the CombinedBlockCache 
policy and have BucketCache working as a strict L2 cache to the L1 
LruBlockCache.
 For such a setup, set `CacheConfig.BUCKET_CACHE_COMBINED_KEY` to `false`.
 In this mode, on eviction from L1, blocks go to L2.
 When a block is cached, it is cached first in L1.
 When we go to look for a cached block, we look first in L1 and if none found, 
then search L2.
-Let us call this deploy format, 
-_(((Raw L1+L2)))_.
+Let us call this deploy format, _Raw L1+L2_.
 
 Other BucketCache configs include: specifying a location to persist cache to 
across restarts, how many threads to use writing the cache, etc.
-See the 
link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html[CacheConfig.html]
              class for configuration options and descriptions.
+See the 
link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html[CacheConfig.html]
 class for configuration options and descriptions.
 
 
 
 ====== BucketCache Example Configuration
-This sample provides a configuration for a 4 GB offheap BucketCache with a 1 
GB onheap cache.
+This sample provides a configuration for a 4 GB off-heap BucketCache with a 1 
GB on-heap cache.
 
 Configuration is performed on the RegionServer.
 
-Setting `hbase.bucketcache.ioengine` and  `hbase.bucketcache.size` > 0 enables 
CombinedBlockCache.
-Let us presume that the RegionServer has been set to run with a 5G heap: i.e.
-HBASE_HEAPSIZE=5g. 
+Setting `hbase.bucketcache.ioengine` and `hbase.bucketcache.size` > 0 enables 
`CombinedBlockCache`.
+Let us presume that the RegionServer has been set to run with a 5G heap: i.e. 
`HBASE_HEAPSIZE=5g`.
 
 
-. First, edit the RegionServer's _hbase-env.sh_ and set `HBASE_OFFHEAPSIZE` to 
a value greater than the offheap size wanted, in this case, 4 GB (expressed as 
4G).  Lets set it to 5G.
-  That'll be 4G for our offheap cache and 1G for any other uses of offheap 
memory (there are other users of offheap memory other than BlockCache; e.g.
-  DFSClient  in RegionServer can make use of offheap memory). See 
<<direct.memory,direct.memory>>.
-  +
+. First, edit the RegionServer's _hbase-env.sh_ and set `HBASE_OFFHEAPSIZE` to 
a value greater than the off-heap size wanted, in this case, 4 GB (expressed as 
4G). Let's set it to 5G.
+  That'll be 4G for our off-heap cache and 1G for any other uses of off-heap 
memory (there are other users of off-heap memory other than BlockCache; e.g.
+  DFSClient in RegionServer can make use of off-heap memory). See 
<<direct.memory>>.
++
 [source]
 ----
 HBASE_OFFHEAPSIZE=5G
@@ -804,14 +802,15 @@ HBASE_OFFHEAPSIZE=5G
 . Restart or rolling restart your cluster, and check the logs for any issues.
 
 
-In the above, we set bucketcache to be 4G.
-The onheap lrublockcache we configured to have 0.2 of the RegionServer's heap 
size (0.2 * 5G = 1G). In other words, you configure the L1 LruBlockCache as you 
would normally, as you would when there is no L2 BucketCache present. 
+In the above, we set the BucketCache to be 4G.
+We configured the on-heap LruBlockCache have 20% (0.2) of the RegionServer's 
heap size (0.2 * 5G = 1G). In other words, you configure the L1 LruBlockCache 
as you would normally (as if there were no L2 cache present).
 
-link:https://issues.apache.org/jira/browse/HBASE-10641[HBASE-10641] introduced 
the ability to configure multiple sizes for the buckets of the bucketcache, in 
HBase 0.98 and newer.
-To configurable multiple bucket sizes, configure the new property 
+hfile.block.cache.sizes+ (instead of +hfile.block.cache.size+) to a 
comma-separated list of block sizes, ordered from smallest to largest, with no 
spaces.
+link:https://issues.apache.org/jira/browse/HBASE-10641[HBASE-10641] introduced 
the ability to configure multiple sizes for the buckets of the BucketCache, in 
HBase 0.98 and newer.
+To configurable multiple bucket sizes, configure the new property 
`hfile.block.cache.sizes` (instead of `hfile.block.cache.size`) to a 
comma-separated list of block sizes, ordered from smallest to largest, with no 
spaces.
 The goal is to optimize the bucket sizes based on your data access patterns.
 The following example configures buckets of size 4096 and 8192.
 
+[source,xml]
 ----
 <property>
   <name>hfile.block.cache.sizes</name>
@@ -819,21 +818,21 @@ The following example configures buckets of size 4096 and 
8192.
 </property>
 ----
 
+[[direct.memory]]
 .Direct Memory Usage In HBase
 [NOTE]
 ====
 The default maximum direct memory varies by JVM.
 Traditionally it is 64M or some relation to allocated heap size (-Xmx) or no 
limit at all (JDK7 apparently). HBase servers use direct memory, in particular 
short-circuit reading, the hosted DFSClient will allocate direct memory buffers.
-If you do offheap block caching, you'll be making use of direct memory.
-Starting your JVM, make sure the `-XX:MaxDirectMemorySize` setting in 
_conf/hbase-env.sh_ is set to some value that is higher than what you have 
allocated to your offheap blockcache (`hbase.bucketcache.size`).  It should be 
larger than your offheap block cache and then some for DFSClient usage (How 
much the DFSClient uses is not easy to quantify; it is the number of open 
hfiles * `hbase.dfs.client.read.shortcircuit.buffer.size`                    
where hbase.dfs.client.read.shortcircuit.buffer.size is set to 128k in HBase -- 
see _hbase-default.xml_                    default configurations). Direct 
memory, which is part of the Java process heap, is separate from the object 
heap allocated by -Xmx.
-The value allocated by MaxDirectMemorySize must not exceed physical RAM, and 
is likely to be less than the total available RAM due to other memory 
requirements and system constraints. 
+If you do off-heap block caching, you'll be making use of direct memory.
+Starting your JVM, make sure the `-XX:MaxDirectMemorySize` setting in 
_conf/hbase-env.sh_ is set to some value that is higher than what you have 
allocated to your off-heap BlockCache (`hbase.bucketcache.size`). It should be 
larger than your off-heap block cache and then some for DFSClient usage (How 
much the DFSClient uses is not easy to quantify; it is the number of open 
HFiles * `hbase.dfs.client.read.shortcircuit.buffer.size` where 
`hbase.dfs.client.read.shortcircuit.buffer.size` is set to 128k in HBase -- see 
_hbase-default.xml_ default configurations). Direct memory, which is part of 
the Java process heap, is separate from the object heap allocated by -Xmx.
+The value allocated by `MaxDirectMemorySize` must not exceed physical RAM, and 
is likely to be less than the total available RAM due to other memory 
requirements and system constraints.
 
-You can see how much memory -- onheap and offheap/direct -- a RegionServer is 
configured to use and how much it is using at any one time by looking at the 
_Server Metrics: Memory_ tab in the UI.
+You can see how much memory -- on-heap and off-heap/direct -- a RegionServer 
is configured to use and how much it is using at any one time by looking at the 
_Server Metrics: Memory_ tab in the UI.
 It can also be gotten via JMX.
 In particular the direct memory currently used by the server can be found on 
the `java.nio.type=BufferPool,name=direct` bean.
-Terracotta has a 
link:http://terracotta.org/documentation/4.0/bigmemorygo/configuration/storage-options[good
 write up] on using offheap memory in java.
-It is for their product BigMemory but alot of the issues noted apply in 
general to any attempt at going offheap.
-Check it out.
+Terracotta has a 
link:http://terracotta.org/documentation/4.0/bigmemorygo/configuration/storage-options[good
 write up] on using off-heap memory in Java.
+It is for their product BigMemory but a lot of the issues noted apply in 
general to any attempt at going off-heap. Check it out.
 ====
 
 .hbase.bucketcache.percentage.in.combinedcache
@@ -842,24 +841,47 @@ Check it out.
 This is a pre-HBase 1.0 configuration removed because it was confusing.
 It was a float that you would set to some value between 0.0 and 1.0.
 Its default was 0.9.
-If the deploy was using CombinedBlockCache, then the LruBlockCache L1 size was 
calculated to be (1 - `hbase.bucketcache.percentage.in.combinedcache`) * 
`size-of-bucketcache`  and the BucketCache size was 
`hbase.bucketcache.percentage.in.combinedcache` * size-of-bucket-cache.
-where size-of-bucket-cache itself is EITHER the value of the configuration 
hbase.bucketcache.size IF it was specified as megabytes OR 
`hbase.bucketcache.size` * `-XX:MaxDirectMemorySize` if 
`hbase.bucketcache.size` between 0 and 1.0. 
+If the deploy was using CombinedBlockCache, then the LruBlockCache L1 size was 
calculated to be `(1 - hbase.bucketcache.percentage.in.combinedcache) * 
size-of-bucketcache`  and the BucketCache size was 
`hbase.bucketcache.percentage.in.combinedcache * size-of-bucket-cache`.
+where size-of-bucket-cache itself is EITHER the value of the configuration 
`hbase.bucketcache.size` IF it was specified as Megabytes OR 
`hbase.bucketcache.size` * `-XX:MaxDirectMemorySize` if 
`hbase.bucketcache.size` is between 0 and 1.0.
 
 In 1.0, it should be more straight-forward.
-L1 LruBlockCache size is set as a fraction of java heap using 
hfile.block.cache.size setting (not the best name) and L2 is set as above 
either in absolute megabytes or as a fraction of allocated maximum direct 
memory. 
+L1 LruBlockCache size is set as a fraction of java heap using 
`hfile.block.cache.size setting` (not the best name) and L2 is set as above 
either in absolute Megabytes or as a fraction of allocated maximum direct 
memory.
 ====
 
-==== Comprewssed Blockcache
+==== Compressed BlockCache
 
-link:https://issues.apache.org/jira/browse/HBASE-11331[HBASE-11331] introduced 
lazy blockcache decompression, more simply referred to as compressed blockcache.
-When compressed blockcache is enabled.
-data and encoded data blocks are cached in the blockcache in their on-disk 
format, rather than being decompressed and decrypted before caching.
+link:https://issues.apache.org/jira/browse/HBASE-11331[HBASE-11331] introduced 
lazy BlockCache decompression, more simply referred to as compressed BlockCache.
+When compressed BlockCache is enabled data and encoded data blocks are cached 
in the BlockCache in their on-disk format, rather than being decompressed and 
decrypted before caching.
 
 For a RegionServer hosting more data than can fit into cache, enabling this 
feature with SNAPPY compression has been shown to result in 50% increase in 
throughput and 30% improvement in mean latency while, increasing garbage 
collection by 80% and increasing overall CPU load by 2%. See HBASE-11331 for 
more details about how performance was measured and achieved.
 For a RegionServer hosting data that can comfortably fit into cache, or if 
your workload is sensitive to extra CPU or garbage-collection load, you may 
receive less benefit.
 
-Compressed blockcache is disabled by default.
-To enable it, set `hbase.block.data.cachecompressed` to `true` in 
_hbase-site.xml_ on all RegionServers.
+The compressed BlockCache is disabled by default. To enable it, set 
`hbase.block.data.cachecompressed` to `true` in _hbase-site.xml_ on all 
RegionServers.
+
+[[regionserver_splitting_implementation]]
+=== RegionServer Splitting Implementation
+
+As write requests are handled by the region server, they accumulate in an 
in-memory storage system called the _memstore_. Once the memstore fills, its 
content are written to disk as additional store files. This event is called a 
_memstore flush_. As store files accumulate, the RegionServer will 
<<compaction,compact>> them into fewer, larger files. After each flush or 
compaction finishes, the amount of data stored in the region has changed. The 
RegionServer consults the region split policy to determine if the region has 
grown too large or should be split for another policy-specific reason. A region 
split request is enqueued if the policy recommends it.
+
+Logically, the process of splitting a region is simple. We find a suitable 
point in the keyspace of the region where we should divide the region in half, 
then split the region's data into two new regions at that point. The details of 
the process however are not simple.  When a split happens, the newly created 
_daughter regions_ do not rewrite all the data into new files immediately. 
Instead, they create small files similar to symbolic link files, named 
link:http://www.google.com/url?q=http%3A%2F%2Fhbase.apache.org%2Fapidocs%2Forg%2Fapache%2Fhadoop%2Fhbase%2Fio%2FReference.html&sa=D&sntz=1&usg=AFQjCNEkCbADZ3CgKHTtGYI8bJVwp663CA[Reference
 files], which point to either the top or bottom part of the parent store file 
according to the split point. The reference file is used just like a regular 
data file, but only half of the records are considered. The region can only be 
split if there are no more references to the immutable data files of the parent 
region. Those reference files are clea
 ned gradually by compactions, so that the region will stop referring to its 
parents files, and can be split further.
+
+Although splitting the region is a local decision made by the RegionServer, 
the split process itself must coordinate with many actors. The RegionServer 
notifies the Master before and after the split, updates the `.META.` table so 
that clients can discover the new daughter regions, and rearranges the 
directory structure and data files in HDFS. Splitting is a multi-task process. 
To enable rollback in case of an error, the RegionServer keeps an in-memory 
journal about the execution state. The steps taken by the RegionServer to 
execute the split are illustrated in <<regionserver_split_process_image>>. Each 
step is labeled with its step number. Actions from RegionServers or Master are 
shown in red, while actions from the clients are show in green.
+
+[[regionserver_split_process_image]]
+.RegionServer Split Process
+image::region_split_process.png[Region Split Process]
+
+. The RegionServer decides locally to split the region, and prepares the 
split. *THE SPLIT TRANSACTION IS STARTED.* As a first step, the RegionServer 
acquires a shared read lock on the table to prevent schema modifications during 
the splitting process. Then it creates a znode in zookeeper under 
`/hbase/region-in-transition/region-name`, and sets the znode's state to 
`SPLITTING`.
+. The Master learns about this znode, since it has a watcher for the parent 
`region-in-transition` znode.
+. The RegionServer creates a sub-directory named `.splits` under the 
parentâs `region` directory in HDFS.
+. The RegionServer closes the parent region and marks the region as offline in 
its local data structures. *THE SPLITTING REGION IS NOW OFFLINE.* At this 
point, client requests coming to the parent region will throw 
`NotServingRegionException`. The client will retry with some backoff. The 
closing region is flushed.
+. The  RegionServer creates region directories under the `.splits` directory, 
for daughter regions A and B, and creates necessary data structures. Then it 
splits the store files, in the sense that it creates two 
link:http://www.google.com/url?q=http%3A%2F%2Fhbase.apache.org%2Fapidocs%2Forg%2Fapache%2Fhadoop%2Fhbase%2Fio%2FReference.html&sa=D&sntz=1&usg=AFQjCNEkCbADZ3CgKHTtGYI8bJVwp663CA[Reference]
 files per store file in the parent region. Those reference files will point to 
the parent regions'files.
+. The RegionServer creates the actual region directory in HDFS, and moves the 
reference files for each daughter.
+. The RegionServer sends a `Put` request to the `.META.` table, to set the 
parent as offline in the `.META.` table and add information about daughter 
regions. At this point, there wonât be individual entries in `.META.` for the 
daughters. Clients will see that the parent region is split if they scan 
`.META.`, but wonât know about the daughters until they appear in `.META.`. 
Also, if this `Put` to `.META`. succeeds, the parent will be effectively split. 
If the RegionServer fails before this RPC succeeds, Master and the next Region 
Server opening the region will clean dirty state about the region split. After 
the `.META.` update, though, the region split will be rolled-forward by Master.
+. The RegionServer opens daughters A and B in parallel.
+. The RegionServer adds the daughters A and B to `.META.`, together with 
information that it hosts the regions. *THE SPLIT REGIONS (DAUGHTERS WITH 
REFERENCES TO PARENT) ARE NOW ONLINE.* After this point, clients can discover 
the new regions and issue requests to them. Clients cache the `.META.` entries 
locally, but when they make requests to the RegionServer or `.META.`, their 
caches will be invalidated, and they will learn about the new regions from 
`.META.`.
+. The RegionServer updates znode `/hbase/region-in-transition/region-name` in 
ZooKeeper to state `SPLIT`, so that the master can learn about it. The balancer 
can freely re-assign the daughter regions to other region servers if necessary. 
*THE SPLIT TRANSACTION IS NOW FINISHED.*
+. After the split, `.META.` and HDFS will still contain references to the 
parent region. Those references will be removed when compactions in daughter 
regions rewrite the data files. Garbage collection tasks in the master 
periodically check whether the daughter regions still refer to the parent 
region's files. If not, the parent region will be removed.
 
 [[wal]]
 === Write Ahead Log (WAL)
@@ -867,31 +889,31 @@ To enable it, set `hbase.block.data.cachecompressed` to 
`true` in _hbase-site.xm
 [[purpose.wal]]
 ==== Purpose
 
-The [firstterm]_Write Ahead Log (WAL)_ records all changes to data in HBase, 
to file-based storage.
+The _Write Ahead Log (WAL)_ records all changes to data in HBase, to 
file-based storage.
 Under normal operations, the WAL is not needed because data changes move from 
the MemStore to StoreFiles.
 However, if a RegionServer crashes or becomes unavailable before the MemStore 
is flushed, the WAL ensures that the changes to the data can be replayed.
 If writing to the WAL fails, the entire operation to modify the data fails.
 
 HBase uses an implementation of the 
link:http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/wal/WAL.html[WAL]
 interface.
 Usually, there is only one instance of a WAL per RegionServer.
-The RegionServer records Puts and Deletes to it, before recording them to the 
<<store.memstore,store.memstore>> for the affected <<store,store>>. 
+The RegionServer records Puts and Deletes to it, before recording them to the 
<<store.memstore>> for the affected <<store>>.
 
 .The HLog
 [NOTE]
 ====
 Prior to 2.0, the interface for WALs in HBase was named `HLog`.
 In 0.94, HLog was the name of the implementation of the WAL.
-You will likely find references to the HLog in documentation tailored to these 
older versions. 
+You will likely find references to the HLog in documentation tailored to these 
older versions.
 ====
 
 The WAL resides in HDFS in the _/hbase/WALs/_ directory (prior to HBase 0.94, 
they were stored in _/hbase/.logs/_), with subdirectories per region.
 
-For more general information about the concept of write ahead logs, see the 
Wikipedia link:http://en.wikipedia.org/wiki/Write-ahead_logging[Write-Ahead 
Log]            article. 
+For more general information about the concept of write ahead logs, see the 
Wikipedia link:http://en.wikipedia.org/wiki/Write-ahead_logging[Write-Ahead 
Log] article.
 
 [[wal_flush]]
 ==== WAL Flushing
 
-TODO (describe). 
+TODO (describe).
 
 ==== WAL Splitting
 
@@ -900,8 +922,7 @@ All of the regions in a region server share the same active 
WAL file.
 Each edit in the WAL file includes information about which region it belongs 
to.
 When a region is opened, the edits in the WAL file which belong to that region 
need to be replayed.
 Therefore, edits in the WAL file must be grouped by region so that particular 
sets can be replayed to regenerate the data in a particular region.
-The process of grouping the WAL edits by region is called [firstterm]_log
-              splitting_.
+The process of grouping the WAL edits by region is called _log splitting_.
 It is a critical process for recovering data if a region server fails.
 
 Log splitting is done by the HMaster during cluster start-up or by the 
ServerShutdownHandler as a region server shuts down.
@@ -945,8 +966,7 @@ After log splitting completes, the _.temp_ file is renamed 
to the sequence ID of
 To determine whether all edits have been written, the sequence ID is compared 
to the sequence of the last edit that was written to the HFile.
 If the sequence of the last edit is greater than or equal to the sequence ID 
included in the file name, it is clear that all writes from the edit file have 
been completed.
 
-. After log splitting is complete, each affected region is assigned to a
-  RegionServer.
+. After log splitting is complete, each affected region is assigned to a 
RegionServer.
 +
 When the region is opened, the _recovered.edits_ folder is checked for 
recovered edits files.
 If any such files are present, they are replayed by reading the edits and 
saving them to the MemStore.
@@ -955,60 +975,57 @@ After all edit files are replayed, the contents of the 
MemStore are written to d
 
 ===== Handling of Errors During Log Splitting
 
-If you set the `hbase.hlog.split.skip.errors` option to [constant]+true+, 
errors are treated as follows:
+If you set the `hbase.hlog.split.skip.errors` option to `true`, errors are 
treated as follows:
 
 * Any error encountered during splitting will be logged.
-* The problematic WAL log will be moved into the _.corrupt_                  
directory under the hbase `rootdir`,
+* The problematic WAL log will be moved into the _.corrupt_ directory under 
the hbase `rootdir`,
 * Processing of the WAL will continue
 
-If the `hbase.hlog.split.skip.errors` optionset to `false`, the default, the 
exception will be propagated and the split will be logged as failed.
-See link:https://issues.apache.org/jira/browse/HBASE-2958[HBASE-2958 When
-hbase.hlog.split.skip.errors is set to false, we fail the split but thats
-it].
+If the `hbase.hlog.split.skip.errors` option is set to `false`, the default, 
the exception will be propagated and the split will be logged as failed.
+See link:https://issues.apache.org/jira/browse/HBASE-2958[HBASE-2958 When 
hbase.hlog.split.skip.errors is set to false, we fail the split but thats it].
 We need to do more than just fail split if this flag is set.
 
-====== How EOFExceptions are treated when splitting a crashed 
RegionServers'WALs
+====== How EOFExceptions are treated when splitting a crashed RegionServer's 
WALs
 
 If an EOFException occurs while splitting logs, the split proceeds even when 
`hbase.hlog.split.skip.errors` is set to `false`.
-An EOFException while reading the last log in the set of files to split is 
likely, because the RegionServer is likely to be in the process of writing a 
record at the time of a crash.
-For background, see 
link:https://issues.apache.org/jira/browse/HBASE-2643[HBASE-2643
-                      Figure how to deal with eof splitting logs]
+An EOFException while reading the last log in the set of files to split is 
likely, because the RegionServer was likely in the process of writing a record 
at the time of a crash.
+For background, see 
link:https://issues.apache.org/jira/browse/HBASE-2643[HBASE-2643 Figure how to 
deal with eof splitting logs]
 
 ===== Performance Improvements during Log Splitting
 
-WAL log splitting and recovery can be resource intensive and take a long time, 
depending on the number of RegionServers involved in the crash and the size of 
the regions. <<distributed.log.splitting,distributed.log.splitting>> and 
<<distributed.log.replay,distributed.log.replay>> were developed to improve 
performance during log splitting. 
+WAL log splitting and recovery can be resource intensive and take a long time, 
depending on the number of RegionServers involved in the crash and the size of 
the regions. <<distributed.log.splitting>> and <<distributed.log.replay>> were 
developed to improve performance during log splitting.
 
 [[distributed.log.splitting]]
 ====== Distributed Log Splitting
 
-[firstterm]_Distributed Log Splitting_ was added in HBase version 0.92 
(link:https://issues.apache.org/jira/browse/HBASE-1364[HBASE-1364])  by Prakash 
Khemani from Facebook.
+_Distributed Log Splitting_ was added in HBase version 0.92 
(link:https://issues.apache.org/jira/browse/HBASE-1364[HBASE-1364]) by Prakash 
Khemani from Facebook.
 It reduces the time to complete log splitting dramatically, improving the 
availability of regions and tables.
 For example, recovering a crashed cluster took around 9 hours with 
single-threaded log splitting, but only about six minutes with distributed log 
splitting.
 
-The information in this section is sourced from Jimmy Xiang's blog post at 
link:http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/.
+The information in this section is sourced from Jimmy Xiang's blog post at 
http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/.
 
 .Enabling or Disabling Distributed Log Splitting
 
 Distributed log processing is enabled by default since HBase 0.92.
-The setting is controlled by the +hbase.master.distributed.log.splitting+      
            property, which can be set to `true` or `false`, but defaults to 
`true`. 
+The setting is controlled by the `hbase.master.distributed.log.splitting` 
property, which can be set to `true` or `false`, but defaults to `true`.
 
 [[log.splitting.step.by.step]]
 .Distributed Log Splitting, Step by Step
 
 After configuring distributed log splitting, the HMaster controls the process.
 The HMaster enrolls each RegionServer in the log splitting process, and the 
actual work of splitting the logs is done by the RegionServers.
-The general process for log splitting, as described in 
<<log.splitting.step.by.step,log.splitting.step.by.step>> still applies here.
+The general process for log splitting, as described in 
<<log.splitting.step.by.step>> still applies here.
 
-. If distributed log processing is enabled, the HMaster creates a 
[firstterm]_split log manager_ instance when the cluster is started.
+. If distributed log processing is enabled, the HMaster creates a _split log 
manager_ instance when the cluster is started.
   .. The split log manager manages all log files which need to be scanned and 
split.
   .. The split log manager places all the logs into the ZooKeeper splitlog 
node (_/hbase/splitlog_) as tasks.
-  .. You can view the contents of the splitlog by issuing the following 
+zkcli+ command. Example output is shown.
+  .. You can view the contents of the splitlog by issuing the following 
`zkCli` command. Example output is shown.
 +
 [source,bash]
 ----
 ls /hbase/splitlog
-[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900,
 
-hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931,
 
+[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900,
+hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931,
 
hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946]
 ----
 +
@@ -1018,10 +1035,10 @@ When decoded, it looks much more simple:
 ----
 [hdfs://host2.sample.com:56020/hbase/.logs
 /host8.sample.com,57020,1340474893275-splitting
-/host8.sample.com%3A57020.1340474893900, 
+/host8.sample.com%3A57020.1340474893900,
 hdfs://host2.sample.com:56020/hbase/.logs
 /host3.sample.com,57020,1340474893299-splitting
-/host3.sample.com%3A57020.1340474893931, 
+/host3.sample.com%3A57020.1340474893931,
 hdfs://host2.sample.com:56020/hbase/.logs
 /host4.sample.com,57020,1340474893287-splitting
 /host4.sample.com%3A57020.1340474893946]
@@ -1047,12 +1064,12 @@ The split log manager is responsible for the following 
ongoing tasks:
 * The split log manager watches the HBase split log znodes constantly.
   If any split log task node data is changed, the split log manager retrieves 
the node data.
   The node data contains the current state of the task.
-  You can use the +zkcli+ +get+ command to retrieve the current state of a 
task.
+  You can use the `zkCli` `get` command to retrieve the current state of a 
task.
   In the example output below, the first line of the output shows that the 
task is currently unassigned.
 +
 ----
 get 
/hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945
- 
+
 unassigned host2.sample.com:57000
 cZxid = 0Ã7115
 ctime = Sat Jun 23 11:13:40 PDT 2012
@@ -1063,43 +1080,46 @@ Based on the state of the task whose data is changed, 
the split log manager does
 +
 * Resubmit the task if it is unassigned
 * Heartbeat the task if it is assigned
-* Resubmit or fail the task if it is resigned (see 
<<distributed.log.replay.failure.reasons,distributed.log.replay.failure.reasons>>)
-* Resubmit or fail the task if it is completed with errors (see 
<<distributed.log.replay.failure.reasons,distributed.log.replay.failure.reasons>>)
-* Resubmit or fail the task if it could not complete due to errors (see 
<<distributed.log.replay.failure.reasons,distributed.log.replay.failure.reasons>>)
+* Resubmit or fail the task if it is resigned (see 
<<distributed.log.replay.failure.reasons>>)
+* Resubmit or fail the task if it is completed with errors (see 
<<distributed.log.replay.failure.reasons>>)
+* Resubmit or fail the task if it could not complete due to errors (see 
<<distributed.log.replay.failure.reasons>>)
 * Delete the task if it is successfully completed or failed
 +
-* .Reasons a Task Will FailThe task has been deleted.
+[[distributed.log.replay.failure.reasons]]
+[NOTE]
+.Reasons a Task Will Fail
+====
+* The task has been deleted.
 * The node no longer exists.
-* The log status manager failed to move the state of the task to 
TASK_UNASSIGNED.
+* The log status manager failed to move the state of the task to 
`TASK_UNASSIGNED`.
 * The number of resubmits is over the resubmit threshold.
-
+====
 
 . Each RegionServer's split log worker performs the log-splitting tasks.
 +
-Each RegionServer runs a daemon thread called the [firstterm]_split log
-                      worker_, which does the work to split the logs.
+Each RegionServer runs a daemon thread called the _split log worker_, which 
does the work to split the logs.
 The daemon thread starts when the RegionServer starts, and registers itself to 
watch HBase znodes.
 If any splitlog znode children change, it notifies a sleeping worker thread to 
wake up and grab more tasks.
 If if a worker's current task's node data is changed, the worker checks to see 
if the task has been taken by another worker.
 If so, the worker thread stops work on the current task.
 +
 The worker monitors the splitlog znode constantly.
-When a new task appears, the split log worker retrieves  the task paths and 
checks each one until it finds an unclaimed task, which it attempts to claim.
-If the claim was successful, it attempts to perform the task and updates the 
task's +state+ property based on the splitting outcome.
+When a new task appears, the split log worker retrieves the task paths and 
checks each one until it finds an unclaimed task, which it attempts to claim.
+If the claim was successful, it attempts to perform the task and updates the 
task's `state` property based on the splitting outcome.
 At this point, the split log worker scans for another unclaimed task.
 +
-* .How the Split Log Worker Approaches a TaskIt queries the task state and 
only takes action if the task is in `TASK_UNASSIGNED `state.
+.How the Split Log Worker Approaches a Task
+* It queries the task state and only takes action if the task is in 
`TASK_UNASSIGNED `state.
 * If the task is is in `TASK_UNASSIGNED` state, the worker attempts to set the 
state to `TASK_OWNED` by itself.
   If it fails to set the state, another worker will try to grab it.
   The split log manager will also ask all workers to rescan later if the task 
remains unassigned.
 * If the worker succeeds in taking ownership of the task, it tries to get the 
task state again to make sure it really gets it asynchronously.
-  In the meantime, it starts a split task executor to do the actual work: 
-+
-* Get the HBase root folder, create a temp folder under the root, and split 
the log file to the temp folder.
-* If the split was successful, the task executor sets the task to state 
`TASK_DONE`.
-* If the worker catches an unexpected IOException, the task is set to state 
`TASK_ERR`.
-* If the worker is shutting down, set the the task to state `TASK_RESIGNED`.
-* If the task is taken by another worker, just log it.
+  In the meantime, it starts a split task executor to do the actual work:
+** Get the HBase root folder, create a temp folder under the root, and split 
the log file to the temp folder.
+** If the split was successful, the task executor sets the task to state 
`TASK_DONE`.
+** If the worker catches an unexpected IOException, the task is set to state 
`TASK_ERR`.
+** If the worker is shutting down, set the the task to state `TASK_RESIGNED`.
+** If the task is taken by another worker, just log it.
 
 
 . The split log manager monitors for uncompleted tasks.
@@ -1114,11 +1134,11 @@ If none are found, it throws an exception so that the 
log splitting can be retri
 [[distributed.log.replay]]
 ====== Distributed Log Replay
 
-After a RegionServer fails, its failed region is assigned to another 
RegionServer, which is marked as "recovering" in ZooKeeper.
-A split log worker directly replays edits from the WAL of the failed region 
server to the region at its new location.
-When a region is in "recovering" state, it can accept writes but no reads 
(including Append and Increment), region splits or merges. 
+After a RegionServer fails, its failed regions are assigned to another 
RegionServer, which are marked as "recovering" in ZooKeeper.
+A split log worker directly replays edits from the WAL of the failed 
RegionServer to the regions at its new location.
+When a region is in "recovering" state, it can accept writes but no reads 
(including Append and Increment), region splits or merges.
 
-Distributed Log Replay extends the 
<<distributed.log.splitting,distributed.log.splitting>> framework.
+Distributed Log Replay extends the <<distributed.log.splitting>> framework.
 It works by directly replaying WAL edits to another RegionServer instead of 
creating _recovered.edits_ files.
 It provides the following advantages over distributed log splitting alone:
 
@@ -1129,7 +1149,7 @@ It provides the following advantages over distributed log 
splitting alone:
   It only takes seconds for a recovering region to accept writes again.
 
 .Enabling Distributed Log Replay
-To enable distributed log replay, set `hbase.master.distributed.log.replay` to 
true.
+To enable distributed log replay, set `hbase.master.distributed.log.replay` to 
`true`.
 This will be the default for HBase 0.99 
(link:https://issues.apache.org/jira/browse/HBASE-10888[HBASE-10888]).
 
 You must also enable HFile version 3 (which is the default HFile format 
starting in HBase 0.99.
@@ -1138,7 +1158,7 @@ See 
link:https://issues.apache.org/jira/browse/HBASE-10855[HBASE-10855]). Distri
 [[wal.disable]]
 ==== Disabling the WAL
 
-It is possible to disable the WAL, to improve performace in certain specific 
situations.
+It is possible to disable the WAL, to improve performance in certain specific 
situations.
 However, disabling the WAL puts your data at risk.
 The only situation where this is recommended is during a bulk load.
 This is because, in the event of a problem, the bulk load can be re-run with 
no risk of data loss.
@@ -1153,18 +1173,18 @@ WARNING: If you disable the WAL for anything other than 
bulk loads, your data is
 == Regions
 
 Regions are the basic element of availability and distribution for tables, and 
are comprised of a Store per Column Family.
-The heirarchy of objects is as follows: 
+The hierarchy of objects is as follows:
 
 ----
-Table       (HBase table)
-    Region       (Regions for the table)
-         Store          (Store per ColumnFamily for each Region for the table)
-              MemStore           (MemStore for each Store for each Region for 
the table)
-              StoreFile          (StoreFiles for each Store for each Region 
for the table)
-                    Block             (Blocks within a StoreFile within a 
Store for each Region for the table)
-----     
+Table                    (HBase table)
+    Region               (Regions for the table)
+        Store            (Store per ColumnFamily for each Region for the table)
+            MemStore     (MemStore for each Store for each Region for the 
table)
+            StoreFile    (StoreFiles for each Store for each Region for the 
table)
+                Block    (Blocks within a StoreFile within a Store for each 
Region for the table)
+----
 
-For a description of what HBase files look like when written to HDFS, see 
<<trouble.namenode.hbase.objects,trouble.namenode.hbase.objects>>. 
+For a description of what HBase files look like when written to HDFS, see 
<<trouble.namenode.hba


<TRUNCATED>

[11/12] hbase git commit: Pull in documentation updates from trunk made since last 0.98 release

Reply via email to