[06/12] hbase git commit: Pull in documentation updates from trunk made since last 0.98 release

apurtell Mon, 02 Mar 2015 17:30:31 -0800

http://git-wip-us.apache.org/repos/asf/hbase/blob/7139c90e/src/main/asciidoc/_chapters/performance.adoc
----------------------------------------------------------------------
diff --git a/src/main/asciidoc/_chapters/performance.adoc 
b/src/main/asciidoc/_chapters/performance.adoc
index 11a0f5e..2155d52 100644
--- a/src/main/asciidoc/_chapters/performance.adoc
+++ b/src/main/asciidoc/_chapters/performance.adoc
@@ -46,24 +46,24 @@ Use a 64-bit platform (and 64-bit JVM).
 === Swapping
 
 Watch out for swapping.
-Set swappiness to 0.
+Set `swappiness` to 0.
 
 [[perf.network]]
 == Network
 
-Perhaps the most important factor in avoiding network issues degrading Hadoop 
and HBase performance is the switching hardware that is used, decisions made 
early in the scope of the project can cause major problems when you double or 
triple the size of your cluster (or more). 
+Perhaps the most important factor in avoiding network issues degrading Hadoop 
and HBase performance is the switching hardware that is used, decisions made 
early in the scope of the project can cause major problems when you double or 
triple the size of your cluster (or more).
 
-Important items to consider: 
+Important items to consider:
 
 * Switching capacity of the device
 * Number of systems connected
-* Uplink capacity    
+* Uplink capacity
 
 [[perf.network.1switch]]
 === Single Switch
 
 The single most important factor in this configuration is that the switching 
capacity of the hardware is capable of handling the traffic which can be 
generated by all systems connected to the switch.
-Some lower priced commodity hardware can have a slower switching capacity than 
could be utilized by a full switch. 
+Some lower priced commodity hardware can have a slower switching capacity than 
could be utilized by a full switch.
 
 [[perf.network.2switch]]
 === Multiple Switches
@@ -71,9 +71,9 @@ Some lower priced commodity hardware can have a slower 
switching capacity than c
 Multiple switches are a potential pitfall in the architecture.
 The most common configuration of lower priced hardware is a simple 1Gbps 
uplink from one switch to another.
 This often overlooked pinch point can easily become a bottleneck for cluster 
communication.
-Especially with MapReduce jobs that are both reading and writing a lot of data 
the communication across this uplink could be saturated. 
+Especially with MapReduce jobs that are both reading and writing a lot of data 
the communication across this uplink could be saturated.
 
-Mitigation of this issue is fairly simple and can be accomplished in multiple 
ways: 
+Mitigation of this issue is fairly simple and can be accomplished in multiple 
ways:
 
 * Use appropriate hardware for the scale of the cluster which you're 
attempting to build.
 * Use larger single switch configurations i.e.
@@ -83,7 +83,7 @@ Mitigation of this issue is fairly simple and can be 
accomplished in multiple wa
 [[perf.network.multirack]]
 === Multiple Racks
 
-Multiple rack configurations carry the same potential issues as multiple 
switches, and can suffer performance degradation from two main areas: 
+Multiple rack configurations carry the same potential issues as multiple 
switches, and can suffer performance degradation from two main areas:
 
 * Poor switch capacity performance
 * Insufficient uplink to another rack
@@ -91,14 +91,25 @@ Multiple rack configurations carry the same potential 
issues as multiple switche
 If the the switches in your rack have appropriate switching capacity to handle 
all the hosts at full speed, the next most likely issue will be caused by 
homing more of your cluster across racks.
 The easiest way to avoid issues when spanning multiple racks is to use port 
trunking to create a bonded uplink to other racks.
 The downside of this method however, is in the overhead of ports that could 
potentially be used.
-An example of this is, creating an 8Gbps port channel from rack A to rack B, 
using 8 of your 24 ports to communicate between racks gives you a poor ROI, 
using too few however can mean you're not getting the most out of your cluster. 
+An example of this is, creating an 8Gbps port channel from rack A to rack B, 
using 8 of your 24 ports to communicate between racks gives you a poor ROI, 
using too few however can mean you're not getting the most out of your cluster.
 
-Using 10Gbe links between racks will greatly increase performance, and 
assuming your switches support a 10Gbe uplink or allow for an expansion card 
will allow you to save your ports for machines as opposed to uplinks. 
+Using 10Gbe links between racks will greatly increase performance, and 
assuming your switches support a 10Gbe uplink or allow for an expansion card 
will allow you to save your ports for machines as opposed to uplinks.
 
 [[perf.network.ints]]
 === Network Interfaces
 
-Are all the network interfaces functioning correctly? Are you sure? See the 
Troubleshooting Case Study in <<casestudies.slownode,casestudies.slownode>>. 
+Are all the network interfaces functioning correctly? Are you sure? See the 
Troubleshooting Case Study in <<casestudies.slownode>>.
+
+[[perf.network.call_me_maybe]]
+=== Network Consistency and Partition Tolerance
+The link:http://en.wikipedia.org/wiki/CAP_theorem[CAP Theorem] states that a 
distributed system can maintain two out of the following three charateristics: 
+- *C*onsistency -- all nodes see the same data. 
+- *A*vailability -- every request receives a response about whether it 
succeeded or failed.
+- *P*artition tolerance -- the system continues to operate even if some of its 
components become unavailable to the others.
+
+HBase favors consistency and partition tolerance, where a decision has to be 
made. Coda Hale explains why partition tolerance is so important, in 
http://codahale.com/you-cant-sacrifice-partition-tolerance/. 
+
+Robert Yokota used an automated testing framework called 
link:https://aphyr.com/tags/jepsen[Jepson] to test HBase's partition tolerance 
in the face of network partitions, using techniques modeled after Aphyr's 
link:https://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions[Call
 Me Maybe] series. The results, available as a 
link:http://eng.yammer.com/call-me-maybe-hbase/[blog post] and an 
link:http://eng.yammer.com/call-me-maybe-hbase-addendum/[addendum], show that 
HBase performs correctly.
 
 [[jvm]]
 == Java
@@ -109,35 +120,33 @@ Are all the network interfaces functioning correctly? Are 
you sure? See the Trou
 [[gcpause]]
 ==== Long GC pauses
 
-In his presentation, 
link:http://www.slideshare.net/cloudera/hbase-hug-presentation[Avoiding Full GCs
-            with MemStore-Local Allocation Buffers], Todd Lipcon describes two 
cases of stop-the-world garbage collections common in HBase, especially during 
loading; CMS failure modes and old generation heap fragmentation brought.
+In his presentation, 
link:http://www.slideshare.net/cloudera/hbase-hug-presentation[Avoiding Full 
GCs with MemStore-Local Allocation Buffers], Todd Lipcon describes two cases of 
stop-the-world garbage collections common in HBase, especially during loading; 
CMS failure modes and old generation heap fragmentation brought.
+
 To address the first, start the CMS earlier than default by adding 
`-XX:CMSInitiatingOccupancyFraction` and setting it down from defaults.
-Start at 60 or 70 percent (The lower you bring down the threshold, the more 
GCing is done, the more CPU used). To address the second fragmentation issue, 
Todd added an experimental facility, 
-(((MSLAB))), that must be explicitly enabled in Apache HBase 0.90.x (Its 
defaulted to be on in Apache 0.92.x HBase). See 
`hbase.hregion.memstore.mslab.enabled` to true in your `Configuration`.
+Start at 60 or 70 percent (The lower you bring down the threshold, the more 
GCing is done, the more CPU used). To address the second fragmentation issue, 
Todd added an experimental facility,
+(MSLAB), that must be explicitly enabled in Apache HBase 0.90.x (It's 
defaulted to be _on_ in Apache 0.92.x HBase). Set 
`hbase.hregion.memstore.mslab.enabled` to true in your `Configuration`.
 See the cited slides for background and detail.
-The latest jvms do better regards fragmentation so make sure you are running a 
recent release.
-Read down in the message, 
link:http://osdir.com/ml/hotspot-gc-use/2011-11/msg00002.html[Identifying
-            concurrent mode failures caused by fragmentation].
+The latest JVMs do better regards fragmentation so make sure you are running a 
recent release.
+Read down in the message, 
link:http://osdir.com/ml/hotspot-gc-use/2011-11/msg00002.html[Identifying 
concurrent mode failures caused by fragmentation].
 Be aware that when enabled, each MemStore instance will occupy at least an 
MSLAB instance of memory.
 If you have thousands of regions or lots of regions each with many column 
families, this allocation of MSLAB may be responsible for a good portion of 
your heap allocation and in an extreme case cause you to OOME.
-Disable MSLAB in this case, or lower the amount of memory it uses or float 
less regions per server. 
+Disable MSLAB in this case, or lower the amount of memory it uses or float 
less regions per server.
 
-If you have a write-heavy workload, check out 
link:https://issues.apache.org/jira/browse/HBASE-8163[HBASE-8163
-            MemStoreChunkPool: An improvement for JAVA GC when using MSLAB].
+If you have a write-heavy workload, check out 
link:https://issues.apache.org/jira/browse/HBASE-8163[HBASE-8163 
MemStoreChunkPool: An improvement for JAVA GC when using MSLAB].
 It describes configurations to lower the amount of young GC during write-heavy 
loadings.
 If you do not have HBASE-8163 installed, and you are trying to improve your 
young GC times, one trick to consider -- courtesy of our Liang Xie -- is to set 
the GC config `-XX:PretenureSizeThreshold` in _hbase-env.sh_ to be just smaller 
than the size of `hbase.hregion.memstore.mslab.chunksize` so MSLAB allocations 
happen in the tenured space directly rather than first in the young gen.
-You'd do this because these MSLAB allocations are going to likely make it to 
the old gen anyways and rather than pay the price of a copies between s0 and s1 
in eden space followed by the copy up from young to old gen after the MSLABs 
have achieved sufficient tenure, save a bit of YGC churn and allocate in the 
old gen directly. 
+You'd do this because these MSLAB allocations are going to likely make it to 
the old gen anyways and rather than pay the price of a copies between s0 and s1 
in eden space followed by the copy up from young to old gen after the MSLABs 
have achieved sufficient tenure, save a bit of YGC churn and allocate in the 
old gen directly.
 
-For more information about GC logs, see <<trouble.log.gc,trouble.log.gc>>. 
+For more information about GC logs, see <<trouble.log.gc>>.
 
-Consider also enabling the offheap Block Cache.
+Consider also enabling the off-heap Block Cache.
 This has been shown to mitigate GC pause times.
-See <<block.cache,block.cache>>
+See <<block.cache>>
 
 [[perf.configurations]]
 == HBase Configurations
 
-See <<recommended_configurations,recommended configurations>>.
+See <<recommended_configurations>>.
 
 [[perf.compactions.and.splits]]
 === Managing Compactions
@@ -147,22 +156,22 @@ For larger systems, managing link:[compactions and 
splits] may be something you
 [[perf.handlers]]
 === `hbase.regionserver.handler.count`
 
-See <<hbase.regionserver.handler.count,hbase.regionserver.handler.count>>. 
+See <<hbase.regionserver.handler.count>>.
 
 [[perf.hfile.block.cache.size]]
 === `hfile.block.cache.size`
 
-See <<hfile.block.cache.size,hfile.block.cache.size>>.
-A memory setting for the RegionServer process. 
+See <<hfile.block.cache.size>>.
+A memory setting for the RegionServer process.
 
 [[blockcache.prefetch]]
 === Prefetch Option for Blockcache
 
-link:https://issues.apache.org/jira/browse/HBASE-9857[HBASE-9857]        adds 
a new option to prefetch HFile contents when opening the blockcache, if a 
columnfamily or regionserver property is set.
+link:https://issues.apache.org/jira/browse/HBASE-9857[HBASE-9857] adds a new 
option to prefetch HFile contents when opening the BlockCache, if a Column 
family or RegionServer property is set.
 This option is available for HBase 0.98.3 and later.
-The purpose is to warm the blockcache as rapidly as possible after the cache 
is opened, using in-memory table data, and not counting the prefetching as 
cache misses.
-This is great for fast reads, but is not a good idea if the data to be 
preloaded will not fit into the blockcache.
-It is useful for tuning the IO impact of prefetching versus the time before 
all data blocks are in cache. 
+The purpose is to warm the BlockCache as rapidly as possible after the cache 
is opened, using in-memory table data, and not counting the prefetching as 
cache misses.
+This is great for fast reads, but is not a good idea if the data to be 
preloaded will not fit into the BlockCache.
+It is useful for tuning the IO impact of prefetching versus the time before 
all data blocks are in cache.
 
 To enable prefetching on a given column family, you can use HBase Shell or use 
the API.
 
@@ -192,73 +201,73 @@ See the API documentation for 
link:https://hbase.apache.org/apidocs/org/apache/h
 [[perf.rs.memstore.size]]
 === `hbase.regionserver.global.memstore.size`
 
-See 
<<hbase.regionserver.global.memstore.size,hbase.regionserver.global.memstore.size>>.
-This memory setting is often adjusted for the RegionServer process depending 
on needs. 
+See <<hbase.regionserver.global.memstore.size>>.
+This memory setting is often adjusted for the RegionServer process depending 
on needs.
 
 [[perf.rs.memstore.size.lower.limit]]
 === `hbase.regionserver.global.memstore.size.lower.limit`
 
-See 
<<hbase.regionserver.global.memstore.size.lower.limit,hbase.regionserver.global.memstore.size.lower.limit>>.
-This memory setting is often adjusted for the RegionServer process depending 
on needs. 
+See <<hbase.regionserver.global.memstore.size.lower.limit>>.
+This memory setting is often adjusted for the RegionServer process depending 
on needs.
 
 [[perf.hstore.blockingstorefiles]]
 === `hbase.hstore.blockingStoreFiles`
 
-See <<hbase.hstore.blockingstorefiles,hbase.hstore.blockingStoreFiles>>.
-If there is blocking in the RegionServer logs, increasing this can help. 
+See <<hbase.hstore.blockingstorefiles>>.
+If there is blocking in the RegionServer logs, increasing this can help.
 
 [[perf.hregion.memstore.block.multiplier]]
 === `hbase.hregion.memstore.block.multiplier`
 
-See 
<<hbase.hregion.memstore.block.multiplier,hbase.hregion.memstore.block.multiplier>>.
-If there is enough RAM, increasing this can help. 
+See <<hbase.hregion.memstore.block.multiplier>>.
+If there is enough RAM, increasing this can help.
 
 [[hbase.regionserver.checksum.verify.performance]]
 === `hbase.regionserver.checksum.verify`
 
 Have HBase write the checksum into the datablock and save having to do the 
checksum seek whenever you read.
 
-See <<hbase.regionserver.checksum.verify,hbase.regionserver.checksum.verify>>, 
<<hbase.hstore.bytes.per.checksum,hbase.hstore.bytes.per.checksum>> and 
<<hbase.hstore.checksum.algorithm,hbase.hstore.checksum.algorithm>>        For 
more information see the release note on 
link:https://issues.apache.org/jira/browse/HBASE-5074[HBASE-5074 support 
checksums in HBase block cache]. 
+See <<hbase.regionserver.checksum.verify>>, 
<<hbase.hstore.bytes.per.checksum>> and <<hbase.hstore.checksum.algorithm>>. 
For more information see the release note on 
link:https://issues.apache.org/jira/browse/HBASE-5074[HBASE-5074 support 
checksums in HBase block cache].
 
 === Tuning `callQueue` Options
 
-link:https://issues.apache.org/jira/browse/HBASE-11355[HBASE-11355]        
introduces several callQueue tuning mechanisms which can increase performance.
+link:https://issues.apache.org/jira/browse/HBASE-11355[HBASE-11355] introduces 
several callQueue tuning mechanisms which can increase performance.
 See the JIRA for some benchmarking information.
 
-* To increase the number of callqueues, set +hbase.ipc.server.num.callqueue+ 
to a value greater than `1`.
-* To split the callqueue into separate read and write queues, set 
`hbase.ipc.server.callqueue.read.ratio` to a value between `0` and `1`.
-  This factor weights the queues toward writes (if below .5) or reads (if 
above .5). Another way to say this is that the factor determines what 
percentage of the split queues are used for reads.
-  The following examples illustrate some of the possibilities.
-  Note that you always have at least one write queue, no matter what setting 
you use.
-+
+To increase the number of callqueues, set `hbase.ipc.server.num.callqueue` to 
a value greater than `1`.
+To split the callqueue into separate read and write queues, set 
`hbase.ipc.server.callqueue.read.ratio` to a value between `0` and `1`.
+This factor weights the queues toward writes (if below .5) or reads (if above 
.5). Another way to say this is that the factor determines what percentage of 
the split queues are used for reads.
+The following examples illustrate some of the possibilities.
+Note that you always have at least one write queue, no matter what setting you 
use.
+
 * The default value of `0` does not split the queue.
 * A value of `.3` uses 30% of the queues for reading and 60% for writing.
-  Given a value of `10` for +hbase.ipc.server.num.callqueue+, 3 queues would 
be used for reads and 7 for writes.
+  Given a value of `10` for `hbase.ipc.server.num.callqueue`, 3 queues would 
be used for reads and 7 for writes.
 * A value of `.5` uses the same number of read queues and write queues.
-  Given a value of `10` for +hbase.ipc.server.num.callqueue+, 5 queues would 
be used for reads and 5 for writes.
+  Given a value of `10` for `hbase.ipc.server.num.callqueue`, 5 queues would 
be used for reads and 5 for writes.
 * A value of `.6` uses 60% of the queues for reading and 30% for reading.
-  Given a value of `10` for +hbase.ipc.server.num.callqueue+, 7 queues would 
be used for reads and 3 for writes.
+  Given a value of `10` for `hbase.ipc.server.num.callqueue`, 7 queues would 
be used for reads and 3 for writes.
 * A value of `1.0` uses one queue to process write requests, and all other 
queues process read requests.
-  A value higher than `1.0`                has the same effect as a value of 
`1.0`.
-  Given a value of `10` for +hbase.ipc.server.num.callqueue+, 9 queues would 
be used for reads and 1 for writes.
+  A value higher than `1.0` has the same effect as a value of `1.0`.
+  Given a value of `10` for `hbase.ipc.server.num.callqueue`, 9 queues would 
be used for reads and 1 for writes.
+
+You can also split the read queues so that separate queues are used for short 
reads (from Get operations) and long reads (from Scan operations), by setting 
the `hbase.ipc.server.callqueue.scan.ratio` option.
+This option is a factor between 0 and 1, which determine the ratio of read 
queues used for Gets and Scans.
+More queues are used for Gets if the value is below `.5` and more are used for 
scans if the value is above `.5`.
+No matter what setting you use, at least one read queue is used for Get 
operations.
 
-* You can also split the read queues so that separate queues are used for 
short reads (from Get operations) and long reads (from Scan operations), by 
setting the +hbase.ipc.server.callqueue.scan.ratio+ option.
-  This option is a factor between 0 and 1, which determine the ratio of read 
queues used for Gets and Scans.
-  More queues are used for Gets if the value is below `.5` and more are used 
for scans if the value is above `.5`.
-  No matter what setting you use, at least one read queue is used for Get 
operations.
-+
 * A value of `0` does not split the read queue.
 * A value of `.3` uses 60% of the read queues for Gets and 30% for Scans.
-  Given a value of `20` for +hbase.ipc.server.num.callqueue+ and a value of 
`.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for 
reads, out of those 10, 7 would be used for Gets and 3 for Scans.
+  Given a value of `20` for `hbase.ipc.server.num.callqueue` and a value of 
`.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for 
reads, out of those 10, 7 would be used for Gets and 3 for Scans.
 * A value of `.5` uses half the read queues for Gets and half for Scans.
-  Given a value of `20` for +hbase.ipc.server.num.callqueue+ and a value of 
`.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for 
reads, out of those 10, 5 would be used for Gets and 5 for Scans.
+  Given a value of `20` for `hbase.ipc.server.num.callqueue` and a value of 
`.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for 
reads, out of those 10, 5 would be used for Gets and 5 for Scans.
 * A value of `.6` uses 30% of the read queues for Gets and 60% for Scans.
-  Given a value of `20` for +hbase.ipc.server.num.callqueue+ and a value of 
`.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for 
reads, out of those 10, 3 would be used for Gets and 7 for Scans.
+  Given a value of `20` for `hbase.ipc.server.num.callqueue` and a value of 
`.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for 
reads, out of those 10, 3 would be used for Gets and 7 for Scans.
 * A value of `1.0` uses all but one of the read queues for Scans.
-  Given a value of `20` for +hbase.ipc.server.num.callqueue+ and a value 
of`.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for 
reads, out of those 10, 1 would be used for Gets and 9 for Scans.
+  Given a value of `20` for `hbase.ipc.server.num.callqueue` and a value 
of`.5` for `hbase.ipc.server.callqueue.read.ratio`, 10 queues would be used for 
reads, out of those 10, 1 would be used for Gets and 9 for Scans.
+
+You can use the new option `hbase.ipc.server.callqueue.handler.factor` to 
programmatically tune the number of queues:
 
-* You can use the new option `hbase.ipc.server.callqueue.handler.factor` to 
programmatically tune the number of queues:
-+
 * A value of `0` uses a single shared queue between all the handlers.
 * A value of `1` uses a separate queue for each handler.
 * A value between `0` and `1` tunes the number of queues against the number of 
handlers.
@@ -268,13 +277,13 @@ Having more queues, such as in a situation where you have 
one queue per handler,
 The trade-off is that if you have some queues with long-running tasks, a 
handler may end up waiting to execute from that queue rather than processing 
another queue which has waiting tasks.
 
 
-For these values to take effect on a given Region Server, the Region Server 
must be restarted.
+For these values to take effect on a given RegionServer, the RegionServer must 
be restarted.
 These parameters are intended for testing purposes and should be used 
carefully.
 
 [[perf.zookeeper]]
 == ZooKeeper
 
-See <<zookeeper,zookeeper>> for information on configuring ZooKeeper, and see 
the part about having a dedicated disk. 
+See <<zookeeper>> for information on configuring ZooKeeper, and see the part 
about having a dedicated disk.
 
 [[perf.schema]]
 == Schema Design
@@ -282,20 +291,20 @@ See <<zookeeper,zookeeper>> for information on 
configuring ZooKeeper, and see th
 [[perf.number.of.cfs]]
 === Number of Column Families
 
-See <<number.of.cfs,number.of.cfs>>.
+See <<number.of.cfs>>.
 
 [[perf.schema.keys]]
 === Key and Attribute Lengths
 
-See <<keysize,keysize>>.
-See also <<perf.compression.however,perf.compression.however>> for compression 
caveats.
+See <<keysize>>.
+See also <<perf.compression.however>> for compression caveats.
 
 [[schema.regionsize]]
 === Table RegionSize
 
-The regionsize can be set on a per-table basis via `setFileSize` on 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor]
        in the event where certain tables require different regionsizes than 
the configured default regionsize. 
+The regionsize can be set on a per-table basis via `setFileSize` on 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor]
 in the event where certain tables require different regionsizes than the 
configured default regionsize.
 
-See <<ops.capacity.regions,ops.capacity.regions>> for more information. 
+See <<ops.capacity.regions>> for more information.
 
 [[schema.bloom]]
 === Bloom Filters
@@ -303,13 +312,13 @@ See <<ops.capacity.regions,ops.capacity.regions>> for 
more information.
 A Bloom filter, named for its creator, Burton Howard Bloom, is a data 
structure which is designed to predict whether a given element is a member of a 
set of data.
 A positive result from a Bloom filter is not always accurate, but a negative 
result is guaranteed to be accurate.
 Bloom filters are designed to be "accurate enough" for sets of data which are 
so large that conventional hashing mechanisms would be impractical.
-For more information about Bloom filters in general, refer to 
link:http://en.wikipedia.org/wiki/Bloom_filter.
+For more information about Bloom filters in general, refer to 
http://en.wikipedia.org/wiki/Bloom_filter.
 
 In terms of HBase, Bloom filters provide a lightweight in-memory structure to 
reduce the number of disk reads for a given Get operation (Bloom filters do not 
work with Scans) to only the StoreFiles likely to contain the desired Row.
-The potential performance gain increases with the number of parallel reads. 
+The potential performance gain increases with the number of parallel reads.
 
 The Bloom filters themselves are stored in the metadata of each HFile and 
never need to be updated.
-When an HFile is opened because a region is deployed to a RegionServer, the 
Bloom filter is loaded into memory. 
+When an HFile is opened because a region is deployed to a RegionServer, the 
Bloom filter is loaded into memory.
 
 HBase includes some tuning mechanisms for folding the Bloom filter to reduce 
the size and keep the false positive rate within a desired range.
 
@@ -317,8 +326,7 @@ Bloom filters were introduced in 
link:https://issues.apache.org/jira/browse/HBAS
 Since HBase 0.96, row-based Bloom filters are enabled by default.
 (link:https://issues.apache.org/jira/browse/HBASE-8450[HBASE-])
 
-For more information on Bloom filters in relation to HBase, see 
<<blooms,blooms>> for more information, or the following Quora discussion: 
link:http://www.quora.com/How-are-bloom-filters-used-in-HBase[How are bloom
-          filters used in HBase?]. 
+For more information on Bloom filters in relation to HBase, see <<blooms>> for 
more information, or the following Quora discussion: 
link:http://www.quora.com/How-are-bloom-filters-used-in-HBase[How are bloom 
filters used in HBase?].
 
 [[bloom.filters.when]]
 ==== When To Use Bloom Filters
@@ -327,16 +335,16 @@ Since HBase 0.96, row-based Bloom filters are enabled by 
default.
 You may choose to disable them or to change some tables to use row+column 
Bloom filters, depending on the characteristics of your data and how it is 
loaded into HBase.
 
 To determine whether Bloom filters could have a positive impact, check the 
value of `blockCacheHitRatio` in the RegionServer metrics.
-If Bloom filters are enabled, the value of `blockCacheHitRatio` should 
increase, because the Bloom filter is filtering out blocks that are definitely 
not needed. 
+If Bloom filters are enabled, the value of `blockCacheHitRatio` should 
increase, because the Bloom filter is filtering out blocks that are definitely 
not needed.
 
 You can choose to enable Bloom filters for a row or for a row+column 
combination.
 If you generally scan entire rows, the row+column combination will not provide 
any benefit.
 A row-based Bloom filter can operate on a row+column Get, but not the other 
way around.
 However, if you have a large number of column-level Puts, such that a row may 
be present in every StoreFile, a row-based filter will always return a positive 
result and provide no benefit.
 Unless you have one column per row, row+column Bloom filters require more 
space, in order to store more keys.
-Bloom filters work best when the size of each data entry is at least a few 
kilobytes in size. 
+Bloom filters work best when the size of each data entry is at least a few 
kilobytes in size.
 
-Overhead will be reduced when your data is stored in a few larger StoreFiles, 
to avoid extra disk IO during low-level scans to find a specific row. 
+Overhead will be reduced when your data is stored in a few larger StoreFiles, 
to avoid extra disk IO during low-level scans to find a specific row.
 
 Bloom filters need to be rebuilt upon deletion, so may not be appropriate in 
environments with a large number of deletions.
 
@@ -345,7 +353,7 @@ Bloom filters need to be rebuilt upon deletion, so may not 
be appropriate in env
 Bloom filters are enabled on a Column Family.
 You can do this by using the setBloomFilterType method of HColumnDescriptor or 
using the HBase API.
 Valid values are `NONE` (the default), `ROW`, or `ROWCOL`.
-See <<bloom.filters.when,bloom.filters.when>> for more information on `ROW` 
versus `ROWCOL`.
+See <<bloom.filters.when>> for more information on `ROW` versus `ROWCOL`.
 See also the API documentation for 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor].
 
 The following example creates a table and enables a ROWCOL Bloom filter on the 
`colfam1` column family.
@@ -357,7 +365,7 @@ hbase> create 'mytable',{NAME => 'colfam1', BLOOMFILTER => 
'ROWCOL'}
 
 ==== Configuring Server-Wide Behavior of Bloom Filters
 
-You can configure the following settings in the _hbase-site.xml_. 
+You can configure the following settings in the _hbase-site.xml_.
 
 [cols="1,1,1", options="header"]
 |===
@@ -367,8 +375,7 @@ You can configure the following settings in the 
_hbase-site.xml_.
 
 | io.hfile.bloom.enabled
 | yes
-| Set to no to kill bloom filters server-wide if
-                    something goes wrong
+| Set to no to kill bloom filters server-wide if something goes wrong
 
 | io.hfile.bloom.error.rate
 | .01
@@ -383,18 +390,16 @@ You can configure the following settings in the 
_hbase-site.xml_.
 
 | io.storefile.bloom.max.keys
 | 128000000
-| For default (single-block) Bloom filters, this specifies the maximum
-                    number of keys.
+| For default (single-block) Bloom filters, this specifies the maximum number 
of keys.
 
 | io.storefile.delete.family.bloom.enabled
 | true
-| Master switch to enable Delete Family Bloom filters and store them in
-                  the StoreFile.
+| Master switch to enable Delete Family Bloom filters and store them in the 
StoreFile.
 
 | io.storefile.bloom.block.size
 | 65536
 | Target Bloom block size. Bloom filter blocks of approximately this size
-                    are interleaved with data blocks.
+                  are interleaved with data blocks.
 
 | hfile.block.bloom.cacheonwrite
 | false
@@ -404,35 +409,35 @@ You can configure the following settings in the 
_hbase-site.xml_.
 [[schema.cf.blocksize]]
 === ColumnFamily BlockSize
 
-The blocksize can be configured for each ColumnFamily in a table, and this 
defaults to 64k.
+The blocksize can be configured for each ColumnFamily in a table, and defaults 
to 64k.
 Larger cell values require larger blocksizes.
-There is an inverse relationship between blocksize and the resulting StoreFile 
indexes (i.e., if the blocksize is doubled then the resulting indexes should be 
roughly halved). 
+There is an inverse relationship between blocksize and the resulting StoreFile 
indexes (i.e., if the blocksize is doubled then the resulting indexes should be 
roughly halved).
 
-See 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor]
        and <<store,store>>for more information. 
+See 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor]
 and <<store>>for more information.
 
 [[cf.in.memory]]
 === In-Memory ColumnFamilies
 
 ColumnFamilies can optionally be defined as in-memory.
 Data is still persisted to disk, just like any other ColumnFamily.
-In-memory blocks have the highest priority in the <<block.cache,block.cache>>, 
but it is not a guarantee that the entire table will be in memory. 
+In-memory blocks have the highest priority in the <<block.cache>>, but it is 
not a guarantee that the entire table will be in memory.
 
-See 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor]
        for more information. 
+See 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor]
 for more information.
 
 [[perf.compression]]
 === Compression
 
 Production systems should use compression with their ColumnFamily definitions.
-See <<compression,compression>> for more information. 
+See <<compression>> for more information.
 
 [[perf.compression.however]]
 ==== However...
 
 Compression deflates data _on disk_.
 When it's in-memory (e.g., in the MemStore) or on the wire (e.g., transferring 
between RegionServer and Client) it's inflated.
-So while using ColumnFamily compression is a best practice, but it's not going 
to completely eliminate the impact of over-sized Keys, over-sized ColumnFamily 
names, or over-sized Column names. 
+So while using ColumnFamily compression is a best practice, but it's not going 
to completely eliminate the impact of over-sized Keys, over-sized ColumnFamily 
names, or over-sized Column names.
 
-See <<keysize,keysize>> on for schema design tips, and <<keyvalue,keyvalue>> 
for more information on HBase stores data internally. 
+See <<keysize>> on for schema design tips, and <<keyvalue>> for more 
information on HBase stores data internally.
 
 [[perf.general]]
 == HBase General Patterns
@@ -444,9 +449,8 @@ When people get started with HBase they have a tendency to 
write code that looks
 
 [source,java]
 ----
-
 Get get = new Get(rowkey);
-Result r = htable.get(get);
+Result r = table.get(get);
 byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns 
current version of value
 ----
 
@@ -455,12 +459,11 @@ It's better to use constants for the byte-arrays, like 
this:
 
 [source,java]
 ----
-
 public static final byte[] CF = "cf".getBytes();
 public static final byte[] ATTR = "attr".getBytes();
 ...
 Get get = new Get(rowkey);
-Result r = htable.get(get);
+Result r = table.get(get);
 byte[] b = r.getValue(CF, ATTR);  // returns current version of value
 ----
 
@@ -471,61 +474,60 @@ byte[] b = r.getValue(CF, ATTR);  // returns current 
version of value
 === Batch Loading
 
 Use the bulk load tool if you can.
-See <<arch.bulk.load,arch.bulk.load>>.
-Otherwise, pay attention to the below. 
+See <<arch.bulk.load>>.
+Otherwise, pay attention to the below.
 
 [[precreate.regions]]
-===  Table Creation: Pre-Creating Regions 
+===  Table Creation: Pre-Creating Regions
 
 Tables in HBase are initially created with one region by default.
 For bulk imports, this means that all clients will write to the same region 
until it is large enough to split and become distributed across the cluster.
 A useful pattern to speed up the bulk import process is to pre-create empty 
regions.
-Be somewhat conservative in this, because too-many regions can actually 
degrade performance. 
+Be somewhat conservative in this, because too-many regions can actually 
degrade performance.
 
 There are two different approaches to pre-creating splits.
-The first approach is to rely on the default `HBaseAdmin` strategy (which is 
implemented in `Bytes.split`)... 
+The first approach is to rely on the default `Admin` strategy (which is 
implemented in `Bytes.split`)...
 
 [source,java]
 ----
 
-byte[] startKey = ...;         // your lowest key
-byte[] endKey = ...;                   // your highest key
-int numberOfRegions = ...;     // # of regions to create
+byte[] startKey = ...;      // your lowest key
+byte[] endKey = ...;        // your highest key
+int numberOfRegions = ...;  // # of regions to create
 admin.createTable(table, startKey, endKey, numberOfRegions);
 ----
 
-And the other approach is to define the splits yourself... 
+And the other approach is to define the splits yourself...
 
 [source,java]
 ----
-
 byte[][] splits = ...;   // create your own splits
 admin.createTable(table, splits);
 ----
 
-See <<rowkey.regionsplits,rowkey.regionsplits>> for issues related to 
understanding your keyspace and pre-creating regions.
-See <<manual_region_splitting_decisions,manual region splitting decisions>>    
    for discussion on manually pre-splitting regions.
+See <<rowkey.regionsplits>> for issues related to understanding your keyspace 
and pre-creating regions.
+See <<manual_region_splitting_decisions,manual region splitting decisions>>  
for discussion on manually pre-splitting regions.
 
 [[def.log.flush]]
-===  Table Creation: Deferred Log Flush 
+===  Table Creation: Deferred Log Flush
 
 The default behavior for Puts using the Write Ahead Log (WAL) is that `WAL` 
edits will be written immediately.
 If deferred log flush is used, WAL edits are kept in memory until the flush 
period.
 The benefit is aggregated and asynchronous `WAL`- writes, but the potential 
downside is that if the RegionServer goes down the yet-to-be-flushed edits are 
lost.
-This is safer, however, than not using WAL at all with Puts. 
+This is safer, however, than not using WAL at all with Puts.
 
 Deferred log flush can be configured on tables via 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor].
-The default value of `hbase.regionserver.optionallogflushinterval` is 1000ms. 
+The default value of `hbase.regionserver.optionallogflushinterval` is 1000ms.
 
 [[perf.hbase.client.autoflush]]
 === HBase Client: AutoFlush
 
-When performing a lot of Puts, make sure that setAutoFlush is set to false on 
your 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html[HTable]
        instance.
+When performing a lot of Puts, make sure that setAutoFlush is set to false on 
your 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table]
 instance.
 Otherwise, the Puts will be sent one at a time to the RegionServer.
-Puts added via ` htable.add(Put)` and ` htable.add( <List> Put)` wind up in 
the same write buffer.
+Puts added via `table.add(Put)` and `table.add( <List> Put)` wind up in the 
same write buffer.
 If `autoFlush = false`, these messages are not sent until the write-buffer is 
filled.
-To explicitly flush the messages, call [method]+flushCommits+.
-Calling [method]+close+ on the `HTable` instance will invoke 
[method]+flushCommits+.
+To explicitly flush the messages, call `flushCommits`.
+Calling `close` on the `Table` instance will invoke `flushCommits`.
 
 [[perf.hbase.client.putwal]]
 === HBase Client: Turn off WAL on Puts
@@ -536,47 +538,46 @@ Bulk loads can be re-run in the event of a crash, with 
little risk of data loss.
 
 WARNING: If you disable the WAL for anything other than bulk loads, your data 
is at risk.
 
-In general, it is best to use WAL for Puts, and where loading throughput is a 
concern to use link:[bulk loading] techniques instead.
+In general, it is best to use WAL for Puts, and where loading throughput is a 
concern to use bulk loading techniques instead.
 For normal Puts, you are not likely to see a performance improvement which 
would outweigh the risk.
-To disable the WAL, see <<wal.disable,wal.disable>>.
+To disable the WAL, see <<wal.disable>>.
 
 [[perf.hbase.client.regiongroup]]
 === HBase Client: Group Puts by RegionServer
 
 In addition to using the writeBuffer, grouping `Put`s by RegionServer can 
reduce the number of client RPC calls per writeBuffer flush.
-There is a utility `HTableUtil` currently on TRUNK that does this, but you can 
either copy that or implement your own version for those still on 0.90.x or 
earlier. 
+There is a utility `HTableUtil` currently on TRUNK that does this, but you can 
either copy that or implement your own version for those still on 0.90.x or 
earlier.
 
 [[perf.hbase.write.mr.reducer]]
 === MapReduce: Skip The Reducer
 
 When writing a lot of data to an HBase table from a MR job (e.g., with 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat]),
 and specifically where Puts are being emitted from the Mapper, skip the 
Reducer step.
 When a Reducer step is used, all of the output (Puts) from the Mapper will get 
spooled to disk, then sorted/shuffled to other Reducers that will most likely 
be off-node.
-It's far more efficient to just write directly to HBase. 
+It's far more efficient to just write directly to HBase.
 
-For summary jobs where HBase is used as a source and a sink, then writes will 
be coming from the Reducer step (e.g., summarize values then write out result). 
This is a different processing problem than from the the above case. 
+For summary jobs where HBase is used as a source and a sink, then writes will 
be coming from the Reducer step (e.g., summarize values then write out result). 
This is a different processing problem than from the the above case.
 
 [[perf.one.region]]
 === Anti-Pattern: One Hot Region
 
-If all your data is being written to one region at a time, then re-read the 
section on processing link:[timeseries] data.
+If all your data is being written to one region at a time, then re-read the 
section on processing timeseries data.
 
-Also, if you are pre-splitting regions and all your data is _still_        
winding up in a single region even though your keys aren't monotonically 
increasing, confirm that your keyspace actually works with the split strategy.
+Also, if you are pre-splitting regions and all your data is _still_ winding up 
in a single region even though your keys aren't monotonically increasing, 
confirm that your keyspace actually works with the split strategy.
 There are a variety of reasons that regions may appear "well split" but won't 
work with your data.
-As the HBase client communicates directly with the RegionServers, this can be 
obtained via 
link:hhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation(byte[])[HTable.getRegionLocation].
 
+As the HBase client communicates directly with the RegionServers, this can be 
obtained via 
link:hhttp://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#getRegionLocation(byte[])[Table.getRegionLocation].
 
-See <<precreate.regions,precreate.regions>>, as well as 
<<perf.configurations,perf.configurations>>      
+See <<precreate.regions>>, as well as <<perf.configurations>>
 
 [[perf.reading]]
 == Reading from HBase
 
 The mailing list can help if you are having performance issues.
-For example, here is a good general thread on what to look at addressing 
read-time issues: link:http://search-hadoop.com/m/qOo2yyHtCC1[HBase Random Read 
latency >
-      100ms]
+For example, here is a good general thread on what to look at addressing 
read-time issues: link:http://search-hadoop.com/m/qOo2yyHtCC1[HBase Random Read 
latency > 100ms]
 
 [[perf.hbase.client.caching]]
 === Scan Caching
 
-If HBase is used as an input source for a MapReduce job, for example, make 
sure that the input 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan]
        instance to the MapReduce job has [method]+setCaching+ set to something 
greater than the default (which is 1). Using the default value means that the 
map-task will make call back to the region-server for every record processed.
+If HBase is used as an input source for a MapReduce job, for example, make 
sure that the input 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan]
 instance to the MapReduce job has `setCaching` set to something greater than 
the default (which is 1). Using the default value means that the map-task will 
make call back to the region-server for every record processed.
 Setting this value to 500, for example, will transfer 500 rows at a time to 
the client to be processed.
 There is a cost/benefit to have the cache value be large because it costs more 
in memory for both client and RegionServer, so bigger isn't always better.
 
@@ -585,18 +586,18 @@ There is a cost/benefit to have the cache value be large 
because it costs more i
 
 Scan settings in MapReduce jobs deserve special attention.
 Timeouts can result (e.g., UnknownScannerException) in Map tasks if it takes 
longer to process a batch of records before the client goes back to the 
RegionServer for the next set of data.
-This problem can occur because there is non-trivial processing occuring per 
row.
+This problem can occur because there is non-trivial processing occurring per 
row.
 If you process rows quickly, set caching higher.
-If you process rows more slowly (e.g., lots of transformations per row, 
writes), then set caching lower. 
+If you process rows more slowly (e.g., lots of transformations per row, 
writes), then set caching lower.
 
-Timeouts can also happen in a non-MapReduce use case (i.e., single threaded 
HBase client doing a Scan), but the processing that is often performed in 
MapReduce jobs tends to exacerbate this issue. 
+Timeouts can also happen in a non-MapReduce use case (i.e., single threaded 
HBase client doing a Scan), but the processing that is often performed in 
MapReduce jobs tends to exacerbate this issue.
 
 [[perf.hbase.client.selection]]
 === Scan Attribute Selection
 
 Whenever a Scan is used to process large numbers of rows (and especially when 
used as a MapReduce source), be aware of which attributes are selected.
-If `scan.addFamily`        is called then _all_ of the attributes in the 
specified ColumnFamily will be returned to the client.
-If only a small number of the available attributes are to be processed, then 
only those attributes should be specified in the input scan because attribute 
over-selection is a non-trivial performance penalty over large datasets. 
+If `scan.addFamily` is called then _all_ of the attributes in the specified 
ColumnFamily will be returned to the client.
+If only a small number of the available attributes are to be processed, then 
only those attributes should be specified in the input scan because attribute 
over-selection is a non-trivial performance penalty over large datasets.
 
 [[perf.hbase.client.seek]]
 === Avoid scan seeks
@@ -610,7 +611,6 @@ The following code instructs the RegionServer to attempt 
two iterations of next
 
 [source,java]
 ----
-
 Scan scan = new Scan();
 scan.addColumn(...);
 scan.setAttribute(Scan.HINT_LOOKAHEAD, Bytes.toBytes(2));
@@ -620,71 +620,68 @@ table.getScanner(scan);
 [[perf.hbase.mr.input]]
 === MapReduce - Input Splits
 
-For MapReduce jobs that use HBase tables as a source, if there a pattern where 
the "slow" map tasks seem to have the same Input Split (i.e., the RegionServer 
serving the data), see the Troubleshooting Case Study in 
<<casestudies.slownode,casestudies.slownode>>. 
+For MapReduce jobs that use HBase tables as a source, if there a pattern where 
the "slow" map tasks seem to have the same Input Split (i.e., the RegionServer 
serving the data), see the Troubleshooting Case Study in 
<<casestudies.slownode>>.
 
 [[perf.hbase.client.scannerclose]]
 === Close ResultScanners
 
-This isn't so much about improving performance but rather _avoiding_        
performance problems.
-If you forget to close 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ResultScanner.html[ResultScanners]
        you can cause problems on the RegionServers.
-Always have ResultScanner processing enclosed in try/catch blocks...
+This isn't so much about improving performance but rather _avoiding_ 
performance problems.
+If you forget to close 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ResultScanner.html[ResultScanners]
 you can cause problems on the RegionServers.
+Always have ResultScanner processing enclosed in try/catch blocks.
 
 [source,java]
 ----
-
 Scan scan = new Scan();
 // set attrs...
-ResultScanner rs = htable.getScanner(scan);
+ResultScanner rs = table.getScanner(scan);
 try {
   for (Result r = rs.next(); r != null; r = rs.next()) {
   // process result...
 } finally {
   rs.close();  // always close the ResultScanner!
 }
-htable.close();
+table.close();
 ----
 
 [[perf.hbase.client.blockcache]]
 === Block Cache
 
-link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan]
        instances can be set to use the block cache in the RegionServer via the 
[method]+setCacheBlocks+ method.
+link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan]
 instances can be set to use the block cache in the RegionServer via the 
`setCacheBlocks` method.
 For input Scans to MapReduce jobs, this should be `false`.
 For frequently accessed rows, it is advisable to use the block cache.
 
-Cache more data by moving your Block Cache offheap.
-See <<offheap.blockcache,offheap.blockcache>>
+Cache more data by moving your Block Cache off-heap.
+See <<offheap.blockcache>>
 
 [[perf.hbase.client.rowkeyonly]]
 === Optimal Loading of Row Keys
 
-When performing a table 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[scan]
        where only the row keys are needed (no families, qualifiers, values or 
timestamps), add a FilterList with a `MUST_PASS_ALL` operator to the scanner 
using [method]+setFilter+.
-The filter list should include both a 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html[FirstKeyOnlyFilter]
        and a 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html[KeyOnlyFilter].
-Using this filter combination will result in a worst case scenario of a 
RegionServer reading a single value from disk and minimal network traffic to 
the client for a single row. 
+When performing a table 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[scan]
 where only the row keys are needed (no families, qualifiers, values or 
timestamps), add a FilterList with a `MUST_PASS_ALL` operator to the scanner 
using `setFilter`.
+The filter list should include both a 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html[FirstKeyOnlyFilter]
 and a 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html[KeyOnlyFilter].
+Using this filter combination will result in a worst case scenario of a 
RegionServer reading a single value from disk and minimal network traffic to 
the client for a single row.
 
 [[perf.hbase.read.dist]]
 === Concurrency: Monitor Data Spread
 
 When performing a high number of concurrent reads, monitor the data spread of 
the target tables.
-If the target table(s) have too few regions then the reads could likely be 
served from too few nodes. 
+If the target table(s) have too few regions then the reads could likely be 
served from too few nodes.
 
-See <<precreate.regions,precreate.regions>>, as well as 
<<perf.configurations,perf.configurations>>      
+See <<precreate.regions>>, as well as <<perf.configurations>>
 
 [[blooms]]
 === Bloom Filters
 
 Enabling Bloom Filters can save your having to go to disk and can help improve 
read latencies.
 
-link:http://en.wikipedia.org/wiki/Bloom_filter[Bloom filters] were developed 
over in link:https://issues.apache.org/jira/browse/HBASE-1200[HBase-1200 Add
-          bloomfilters].
-For description of the development process -- why static blooms rather than 
dynamic -- and for an overview of the unique properties that pertain to blooms 
in HBase, as well as possible future directions, see the _Development Process_ 
section of the document 
link:https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf[BloomFilters
-              in HBase] attached to 
link:https://issues.apache.org/jira/browse/HBASE-1200[HBase-1200].
+link:http://en.wikipedia.org/wiki/Bloom_filter[Bloom filters] were developed 
over in link:https://issues.apache.org/jira/browse/HBASE-1200[HBase-1200 Add 
bloomfilters].
+For description of the development process -- why static blooms rather than 
dynamic -- and for an overview of the unique properties that pertain to blooms 
in HBase, as well as possible future directions, see the _Development Process_ 
section of the document 
link:https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf[BloomFilters
 in HBase] attached to 
link:https://issues.apache.org/jira/browse/HBASE-1200[HBASE-1200].
 The bloom filters described here are actually version two of blooms in HBase.
 In versions up to 0.19.x, HBase had a dynamic bloom option based on work done 
by the link:http://www.one-lab.org[European Commission One-Lab Project 034819].
 The core of the HBase bloom work was later pulled up into Hadoop to implement 
org.apache.hadoop.io.BloomMapFile.
 Version 1 of HBase blooms never worked that well.
 Version 2 is a rewrite from scratch though again it starts with the one-lab 
work.
 
-See also <<schema.bloom,schema.bloom>>. 
+See also <<schema.bloom>>.
 
 [[bloom_footprint]]
 ==== Bloom StoreFile footprint
@@ -698,11 +695,11 @@ Bloom filters add an entry to the `StoreFile` general 
`FileInfo` data structure
 ===== BloomFilter entries in `StoreFile` metadata
 
 `BLOOM_FILTER_META` holds Bloom Size, Hash Function used, etc.
-Its small in size and is cached on `StoreFile.Reader` load
+It's small in size and is cached on `StoreFile.Reader` load
 
 `BLOOM_FILTER_DATA` is the actual bloomfilter data.
 Obtained on-demand.
-Stored in the LRU cache, if it is enabled (Its enabled by default).
+Stored in the LRU cache, if it is enabled (It's enabled by default).
 
 [[config.bloom]]
 ==== Bloom Filter Configuration
@@ -723,8 +720,7 @@ to .5%) == +1 bit per bloom entry.
 `io.hfile.bloom.max.fold` = guaranteed minimum fold rate.
 Most people should leave this alone.
 Default = 7, or can collapse to at least 1/128th of original size.
-See the _Development Process_ section of the document 
link:https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf[BloomFilters
-              in HBase] for more on what this option means.
+See the _Development Process_ section of the document 
link:https://issues.apache.org/jira/secure/attachment/12444007/Bloom_Filters_in_HBase.pdf[BloomFilters
 in HBase] for more on what this option means.
 
 === Hedged Reads
 
@@ -736,12 +732,14 @@ Hedged reads can be helpful for times where a rare slow 
read is caused by a tran
 
 Because a HBase RegionServer is a HDFS client, you can enable hedged reads in 
HBase, by adding the following properties to the RegionServer's hbase-site.xml 
and tuning the values to suit your environment.
 
-* .Configuration for Hedged Reads`dfs.client.hedged.read.threadpool.size` - 
the number of threads dedicated to servicing hedged reads.
+.Configuration for Hedged Reads
+* `dfs.client.hedged.read.threadpool.size` - the number of threads dedicated 
to servicing hedged reads.
   If this is set to 0 (the default), hedged reads are disabled.
 * `dfs.client.hedged.read.threshold.millis` - the number of milliseconds to 
wait before spawning a second read thread.
 
 .Hedged Reads Configuration Example
 ====
+[source,xml]
 ----
 <property>
   <name>dfs.client.hedged.read.threadpool.size</name>
@@ -755,9 +753,10 @@ Because a HBase RegionServer is a HDFS client, you can 
enable hedged reads in HB
 ====
 
 Use the following metrics to tune the settings for hedged reads on your 
cluster.
-See <<hbase_metrics,hbase metrics>>  for more information.
+See <<hbase_metrics>>  for more information.
 
-* .Metrics for Hedged ReadshedgedReadOps - the number of times hedged read 
threads have been triggered.
+.Metrics for Hedged Reads
+* hedgedReadOps - the number of times hedged read threads have been triggered.
   This could indicate that read requests are often slow, or that hedged reads 
are triggered too quickly.
 * hedgeReadOpsWin - the number of times the hedged read thread was faster than 
the original thread.
   This could indicate that a given RegionServer is having trouble servicing 
requests.
@@ -770,24 +769,24 @@ See <<hbase_metrics,hbase metrics>>  for more information.
 
 HBase tables are sometimes used as queues.
 In this case, special care must be taken to regularly perform major 
compactions on tables used in this manner.
-As is documented in <<datamodel,datamodel>>, marking rows as deleted creates 
additional StoreFiles which then need to be processed on reads.
-Tombstones only get cleaned up with major compactions. 
+As is documented in <<datamodel>>, marking rows as deleted creates additional 
StoreFiles which then need to be processed on reads.
+Tombstones only get cleaned up with major compactions.
 
-See also <<compaction,compaction>> and 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#majorCompact%28java.lang.String%29[HBaseAdmin.majorCompact].
 
+See also <<compaction>> and 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact%28java.lang.String%29[Admin.majorCompact].
 
 [[perf.deleting.rpc]]
 === Delete RPC Behavior
 
-Be aware that `htable.delete(Delete)` doesn't use the writeBuffer.
+Be aware that `Table.delete(Delete)` doesn't use the writeBuffer.
 It will execute an RegionServer RPC with each invocation.
-For a large number of deletes, consider `htable.delete(List)`. 
+For a large number of deletes, consider `Table.delete(List)`.
 
-See 
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#delete%28org.apache.hadoop.hbase.client.Delete%29
      
+See 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#delete%28org.apache.hadoop.hbase.client.Delete%29
 
 [[perf.hdfs]]
 == HDFS
 
-Because HBase runs on <<arch.hdfs,arch.hdfs>> it is important to understand 
how it works and how it affects HBase. 
+Because HBase runs on <<arch.hdfs>> it is important to understand how it works 
and how it affects HBase.
 
 [[perf.hdfs.curr]]
 === Current Issues With Low-Latency Reads
@@ -795,26 +794,22 @@ Because HBase runs on <<arch.hdfs,arch.hdfs>> it is 
important to understand how
 The original use-case for HDFS was batch processing.
 As such, there low-latency reads were historically not a priority.
 With the increased adoption of Apache HBase this is changing, and several 
improvements are already in development.
-See the link:https://issues.apache.org/jira/browse/HDFS-1599[Umbrella Jira 
Ticket for HDFS
-          Improvements for HBase]. 
+See the link:https://issues.apache.org/jira/browse/HDFS-1599[Umbrella Jira 
Ticket for HDFS Improvements for HBase].
 
 [[perf.hdfs.configs.localread]]
 === Leveraging local data
 
 Since Hadoop 1.0.0 (also 0.22.1, 0.23.1, CDH3u3 and HDP 1.0) via 
link:https://issues.apache.org/jira/browse/HDFS-2246[HDFS-2246], it is possible 
for the DFSClient to take a "short circuit" and read directly from the disk 
instead of going through the DataNode when the data is local.
 What this means for HBase is that the RegionServers can read directly off 
their machine's disks instead of having to open a socket to talk to the 
DataNode, the former being generally much faster.
-See JD's link:http://files.meetup.com/1350427/hug_ebay_jdcryans.pdf[Performance
-              Talk].
-Also see link:http://search-hadoop.com/m/zV6dKrLCVh1[HBase, mail # dev - read 
short
-          circuit] thread for more discussion around short circuit reads. 
+See JD's 
link:http://files.meetup.com/1350427/hug_ebay_jdcryans.pdf[Performance Talk].
+Also see link:http://search-hadoop.com/m/zV6dKrLCVh1[HBase, mail # dev - read 
short circuit] thread for more discussion around short circuit reads.
 
 To enable "short circuit" reads, it will depend on your version of Hadoop.
 The original shortcircuit read patch was much improved upon in Hadoop 2 in 
link:https://issues.apache.org/jira/browse/HDFS-347[HDFS-347].
-See 
link:http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/
        for details on the difference between the old and new implementations.
-See 
link:http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html[Hadoop
-          shortcircuit reads configuration page] for how to enable the latter, 
better version of shortcircuit.
+See 
http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/
 for details on the difference between the old and new implementations.
+See 
link:http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html[Hadoop
 shortcircuit reads configuration page] for how to enable the latter, better 
version of shortcircuit.
 For example, here is a minimal config.
-enabling short-circuit reads added to _hbase-site.xml_: 
+enabling short-circuit reads added to _hbase-site.xml_:
 
 [source,xml]
 ----
@@ -837,38 +832,37 @@ enabling short-circuit reads added to _hbase-site.xml_:
 </property>
 ----
 
-Be careful about permissions for the directory that hosts the shared domain 
socket; dfsclient will complain if open to other than the hbase user. 
+Be careful about permissions for the directory that hosts the shared domain 
socket; dfsclient will complain if open to other than the hbase user.
 
 If you are running on an old Hadoop, one that is without 
link:https://issues.apache.org/jira/browse/HDFS-347[HDFS-347] but that has 
link:https://issues.apache.org/jira/browse/HDFS-2246[HDFS-2246], you must set 
two configurations.
 First, the hdfs-site.xml needs to be amended.
-Set the property `dfs.block.local-path-access.user` to be the _only_        
user that can use the shortcut.
+Set the property `dfs.block.local-path-access.user` to be the _only_ user that 
can use the shortcut.
 This has to be the user that started HBase.
-Then in hbase-site.xml, set `dfs.client.read.shortcircuit` to be `true`      
+Then in hbase-site.xml, set `dfs.client.read.shortcircuit` to be `true`
 
-Services -- at least the HBase RegionServers -- will need to be restarted in 
order to pick up the new configurations. 
+Services -- at least the HBase RegionServers -- will need to be restarted in 
order to pick up the new configurations.
 
 .dfs.client.read.shortcircuit.buffer.size
 [NOTE]
 ====
-The default for this value is too high when running on a highly trafficed 
HBase.
-In HBase, if this value has not been set, we set it down from the default of 
1M to 128k (Since HBase 0.98.0 and 0.96.1). See 
link:https://issues.apache.org/jira/browse/HBASE-8143[HBASE-8143 HBase on Hadoop
-            2 with local short circuit reads (ssr) causes OOM]). The Hadoop 
DFSClient in HBase will allocate a direct byte buffer of this size for _each_ 
block it has open; given HBase keeps its HDFS files open all the time, this can 
add up quickly.
+The default for this value is too high when running on a highly trafficked 
HBase.
+In HBase, if this value has not been set, we set it down from the default of 
1M to 128k (Since HBase 0.98.0 and 0.96.1). See 
link:https://issues.apache.org/jira/browse/HBASE-8143[HBASE-8143 HBase on 
Hadoop 2 with local short circuit reads (ssr) causes OOM]). The Hadoop 
DFSClient in HBase will allocate a direct byte buffer of this size for _each_ 
block it has open; given HBase keeps its HDFS files open all the time, this can 
add up quickly.
 ====
 
 [[perf.hdfs.comp]]
 === Performance Comparisons of HBase vs. HDFS
 
 A fairly common question on the dist-list is why HBase isn't as performant as 
HDFS files in a batch context (e.g., as a MapReduce source or sink). The short 
answer is that HBase is doing a lot more than HDFS (e.g., reading the 
KeyValues, returning the most current row or specified timestamps, etc.), and 
as such HBase is 4-5 times slower than HDFS in this processing context.
-There is room for improvement and this gap will, over time, be reduced, but 
HDFS will always be faster in this use-case. 
+There is room for improvement and this gap will, over time, be reduced, but 
HDFS will always be faster in this use-case.
 
 [[perf.ec2]]
 == Amazon EC2
 
 Performance questions are common on Amazon EC2 environments because it is a 
shared environment.
 You will not see the same throughput as a dedicated server.
-In terms of running tests on EC2, run them several times for the same reason 
(i.e., it's a shared environment and you don't know what else is happening on 
the server). 
+In terms of running tests on EC2, run them several times for the same reason 
(i.e., it's a shared environment and you don't know what else is happening on 
the server).
 
-If you are running on EC2 and post performance questions on the dist-list, 
please state this fact up-front that because EC2 issues are practically a 
separate class of performance issues. 
+If you are running on EC2 and post performance questions on the dist-list, 
please state this fact up-front that because EC2 issues are practically a 
separate class of performance issues.
 
 [[perf.hbase.mr.cluster]]
 == Collocating HBase and MapReduce
@@ -877,17 +871,17 @@ It is often recommended to have different clusters for 
HBase and MapReduce.
 A better qualification of this is: don't collocate a HBase that serves live 
requests with a heavy MR workload.
 OLTP and OLAP-optimized systems have conflicting requirements and one will 
lose to the other, usually the former.
 For example, short latency-sensitive disk reads will have to wait in line 
behind longer reads that are trying to squeeze out as much throughput as 
possible.
-MR jobs that write to HBase will also generate flushes and compactions, which 
will in turn invalidate blocks in the <<block.cache,block.cache>>. 
+MR jobs that write to HBase will also generate flushes and compactions, which 
will in turn invalidate blocks in the <<block.cache>>.
 
-If you need to process the data from your live HBase cluster in MR, you can 
ship the deltas with <<copy.table,copy.table>> or use replication to get the 
new data in real time on the OLAP cluster.
-In the worst case, if you really need to collocate both, set MR to use less 
Map and Reduce slots than you'd normally configure, possibly just one. 
+If you need to process the data from your live HBase cluster in MR, you can 
ship the deltas with <<copy.table>> or use replication to get the new data in 
real time on the OLAP cluster.
+In the worst case, if you really need to collocate both, set MR to use less 
Map and Reduce slots than you'd normally configure, possibly just one.
 
-When HBase is used for OLAP operations, it's preferable to set it up in a 
hardened way like configuring the ZooKeeper session timeout higher and giving 
more memory to the MemStores (the argument being that the Block Cache won't be 
used much since the workloads are usually long scans). 
+When HBase is used for OLAP operations, it's preferable to set it up in a 
hardened way like configuring the ZooKeeper session timeout higher and giving 
more memory to the MemStores (the argument being that the Block Cache won't be 
used much since the workloads are usually long scans).
 
 [[perf.casestudy]]
 == Case Studies
 
-For Performance and Troubleshooting Case Studies, see 
<<casestudies,casestudies>>. 
+For Performance and Troubleshooting Case Studies, see <<casestudies>>.
 
 ifdef::backend-docbook[]
 [index]


http://git-wip-us.apache.org/repos/asf/hbase/blob/7139c90e/src/main/asciidoc/_chapters/preface.adoc
----------------------------------------------------------------------
diff --git a/src/main/asciidoc/_chapters/preface.adoc 
b/src/main/asciidoc/_chapters/preface.adoc
index 4f8941a..2eb8411 100644
--- a/src/main/asciidoc/_chapters/preface.adoc
+++ b/src/main/asciidoc/_chapters/preface.adoc
@@ -29,25 +29,20 @@
 
 This is the official reference guide for the 
link:http://hbase.apache.org/[HBase] version it ships with.
 
-Herein you will find either the definitive documentation on an HBase topic as 
of its standing when the referenced HBase version shipped, or it will point to 
the location in link:http://hbase.apache.org/apidocs/index.html[javadoc], 
link:https://issues.apache.org/jira/browse/HBASE[JIRA] or 
link:http://wiki.apache.org/hadoop/Hbase[wiki] where the pertinent information 
can be found.
+Herein you will find either the definitive documentation on an HBase topic as 
of its standing when the referenced HBase version shipped, or it will point to 
the location in link:http://hbase.apache.org/apidocs/index.html[Javadoc], 
link:https://issues.apache.org/jira/browse/HBASE[JIRA] or 
link:http://wiki.apache.org/hadoop/Hbase[wiki] where the pertinent information 
can be found.
 
 .About This Guide
-This reference guide is a work in progress. The source for this guide can be 
found in the _src/main/dasciidoc_ directory of the HBase source. This reference 
guide is marked up using Asciidoc, from which the the finished guide is 
generated as part of the 'site' build target. Run 
+This reference guide is a work in progress. The source for this guide can be 
found in the _src/main/asciidoc directory of the HBase source. This reference 
guide is marked up using link:http://asciidoc.org/[AsciiDoc] from which the 
finished guide is generated as part of the 'site' build target. Run
 [source,bourne]
 ----
 mvn site
----- 
+----
 to generate this documentation.
 Amendments and improvements to the documentation are welcomed.
 Click 
link:https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12310753&issuetype=1&components=12312132&summary=SHORT+DESCRIPTION[this
 link] to file a new documentation bug against Apache HBase with some values 
pre-selected.
 
 .Contributing to the Documentation
-For an overview of Asciidoc and suggestions to get started contributing to the 
documentation, see <<appendix_contributing_to_documentation,appendix 
contributing to documentation>>.
-
-.Providing Feedback
-This guide allows you to leave comments or questions on any page, using Disqus.
-Look for the Comments area at the bottom of the page.
-Answering these questions is a volunteer effort, and may be delayed.
+For an overview of AsciiDoc and suggestions to get started contributing to the 
documentation, see the <<appendix_contributing_to_documentation,relevant 
section later in this documentation>>.
 
 .Heads-up if this is your first foray into the world of distributed 
computing...
 If this is your first foray into the wonderful world of Distributed Computing, 
then you are in for some interesting times.
@@ -57,8 +52,8 @@ Your cluster's operation can hiccup because of any of a 
myriad set of reasons fr
 Here is one good starting point: 
link:http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing[Fallacies 
of Distributed Computing].
 
 That said, you are welcome. +
-Its a fun place to be. +
-Yours, the HBase Community. 
+It's a fun place to be. +
+Yours, the HBase Community.
 
 
 :numbered:

[06/12] hbase git commit: Pull in documentation updates from trunk made since last 0.98 release

Reply via email to