[04/34] hbase git commit: HBASE-12918 Backport asciidoc changes (apurtell and enis)

enis Tue, 27 Jan 2015 19:32:04 -0800

http://git-wip-us.apache.org/repos/asf/hbase/blob/cb77a925/src/main/docbkx/troubleshooting.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/troubleshooting.xml 
b/src/main/docbkx/troubleshooting.xml
deleted file mode 100644
index d57bb08..0000000
--- a/src/main/docbkx/troubleshooting.xml
+++ /dev/null
@@ -1,1700 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<chapter
-  version="5.0"
-  xml:id="trouble"
-  xmlns="http://docbook.org/ns/docbook";
-  xmlns:xlink="http://www.w3.org/1999/xlink";
-  xmlns:xi="http://www.w3.org/2001/XInclude";
-  xmlns:svg="http://www.w3.org/2000/svg";
-  xmlns:m="http://www.w3.org/1998/Math/MathML";
-  xmlns:html="http://www.w3.org/1999/xhtml";
-  xmlns:db="http://docbook.org/ns/docbook";>
-  <!--
-/**
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
--->
-  <title>Troubleshooting and Debugging Apache HBase</title>
-  <section
-    xml:id="trouble.general">
-    <title>General Guidelines</title>
-    <para> Always start with the master log (TODO: Which lines?). Normally 
itâs just printing the
-      same lines over and over again. If not, then thereâs an issue. Google 
or <link
-        xlink:href="http://search-hadoop.com";>search-hadoop.com</link> should 
return some hits for
-      those exceptions youâre seeing. </para>
-    <para> An error rarely comes alone in Apache HBase, usually when something 
gets screwed up what
-      will follow may be hundreds of exceptions and stack traces coming from 
all over the place. The
-      best way to approach this type of problem is to walk the log up to where 
it all began, for
-      example one trick with RegionServers is that they will print some 
metrics when aborting so
-      grepping for <emphasis>Dump</emphasis> should get you around the start 
of the problem. </para>
-    <para> RegionServer suicides are ânormalâ, as this is what they do 
when something goes wrong.
-      For example, if ulimit and max transfer threads (the two most important 
initial settings, see
-      <xref linkend="ulimit" /> and <xref 
linkend="dfs.datanode.max.transfer.threads" />) arenât
-      changed, it will make it impossible at some point for DataNodes
-      to create new threads that from the HBase point of view is seen as if 
HDFS was gone. Think
-      about what would happen if your MySQL database was suddenly unable to 
access files on your
-      local file system, well itâs the same with HBase and HDFS. Another 
very common reason to see
-      RegionServers committing seppuku is when they enter prolonged garbage 
collection pauses that
-      last longer than the default ZooKeeper session timeout. For more 
information on GC pauses, see
-      the <link
-        
xlink:href="http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/";>3
-        part blog post</link> by Todd Lipcon and <xref
-        linkend="gcpause" /> above. </para>
-  </section>
-  <section
-    xml:id="trouble.log">
-    <title>Logs</title>
-    <para> The key process logs are as follows... (replace &lt;user&gt; with 
the user that started
-      the service, and &lt;hostname&gt; for the machine name) </para>
-    <para> NameNode:
-        
<filename>$HADOOP_HOME/logs/hadoop-&lt;user&gt;-namenode-&lt;hostname&gt;.log</filename>
-    </para>
-    <para> DataNode:
-        
<filename>$HADOOP_HOME/logs/hadoop-&lt;user&gt;-datanode-&lt;hostname&gt;.log</filename>
-    </para>
-    <para> JobTracker:
-        
<filename>$HADOOP_HOME/logs/hadoop-&lt;user&gt;-jobtracker-&lt;hostname&gt;.log</filename>
-    </para>
-    <para> TaskTracker:
-        
<filename>$HADOOP_HOME/logs/hadoop-&lt;user&gt;-tasktracker-&lt;hostname&gt;.log</filename>
-    </para>
-    <para> HMaster:
-        
<filename>$HBASE_HOME/logs/hbase-&lt;user&gt;-master-&lt;hostname&gt;.log</filename>
-    </para>
-    <para> RegionServer:
-        
<filename>$HBASE_HOME/logs/hbase-&lt;user&gt;-regionserver-&lt;hostname&gt;.log</filename>
-    </para>
-    <para> ZooKeeper: <filename>TODO</filename>
-    </para>
-    <section
-      xml:id="trouble.log.locations">
-      <title>Log Locations</title>
-      <para>For stand-alone deployments the logs are obviously going to be on 
a single machine,
-        however this is a development configuration only. Production 
deployments need to run on a
-        cluster.</para>
-      <section
-        xml:id="trouble.log.locations.namenode">
-        <title>NameNode</title>
-        <para>The NameNode log is on the NameNode server. The HBase Master is 
typically run on the
-          NameNode server, and well as ZooKeeper.</para>
-        <para>For smaller clusters the JobTracker is typically run on the 
NameNode server as
-          well.</para>
-      </section>
-      <section
-        xml:id="trouble.log.locations.datanode">
-        <title>DataNode</title>
-        <para>Each DataNode server will have a DataNode log for HDFS, as well 
as a RegionServer log
-          for HBase.</para>
-        <para>Additionally, each DataNode server will also have a TaskTracker 
log for MapReduce task
-          execution.</para>
-      </section>
-    </section>
-    <section
-      xml:id="trouble.log.levels">
-      <title>Log Levels</title>
-      <section
-        xml:id="rpc.logging">
-        <title>Enabling RPC-level logging</title>
-        <para>Enabling the RPC-level logging on a RegionServer can often given 
insight on timings at
-          the server. Once enabled, the amount of log spewed is voluminous. It 
is not recommended
-          that you leave this logging on for more than short bursts of time. 
To enable RPC-level
-          logging, browse to the RegionServer UI and click on <emphasis>Log 
Level</emphasis>. Set
-          the log level to <varname>DEBUG</varname> for the package
-            <classname>org.apache.hadoop.ipc</classname> (Thats right, for
-            <classname>hadoop.ipc</classname>, NOT, 
<classname>hbase.ipc</classname>). Then tail the
-          RegionServers log. Analyze.</para>
-        <para>To disable, set the logging level back to 
<varname>INFO</varname> level. </para>
-      </section>
-    </section>
-    <section
-      xml:id="trouble.log.gc">
-      <title>JVM Garbage Collection Logs</title>
-      <para>HBase is memory intensive, and using the default GC you can see 
long pauses in all
-        threads including the <emphasis>Juliet Pause</emphasis> aka "GC of 
Death". To help debug
-        this or confirm this is happening GC logging can be turned on in the 
Java virtual machine. </para>
-      <para> To enable, in <filename>hbase-env.sh</filename>, uncomment one of 
the below lines
-        :</para>
-      <programlisting language="bourne">
-# This enables basic gc logging to the .out file.
-# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps"
-
-# This enables basic gc logging to its own file.
-# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -Xloggc:&lt;FILE-PATH&gt;"
-
-# This enables basic GC logging to its own file with automatic log rolling. 
Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
-# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -Xloggc:&lt;FILE-PATH&gt; -XX:+UseGCLogFileRotation 
-XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
-
-# If &lt;FILE-PATH&gt; is not replaced, the log file(.gc) would be generated 
in the HBASE_LOG_DIR.
-          </programlisting>
-      <para> At this point you should see logs like so:</para>
-      <programlisting>
-64898.952: [GC [1 CMS-initial-mark: 2811538K(3055704K)] 2812179K(3061272K), 
0.0007360 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
-64898.953: [CMS-concurrent-mark-start]
-64898.971: [GC 64898.971: [ParNew: 5567K->576K(5568K), 0.0101110 secs] 
2817105K->2812715K(3061272K), 0.0102200 secs] [Times: user=0.07 sys=0.00, 
real=0.01 secs]
-          </programlisting>
-      <para> In this section, the first line indicates a 0.0007360 second 
pause for the CMS to
-        initially mark. This pauses the entire VM, all threads for that period 
of time. </para>
-      <para> The third line indicates a "minor GC", which pauses the VM for 
0.0101110 seconds - aka
-        10 milliseconds. It has reduced the "ParNew" from about 5.5m to 576k. 
Later on in this cycle
-        we see:</para>
-      <programlisting>
-64901.445: [CMS-concurrent-mark: 1.542/2.492 secs] [Times: user=10.49 
sys=0.33, real=2.49 secs]
-64901.445: [CMS-concurrent-preclean-start]
-64901.453: [GC 64901.453: [ParNew: 5505K->573K(5568K), 0.0062440 secs] 
2868746K->2864292K(3061272K), 0.0063360 secs] [Times: user=0.05 sys=0.00, 
real=0.01 secs]
-64901.476: [GC 64901.476: [ParNew: 5563K->575K(5568K), 0.0072510 secs] 
2869283K->2864837K(3061272K), 0.0073320 secs] [Times: user=0.05 sys=0.01, 
real=0.01 secs]
-64901.500: [GC 64901.500: [ParNew: 5517K->573K(5568K), 0.0120390 secs] 
2869780K->2865267K(3061272K), 0.0121150 secs] [Times: user=0.09 sys=0.00, 
real=0.01 secs]
-64901.529: [GC 64901.529: [ParNew: 5507K->569K(5568K), 0.0086240 secs] 
2870200K->2865742K(3061272K), 0.0087180 secs] [Times: user=0.05 sys=0.00, 
real=0.01 secs]
-64901.554: [GC 64901.555: [ParNew: 5516K->575K(5568K), 0.0107130 secs] 
2870689K->2866291K(3061272K), 0.0107820 secs] [Times: user=0.06 sys=0.00, 
real=0.01 secs]
-64901.578: [CMS-concurrent-preclean: 0.070/0.133 secs] [Times: user=0.48 
sys=0.01, real=0.14 secs]
-64901.578: [CMS-concurrent-abortable-preclean-start]
-64901.584: [GC 64901.584: [ParNew: 5504K->571K(5568K), 0.0087270 secs] 
2871220K->2866830K(3061272K), 0.0088220 secs] [Times: user=0.05 sys=0.00, 
real=0.01 secs]
-64901.609: [GC 64901.609: [ParNew: 5512K->569K(5568K), 0.0063370 secs] 
2871771K->2867322K(3061272K), 0.0064230 secs] [Times: user=0.06 sys=0.00, 
real=0.01 secs]
-64901.615: [CMS-concurrent-abortable-preclean: 0.007/0.037 secs] [Times: 
user=0.13 sys=0.00, real=0.03 secs]
-64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 
0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 
2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 
sys=0.01, real=0.01 secs]
-64901.621: [CMS-concurrent-sweep-start]
-            </programlisting>
-      <para> The first line indicates that the CMS concurrent mark (finding 
garbage) has taken 2.4
-        seconds. But this is a _concurrent_ 2.4 seconds, Java has not been 
paused at any point in
-        time. </para>
-      <para> There are a few more minor GCs, then there is a pause at the 2nd 
last line:
-        <programlisting>
-64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 
0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 
2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 
sys=0.01, real=0.01 secs]
-            </programlisting>
-      </para>
-      <para> The pause here is 0.0049380 seconds (aka 4.9 milliseconds) to 
'remark' the heap. </para>
-      <para> At this point the sweep starts, and you can watch the heap size 
go down:</para>
-      <programlisting>
-64901.637: [GC 64901.637: [ParNew: 5501K->569K(5568K), 0.0097350 secs] 
2871958K->2867441K(3061272K), 0.0098370 secs] [Times: user=0.05 sys=0.00, 
real=0.01 secs]
-...  lines removed ...
-64904.936: [GC 64904.936: [ParNew: 5532K->568K(5568K), 0.0070720 secs] 
1365024K->1360689K(3061272K), 0.0071930 secs] [Times: user=0.05 sys=0.00, 
real=0.01 secs]
-64904.953: [CMS-concurrent-sweep: 2.030/3.332 secs] [Times: user=9.57 
sys=0.26, real=3.33 secs]
-            </programlisting>
-      <para>At this point, the CMS sweep took 3.332 seconds, and heap went 
from about ~ 2.8 GB to
-        1.3 GB (approximate). </para>
-      <para> The key points here is to keep all these pauses low. CMS pauses 
are always low, but if
-        your ParNew starts growing, you can see minor GC pauses approach 
100ms, exceed 100ms and hit
-        as high at 400ms. </para>
-      <para> This can be due to the size of the ParNew, which should be 
relatively small. If your
-        ParNew is very large after running HBase for a while, in one example a 
ParNew was about
-        150MB, then you might have to constrain the size of ParNew (The larger 
it is, the longer the
-        collections take but if its too small, objects are promoted to old gen 
too quickly). In the
-        below we constrain new gen size to 64m. </para>
-      <para> Add the below line in <filename>hbase-env.sh</filename>:
-        <programlisting language="bourne">
-export SERVER_GC_OPTS="$SERVER_GC_OPTS -XX:NewSize=64m -XX:MaxNewSize=64m"
-            </programlisting>
-      </para>
-      <para> Similarly, to enable GC logging for client processes, uncomment 
one of the below lines
-        in <filename>hbase-env.sh</filename>:</para>
-      <programlisting language="bourne">
-# This enables basic gc logging to the .out file.
-# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps"
-
-# This enables basic gc logging to its own file.
-# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -Xloggc:&lt;FILE-PATH&gt;"
-
-# This enables basic GC logging to its own file with automatic log rolling. 
Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
-# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCDateStamps -Xloggc:&lt;FILE-PATH&gt; -XX:+UseGCLogFileRotation 
-XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
-
-# If &lt;FILE-PATH&gt; is not replaced, the log file(.gc) would be generated 
in the HBASE_LOG_DIR .
-            </programlisting>
-      <para> For more information on GC pauses, see the <link
-          
xlink:href="http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/";>3
-          part blog post</link> by Todd Lipcon and <xref
-          linkend="gcpause" /> above. </para>
-    </section>
-  </section>
-  <section
-    xml:id="trouble.resources">
-    <title>Resources</title>
-    <section
-      xml:id="trouble.resources.searchhadoop">
-      <title>search-hadoop.com</title>
-      <para>
-        <link
-          xlink:href="http://search-hadoop.com";>search-hadoop.com</link> 
indexes all the mailing
-        lists and is great for historical searches. Search here first when you 
have an issue as its
-        more than likely someone has already had your problem. </para>
-    </section>
-    <section
-      xml:id="trouble.resources.lists">
-      <title>Mailing Lists</title>
-      <para>Ask a question on the <link 
xlink:href="http://hbase.apache.org/mail-lists.html";>Apache
-          HBase mailing lists</link>. The 'dev' mailing list is aimed at the 
community of developers
-        actually building Apache HBase and for features currently under 
development, and 'user' is
-        generally used for questions on released versions of Apache HBase. 
Before going to the
-        mailing list, make sure your question has not already been answered by 
searching the mailing
-        list archives first. Use <xref 
linkend="trouble.resources.searchhadoop"/>. Take some time
-        crafting your question. See <link 
xlink:href="http://www.mikeash.com/getting_answers.html";
-          >Getting Answers</link> for ideas on crafting good questions. A 
quality question that
-        includes all context and exhibits evidence the author has tried to 
find answers in the
-        manual and out on lists is more likely to get a prompt response. 
</para>
-    </section>
-    <section
-      xml:id="trouble.resources.irc">
-      <title>IRC</title>
-      <para>#hbase on irc.freenode.net</para>
-    </section>
-    <section
-      xml:id="trouble.resources.jira">
-      <title>JIRA</title>
-      <para>
-        <link
-          xlink:href="https://issues.apache.org/jira/browse/HBASE";>JIRA</link> 
is also really
-        helpful when looking for Hadoop/HBase-specific issues. </para>
-    </section>
-  </section>
-  <section
-    xml:id="trouble.tools">
-    <title>Tools</title>
-    <section
-      xml:id="trouble.tools.builtin">
-      <title>Builtin Tools</title>
-      <section
-        xml:id="trouble.tools.builtin.webmaster">
-        <title>Master Web Interface</title>
-        <para>The Master starts a web-interface on port 16010 by default. (Up 
to and including 0.98
-          this was port 60010) </para>
-        <para>The Master web UI lists created tables and their definition 
(e.g., ColumnFamilies,
-          blocksize, etc.). Additionally, the available RegionServers in the 
cluster are listed
-          along with selected high-level metrics (requests, number of regions, 
usedHeap, maxHeap).
-          The Master web UI allows navigation to each RegionServer's web UI. 
</para>
-      </section>
-      <section
-        xml:id="trouble.tools.builtin.webregion">
-        <title>RegionServer Web Interface</title>
-        <para>RegionServers starts a web-interface on port 16030 by default. 
(Up to an including
-          0.98 this was port 60030) </para>
-        <para>The RegionServer web UI lists online regions and their start/end 
keys, as well as
-          point-in-time RegionServer metrics (requests, regions, 
storeFileIndexSize,
-          compactionQueueSize, etc.). </para>
-        <para>See <xref
-            linkend="hbase_metrics" /> for more information in metric 
definitions. </para>
-      </section>
-      <section
-        xml:id="trouble.tools.builtin.zkcli">
-        <title>zkcli</title>
-        <para><code>zkcli</code> is a very useful tool for investigating 
ZooKeeper-related issues.
-          To invoke:
-          <programlisting language="bourne">
-./hbase zkcli -server host:port &lt;cmd&gt; &lt;args&gt;
-</programlisting>
-          The commands (and arguments) are:</para>
-        <programlisting>
-       connect host:port
-       get path [watch]
-       ls path [watch]
-       set path data [version]
-       delquota [-n|-b] path
-       quit
-       printwatches on|off
-       create [-s] [-e] path data acl
-       stat path [watch]
-       close
-       ls2 path [watch]
-       history
-       listquota path
-       setAcl path acl
-       getAcl path
-       sync path
-       redo cmdno
-       addauth scheme auth
-       delete path [version]
-       setquota -n|-b val path
-</programlisting>
-      </section>
-    </section>
-    <section
-      xml:id="trouble.tools.external">
-      <title>External Tools</title>
-      <section
-        xml:id="trouble.tools.tail">
-        <title>tail</title>
-        <para>
-          <code>tail</code> is the command line tool that lets you look at the 
end of a file. Add
-          the â-fâ option and it will refresh when new data is available. 
Itâs useful when you are
-          wondering whatâs happening, for example, when a cluster is taking 
a long time to shutdown
-          or startup as you can just fire a new terminal and tail the master 
log (and maybe a few
-          RegionServers). </para>
-      </section>
-      <section
-        xml:id="trouble.tools.top">
-        <title>top</title>
-        <para>
-          <code>top</code> is probably one of the most important tool when 
first trying to see
-          whatâs running on a machine and how the resources are consumed. 
Hereâs an example from
-          production system:</para>
-        <programlisting>
-top - 14:46:59 up 39 days, 11:55,  1 user,  load average: 3.75, 3.57, 3.84
-Tasks: 309 total,   1 running, 308 sleeping,   0 stopped,   0 zombie
-Cpu(s):  4.5%us,  1.6%sy,  0.0%ni, 91.7%id,  1.4%wa,  0.1%hi,  0.6%si,  0.0%st
-Mem:  24414432k total, 24296956k used,   117476k free,     7196k buffers
-Swap: 16008732k total, 14348k used, 15994384k free, 11106908k cached
-
-  PID USER     PR  NI  VIRT  RES  SHR S %CPU %MEM      TIME+  COMMAND
-15558 hadoop   18  -2 3292m 2.4g 3556 S   79 10.4   6523:52 java
-13268 hadoop   18  -2 8967m 8.2g 4104 S   21 35.1   5170:30 java
- 8895 hadoop   18  -2 1581m 497m 3420 S   11  2.1   4002:32 java
-â¦
-        </programlisting>
-        <para> Here we can see that the system load average during the last 
five minutes is 3.75,
-          which very roughly means that on average 3.75 threads were waiting 
for CPU time during
-          these 5 minutes. In general, the âperfectâ utilization equals to 
the number of cores,
-          under that number the machine is under utilized and over that the 
machine is over
-          utilized. This is an important concept, see this article to 
understand it more: <link
-            
xlink:href="http://www.linuxjournal.com/article/9001";>http://www.linuxjournal.com/article/9001</link>.
 </para>
-        <para> Apart from load, we can see that the system is using almost all 
its available RAM but
-          most of it is used for the OS cache (which is good). The swap only 
has a few KBs in it and
-          this is wanted, high numbers would indicate swapping activity which 
is the nemesis of
-          performance of Java systems. Another way to detect swapping is when 
the load average goes
-          through the roof (although this could also be caused by things like 
a dying disk, among
-          others). </para>
-        <para> The list of processes isnât super useful by default, all we 
know is that 3 java
-          processes are using about 111% of the CPUs. To know which is which, 
simply type âcâ and
-          each line will be expanded. Typing â1â will give you the detail 
of how each CPU is used
-          instead of the average for all of them like shown here. </para>
-      </section>
-      <section
-        xml:id="trouble.tools.jps">
-        <title>jps</title>
-        <para>
-          <code>jps</code> is shipped with every JDK and gives the java 
process ids for the current
-          user (if root, then it gives the ids for all users). Example:</para>
-        <programlisting language="bourne">
-hadoop@sv4borg12:~$ jps
-1322 TaskTracker
-17789 HRegionServer
-27862 Child
-1158 DataNode
-25115 HQuorumPeer
-2950 Jps
-19750 ThriftServer
-18776 jmx
-        </programlisting>
-        <para>In order, we see a: </para>
-        <itemizedlist>
-          <listitem>
-            <para>Hadoop TaskTracker, manages the local Childs</para>
-          </listitem>
-          <listitem>
-            <para>HBase RegionServer, serves regions</para>
-          </listitem>
-          <listitem>
-            <para>Child, its MapReduce task, cannot tell which type 
exactly</para>
-          </listitem>
-          <listitem>
-            <para>Hadoop TaskTracker, manages the local Childs</para>
-          </listitem>
-          <listitem>
-            <para>Hadoop DataNode, serves blocks</para>
-          </listitem>
-          <listitem>
-            <para>HQuorumPeer, a ZooKeeper ensemble member</para>
-          </listitem>
-          <listitem>
-            <para>Jps, wellâ¦ itâs the current process</para>
-          </listitem>
-          <listitem>
-            <para>ThriftServer, itâs a special one will be running only if 
thrift was started</para>
-          </listitem>
-          <listitem>
-            <para>jmx, this is a local process thatâs part of our monitoring 
platform ( poorly named
-              maybe). You probably donât have that.</para>
-          </listitem>
-        </itemizedlist>
-        <para> You can then do stuff like checking out the full command line 
that started the
-          process:</para>
-        <programlisting language="bourne">
-hadoop@sv4borg12:~$ ps aux | grep HRegionServer
-hadoop   17789  155 35.2 9067824 8604364 ?     S&lt;l  Mar04 9855:48 
/usr/java/jdk1.6.0_14/bin/java -Xmx8000m -XX:+DoEscapeAnalysis 
-XX:+AggressiveOpts -XX:+UseConcMarkSweepGC -XX:NewSize=64m -XX:MaxNewSize=64m 
-XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -Xloggc:/export1/hadoop/logs/gc-hbase.log 
-Dcom.sun.management.jmxremote.port=10102 
-Dcom.sun.management.jmxremote.authenticate=true 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.password.file=/home/hadoop/hbase/conf/jmxremote.password
 -Dcom.sun.management.jmxremote -Dhbase.log.dir=/export1/hadoop/logs 
-Dhbase.log.file=hbase-hadoop-regionserver-sv4borg12.log 
-Dhbase.home.dir=/home/hadoop/hbase -Dhbase.id.str=hadoop 
-Dhbase.root.logger=INFO,DRFA 
-Djava.library.path=/home/hadoop/hbase/lib/native/Linux-amd64-64 -classpath 
/home/hadoop/hbase/bin/../conf:[many jars]:/home/hadoop/hadoop/conf 
org.apache.hadoop.hbase.regionserver.HRegionServer start
-        </programlisting>
-      </section>
-      <section
-        xml:id="trouble.tools.jstack">
-        <title>jstack</title>
-        <para>
-          <code>jstack</code> is one of the most important tools when trying 
to figure out what a
-          java process is doing apart from looking at the logs. It has to be 
used in conjunction
-          with jps in order to give it a process id. It shows a list of 
threads, each one has a
-          name, and they appear in the order that they were created (so the 
top ones are the most
-          recent threads). Hereâs a few example: </para>
-        <para> The main thread of a RegionServer thatâs waiting for 
something to do from the
-          master:</para>
-        <programlisting>
-"regionserver60020" prio=10 tid=0x0000000040ab4000 nid=0x45cf waiting on 
condition [0x00007f16b6a96000..0x00007f16b6a96a70]
-java.lang.Thread.State: TIMED_WAITING (parking)
-    at sun.misc.Unsafe.park(Native Method)
-    - parking to wait for  &lt;0x00007f16cd5c2f30&gt; (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
-    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
-    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
-    at 
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:395)
-    at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:647)
-    at java.lang.Thread.run(Thread.java:619)
-
-    The MemStore flusher thread that is currently flushing to a file:
-"regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000040f4e000 
nid=0x45eb in Object.wait() [0x00007f16b5b86000..0x00007f16b5b87af0]
-java.lang.Thread.State: WAITING (on object monitor)
-    at java.lang.Object.wait(Native Method)
-    at java.lang.Object.wait(Object.java:485)
-    at org.apache.hadoop.ipc.Client.call(Client.java:803)
-    - locked &lt;0x00007f16cb14b3a8&gt; (a org.apache.hadoop.ipc.Client$Call)
-    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)
-    at $Proxy1.complete(Unknown Source)
-    at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
-    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
-    at java.lang.reflect.Method.invoke(Method.java:597)
-    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
-    at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
-    at $Proxy1.complete(Unknown Source)
-    at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3390)
-    - locked &lt;0x00007f16cb14b470&gt; (a 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
-    at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3304)
-    at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
-    at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
-    at org.apache.hadoop.hbase.io.hfile.HFile$Writer.close(HFile.java:650)
-    at 
org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:853)
-    at 
org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:467)
-    - locked &lt;0x00007f16d00e6f08&gt; (a java.lang.Object)
-    at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:427)
-    at org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:80)
-    at 
org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1359)
-    at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:907)
-    at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:834)
-    at 
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:786)
-    at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:250)
-    at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:224)
-    at 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:146)
-        </programlisting>
-        <para> A handler thread thatâs waiting for stuff to do (like put, 
delete, scan, etc):</para>
-        <programlisting>
-"IPC Server handler 16 on 60020" daemon prio=10 tid=0x00007f16b011d800 
nid=0x4a5e waiting on condition [0x00007f16afefd000..0x00007f16afefd9f0]
-   java.lang.Thread.State: WAITING (parking)
-               at sun.misc.Unsafe.park(Native Method)
-               - parking to wait for  &lt;0x00007f16cd3f8dd8&gt; (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
-               at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
-               at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
-               at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
-               at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1013)
-        </programlisting>
-        <para> And one thatâs busy doing an increment of a counter (itâs 
in the phase where itâs
-          trying to create a scanner in order to read the last value):</para>
-        <programlisting>
-"IPC Server handler 66 on 60020" daemon prio=10 tid=0x00007f16b006e800 
nid=0x4a90 runnable [0x00007f16acb77000..0x00007f16acb77cf0]
-   java.lang.Thread.State: RUNNABLE
-               at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.&lt;init&gt;(KeyValueHeap.java:56)
-               at 
org.apache.hadoop.hbase.regionserver.StoreScanner.&lt;init&gt;(StoreScanner.java:79)
-               at 
org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1202)
-               at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.&lt;init&gt;(HRegion.java:2209)
-               at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1063)
-               at 
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1055)
-               at 
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1039)
-               at 
org.apache.hadoop.hbase.regionserver.HRegion.getLastIncrement(HRegion.java:2875)
-               at 
org.apache.hadoop.hbase.regionserver.HRegion.incrementColumnValue(HRegion.java:2978)
-               at 
org.apache.hadoop.hbase.regionserver.HRegionServer.incrementColumnValue(HRegionServer.java:2433)
-               at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
-               at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
-               at java.lang.reflect.Method.invoke(Method.java:597)
-               at 
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:560)
-               at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1027)
-        </programlisting>
-        <para> A thread that receives data from HDFS:</para>
-        <programlisting>
-"IPC Client (47) connection to sv4borg9/10.4.24.40:9000 from hadoop" daemon 
prio=10 tid=0x00007f16a02d0000 nid=0x4fa3 runnable 
[0x00007f16b517d000..0x00007f16b517dbf0]
-   java.lang.Thread.State: RUNNABLE
-               at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
-               at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
-               at 
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
-               at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
-               - locked &lt;0x00007f17d5b68c00&gt; (a sun.nio.ch.Util$1)
-               - locked &lt;0x00007f17d5b68be8&gt; (a 
java.util.Collections$UnmodifiableSet)
-               - locked &lt;0x00007f1877959b50&gt; (a 
sun.nio.ch.EPollSelectorImpl)
-               at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
-               at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
-               at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
-               at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
-               at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
-               at java.io.FilterInputStream.read(FilterInputStream.java:116)
-               at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:304)
-               at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
-               at 
java.io.BufferedInputStream.read(BufferedInputStream.java:237)
-               - locked &lt;0x00007f1808539178&gt; (a 
java.io.BufferedInputStream)
-               at java.io.DataInputStream.readInt(DataInputStream.java:370)
-               at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:569)
-               at org.apache.hadoop.ipc.Client$Connection.run(Client.java:477)
-          </programlisting>
-        <para> And here is a master trying to recover a lease after a 
RegionServer died:</para>
-        <programlisting>
-"LeaseChecker" daemon prio=10 tid=0x00000000407ef800 nid=0x76cd waiting on 
condition [0x00007f6d0eae2000..0x00007f6d0eae2a70]
---
-   java.lang.Thread.State: WAITING (on object monitor)
-               at java.lang.Object.wait(Native Method)
-               at java.lang.Object.wait(Object.java:485)
-               at org.apache.hadoop.ipc.Client.call(Client.java:726)
-               - locked &lt;0x00007f6d1cd28f80&gt; (a 
org.apache.hadoop.ipc.Client$Call)
-               at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
-               at $Proxy1.recoverBlock(Unknown Source)
-               at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2636)
-               at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.&lt;init&gt;(DFSClient.java:2832)
-               at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:529)
-               at 
org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
-               at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:530)
-               at 
org.apache.hadoop.hbase.util.FSUtils.recoverFileLease(FSUtils.java:619)
-               at 
org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1322)
-               at 
org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1210)
-               at 
org.apache.hadoop.hbase.master.HMaster.splitLogAfterStartup(HMaster.java:648)
-               at 
org.apache.hadoop.hbase.master.HMaster.joinCluster(HMaster.java:572)
-               at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:503)
-          </programlisting>
-      </section>
-      <section
-        xml:id="trouble.tools.opentsdb">
-        <title>OpenTSDB</title>
-        <para>
-          <link
-            xlink:href="http://opentsdb.net";>OpenTSDB</link> is an excellent 
alternative to Ganglia
-          as it uses Apache HBase to store all the time series and doesnât 
have to downsample.
-          Monitoring your own HBase cluster that hosts OpenTSDB is a good 
exercise. </para>
-        <para> Hereâs an example of a cluster thatâs suffering from 
hundreds of compactions launched
-          almost all around the same time, which severely affects the IO 
performance: (TODO: insert
-          graph plotting compactionQueueSize) </para>
-        <para> Itâs a good practice to build dashboards with all the 
important graphs per machine
-          and per cluster so that debugging issues can be done with a single 
quick look. For
-          example, at StumbleUpon thereâs one dashboard per cluster with the 
most important metrics
-          from both the OS and Apache HBase. You can then go down at the 
machine level and get even
-          more detailed metrics. </para>
-      </section>
-      <section
-        xml:id="trouble.tools.clustersshtop">
-        <title>clusterssh+top</title>
-        <para> clusterssh+top, itâs like a poor manâs monitoring system 
and it can be quite useful
-          when you have only a few machines as itâs very easy to setup. 
Starting clusterssh will
-          give you one terminal per machine and another terminal in which 
whatever you type will be
-          retyped in every window. This means that you can type âtopâ once 
and it will start it for
-          all of your machines at the same time giving you full view of the 
current state of your
-          cluster. You can also tail all the logs at the same time, edit 
files, etc. </para>
-      </section>
-    </section>
-  </section>
-
-  <section
-    xml:id="trouble.client">
-    <title>Client</title>
-    <para>For more information on the HBase client, see <xref
-        linkend="client" />. </para>
-    <section
-      xml:id="trouble.client.scantimeout">
-      <title>ScannerTimeoutException or UnknownScannerException</title>
-      <para>This is thrown if the time between RPC calls from the client to 
RegionServer exceeds the
-        scan timeout. For example, if <code>Scan.setCaching</code> is set to 
500, then there will be
-        an RPC call to fetch the next batch of rows every 500 
<code>.next()</code> calls on the
-        ResultScanner because data is being transferred in blocks of 500 rows 
to the client.
-        Reducing the setCaching value may be an option, but setting this value 
too low makes for
-        inefficient processing on numbers of rows. </para>
-      <para>See <xref
-          linkend="perf.hbase.client.caching" />. </para>
-    </section>
-    <section>
-      <title>Performance Differences in Thrift and Java APIs</title>
-      <para>Poor performance, or even <code>ScannerTimeoutExceptions</code>, 
can occur if
-          <code>Scan.setCaching</code> is too high, as discussed in <xref
-          linkend="trouble.client.scantimeout"/>. If the Thrift client uses 
the wrong caching
-        settings for a given workload, performance can suffer compared to the 
Java API. To set
-        caching for a given scan in the Thrift client, use the 
<code>scannerGetList(scannerId,
-          numRows)</code> method, where <code>numRows</code> is an integer 
representing the number
-        of rows to cache. In one case, it was found that reducing the cache 
for Thrift scans from
-        1000 to 100 increased performance to near parity with the Java API 
given the same
-        queries.</para>
-      <para>See also Jesse Andersen's <link 
xlink:href="http://blog.cloudera.com/blog/2014/04/how-to-use-the-hbase-thrift-interface-part-3-using-scans/";>blog
 post</link> 
-        about using Scans with Thrift.</para>
-    </section>
-    <section
-      xml:id="trouble.client.lease.exception">
-      <title><classname>LeaseException</classname> when calling
-        <classname>Scanner.next</classname></title>
-      <para> In some situations clients that fetch data from a RegionServer 
get a LeaseException
-        instead of the usual <xref
-          linkend="trouble.client.scantimeout" />. Usually the source of the 
exception is
-          
<classname>org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)</classname>
-        (line number may vary). It tends to happen in the context of a 
slow/freezing
-        RegionServer#next call. It can be prevented by having 
<varname>hbase.rpc.timeout</varname> >
-          <varname>hbase.regionserver.lease.period</varname>. Harsh J 
investigated the issue as part
-        of the mailing list thread <link
-          
xlink:href="http://mail-archives.apache.org/mod_mbox/hbase-user/201209.mbox/%3CCAOcnVr3R-LqtKhFsk8Bhrm-YW2i9O6J6Fhjz2h7q6_sxvwd2yw%40mail.gmail.com%3E";>HBase,
-          mail # user - Lease does not exist exceptions</link>
-      </para>
-    </section>
-    <section
-      xml:id="trouble.client.scarylogs">
-      <title>Shell or client application throws lots of scary exceptions 
during normal
-        operation</title>
-      <para>Since 0.20.0 the default log level for 
<code>org.apache.hadoop.hbase.*</code>is DEBUG. </para>
-      <para> On your clients, edit 
<filename>$HBASE_HOME/conf/log4j.properties</filename> and change
-        this: <code>log4j.logger.org.apache.hadoop.hbase=DEBUG</code> to this:
-          <code>log4j.logger.org.apache.hadoop.hbase=INFO</code>, or even
-          <code>log4j.logger.org.apache.hadoop.hbase=WARN</code>. </para>
-    </section>
-    <section
-      xml:id="trouble.client.longpauseswithcompression">
-      <title>Long Client Pauses With Compression</title>
-      <para>This is a fairly frequent question on the Apache HBase dist-list. 
The scenario is that a
-        client is typically inserting a lot of data into a relatively 
un-optimized HBase cluster.
-        Compression can exacerbate the pauses, although it is not the source 
of the problem.</para>
-      <para>See <xref
-          linkend="precreate.regions" /> on the pattern for pre-creating 
regions and confirm that
-        the table isn't starting with a single region.</para>
-      <para>See <xref
-          linkend="perf.configurations" /> for cluster configuration, 
particularly
-          <code>hbase.hstore.blockingStoreFiles</code>,
-          <code>hbase.hregion.memstore.block.multiplier</code>, 
<code>MAX_FILESIZE</code> (region
-        size), and <code>MEMSTORE_FLUSHSIZE.</code>
-      </para>
-      <para>A slightly longer explanation of why pauses can happen is as 
follows: Puts are sometimes
-        blocked on the MemStores which are blocked by the flusher thread which 
is blocked because
-        there are too many files to compact because the compactor is given too 
many small files to
-        compact and has to compact the same data repeatedly. This situation 
can occur even with
-        minor compactions. Compounding this situation, Apache HBase doesn't 
compress data in memory.
-        Thus, the 64MB that lives in the MemStore could become a 6MB file 
after compression - which
-        results in a smaller StoreFile. The upside is that more data is packed 
into the same region,
-        but performance is achieved by being able to write larger files - 
which is why HBase waits
-        until the flushize before writing a new StoreFile. And smaller 
StoreFiles become targets for
-        compaction. Without compression the files are much bigger and don't 
need as much compaction,
-        however this is at the expense of I/O. </para>
-      <para> For additional information, see this thread on <link
-          
xlink:href="http://search-hadoop.com/m/WUnLM6ojHm1/Long+client+pauses+with+compression&amp;subj=Long+client+pauses+with+compression";>Long
-          client pauses with compression</link>. </para>
-    </section>
-    <section xml:id="trouble.client.security.rpc.krb">
-      <title>Secure Client Connect ([Caused by GSSException: No valid 
credentials provided...])</title>
-      <para>You may encounter the following error:</para>
-      <screen>Secure Client Connect ([Caused by GSSException: No valid 
credentials provided
-        (Mechanism level: Request is a replay (34) V PROCESS_TGS)])</screen>
-      <para> This issue is caused by bugs in the MIT Kerberos replay_cache 
component, <link
-          
xlink:href="http://krbdev.mit.edu/rt/Ticket/Display.html?id=1201";>#1201</link> 
and <link
-          
xlink:href="http://krbdev.mit.edu/rt/Ticket/Display.html?id=5924";>#5924</link>. 
These bugs
-        caused the old version of krb5-server to erroneously block subsequent 
requests sent from a
-        Principal. This caused krb5-server to block the connections sent from 
one Client (one HTable
-        instance with multi-threading connection instances for each 
regionserver); Messages, such as
-          <literal>Request is a replay (34)</literal>, are logged in the 
client log You can ignore
-        the messages, because HTable will retry 5 * 10 (50) times for each 
failed connection by
-        default. HTable will throw IOException if any connection to the 
regionserver fails after the
-        retries, so that the user client code for HTable instance can handle 
it further. </para>
-      <para> Alternatively, update krb5-server to a version which solves these 
issues, such as
-        krb5-server-1.10.3. See JIRA <link
-          
xlink:href="https://issues.apache.org/jira/browse/HBASE-10379";>HBASE-10379</link>
 for more
-        details. </para>
-    </section>
-    <section
-      xml:id="trouble.client.zookeeper">
-      <title>ZooKeeper Client Connection Errors</title>
-      <para>Errors like this...</para>
-      <programlisting>
-11/07/05 11:26:41 WARN zookeeper.ClientCnxn: Session 0x0 for server null,
- unexpected error, closing socket connection and attempting reconnect
- java.net.ConnectException: Connection refused: no further information
-        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
-        at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
-        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
- 11/07/05 11:26:43 INFO zookeeper.ClientCnxn: Opening socket connection to
- server localhost/127.0.0.1:2181
- 11/07/05 11:26:44 WARN zookeeper.ClientCnxn: Session 0x0 for server null,
- unexpected error, closing socket connection and attempting reconnect
- java.net.ConnectException: Connection refused: no further information
-        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
-        at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
-        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
- 11/07/05 11:26:45 INFO zookeeper.ClientCnxn: Opening socket connection to
- server localhost/127.0.0.1:2181
-</programlisting>
-      <para>... are either due to ZooKeeper being down, or unreachable due to 
network issues. </para>
-      <para>The utility <xref
-          linkend="trouble.tools.builtin.zkcli" /> may help investigate 
ZooKeeper issues. </para>
-    </section>
-    <section
-      xml:id="trouble.client.oome.directmemory.leak">
-      <title>Client running out of memory though heap size seems to be stable 
(but the
-        off-heap/direct heap keeps growing)</title>
-      <para> You are likely running into the issue that is described and 
worked through in the mail
-        thread <link
-          
xlink:href="http://search-hadoop.com/m/ubhrX8KvcH/Suspected+memory+leak&amp;subj=Re+Suspected+memory+leak";>HBase,
-          mail # user - Suspected memory leak</link> and continued over in 
<link
-          
xlink:href="http://search-hadoop.com/m/p2Agc1Zy7Va/MaxDirectMemorySize+Was%253A+Suspected+memory+leak&amp;subj=Re+FeedbackRe+Suspected+memory+leak";>HBase,
-          mail # dev - FeedbackRe: Suspected memory leak</link>. A workaround 
is passing your
-        client-side JVM a reasonable value for 
<code>-XX:MaxDirectMemorySize</code>. By default, the
-          <varname>MaxDirectMemorySize</varname> is equal to your 
<code>-Xmx</code> max heapsize
-        setting (if <code>-Xmx</code> is set). Try seting it to something 
smaller (for example, one
-        user had success setting it to <code>1g</code> when they had a 
client-side heap of
-          <code>12g</code>). If you set it too small, it will bring on 
<code>FullGCs</code> so keep
-        it a bit hefty. You want to make this setting client-side only 
especially if you are running
-        the new experiemental server-side off-heap cache since this feature 
depends on being able to
-        use big direct buffers (You may have to keep separate client-side and 
server-side config
-        dirs). </para>
-
-    </section>
-    <section
-      xml:id="trouble.client.slowdown.admin">
-      <title>Client Slowdown When Calling Admin Methods (flush, compact, 
etc.)</title>
-      <para> This is a client issue fixed by <link
-          
xlink:href="https://issues.apache.org/jira/browse/HBASE-5073";>HBASE-5073</link> 
in 0.90.6.
-        There was a ZooKeeper leak in the client and the client was getting 
pummeled by ZooKeeper
-        events with each additional invocation of the admin API. </para>
-    </section>
-
-    <section
-      xml:id="trouble.client.security.rpc">
-      <title>Secure Client Cannot Connect ([Caused by GSSException: No valid 
credentials provided
-        (Mechanism level: Failed to find any Kerberos tgt)])</title>
-      <para> There can be several causes that produce this symptom. </para>
-      <para> First, check that you have a valid Kerberos ticket. One is 
required in order to set up
-        communication with a secure Apache HBase cluster. Examine the ticket 
currently in the
-        credential cache, if any, by running the klist command line utility. 
If no ticket is listed,
-        you must obtain a ticket by running the kinit command with either a 
keytab specified, or by
-        interactively entering a password for the desired principal. </para>
-      <para> Then, consult the <link
-          
xlink:href="http://docs.oracle.com/javase/1.5.0/docs/guide/security/jgss/tutorials/Troubleshooting.html";>Java
-          Security Guide troubleshooting section</link>. The most common 
problem addressed there is
-        resolved by setting javax.security.auth.useSubjectCredsOnly system 
property value to false. </para>
-      <para> Because of a change in the format in which MIT Kerberos writes 
its credentials cache,
-        there is a bug in the Oracle JDK 6 Update 26 and earlier that causes 
Java to be unable to
-        read the Kerberos credentials cache created by versions of MIT 
Kerberos 1.8.1 or higher. If
-        you have this problematic combination of components in your 
environment, to work around this
-        problem, first log in with kinit and then immediately refresh the 
credential cache with
-        kinit -R. The refresh will rewrite the credential cache without the 
problematic formatting. </para>
-      <para> Finally, depending on your Kerberos configuration, you may need 
to install the <link
-          
xlink:href="http://docs.oracle.com/javase/1.4.2/docs/guide/security/jce/JCERefGuide.html";>Java
-          Cryptography Extension</link>, or JCE. Insure the JCE jars are on 
the classpath on both
-        server and client systems. </para>
-      <para> You may also need to download the <link
-          
xlink:href="http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html";>unlimited
-          strength JCE policy files</link>. Uncompress and extract the 
downloaded file, and install
-        the policy jars into &lt;java-home&gt;/lib/security. </para>
-    </section>
-
-  </section>
-
-  <section
-    xml:id="trouble.mapreduce">
-    <title>MapReduce</title>
-    <section
-      xml:id="trouble.mapreduce.local">
-      <title>You Think You're On The Cluster, But You're Actually Local</title>
-      <para>This following stacktrace happened using <code>ImportTsv</code>, 
but things like this
-        can happen on any job with a mis-configuration.</para>
-      <programlisting>
-    WARN mapred.LocalJobRunner: job_local_0001
-java.lang.IllegalArgumentException: Can't read partitions file
-       at 
org.apache.hadoop.hbase.mapreduce.hadoopbackport.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:111)
-       at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
-       at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
-       at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.&lt;init&gt;(MapTask.java:560)
-       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
-       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
-       at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
-Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
-       at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383)
-       at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
-       at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:776)
-       at 
org.apache.hadoop.io.SequenceFile$Reader.&lt;init&gt;(SequenceFile.java:1424)
-       at 
org.apache.hadoop.io.SequenceFile$Reader.&lt;init&gt;(SequenceFile.java:1419)
-       at 
org.apache.hadoop.hbase.mapreduce.hadoopbackport.TotalOrderPartitioner.readPartitions(TotalOrderPartitioner.java:296)
-</programlisting>
-      <para>.. see the critical portion of the stack? It's...</para>
-      <programlisting>
-at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
-</programlisting>
-      <para>LocalJobRunner means the job is running locally, not on the 
cluster. </para>
-
-      <para>To solve this problem, you should run your MR job with your
-          <code>HADOOP_CLASSPATH</code> set to include the HBase dependencies. 
The "hbase classpath"
-        utility can be used to do this easily. For example (substitute VERSION 
with your HBase
-        version):</para>
-      <programlisting language="bourne">
-          HADOOP_CLASSPATH=`hbase classpath` hadoop jar 
$HBASE_HOME/hbase-server-VERSION.jar rowcounter usertable
-      </programlisting>
-      <para>See <link
-          
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath";>
-          
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath</link>
-        for more information on HBase MapReduce jobs and classpaths. </para>
-    </section>
-    <section xml:id="trouble.hbasezerocopybytestring">
-      <title>Launching a job, you get java.lang.IllegalAccessError: 
com/google/protobuf/HBaseZeroCopyByteString or class 
com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass 
com.google.protobuf.LiteralByteString</title>
-      <para>See <link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-10304";>HBASE-10304 
Running an hbase job jar: IllegalAccessError: class 
com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass 
com.google.protobuf.LiteralByteString</link> and <link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-11118";>HBASE-11118 non 
environment variable solution for "IllegalAccessError: class 
com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass 
com.google.protobuf.LiteralByteString"</link>.  The issue can also show up
-          when trying to run spark jobs.  See <link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-10877";>HBASE-10877 
HBase non-retriable exception list should be expanded</link>.
-      </para>
-    </section>
-  </section>
-
-  <section
-    xml:id="trouble.namenode">
-    <title>NameNode</title>
-    <para>For more information on the NameNode, see <xref
-        linkend="arch.hdfs" />. </para>
-    <section
-      xml:id="trouble.namenode.disk">
-      <title>HDFS Utilization of Tables and Regions</title>
-      <para>To determine how much space HBase is using on HDFS use the 
<code>hadoop</code> shell
-        commands from the NameNode. For example... </para>
-      <para><programlisting language="bourne">hadoop fs -dus 
/hbase/</programlisting> ...returns the summarized disk
-        utilization for all HBase objects. </para>
-      <para><programlisting language="bourne">hadoop fs -dus 
/hbase/myTable</programlisting> ...returns the summarized
-        disk utilization for the HBase table 'myTable'. </para>
-      <para><programlisting language="bourne">hadoop fs -du 
/hbase/myTable</programlisting> ...returns a list of the
-        regions under the HBase table 'myTable' and their disk utilization. 
</para>
-      <para>For more information on HDFS shell commands, see the <link
-          
xlink:href="http://hadoop.apache.org/common/docs/current/file_system_shell.html";>HDFS
-          FileSystem Shell documentation</link>. </para>
-    </section>
-    <section
-      xml:id="trouble.namenode.hbase.objects">
-      <title>Browsing HDFS for HBase Objects</title>
-      <para>Sometimes it will be necessary to explore the HBase objects that 
exist on HDFS. These
-        objects could include the WALs (Write Ahead Logs), tables, regions, 
StoreFiles, etc. The
-        easiest way to do this is with the NameNode web application that runs 
on port 50070. The
-        NameNode web application will provide links to the all the DataNodes 
in the cluster so that
-        they can be browsed seamlessly. </para>
-      <para>The HDFS directory structure of HBase tables in the cluster is...
-        <programlisting>
-<filename>/hbase</filename>
-     <filename>/&lt;Table&gt;</filename>             (Tables in the cluster)
-          <filename>/&lt;Region&gt;</filename>           (Regions for the 
table)
-               <filename>/&lt;ColumnFamily&gt;</filename>      (ColumnFamilies 
for the Region for the table)
-                    <filename>/&lt;StoreFile&gt;</filename>        (StoreFiles 
for the ColumnFamily for the Regions for the table)
-            </programlisting>
-      </para>
-      <para>The HDFS directory structure of HBase WAL is..
-        <programlisting>
-<filename>/hbase</filename>
-     <filename>/.logs</filename>
-          <filename>/&lt;RegionServer&gt;</filename>    (RegionServers)
-               <filename>/&lt;WAL&gt;</filename>           (WAL files for the 
RegionServer)
-            </programlisting>
-      </para>
-      <para>See the <link
-          
xlink:href="http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html";>HDFS
 User
-          Guide</link> for other non-shell diagnostic utilities like 
<code>fsck</code>. </para>
-      <section
-        xml:id="trouble.namenode.0size.hlogs">
-        <title>Zero size WALs with data in them</title>
-        <para>Problem: when getting a listing of all the files in a region 
server's .logs directory,
-          one file has a size of 0 but it contains data.</para>
-        <para>Answer: It's an HDFS quirk. A file that's currently being 
written to will appear to
-          have a size of 0 but once it's closed it will show its true 
size</para>
-      </section>
-      <section
-        xml:id="trouble.namenode.uncompaction">
-        <title>Use Cases</title>
-        <para>Two common use-cases for querying HDFS for HBase objects is 
research the degree of
-          uncompaction of a table. If there are a large number of StoreFiles 
for each ColumnFamily
-          it could indicate the need for a major compaction. Additionally, 
after a major compaction
-          if the resulting StoreFile is "small" it could indicate the need for 
a reduction of
-          ColumnFamilies for the table. </para>
-      </section>
-
-    </section>
-  </section>
-
-  <section
-    xml:id="trouble.network">
-    <title>Network</title>
-    <section
-      xml:id="trouble.network.spikes">
-      <title>Network Spikes</title>
-      <para>If you are seeing periodic network spikes you might want to check 
the
-          <code>compactionQueues</code> to see if major compactions are 
happening. </para>
-      <para>See <xref
-          linkend="managed.compactions" /> for more information on managing 
compactions. </para>
-    </section>
-    <section
-      xml:id="trouble.network.loopback">
-      <title>Loopback IP</title>
-      <para>HBase expects the loopback IP Address to be 127.0.0.1. See the 
Getting Started section
-        on <xref
-          linkend="loopback.ip" />. </para>
-    </section>
-    <section
-      xml:id="trouble.network.ints">
-      <title>Network Interfaces</title>
-      <para>Are all the network interfaces functioning correctly? Are you 
sure? See the
-        Troubleshooting Case Study in <xref
-          linkend="trouble.casestudy" />. </para>
-    </section>
-
-  </section>
-
-  <section
-    xml:id="trouble.rs">
-    <title>RegionServer</title>
-    <para>For more information on the RegionServers, see <xref
-        linkend="regionserver.arch" />. </para>
-    <section
-      xml:id="trouble.rs.startup">
-      <title>Startup Errors</title>
-      <section
-        xml:id="trouble.rs.startup.master-no-region">
-        <title>Master Starts, But RegionServers Do Not</title>
-        <para>The Master believes the RegionServers have the IP of 127.0.0.1 - 
which is localhost
-          and resolves to the master's own localhost. </para>
-        <para>The RegionServers are erroneously informing the Master that 
their IP addresses are
-          127.0.0.1. </para>
-        <para>Modify <filename>/etc/hosts</filename> on the region servers, 
from...</para>
-        <programlisting>
-# Do not remove the following line, or various programs
-# that require network functionality will fail.
-127.0.0.1               fully.qualified.regionservername regionservername  
localhost.localdomain localhost
-::1             localhost6.localdomain6 localhost6
-            </programlisting>
-        <para>... to (removing the master node's name from localhost)...</para>
-        <programlisting>
-# Do not remove the following line, or various programs
-# that require network functionality will fail.
-127.0.0.1               localhost.localdomain localhost
-::1             localhost6.localdomain6 localhost6
-            </programlisting>
-      </section>
-
-      <section
-        xml:id="trouble.rs.startup.compression">
-        <title>Compression Link Errors</title>
-        <para> Since compression algorithms such as LZO need to be installed 
and configured on each
-          cluster this is a frequent source of startup error. If you see 
messages like
-          this...</para>
-        <programlisting>
-11/02/20 01:32:15 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl 
library
-java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
-        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1734)
-        at java.lang.Runtime.loadLibrary0(Runtime.java:823)
-        at java.lang.System.loadLibrary(System.java:1028)
-            </programlisting>
-        <para>.. then there is a path issue with the compression libraries. 
See the Configuration
-          section on <link
-            linkend="lzo.compression">LZO compression configuration</link>. 
</para>
-      </section>
-    </section>
-    <section
-      xml:id="trouble.rs.runtime">
-      <title>Runtime Errors</title>
-
-      <section
-        xml:id="trouble.rs.runtime.hang">
-        <title>RegionServer Hanging</title>
-        <para> Are you running an old JVM (&lt; 1.6.0_u21?)? When you look at 
a thread dump, does it
-          look like threads are BLOCKED but no one holds the lock all are 
blocked on? See <link
-            
xlink:href="https://issues.apache.org/jira/browse/HBASE-3622";>HBASE 3622 
Deadlock in
-            HBaseServer (JVM bug?)</link>. Adding <code>-XX:+UseMembar</code> 
to the HBase
-            <varname>HBASE_OPTS</varname> in 
<filename>conf/hbase-env.sh</filename> may fix it.
-        </para>
-      </section>
-      <section
-        xml:id="trouble.rs.runtime.filehandles">
-        <title>java.io.IOException...(Too many open files)</title>
-        <para> If you see log messages like this...</para>
-        <programlisting>
-2010-09-13 01:24:17,336 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
-Disk-related IOException in BlockReceiver constructor. Cause is 
java.io.IOException: Too many open files
-        at java.io.UnixFileSystem.createFileExclusively(Native Method)
-        at java.io.File.createNewFile(File.java:883)
-</programlisting>
-        <para>... see the Getting Started section on <link
-            linkend="ulimit">ulimit and nproc configuration</link>. </para>
-      </section>
-      <section
-        xml:id="trouble.rs.runtime.xceivers">
-        <title>xceiverCount 258 exceeds the limit of concurrent xcievers 
256</title>
-        <para> This typically shows up in the DataNode logs. </para>
-        <para> See the Getting Started section on <link
-            linkend="dfs.datanode.max.transfer.threads">xceivers 
configuration</link>. </para>
-      </section>
-      <section
-        xml:id="trouble.rs.runtime.oom-nt">
-        <title>System instability, and the presence of 
"java.lang.OutOfMemoryError: unable to create
-          new native thread in exceptions" HDFS DataNode logs or that of any 
system daemon</title>
-        <para> See the Getting Started section on <link
-            linkend="ulimit">ulimit and nproc configuration</link>. The 
default on recent Linux
-          distributions is 1024 - which is far too low for HBase. </para>
-      </section>
-      <section
-        xml:id="trouble.rs.runtime.gc">
-        <title>DFS instability and/or RegionServer lease timeouts</title>
-        <para> If you see warning messages like this...</para>
-        <programlisting>
-2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
xxx ms, ten times longer than scheduled: 10000
-2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
xxx ms, ten times longer than scheduled: 15000
-2009-02-24 10:01:36,472 WARN 
org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master 
for xxx milliseconds - retrying
-           </programlisting>
-        <para>... or see full GC compactions then you may be experiencing full 
GC's. </para>
-      </section>
-      <section
-        xml:id="trouble.rs.runtime.nolivenodes">
-        <title>"No live nodes contain current block" and/or 
YouAreDeadException</title>
-        <para> These errors can happen either when running out of OS file 
handles or in periods of
-          severe network problems where the nodes are unreachable. </para>
-        <para> See the Getting Started section on <link
-            linkend="ulimit">ulimit and nproc configuration</link> and check 
your network. </para>
-      </section>
-      <section
-        xml:id="trouble.rs.runtime.zkexpired">
-        <title>ZooKeeper SessionExpired events</title>
-        <para>Master or RegionServers shutting down with messages like those 
in the logs: </para>
-        <programlisting>
-WARN org.apache.zookeeper.ClientCnxn: Exception
-closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec
-java.io.IOException: TIMED OUT
-       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
-WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer 
than scheduled: 5000
-INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server 
hostname/IP:PORT
-INFO org.apache.zookeeper.ClientCnxn: Priming connection to 
java.nio.channels.SocketChannel[connected local=/IP:PORT 
remote=hostname/IP:PORT]
-INFO org.apache.zookeeper.ClientCnxn: Server connection successful
-WARN org.apache.zookeeper.ClientCnxn: Exception closing session 
0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e
-java.io.IOException: Session Expired
-       at 
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
-       at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
-       at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
-ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session 
expired
-           </programlisting>
-        <para> The JVM is doing a long running garbage collecting which is 
pausing every threads
-          (aka "stop the world"). Since the RegionServer's local ZooKeeper 
client cannot send
-          heartbeats, the session times out. By design, we shut down any node 
that isn't able to
-          contact the ZooKeeper ensemble after getting a timeout so that it 
stops serving data that
-          may already be assigned elsewhere. </para>
-
-        <itemizedlist>
-          <listitem>
-            <para>Make sure you give plenty of RAM (in 
<filename>hbase-env.sh</filename>), the
-              default of 1GB won't be able to sustain long running 
imports.</para>
-          </listitem>
-          <listitem>
-            <para>Make sure you don't swap, the JVM never behaves well under 
swapping.</para>
-          </listitem>
-          <listitem>
-            <para>Make sure you are not CPU starving the RegionServer thread. 
For example, if you
-              are running a MapReduce job using 6 CPU-intensive tasks on a 
machine with 4 cores, you
-              are probably starving the RegionServer enough to create longer 
garbage collection
-              pauses.</para>
-          </listitem>
-          <listitem>
-            <para>Increase the ZooKeeper session timeout</para>
-          </listitem>
-        </itemizedlist>
-        <para>If you wish to increase the session timeout, add the following 
to your
-            <filename>hbase-site.xml</filename> to increase the timeout from 
the default of 60
-          seconds to 120 seconds. </para>
-        <programlisting language="xml">
-<![CDATA[<property>
-    <name>zookeeper.session.timeout</name>
-    <value>1200000</value>
-</property>
-<property>
-    <name>hbase.zookeeper.property.tickTime</name>
-    <value>6000</value>
-</property>]]>
-            </programlisting>
-           
-           <para>
-           Be aware that setting a higher timeout means that the regions 
served by a failed RegionServer will take at least
-           that amount of time to be transfered to another RegionServer. For a 
production system serving live requests, we would instead
-           recommend setting it lower than 1 minute and over-provision your 
cluster in order the lower the memory load on each machines (hence having
-           less garbage to collect per machine).
-           </para>
-           <para>
-           If this is happening during an upload which only happens once (like 
initially loading all your data into HBase), consider bulk loading.
-           </para>
-<para>See <xref linkend="trouble.zookeeper.general"/> for other general 
information about ZooKeeper troubleshooting.
-</para>        </section>
-        <section xml:id="trouble.rs.runtime.notservingregion">
-           <title>NotServingRegionException</title>
-           <para>This exception is "normal" when found in the RegionServer 
logs at DEBUG level.  This exception is returned back to the client
-           and then the client goes back to hbase:meta to find the new 
location of the moved region.</para>
-           <para>However, if the NotServingRegionException is logged ERROR, 
then the client ran out of retries and something probably wrong.</para>
-        </section>
-        <section xml:id="trouble.rs.runtime.double_listed_regions">
-           <title>Regions listed by domain name, then IP</title>
-           <para>
-           Fix your DNS.  In versions of Apache HBase before 0.92.x, reverse 
DNS needs to give same answer
-           as forward lookup. See <link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-3431";>HBASE 3431
-           RegionServer is not using the name given it by the master; double 
entry in master listing of servers</link> for gorey details.
-          </para>
-        </section>
-        <section xml:id="brand.new.compressor">
-          <title>Logs flooded with '2011-01-10 12:40:48,407 INFO 
org.apache.hadoop.io.compress.CodecPool: Got
-            brand-new compressor' messages</title>
-                <para>We are not using the native versions of compression
-                    libraries.  See <link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-1900";>HBASE-1900 Put 
back native support when hadoop 0.21 is released</link>.
-                    Copy the native libs from hadoop under hbase lib dir or
-                    symlink them into place and the message should go away.
-                </para>
-        </section>
-        <section xml:id="trouble.rs.runtime.client_went_away">
-           <title>Server handler X on 60020 caught: 
java.nio.channels.ClosedChannelException</title>
-           <para>
-           If you see this type of message it means that the region server was 
trying to read/send data from/to a client but
-           it already went away. Typical causes for this are if the client was 
killed (you see a storm of messages like this when a MapReduce
-           job is killed or fails) or if the client receives a 
SocketTimeoutException. It's harmless, but you should consider digging in
-           a bit more if you aren't doing something to trigger them.
-           </para>
-        </section>
-
-      </section>
-    <section>
-      <title>Snapshot Errors Due to Reverse DNS</title>
-      <para>Several operations within HBase, including snapshots, rely on 
properly configured
-        reverse DNS. Some environments, such as Amazon EC2, have trouble with 
reverse DNS. If you
-        see errors like the following on your RegionServers, check your 
reverse DNS configuration:</para>
-      <screen>
-2013-05-01 00:04:56,356 DEBUG org.apache.hadoop.hbase.procedure.Subprocedure: 
Subprocedure 'backup1' 
-coordinator notified of 'acquire', waiting on 'reached' or 'abort' from 
coordinator.        
-      </screen>
-      <para>In general, the hostname reported by the RegionServer needs to be 
the same as the
-        hostname the Master is trying to reach. You can see a hostname 
mismatch by looking for the
-        following type of message in the RegionServer's logs at 
start-up.</para>
-      <screen>
-2013-05-01 00:03:00,614 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us hostname 
-to use. Was=myhost-1234, Now=ip-10-55-88-99.ec2.internal        
-      </screen>
-    </section>
-      <section xml:id="trouble.rs.shutdown">
-        <title>Shutdown Errors</title>
-  <para />
-      </section>
-
-    </section>
-
-    <section xml:id="trouble.master">
-      <title>Master</title>
-       <para>For more information on the Master, see <xref linkend="master"/>.
-       </para>
-      <section xml:id="trouble.master.startup">
-        <title>Startup Errors</title>
-          <section xml:id="trouble.master.startup.migration">
-             <title>Master says that you need to run the hbase migrations 
script</title>
-             <para>Upon running that, the hbase migrations script says no 
files in root directory.</para>
-             <para>HBase expects the root directory to either not exist, or to 
have already been initialized by hbase running a previous time. If you create a 
new directory for HBase using Hadoop DFS, this error will occur.
-             Make sure the HBase root directory does not currently exist or 
has been initialized by a previous run of HBase. Sure fire solution is to just 
use Hadoop dfs to delete the HBase root and let HBase create and initialize the 
directory itself.
-             </para>
-          </section>
-          <section xml:id="trouble.master.startup.zk.buffer">
-              <title>Packet len6080218 is out of range!</title>
-              <para>If you have many regions on your cluster and you see an 
error
-                  like that reported above in this sections title in your 
logs, see
-                  <link 
xlink:href="https://issues.apache.org/jira/browse/HBASE-4246";>HBASE-4246 
Cluster with too many regions cannot withstand some master failover 
scenarios</link>.</para>
-          </section>
-
-      </section>
-      <section xml:id="trouble.master.shutdown">
-        <title>Shutdown Errors</title>
-        <para/>
-      </section>
-
-    </section>
-
-    <section xml:id="trouble.zookeeper">
-      <title>ZooKeeper</title>
-      <section xml:id="trouble.zookeeper.startup">
-        <title>Startup Errors</title>
-          <section xml:id="trouble.zookeeper.startup.address">
-             <title>Could not find my address: xyz in list of ZooKeeper quorum 
servers</title>
-             <para>A ZooKeeper server wasn't able to start, throws that error. 
xyz is the name of your server.</para>
-             <para>This is a name lookup problem. HBase tries to start a 
ZooKeeper server on some machine but that machine isn't able to find itself in 
the <varname>hbase.zookeeper.quorum</varname> configuration.
-             </para>
-             <para>Use the hostname presented in the error message instead of 
the value you used. If you have a DNS server, you can set 
<varname>hbase.zookeeper.dns.interface</varname> and 
<varname>hbase.zookeeper.dns.nameserver</varname> in 
<filename>hbase-site.xml</filename> to make sure it resolves to the correct 
FQDN.
-             </para>
-          </section>
-
-      </section>
-      <section xml:id="trouble.zookeeper.general">
-          <title>ZooKeeper, The Cluster Canary</title>
-          <para>ZooKeeper is the cluster's "canary in the mineshaft". It'll be 
the first to notice issues if any so making sure its happy is the short-cut to 
a humming cluster.
-          </para>
-          <para>
-          See the <link 
xlink:href="http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting";>ZooKeeper 
Operating Environment Troubleshooting</link> page. It has suggestions and tools 
for checking disk and networking performance; i.e. the operating environment 
your ZooKeeper and HBase are running in.
-          </para>
-         <para>Additionally, the utility <xref 
linkend="trouble.tools.builtin.zkcli"/> may help investigate ZooKeeper issues.
-         </para>
-      </section>
-
-    </section>
-
-    <section xml:id="trouble.ec2">
-       <title>Amazon EC2</title>
-          <section xml:id="trouble.ec2.zookeeper">
-             <title>ZooKeeper does not seem to work on Amazon EC2</title>
-             <para>HBase does not start when deployed as Amazon EC2 instances. 
 Exceptions like the below appear in the Master and/or RegionServer logs: 
</para>
-             <programlisting>
-  2009-10-19 11:52:27,030 INFO org.apache.zookeeper.ClientCnxn: Attempting
-  connection to server 
ec2-174-129-15-236.compute-1.amazonaws.com/10.244.9.171:2181
-  2009-10-19 11:52:27,032 WARN org.apache.zookeeper.ClientCnxn: Exception
-  closing session 0x0 to sun.nio.ch.SelectionKeyImpl@656dc861
-  java.net.ConnectException: Connection refused
-             </programlisting>
-             <para>
-             Security group policy is blocking the ZooKeeper port on a public 
address.
-             Use the internal EC2 host names when configuring the ZooKeeper 
quorum peer list.
-             </para>
-          </section>
-          <section xml:id="trouble.ec2.instability">
-             <title>Instability on Amazon EC2</title>
-             <para>Questions on HBase and Amazon EC2 come up frequently on the 
HBase dist-list. Search for old threads using <link 
xlink:href="http://search-hadoop.com/";>Search Hadoop</link>
-             </para>
-          </section>
-          <section xml:id="trouble.ec2.connection">
-             <title>Remote Java Connection into EC2 Cluster Not Working</title>
-             <para>
-             See Andrew's answer here, up on the user list: <link 
xlink:href="http://search-hadoop.com/m/sPdqNFAwyg2";>Remote Java client 
connection into EC2 instance</link>.
-             </para>
-          </section>
-
-    </section>
-
-    <section xml:id="trouble.versions">
-       <title>HBase and Hadoop version issues</title>
-          <section xml:id="trouble.versions.205">
-             <title><code>NoClassDefFoundError</code> when trying to run 
0.90.x on hadoop-0.20.205.x (or hadoop-1.0.x)</title>
-             <para>Apache HBase 0.90.x does not ship with hadoop-0.20.205.x, 
etc.  To make it run, you need to replace the hadoop
-             jars that Apache HBase shipped with in its 
<filename>lib</filename> directory with those of the Hadoop you want to
-             run HBase on.  If even after replacing Hadoop jars you get the 
below exception:</para>
-<programlisting>
-sv4r6s38: Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/commons/configuration/Configuration
-sv4r6s38:       at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.&lt;init>(DefaultMetricsSystem.java:37)
-sv4r6s38:       at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.&lt;clinit>(DefaultMetricsSystem.java:34)
-sv4r6s38:       at 
org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
-sv4r6s38:       at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
-sv4r6s38:       at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:177)
-sv4r6s38:       at 
org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:229)
-sv4r6s38:       at 
org.apache.hadoop.security.KerberosName.&lt;clinit>(KerberosName.java:83)
-sv4r6s38:       at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:202)
-sv4r6s38:       at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:177)
-</programlisting>
-      <para>you need to copy under <filename>hbase/lib</filename>, the
-          <filename>commons-configuration-X.jar</filename> you find in your 
Hadoop's
-          <filename>lib</filename> directory. That should fix the above 
complaint. </para>
-    </section>
-
-    <section
-      xml:id="trouble.wrong.version">
-      <title>...cannot communicate with client version...</title>
-      <para>If you see something like the following in your logs 
<computeroutput>... 2012-09-24
-          10:20:52,168 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled 
exception. Starting
-          shutdown. org.apache.hadoop.ipc.RemoteException: Server IPC version 
7 cannot communicate
-          with client version 4 ...</computeroutput> ...are you trying to talk 
to an Hadoop 2.0.x
-        from an HBase that has an Hadoop 1.0.x client? Use the HBase built 
against Hadoop 2.0 or
-        rebuild your HBase passing the <command>-Dhadoop.profile=2.0</command> 
attribute to Maven
-        (See <xref
-          linkend="maven.build.hadoop" /> for more). </para>
-
-    </section>
-    </section>
-  <section>
-    <title>IPC Configuration Conflicts with Hadoop</title>
-    <para>If the Hadoop configuration is loaded after the HBase configuration, 
and you have
-      configured custom IPC settings in both HBase and Hadoop, the Hadoop 
values may overwrite the
-      HBase values. There is normally no need to change these settings for 
HBase, so this problem is
-      an edge case. However, <link
-        
xlink:href="https://issues.apache.org/jira/browse/HBASE-11492";>HBASE-11492</link>
 renames
-      these settings for HBase to remove the chance of a conflict. Each of the 
setting names have
-      been prefixed with <literal>hbase.</literal>, as shown in the following 
table. No action is
-      required related to these changes unless you are already experiencing a 
conflict.</para>
-    <para>These changes were backported to HBase 0.98.x and apply to all newer 
versions.</para>
-    <informaltable>
-      <tgroup
-        cols="2">
-        <thead>
-          <row>
-            <entry>Pre-0.98.x</entry>
-            <entry>0.98-x And Newer</entry>
-          </row>
-        </thead>
-        <tbody>
-          <row>
-            
<entry><para><code>ipc.server.listen.queue.size</code></para></entry>
-            
<entry><para><code>hbase.ipc.server.listen.queue.size</code></para></entry>
-          </row>
-          <row>
-            
<entry><para><code>ipc.server.max.callqueue.size</code></para></entry>
-            
<entry><para><code>hbase.ipc.server.max.callqueue.size</code></para></entry>
-          </row>
-          <row>
-            
<entry><para><code>ipc.server.callqueue.handler.factor</code></para></entry>
-            
<entry><para><code>hbase.ipc.server.callqueue.handler.factor</code></para></entry>
-          </row>
-          <row>
-            
<entry><para><code>ipc.server.callqueue.read.share</code></para></entry>
-            
<entry><para><code>hbase.ipc.server.callqueue.read.share</code></para></entry>
-          </row>
-          <row>
-            <entry><para><code>ipc.server.callqueue.type</code></para></entry>
-            
<entry><para><code>hbase.ipc.server.callqueue.type</code></para></entry>
-          </row>
-          <row>
-            
<entry><para><code>ipc.server.queue.max.call.delay</code></para></entry>
-            
<entry><para><code>hbase.ipc.server.queue.max.call.delay</code></para></entry>
-          </row>
-          <row>
-            
<entry><para><code>ipc.server.max.callqueue.length</code></para></entry>
-            
<entry><para><code>hbase.ipc.server.max.callqueue.length</code></para></entry>
-          </row>
-          <row>
-            
<entry><para><code>ipc.server.read.threadpool.size</code></para></entry>
-            
<entry><para><code>hbase.ipc.server.read.threadpool.size</code></para></entry>
-          </row>
-          <row>
-            <entry><para><code>ipc.server.tcpkeepalive</code></para></entry>
-            
<entry><para><code>hbase.ipc.server.tcpkeepalive</code></para></entry>
-          </row>
-          <row>
-            <entry><para><code>ipc.server.tcpnodelay</code></para></entry>
-            
<entry><para><code>hbase.ipc.server.tcpnodelay</code></para></entry>
-          </row>
-          <row>
-            
<entry><para><code>ipc.client.call.purge.timeout</code></para></entry>
-            
<entry><para><code>hbase.ipc.client.call.purge.timeout</code></para></entry>
-          </row>
-          <row>
-            
<entry><para><code>ipc.client.connection.maxidletime</code></para></entry>
-            
<entry><para><code>hbase.ipc.client.connection.maxidletime</code></para></entry>
-          </row>
-          <row>
-            <entry><para><code>ipc.client.idlethreshold</code></para></entry>
-            
<entry><para><code>hbase.ipc.client.idlethreshold</code></para></entry>
-          </row>
-          <row>
-            <entry><para><code>ipc.client.kill.max</code></para></entry>
-            <entry><para><code>hbase.ipc.client.kill.max</code></para></entry>
-          </row>
-          <row>
-            <entry><para><code>ipc.server.scan.vtime.weight 
</code></para></entry>
-            <entry><para><code>hbase.ipc.server.scan.vtime.weight 
</code></para></entry>
-          </row>
-        </tbody>
-      </tgroup>
-    </informaltable>
-  </section>
-
-  <section>
-    <title>HBase and HDFS</title>
-    <para>General configuration guidance for Apache HDFS is out of the scope 
of this guide. Refer to
-      the documentation available at <link
-        
xlink:href="http://hadoop.apache.org/";>http://hadoop.apache.org/</link> for 
extensive
-      information about configuring HDFS. This section deals with HDFS in 
terms of HBase. </para>
-    
-    <para>In most cases, HBase stores its data in Apache HDFS. This includes 
the HFiles containing
-      the data, as well as the write-ahead logs (WALs) which store data before 
it is written to the
-      HFiles and protect against RegionServer crashes. HDFS provides 
reliability and protection to
-      data in HBase because it is distributed. To operate with the most 
efficiency, HBase needs data
-    to be available locally. Therefore, it is a good practice to run an HDFS 
datanode on each
-    RegionServer.</para>
-    <variablelist>
-      <title>Important Information and Guidelines for HBase and HDFS</title>
-      <varlistentry>
-        <term>HBase is a client of HDFS.</term>
-        <listitem>
-          <para>HBase is an HDFS client, using the HDFS <code>DFSClient</code> 
class, and references
-            to this class appear in HBase logs with other HDFS client log 
messages.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Configuration is necessary in multiple places.</term>
-        <listitem>
-          <para>Some HDFS configurations relating to HBase need to be done at 
the HDFS (server) side.
-            Others must be done within HBase (at the client side). Other 
settings need
-            to be set at both the server and client side.
-          </para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Write errors which affect HBase may be logged in the HDFS logs 
rather than HBase logs.</term>
-        <listitem>
-          <para>When writing, HDFS pipelines communications from one datanode 
to another. HBase
-            communicates to both the HDFS namenode and datanode, using the 
HDFS client classes.
-            Communication problems between datanodes are logged in the HDFS 
logs, not the HBase
-            logs.</para>
-          <para>HDFS writes are always local when possible. HBase 
RegionServers should not
-            experience many write errors, because they write the local 
datanode. If the datanode
-            cannot replicate the blocks, the errors are logged in HDFS, not in 
the HBase
-            RegionServer logs.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>HBase communicates with HDFS using two different ports.</term>
-


<TRUNCATED>

[04/34] hbase git commit: HBASE-12918 Backport asciidoc changes (apurtell and enis)

Reply via email to