[Hadoop Wiki] Update of "Hbase/Troubleshooting" by Misty

Apache Wiki Mon, 19 Oct 2015 21:35:03 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hbase/Troubleshooting" page has been changed by Misty:
https://wiki.apache.org/hadoop/Hbase/Troubleshooting?action=diff&rev1=50&rev2=51

  This page is OBSOLETE.  See the Troubleshooting section in the HBase book 
(http://hbase.apache.org/book.html#trouble)
  
- == Contents ==
-  1. [[#A1|Problem: Master initializes, but Region Servers do not]]
-  1. [[#A2|Problem: Created Root Directory for HBase through Hadoop DFS]]
-  1. [[#A3|Problem: On migration, no files in root directory]]
-  1. [[#A4|Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 
256"]]
-  1. [[#A5|Problem: "No live nodes contain current block"]]
-  1. [[#A6|Problem: DFS instability and/or regionserver lease timeouts]]
-  1. [[#A7|Problem: Instability on Amazon EC2]]
-  1. [[#A8|Problem: Zookeeper SessionExpired events]]
-  1. [[#A9|Problem: Could not find my address: xyz in list of ZooKeeper quorum 
servers]]
-  1. [[#A10|Problem: Zookeeper does not seem to work on Amazon EC2]]
-  1. [[#A11|Problem: General operating environment issues -- zookeeper session 
timeouts, regionservers shutting down, etc.]]
-  1. [[#A12|Problem: Scanner performance is low]]
-  1. [[#A13|Problem: My shell or client application throws lots of scary 
exceptions during normal operation]]
-  1. [[#A14|Problem: Running a Scan or a MapReduce job over a full table fails 
with "xceiverCount xx exceeds..." or OutOfMemoryErrors in the HDFS datanodes]]
-  1. [[#A15|Problem: System instability, and the presence of 
"java.lang.OutOfMemoryError: unable to create new native thread" exceptions in 
HDFS datanode logs or that of any system daemon]]
- 
- <<Anchor(1)>>
- 
- == 1. Problem: Master initializes, but Region Servers do not ==
- 
-  * The Master believes the Region Servers have the IP of 127.0.0.1 - which is 
localhost and resolves to the master's own localhost.
- 
- === Causes ===
-  * The Region Servers are erroneously informing the Master that their IP 
addresses are 127.0.0.1.
- 
- === Resolution ===
-  * Modify '''/etc/hosts''' on the region servers, from
-   . {{{
- # Do not remove the following line, or various programs
- # that require network functionality will fail.
- 127.0.0.1               fully.qualified.regionservername regionservername  
localhost.localdomain localhost
- ::1             localhost6.localdomain6 localhost6
- }}}
- 
-  * To (removing the master node's name from localhost)
-   . {{{
- # Do not remove the following line, or various programs
- # that require network functionality will fail.
- 127.0.0.1               localhost.localdomain localhost
- ::1             localhost6.localdomain6 localhost6
- }}}
- 
- <<Anchor(2)>>
- 
- == 2. Problem: Created Root Directory for HBase through Hadoop DFS ==
-  * On Startup, Master says that you need to run the hbase migrations script. 
Upon running that, the hbase migrations script says no files in root directory.
- 
- === Causes ===
-  * HBase expects the root directory to either not exist, or to have already 
been initialized by hbase running a previous time. If you create a new 
directory for HBase using Hadoop DFS, this error will occur.
- 
- === Resolution ===
-  * Make sure the HBase root directory does not currently exist or has been 
initialized by a previous run of HBase. Sure fire solution is to just use 
Hadoop dfs to delete the HBase root and let HBase create and initialize the 
directory itself.
- 
- 
- <<Anchor(3)>>
- 
- == 3. Problem: On startup, Master says that you need to run the hbase 
migrations script ==
-  * On Startup, Master says that you need to run the hbase migrations script. 
Upon running that, the hbase migrations script says no files in root directory.
- 
- === Causes ===
-  * HBase expects the root directory to either not exist, or to have already 
been initialized by hbase running a previous time. If you create a new 
directory for HBase using Hadoop DFS, this error will occur.
- 
- === Resolution ===
-  * Make sure the HBase root directory does not currently exist or has been 
initialized by a previous run of HBase. Sure fire solution is to just use 
Hadoop dfs to delete the HBase root and let HBase create and initialize the 
directory itself.
- 
- <<Anchor(4)>>
- 
- == 4. Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 
256" ==
-  * See the Troubleshooting section in the HBase book 
http://hbase.apache.org/book.html#trouble
- 
- <<Anchor(5)>>
- 
- == 5. Problem: "No live nodes contain current block" ==
-  * See the Troubleshooting section in the HBase book 
http://hbase.apache.org/book.html#trouble
- 
- 
- <<Anchor(6)>>
- 
- == 6. Problem: DFS instability and/or regionserver lease timeouts ==
-  * HBase regionserver leases expire during start up
-  * HBase daemons cannot find block locations in HDFS during start up or other 
periods of load
-  * HBase regionserver restarts after beeing unable to report to master:
- 
- {{{
- 2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
xxx ms, ten times longer than scheduled: 10000
- 2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
xxx ms, ten times longer than scheduled: 15000
- 2009-02-24 10:01:36,472 WARN 
org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master 
for xxx milliseconds - retrying
- }}}
- === Causes ===
-  * Slow host name resolution
-  * Very long garbage collector task for the RegionServer JVM: The ''default 
garbage collector'' of the HotspotTM JavaTM Virtual Machine runs a ''full gc'' 
on the ''old space'' when the memory space is full, which can represent about 
90% of the allocated heap space. During this task, the running program is 
stopped, the timers as well. If the heap space is mostly in the swap partition, 
and moreover if it is larger than the physical memory, the subsequent swap can 
yield to I/O overload and takes several minutes.
-  * Network bandwidth overcommitment
- 
- === Resolution ===
-  * Insure that host name resolution latency is low, or use static entries in 
/etc/hosts
-  * Monitor the network and insure that adequate bandwidth is available for 
HRPC transactions
-  * In accordance with your hardware, tune your heap space / garbage collector 
settings in the HBASE_OPTS variable of {{{$HBASE_CONF/hbase-env.sh}}}. Try the 
''concurrent garbage collector'' {{{(-XX:+UseConcMarkSweepGC)}}} to avoid to 
stop the threads during GC. Read these articles for more info about Hotspot GC 
settings
-   * [[http://java.sun.com/docs/hotspot/gc1.4.2/faq.html|Garbage collector 
FAQ]] Quick overview
-   * 
[[http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html|Tuning 
garbage collector in Java SE 6]]
-  * For Java SE 6, some users have had success with {{{ 
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:ParallelGCThreads=8 }}}.
-  * See HBase [[PerformanceTuning|Performance Tuning]] for more on JVM GC 
tuning.
- 
- <<Anchor(7)>>
- 
- == 7. Problem: Instability on Amazon EC2 ==
-  * Questions on HBase and Amazon EC2 come up frequently on the HBase 
dist-list.  Search for old threads using SearchHadoop: 
http://www.search-hadoop.com
- 
- <<Anchor(8)>>
- 
- == 8. Problem: ZooKeeper SessionExpired events ==
-  * Master or Region Servers shutting down with messages like those in the 
logs:
- 
- {{{
- WARN org.apache.zookeeper.ClientCnxn: Exception
- closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec
- java.io.IOException: TIMED OUT
-        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
- WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer 
than scheduled: 5000
- INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server 
hostname/IP:PORT
- INFO org.apache.zookeeper.ClientCnxn: Priming connection to 
java.nio.channels.SocketChannel[connected local=/IP:PORT 
remote=hostname/IP:PORT]
- INFO org.apache.zookeeper.ClientCnxn: Server connection successful
- WARN org.apache.zookeeper.ClientCnxn: Exception closing session 
0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e
- java.io.IOException: Session Expired
-        at 
org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
-        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
-        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
- ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session 
expired
- }}}
- === Causes ===
-  * The JVM is doing a long running garbage collecting which is pausing every 
threads (aka "stop the world").
-  * Since the region server's local zookeeper client cannot send heartbeats, 
the session times out.
-  * By design, we shut down any node that isn't able to contact the Zookeeper 
ensemble after getting a timeout so that it stops serving data that may already 
be assigned elsewhere.
- 
- === Resolution ===
-  * Make sure you give plenty of RAM (in hbase-env.sh), the default of 1GB 
won't be able to sustain long running imports.
-  * Make sure you don't swap, the JVM never behaves well under swapping.
-  * Make sure you are not CPU starving the region server thread. For example, 
if you are running a mapreduce job using 6 CPU-intensive tasks on a machine 
with 4 cores, you are probably starving the region server enough to create 
longer garbage collection pauses.
-  * If you wish to increase the session timeout, add the following to your 
hbase-site.xml to increase the timeout from the default of 60 seconds to 120 
seconds.
- 
- {{{
-   <property>
-     <name>zookeeper.session.timeout</name>
-     <value>1200000</value>
-   </property>
-   <property>
-     <name>hbase.zookeeper.property.tickTime</name>
-     <value>6000</value>
-   </property>
- }}}
-  * Be aware that setting a higher timeout means that the regions served by a 
failed region server will take at least that amount of time to be transfered to 
another region server. For a production system serving live requests, we would 
instead recommend setting it lower than 1 minute and over-provision your 
cluster in order the lower the memory load on each machines (hence having less 
garbage to collect per machine).
-  * If this is happening during an upload which only happens once (like 
initially loading all your data into HBase), consider 
[[http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk|importing
 into HFiles directly]].
-  * HBase ships with some GC tuning, for more information see 
[[PerformanceTuning|Performance Tuning]].
- 
- 
- <<Anchor(9)>>
- 
- == 9. Problem: Could not find my address: xyz in list of ZooKeeper quorum 
servers ==
-  * A Zookeeper server wasn't able to start, throws that error. xyz is the 
name of your server.
- 
- === Causes ===
-  * This is a name lookup problem. HBase tries to start a ZK server on some 
machine but that machine isn't able to find itself in the 
'''hbase.zookeeper.quorum configuration'''.
- 
- === Resolution ===
-  * Use the hostname presented in the error message instead of the value you 
used. If you have a DNS server, you can set '''hbase.zookeeper.dns.interface''' 
and '''hbase.zookeeper.dns.nameserver''' in hbase-site.xml to make sure it 
resolves to the correct FQDN.
- 
- <<Anchor(10)>>
- 
- == 10. Problem: Zookeeper does not seem to work on Amazon EC2 ==
-  * HBase does not start when deployed as Amazon EC2 instances.
-  * Exceptions like the below appear in the master and/or region server logs:
- 
- {{{
-   2009-10-19 11:52:27,030 INFO org.apache.zookeeper.ClientCnxn: Attempting
-   connection to server 
ec2-174-129-15-236.compute-1.amazonaws.com/10.244.9.171:2181
-   2009-10-19 11:52:27,032 WARN org.apache.zookeeper.ClientCnxn: Exception
-   closing session 0x0 to sun.nio.ch.SelectionKeyImpl@656dc861
-   java.net.ConnectException: Connection refused
- }}}
- === Causes ===
-  * Security group policy is blocking the Zookeeper port on a public address.
- 
- === Resolution ===
-  * Use the internal EC2 host names when configuring the Zookeeper quorum peer 
list.
- 
- <<Anchor(11)>>
- 
- == 11. Problem: General operating environment issues -- zookeeper session 
timeouts, regionservers shutting down, etc ==
- === Causes ===
-  . Various.
- 
- === Resolution ===
- See the [[http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting|ZooKeeper 
Operating Environment Troubleshooting]] page.  It has suggestions and tools for 
checking disk and networking performance; i.e. the operating environment your 
zookeeper and hbase are running in.  ZooKeeper is the cluster's "canary".  
It'll be the first to notice issues if any so making sure its happy is the 
short-cut to a humming cluster.
- 
- <<Anchor(12)>>
- 
- == 12. Problem: Scanner performance is low ==
- === Causes ===
- Default scanner caching (prefetching) is set to 1. The default is low because 
if a job takes too long processing, a scanner can time out, which causes 
unhappy jobs/people/emails. See item #10 above.
- 
- === Resolution ===
-  * Increase the amount of prefetching on the scanner, to 10, or 100, or 1000, 
as appropriate for your workload: 
[[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#scannerCaching|HTable.scannerCaching]]
-  * This change can be accomplished globally by setting the 
hbase.client.scanner.caching property in hbase-site.xml to the desired value.
- 
- <<Anchor(13)>>
- 
- == 13. Problem: My shell or client application throws lots of scary 
exceptions during normal operation ==
- === Causes ===
- Since 0.20.0 the default log level for org.apache.hadoop.hbase.* is DEBUG.
- 
- === Resolution ===
- On your clients, edit $HBASE_HOME/conf/log4j.properties and change this: 
{{{log4j.logger.org.apache.hadoop.hbase=DEBUG}}} to this: 
{{{log4j.logger.org.apache.hadoop.hbase=INFO}}}, or even 
{{{log4j.logger.org.apache.hadoop.hbase=WARN}}} .
- 
- 
- <<Anchor(14)>>
- 
- == 14. Problem: Running a Scan or a MapReduce job over a full table fails 
with "xceiverCount xx exceeds..." or OutOfMemoryErrors in the HDFS datanodes ==
- 
- === Causes ===
- This problem is generally a symptom of a mis-configured or underpowered 
cluster. 
- 
- === Resolution ===
-  * See the Troubleshooting section in the HBase book 
http://hbase.apache.org/book.html#trouble on xceivers configuration.
-  * See the configuration section in the HBase book 
http://hbase.apache.org/book.html on '''hbase.hregion.max.filesize'''
-  * Add machines to your cluster.
- 
- <<Anchor(15)>>
- 
- == 15. Problem: System instability, and the presence of 
"java.lang.OutOfMemoryError: unable to create new native thread in exceptions" 
HDFS datanode logs or that of any system daemon ==
- 
- === Causes ===
- 
- The user under which the daemons are running has an nproc limit (default) set 
too low. The default on recent Linux distributions is 1024.
-  
- === Resolution ===
- 
- See the HBase book http://hbase.apache.org/book.html on nproc configuration.
-

[Hadoop Wiki] Update of "Hbase/Troubleshooting" by Misty

Reply via email to