[Hadoop Wiki] Update of "Hbase/Troubleshooting" by Gabr ielReid

Apache Wiki Wed, 08 Dec 2010 01:22:26 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hbase/Troubleshooting" page has been changed by GabrielReid.
The comment on this change is: Added information about scans or MapReduce jobs 
failing over a large table with many regions.
http://wiki.apache.org/hadoop/Hbase/Troubleshooting?action=diff&rev1=44&rev2=45

--------------------------------------------------

  == Contents ==
-  1. [[#1|Problem: Master initializes, but Region Servers do not]]
+  1. [[#A1|Problem: Master initializes, but Region Servers do not]]
-  1. [[#2|Problem: Created Root Directory for HBase through Hadoop DFS]]
+  1. [[#A2|Problem: Created Root Directory for HBase through Hadoop DFS]]
-  1. [[#3|Problem: Replay of hlog required, forcing regionserver restart]]
+  1. [[#A3|Problem: Replay of hlog required, forcing regionserver restart]]
-  1. [[#4|Problem: On migration, no files in root directory]]
+  1. [[#A4|Problem: On migration, no files in root directory]]
-  1. [[#5|Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 
256"]]
+  1. [[#A5|Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 
256"]]
-  1. [[#6|Problem: "No live nodes contain current block"]]
+  1. [[#A6|Problem: "No live nodes contain current block"]]
-  1. [[#7|Problem: DFS instability and/or regionserver lease timeouts]]
+  1. [[#A7|Problem: DFS instability and/or regionserver lease timeouts]]
-  1. [[#8|Problem: Instability on Amazon EC2]]
+  1. [[#A8|Problem: Instability on Amazon EC2]]
-  1. [[#9|Problem: Zookeeper SessionExpired events]]
+  1. [[#A9|Problem: Zookeeper SessionExpired events]]
-  1. [[#10|Problem: Scanners keep getting timeouts]]
+  1. [[#A10|Problem: Scanners keep getting timeouts]]
-  1. [[#11|Problem: Client says no such table but it exists]]
+  1. [[#A11|Problem: Client says no such table but it exists]]
-  1. [[#12|Problem: Could not find my address: xyz in list of ZooKeeper quorum 
servers]]
+  1. [[#A12|Problem: Could not find my address: xyz in list of ZooKeeper 
quorum servers]]
-  1. [[#13|Problem: Long client pauses under high load; or deadlock if using 
THBase]]
+  1. [[#A13|Problem: Long client pauses under high load; or deadlock if using 
THBase]]
-  1. [[#14|Problem: Zookeeper does not seem to work on Amazon EC2]]
+  1. [[#A14|Problem: Zookeeper does not seem to work on Amazon EC2]]
-  1. [[#15|Problem: General operating environment issues -- zookeeper session 
timeouts, regionservers shutting down, etc.]]
+  1. [[#A15|Problem: General operating environment issues -- zookeeper session 
timeouts, regionservers shutting down, etc.]]
-  1. [[#16|Problem: Scanner performance is low]]
+  1. [[#A16|Problem: Scanner performance is low]]
-  1. [[#17|Problem: My shell or client application throws lots of scary 
exceptions during normal operation]]
+  1. [[#A17|Problem: My shell or client application throws lots of scary 
exceptions during normal operation]]
-  1. [[#18|Problem: The HBase or Hadoop daemons crash after some days of 
uptime with no errors logged]]
+  1. [[#A18|Problem: The HBase or Hadoop daemons crash after some days of 
uptime with no errors logged]]
- 
+  1. [[#A19|Problem: Running a Scan or a MapReduce job over a full table fails 
with "xceiverCount xx exceeds..." or OutOfMemoryErrors in the HDFS datanodes]]
  
  <<Anchor(1)>>
+ 
  == 1. Problem: Master initializes, but Region Servers do not ==
   * Master's log contains repeated instances of the following block:
+   . ~-INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 
/127.0.0.1:60020. Already tried 1 time(s).<<BR>> INFO 
org.apache.hadoop.ipc.Client: Retrying connect to server: /127.0.0.1:60020. 
Already tried 2 time(s).<<BR>> -~
-   ~-INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 
/127.0.0.1:60020. Already tried 1 time(s).<<BR>>
-   INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 
/127.0.0.1:60020. Already tried 2 time(s).<<BR>>
-   ...<<BR>>
-   INFO org.apache.hadoop.ipc.RPC: Server at /127.0.0.1:60020 not available 
yet, Zzzzz...-~
+   . ~-..<<BR>> INFO org.apache.hadoop.ipc.RPC: Server at /127.0.0.1:60020 not 
available yet, Zzzzz...-~
   * Region Servers' logs contains repeated instances of the following block:
+   . ~-INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 
masternode/192.168.100.50:60000. Already tried 1 time(s).<<BR>> INFO 
org.apache.hadoop.ipc.Client: Retrying connect to server: 
masternode/192.168.100.50:60000. Already tried 2 time(s).<<BR>> -~
-   ~-INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 
masternode/192.168.100.50:60000. Already tried 1 time(s).<<BR>>
-   INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 
masternode/192.168.100.50:60000. Already tried 2 time(s).<<BR>>
-   ...<<BR>>
-   INFO org.apache.hadoop.ipc.RPC: Server at masternode/192.168.100.50:60000 
not available yet, Zzzzz...-~
+   . ~-..<<BR>> INFO org.apache.hadoop.ipc.RPC: Server at 
masternode/192.168.100.50:60000 not available yet, Zzzzz...-~
   * Note that the Master believes the Region Servers have the IP of 127.0.0.1 
- which is localhost and resolves to the master's own localhost.
+ 
  === Causes ===
   * The Region Servers are erroneously informing the Master that their IP 
addresses are 127.0.0.1.
+ 
  === Resolution ===
   * Modify '''/etc/hosts''' on the region servers, from
-   {{{
+   . {{{
  # Do not remove the following line, or various programs
  # that require network functionality will fail.
- 127.0.0.1             fully.qualified.regionservername regionservername  
localhost.localdomain localhost
+ 127.0.0.1               fully.qualified.regionservername regionservername  
localhost.localdomain localhost
- ::1           localhost6.localdomain6 localhost6
+ ::1             localhost6.localdomain6 localhost6
  }}}
  
   * To (removing the master node's name from localhost)
-   {{{
+   . {{{
  # Do not remove the following line, or various programs
  # that require network functionality will fail.
- 127.0.0.1             localhost.localdomain localhost
+ 127.0.0.1               localhost.localdomain localhost
- ::1           localhost6.localdomain6 localhost6
+ ::1             localhost6.localdomain6 localhost6
  }}}
  
- 
- <<Anchor(2)>> 
+ <<Anchor(2)>>
+ 
  == 2. Problem: Created Root Directory for HBase through Hadoop DFS ==
   * On Startup, Master says that you need to run the hbase migrations script. 
Upon running that, the hbase migrations script says no files in root directory.
+ 
  === Causes ===
   * HBase expects the root directory to either not exist, or to have already 
been initialized by hbase running a previous time. If you create a new 
directory for HBase using Hadoop DFS, this error will occur.
+ 
  === Resolution ===
   * Make sure the HBase root directory does not currently exist or has been 
initialized by a previous run of HBase. Sure fire solution is to just use 
Hadoop dfs to delete the HBase root and let HBase create and initialize the 
directory itself.
  
- 
  <<Anchor(3)>>
+ 
  == 3. Problem: Replay of hlog required, forcing regionserver restart ==
-  * Under a heavy write load, some regions servers will go down with the 
following exception: 
+  * Under a heavy write load, some regions servers will go down with the 
following exception:
+ 
  {{{
  WARN org.apache.hadoop.dfs.DFSClient: Exception while reading from 
blk_xxxxxxxxxxxxxxx of /hbase/some_repository from IP_address:50010: 
java.io.IOException: Premeture EOF from inputStream
  then later
@@ -81, +83 @@

  }}}
  === Causes ===
   * RPC timeouts may happen because of a IO contention which blocks processes 
during file swapping.
+ 
  === Resolution ===
   * Configure your system to avoid swapping. Set vm.swappiness to 0. 
(http://kerneltrap.org/node/1044)
  
- 
  <<Anchor(4)>>
+ 
  == 4. Problem: On migration, no files in root directory ==
   * On Startup, Master says that you need to run the hbase migrations script. 
Upon running that, the hbase migrations script says no files in root directory.
+ 
  === Causes ===
   * HBase expects the root directory to either not exist, or to have already 
been initialized by hbase running a previous time. If you create a new 
directory for HBase using Hadoop DFS, this error will occur.
+ 
  === Resolution ===
   * Make sure the HBase root directory does not currently exist or has been 
initialized by a previous run of HBase. Sure fire solution is to just use 
Hadoop dfs to delete the HBase root and let HBase create and initialize the 
directory itself.
  
- 
  <<Anchor(5)>>
+ 
  == 5. Problem: "xceiverCount 258 exceeds the limit of concurrent xcievers 
256" ==
   * See an exception with above message in logs, usually the datanode logs.
+ 
  === Causes ===
   * An upper bound on connections was added in Hadoop 
(HADOOP-3633/HADOOP-3859).
+ 
  === Resolution ===
-  * Up the maximum by setting '''dfs.datanode.max.xcievers''' (sic).  See 
[[http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200810.mbox/%3c20126171.p...@talk.nabble.com%3e|message
 from jean-adrien]] for some background. Values of 2048 or 4096 are common. 
+  * Up the maximum by setting '''dfs.datanode.max.xcievers''' (sic).  See 
[[http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200810.mbox/<20126171.p...@talk.nabble.com>|message
 from jean-adrien]] for some background. Values of 2048 or 4096 are common.
- 
+  * This may be a symptom of having an unrealistically high number of regions 
in a table and/or an underpowered cluster; see [[#A19|the discussion of this 
below]]
  
  <<Anchor(6)>>
+ 
  == 6. Problem: "No live nodes contain current block" ==
   * See an exception with above message in logs.
+ 
  === Causes ===
   * Insufficient file descriptors available at the OS level for DFS DataNodes
   * Patch for HDFS-127 is not present (Should not be an issue for HBase >= 
0.20.0 as a private Hadoop jar is shipped with the client side fix applied)
   * Slow datanodes are marked as down by DFSClient; eventually all replicas 
are marked as 'bad' (HADOOP-3831).
+ 
  === Resolution ===
-  * Increase the file descriptor limit of the user account under which the DFS 
DataNode processes are operating. On most Linux systems, adding the following 
lines to /etc/security/limits.conf will increase the file descriptor limit from 
the default of 1024 to 32768. Substitute the actual user name for {{{<user>}}}. 
+  * Increase the file descriptor limit of the user account under which the DFS 
DataNode processes are operating. On most Linux systems, adding the following 
lines to /etc/security/limits.conf will increase the file descriptor limit from 
the default of 1024 to 32768. Substitute the actual user name for {{{<user>}}}.
-    {{{
+   . {{{
  <user>          soft    nofile          32768
  <user>          hard    nofile          32768
  }}}
   * RedHat based distributions also may have a maximum total open files across 
the whole system, so you will also need to edit /etc/sysctl.conf to include the 
line:
-    {{{
+   . {{{
  fs.file-max = 32768
  }}}
-   Run the commands {{{sysctl -p /etc/sysctl.conf}}} and {{{service network 
restart}}} to make the change immediately effective.
+   . Run the commands {{{sysctl -p /etc/sysctl.conf}}} and {{{service network 
restart}}} to make the change immediately effective.
- 
  
  <<Anchor(7)>>
+ 
  == 7. Problem: DFS instability and/or regionserver lease timeouts ==
   * HBase regionserver leases expire during start up
   * HBase daemons cannot find block locations in HDFS during start up or other 
periods of load
   * HBase regionserver restarts after beeing unable to report to master:
+ 
  {{{
  2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
xxx ms, ten times longer than scheduled: 10000
  2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 
xxx ms, ten times longer than scheduled: 15000
@@ -137, +148 @@

   * Slow host name resolution
   * Very long garbage collector task for the RegionServer JVM: The ''default 
garbage collector'' of the HotspotTM JavaTM Virtual Machine runs a ''full gc'' 
on the ''old space'' when the memory space is full, which can represent about 
90% of the allocated heap space. During this task, the running program is 
stopped, the timers as well. If the heap space is mostly in the swap partition, 
and moreover if it is larger than the physical memory, the subsequent swap can 
yield to I/O overload and takes several minutes.
   * Network bandwidth overcommitment
+ 
  === Resolution ===
   * Insure that host name resolution latency is low, or use static entries in 
/etc/hosts
   * Monitor the network and insure that adequate bandwidth is available for 
HRPC transactions
   * In accordance with your hardware, tune your heap space / garbage collector 
settings in the HBASE_OPTS variable of {{{$HBASE_CONF/hbase-env.sh}}}. Try the 
''concurrent garbage collector'' {{{(-XX:+UseConcMarkSweepGC)}}} to avoid to 
stop the threads during GC. Read these articles for more info about Hotspot GC 
settings
-     * [[http://java.sun.com/docs/hotspot/gc1.4.2/faq.html|Garbage collector 
FAQ]] Quick overview
+   * [[http://java.sun.com/docs/hotspot/gc1.4.2/faq.html|Garbage collector 
FAQ]] Quick overview
-     * 
[[http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html|Tuning 
garbage collector in Java SE 6]]
+   * 
[[http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html|Tuning 
garbage collector in Java SE 6]]
   * For Java SE 6, some users have had success with {{{ 
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:ParallelGCThreads=8 }}}.
   * See HBase [[PerformanceTuning|Performance Tuning]] for more on JVM GC 
tuning.
  
- 
  <<Anchor(8)>>
+ 
  == 8. Problem: Instability on Amazon EC2 ==
-  * Various problems suggesting overloading on Amazon EC2 deployments: Scanner 
timeouts, problems locating HDFS blocks, missed heartbeats, "We slept xxx ms, 
ten times longer than scheduled" messages, and so on. 
+  * Various problems suggesting overloading on Amazon EC2 deployments: Scanner 
timeouts, problems locating HDFS blocks, missed heartbeats, "We slept xxx ms, 
ten times longer than scheduled" messages, and so on.
-  * These problems continue after following the other relevant advice on this 
page. 
+  * These problems continue after following the other relevant advice on this 
page.
   * Or, you are trying to use Small or Medium instance types. (Do not.)
+ 
  === Causes ===
-  * Hadoop and HBase daemons require 1GB heap, therefore RAM, per daemon. For 
load intensive environments, HBase regionservers may require more heap than 
this. There must be enough available RAM to comfortably hold the working sets 
of all Java processes running on the instance. This includes any mapper or 
reducer tasks which may run co-located with system daemons. Small and Medium 
instances do not have enough available RAM to contain typical Hadoop+HBase 
deployments. 
+  * Hadoop and HBase daemons require 1GB heap, therefore RAM, per daemon. For 
load intensive environments, HBase regionservers may require more heap than 
this. There must be enough available RAM to comfortably hold the working sets 
of all Java processes running on the instance. This includes any mapper or 
reducer tasks which may run co-located with system daemons. Small and Medium 
instances do not have enough available RAM to contain typical Hadoop+HBase 
deployments.
-  * Hadoop and HBase daemons are latency sensitive. There should be enough 
free RAM so no swapping occurs. Swapping during garbage collection may cause 
JVM threads to be suspended for a critically long time. Also, there should be 
sufficient virtual cores to service the JVM threads whenever they become 
runnable. Large instances have two virtual cores, so they can run HDFS and 
HBase daemons concurrently, but nothing more. X-Large instances have four 
virtual cores, so they can run in addition to HDFS and HBase daemons two 
mappers or reducers concurrently. Configure TaskTracker concurrency limits 
accordingly, or separate mapreduce computation from storage functions. 
+  * Hadoop and HBase daemons are latency sensitive. There should be enough 
free RAM so no swapping occurs. Swapping during garbage collection may cause 
JVM threads to be suspended for a critically long time. Also, there should be 
sufficient virtual cores to service the JVM threads whenever they become 
runnable. Large instances have two virtual cores, so they can run HDFS and 
HBase daemons concurrently, but nothing more. X-Large instances have four 
virtual cores, so they can run in addition to HDFS and HBase daemons two 
mappers or reducers concurrently. Configure TaskTracker concurrency limits 
accordingly, or separate mapreduce computation from storage functions.
+ 
  === Resolution ===
   * Use X-Large (c1.xlarge) instances
-  * Consider splitting storage and computational function over disjoint 
instance sets. 
+  * Consider splitting storage and computational function over disjoint 
instance sets.
- 
  
  <<Anchor(9)>>
+ 
  == 9. Problem: ZooKeeper SessionExpired events ==
   * Master or Region Servers shutting down with messages like those in the 
logs:
+ 
  {{{
- WARN org.apache.zookeeper.ClientCnxn: Exception 
+ WARN org.apache.zookeeper.ClientCnxn: Exception
  closing session 0x278bd16a96000f to sun.nio.ch.selectionkeyi...@355811ec
  java.io.IOException: TIMED OUT
         at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
@@ -183, +198 @@

   * The JVM is doing a long running garbage collecting which is pausing every 
threads (aka "stop the world").
   * Since the region server's local zookeeper client cannot send heartbeats, 
the session times out.
   * By design, we shut down any node that isn't able to contact the Zookeeper 
ensemble after getting a timeout so that it stops serving data that may already 
be assigned elsewhere.
+ 
  === Resolution ===
   * Make sure you give plenty of RAM (in hbase-env.sh), the default of 1GB 
won't be able to sustain long running imports.
   * Make sure you don't swap, the JVM never behaves well under swapping.
-  * Make sure you are not CPU starving the region server thread. For example, 
if you are running a mapreduce job using 6 CPU-intensive tasks on a machine 
with 4 cores, you are probably starving the region server enough to create 
longer garbage collection pauses. 
+  * Make sure you are not CPU starving the region server thread. For example, 
if you are running a mapreduce job using 6 CPU-intensive tasks on a machine 
with 4 cores, you are probably starving the region server enough to create 
longer garbage collection pauses.
   * If you wish to increase the session timeout, add the following to your 
hbase-site.xml to increase the timeout from the default of 60 seconds to 120 
seconds.
+ 
  {{{
    <property>
      <name>zookeeper.session.timeout</name>
@@ -198, +215 @@

      <value>6000</value>
    </property>
  }}}
-  * Be aware that setting a higher timeout means that the regions served by a 
failed region server will take at least that amount of time to be transfered to 
another region server. For a production system serving live requests, we would 
instead recommend setting it lower than 1 minute and over-provision your 
cluster in order the lower the memory load on each machines (hence having less 
garbage to collect per machine). 
+  * Be aware that setting a higher timeout means that the regions served by a 
failed region server will take at least that amount of time to be transfered to 
another region server. For a production system serving live requests, we would 
instead recommend setting it lower than 1 minute and over-provision your 
cluster in order the lower the memory load on each machines (hence having less 
garbage to collect per machine).
   * If this is happening during an upload which only happens once (like 
initially loading all your data into HBase), consider 
[[http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#bulk|importing
 into HFiles directly]].
   * HBase ships with some GC tuning, for more information see 
[[PerformanceTuning|Performance Tuning]].
  
- 
  <<Anchor(10)>>
+ 
  == 10. Problem: Scanners keep getting timeouts ==
   * Client receives org.apache.hadoop.hbase.UnknownScannerException or 
timeouts even if the region server lease is really high. Fixed in HBase 0.20.0
+ 
  === Causes ===
   * The client, by default, fetches 30 rows when issuing the first next(). All 
the 29 other calls simply return rows from local memory. So if it takes 3 
minutes to process a row and the timeout is set to 30 minutes, it is still not 
enough to cover 30 * 3 = 90 minutes.
+ 
  === Resolution ===
   * Set hbase.client.scanner.caching in hbase-site.xml at a very low value 
like 1 or use HTable.setScannerCaching(1).
  
- 
  <<Anchor(11)>>
+ 
  == 11. Problem: Client says no such table but it exists ==
   * Client can't find region in table, says no such table.
+ 
  === Causes ===
   * Just deleted a large table
+ 
  === Resolution ===
   * Run major compaction on the .META. table.  In the shell, type '''tool''' 
to learn how to run a major compaction from the shell.
  
- 
  <<Anchor(12)>>
+ 
  == 12. Problem: Could not find my address: xyz in list of ZooKeeper quorum 
servers ==
   * A Zookeeper server wasn't able to start, throws that error. xyz is the 
name of your server.
+ 
  === Causes ===
   * This is a name lookup problem. HBase tries to start a ZK server on some 
machine but that machine isn't able to find itself in the 
'''hbase.zookeeper.quorum configuration'''.
+ 
  === Resolution ===
   * Use the hostname presented in the error message instead of the value you 
used. If you have a DNS server, you can set '''hbase.zookeeper.dns.interface''' 
and '''hbase.zookeeper.dns.nameserver''' in hbase-site.xml to make sure it 
resolves to the correct FQDN.
  
- 
  <<Anchor(13)>>
+ 
  == 13. Problem: Long client pauses under high load; or deadlock if using 
transactional HBase (THBase) ==
   * Under high load, some client operations take a long time; waiting appears 
uneven
   * If using THBase, apparent deadlocks: for example, in thread dumps IPC 
Server handlers are blocked in 
org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegion.updateIndex()
+ 
  === Causes ===
   * The default number of regionserver RPC handlers is insufficient.
+ 
  === Resolution ===
   * Increase the value of "hbase.regionserver.handler.count" in 
hbase-site.xml. The default is 10. Try 100.
  
- 
  <<Anchor(14)>>
+ 
  == 14. Problem: Zookeeper does not seem to work on Amazon EC2 ==
   * HBase does not start when deployed as Amazon EC2 instances.
   * Exceptions like the below appear in the master and/or region server logs:
+ 
  {{{
    2009-10-19 11:52:27,030 INFO org.apache.zookeeper.ClientCnxn: Attempting
    connection to server 
ec2-174-129-15-236.compute-1.amazonaws.com/10.244.9.171:2181
@@ -253, +279 @@

  }}}
  === Causes ===
   * Security group policy is blocking the Zookeeper port on a public address.
+ 
  === Resolution ===
-  * Use the internal EC2 host names when configuring the Zookeeper quorum peer 
list. 
+  * Use the internal EC2 host names when configuring the Zookeeper quorum peer 
list.
- 
  
  <<Anchor(15)>>
+ 
  == 15. Problem: General operating environment issues -- zookeeper session 
timeouts, regionservers shutting down, etc ==
  === Causes ===
-  Various.
+  . Various.
+ 
  === Resolution ===
  See the [[http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting|ZooKeeper 
Operating Environment Troubleshooting]] page.  It has suggestions and tools for 
checking disk and networking performance; i.e. the operating environment your 
zookeeper and hbase are running in.  ZooKeeper is the cluster's "canary".  
It'll be the first to notice issues if any so making sure its happy is the 
short-cut to a humming cluster.
  
- 
  <<Anchor(16)>>
+ 
  == 16. Problem: Scanner performance is low ==
  === Causes ===
  Default scanner caching (prefetching) is set to 1. The default is low because 
if a job takes too long processing, a scanner can time out, which causes 
unhappy jobs/people/emails. See item #10 above.
+ 
  === Resolution ===
   * Increase the amount of prefetching on the scanner, to 10, or 100, or 1000, 
as appropriate for your workload: 
[[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#scannerCaching|HTable.scannerCaching]]
   * This change can be accomplished globally by setting the 
hbase.client.scanner.caching property in hbase-site.xml to the desired value.
  
- 
  <<Anchor(17)>>
+ 
  == 17. Problem: My shell or client application throws lots of scary 
exceptions during normal operation ==
  === Causes ===
  Since 0.20.0 the default log level for org.apache.hadoop.hbase.* is DEBUG.
+ 
  === Resolution ===
+ On your clients, edit $HBASE_HOME/conf/log4j.properties and change this: 
{{{log4j.logger.org.apache.hadoop.hbase=DEBUG}}} to this: 
{{{log4j.logger.org.apache.hadoop.hbase=INFO}}}, or even 
{{{log4j.logger.org.apache.hadoop.hbase=WARN}}} .
- On your clients, edit $HBASE_HOME/conf/log4j.properties and change this: 
{{{log4j.logger.org.apache.hadoop.hbase=DEBUG}}}
- to this: {{{log4j.logger.org.apache.hadoop.hbase=INFO}}}, or even 
{{{log4j.logger.org.apache.hadoop.hbase=WARN}}} .
  
  <<Anchor(18)>>
+ 
  == 18. Problem: The HBase or Hadoop daemons crash after some days of uptime 
with no errors logged ==
  === Causes ===
  HBase and Hadoop have stability issues on certain versions of the JVM that 
can cause this issue. In particular, Sun Java 1.6.0_18 is known to be buggy. 
The current recommended version for production usage is 1.6.0_16.
+ 
  === Resolution ===
  Downgrade your JVM to 1.6.0_16.
  
+ <<Anchor(19)>>
+ 
+ == 19. Problem: Running a Scan or a MapReduce job over a full table fails 
with "xceiverCount xx exceeds..." or OutOfMemoryErrors in the HDFS datanodes ==
+ 
+ === Causes ===
+ HBase keeps a number of files per region open on HDFS. When you have a large 
number of regions in a single table, this means that HDFS needs to keep a large 
number of files open on HDFS, which can cause you to run into the 
[[#A5|"xceiverCount xx exceeds..."]] issue, or conversely ``OutOfMemoryErrors 
due to raising the '''dfs.datanode.max.xceivers''' setting too high to escape 
this issue.
+ 
+ Each region in HBase corresponds to 0 to N store files in HDFS. The 
'''dfs.datanode.max.xcievers''' setting controls the maximum number of handler 
threads that are allowed per HDFS datanode. Each store file consumes at least 
one thread on the datanode. Once a store file is opened, it stays open until a 
compaction or splitting is needed; this can result in the 
dfs.datanode.max.xceivers limit being reached surprisingly quickly if you have 
a lot of regions in a single table.
+ 
+ This problem is generally a symptom of an underpowered cluster. 
+ 
+ === Resolution ===
+ 
+ This can be resolved in the following ways:
+  * Increase the maximum file size per region; this is the 
'''hbase.hregion.max.filesize''' in hbase-site.xml, and it defaults to 
268435456 (256 MB). Keep in mind that this will only apply to future region 
splits, and will not result in existing regions being merged.
+  * Mess with configuration that effects RAM -- i.e. thread stack sizes or, 
dependent on what your query path looks like, shrink size given over to block 
cache (will slow your reads though)
+  * Add machines to your cluster.
+

[Hadoop Wiki] Update of "Hbase/Troubleshooting" by Gabr ielReid

Reply via email to