First, there is work under way for 0.21 which will shorten the time necessary 
for region redeployment. Part of the delay in 0.20 is less than ideal 
performance in that regard by the master. 

Beyond that, just as a general operational principle, I recommend that you host 
no more than 200-250 regions per region server. The Bigtable paper talks about 
each tablet server hosting only 100 regions, with only 200 MB of data each. 
While that is not cost effective for folks who do not build their own hardware 
in bulk, it should cause you to think about why:
   - Limiting the number of regions per tablet server limits time to recovery 
upon node failure -- you can engineer this to be within some threshold
   - Limiting the amount of data per region means that servers with reasonable 
RAM can cache and serve a lot of the data out of memory for sub-disk data 
access latencies

So the advice here is to opt for more servers, not less; more RAM, not less; 
and smaller disk, not larger. 

You should also consider the impact of server failure on HDFS -- loss of block 
replicas. For each under-replicated block, HDFS must work to make additional 
copies. This can come at a bad time if loss of the blocks in the first place 
was due to overloading. 
Smaller disks mean fewer lost block replicas. For example, attach 4 x 160 GB 
drives as JBOD (as opposed to 4 x 1 TB or similar). Losing one disk means a 
loss of 160 GB worth of block replicas only (as opposed to 1 TB). Loss of a 
whole server means losing only 640 GB worth of block replicas (as opposed to 4 
TB).
You can also consider attaching 6 or 8 or even more modest sized disks per 
server to increase the I/O parallelism (number of spindles) while also 
constraining the amount of block replica loss per disk failure.

Even so, blocked reads and writes over some interval during region redeployment 
due to server failure or load rebalancing is part of the Bigtable architecture 
and so HBase, unless we take additional steps such as setting up active-passive 
region server pairs, but that would have complications which affect consistency 
and performance and might not provide enough benefit anyway (still there is 
time needed to detect failure and fall over). This is not an unavailability of 
the Bigtable service. Other regions are not affected. This is 
graceful/proportional service degradation in the face of partial failures. 
There are other alternatives to Bigtable which degrade differently given 
partial failures. Such options can give you no waiting on the write path at any 
time and possibly no waiting on the read path but you will lose strong 
consistency as the trade off. So you may get stale answers over some 
(unbounded, iirc) period, but this is the choice you make. 

HBase also has options like Stargate or the Thrift connector which can block 
and retry on behalf of your clients so they are never blocked for writes. For 
read path options I could look at having Stargate serve (possibly stale) 
answers out of a cache -- with some flag that indicates noncanonical state -- 
if that would be useful, and/or return immediate "try again" indication, so 
your clients are at least not stalled. 

Best regards,

  - Andy




________________________________
From: Murali Krishna. P <muralikpb...@yahoo.com>
To: hbase-user@hadoop.apache.org
Sent: Wed, November 25, 2009 1:31:45 AM
Subject: HBase High Availability

Hi,
    This is regarding the region unavailability when a region server goes down. 
There will be cases where we have thousands of regions per RS and it takes 
considerable amount of time to redistribute the regions when a node fails. The 
service will be unavailable during that period. I am evaluating HBase for an 
application where we need to guarantee close to 100% availability (namenode is 
still SPOF, leave that).
    
    One simple idea would be to replicate the regions in memory. Can we load 
the same region in multiple region servers? I am not sure about the feasibility 
yet, there will be issues like consistency across these in memory replicas. 
Wanted to know whether there were any thoughts / work already going on this 
area? I saw some related discussion here 
http://osdir.com/ml/hbase-user-hadoop-apache/2009-09/msg00118.html, not sure 
what is the state.

  Same needs to be done with the master as well or is it already done with ZK? 
How fast is the master re-election and catalog load currently ? Do we always 
have multiple masters in ready to run state? 


Thanks,
Murali Krishna



      

Reply via email to