Hi Stack,

We have been running a small cluster (name node + 5 rs) on 0.20.3 for a long time now. We are currently at 1100 regions per RS. As far as I can tell, I have not seen any problems or changes in behavior due this.

What kind of problems can I expect with 1K+ regions per RS? What is a consequence of upping region size from 256M to let's 512M.

Thanks,
i.

On 12/14/2010 09:50 AM, Stack wrote:
Can you do w/ less regions?  1k plus per server is pushing it I'd say.
  Can you up your region sizes, for instance?
St.Ack

On Mon, Dec 13, 2010 at 8:36 AM, Jan Lukavský
<jan.lukav...@firma.seznam.cz>  wrote:
Hi all,

we are using HBase 0.20.6 on a cluster of about 25 nodes with about 30k
regions and are experiencing as issue which causes running  M/R jobs to
fail.
When we restart single RegionServer, then happens the following:
  1) all regions of that RS get reassigned to remaing (say 24) nodes
  2) when the restarted RegionServer comes up, HMaster closes about 60
regions on all 24 nodes and assigns them back to the restarted node

Now, the step 1) is usually very quick (if we can assign 10 regions per
heartbeat, we have 240 regions per heartbeat on the whole cluster).
The step 2) seems problematic, because first about 1200 regions get
unassigned, and then they get slowly assigned to the single RS (speed again
10 regions per heartbeat). This time causes clients of Maps connected to the
regions to throw RetriesExhaustedException.

I'm aware that we can limit number of regions closed per RegionServer
heartbeat by hbase.regions.close.max, but this config option seems a bit
unsatisfactory, because as we increase size of the cluster, we will get more
and more regions unassigned in single cluster heartbeat (say we limit this
to 1, then we get 24 unassigned regions, but only 10 assigned per
heartbeat). This led us to a solution, which seems quite simple. We have
introduced new config option which is used to limit number of regions in
transition. When regionsInTransition.size() crosses boundary, we temporarily
stop load balancer. This seems to resolve our issue, because no region gets
unassigned for long time and clients manage to recover within their number
of retries.

My question is, is this s general issue and a new config option should be
proposed, or I am missing something a we could have resolved the issue with
some other config option tuning?

Thanks.
  Jan




Reply via email to