Hi Jan,

That area of HBase was reworked a lot in the upcoming 0.90.0 and
region opening and closing can now be done in parallel for multiple
regions.

Also, the balancer works differently and may not even assign a single
region to a new region server (or a dead one that was restarted) until
the balancer runs (it's now every 5 minutes).

Those behaviors are completely new, so it will probably need better
tuning, and there's still a lot to do regarding region balancing in
general, but it's probably worth trying it out.

Regarding limiting the number of regions, you probably want to use LZO
(99% of the time it's faster for your tables) and set MAX_FILESIZE to
something like 1GB since the default is pretty low.

Maybe your new config would be useful too in the new master, I have to
give it more thoughts.

J-D

On Mon, Dec 13, 2010 at 8:36 AM, Jan Lukavský
<jan.lukav...@firma.seznam.cz> wrote:
> Hi all,
>
> we are using HBase 0.20.6 on a cluster of about 25 nodes with about 30k
> regions and are experiencing as issue which causes running  M/R jobs to
> fail.
> When we restart single RegionServer, then happens the following:
>  1) all regions of that RS get reassigned to remaing (say 24) nodes
>  2) when the restarted RegionServer comes up, HMaster closes about 60
> regions on all 24 nodes and assigns them back to the restarted node
>
> Now, the step 1) is usually very quick (if we can assign 10 regions per
> heartbeat, we have 240 regions per heartbeat on the whole cluster).
> The step 2) seems problematic, because first about 1200 regions get
> unassigned, and then they get slowly assigned to the single RS (speed again
> 10 regions per heartbeat). This time causes clients of Maps connected to the
> regions to throw RetriesExhaustedException.
>
> I'm aware that we can limit number of regions closed per RegionServer
> heartbeat by hbase.regions.close.max, but this config option seems a bit
> unsatisfactory, because as we increase size of the cluster, we will get more
> and more regions unassigned in single cluster heartbeat (say we limit this
> to 1, then we get 24 unassigned regions, but only 10 assigned per
> heartbeat). This led us to a solution, which seems quite simple. We have
> introduced new config option which is used to limit number of regions in
> transition. When regionsInTransition.size() crosses boundary, we temporarily
> stop load balancer. This seems to resolve our issue, because no region gets
> unassigned for long time and clients manage to recover within their number
> of retries.
>
> My question is, is this s general issue and a new config option should be
> proposed, or I am missing something a we could have resolved the issue with
> some other config option tuning?
>
> Thanks.
>  Jan
>
>

Reply via email to