On Fri, Jul 1, 2011 at 12:21 PM, Sam Seigal <selek...@yahoo.com> wrote:
> On Thu, Jun 30, 2011 at 11:33 PM, Stack <st...@duboce.net> wrote:
>> Well, splitting is fast in that it just writes out references files;
>> it does not actually rewrite data so size shouldn't matter.
>>
>
> This is interesting.  I always thought of a region split as a single file
> being copied out as two. Is there more documentation on this ? If not, what
> code can I look at to better understand splits ?

Start here: 
http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/Reference.html#36

> Also, when a region is moved from one regionserver to another, doesn't that
> need to move the data local to the new regionserver for better performance
> and reducing I/O ?
>

Yes.

As HBase runs flushing and compacting, the tendency is for files to
migrate local given dfsclient plants first replica local (When we
compact we'll pull from remotes and the new compacted product will
have one copy of its replicas placed local).  In the mean time the
regionserver happily serves up from the 'remote' locations.

> If this is true, when will you ever want to have multiple regions for the
> same table and column family being served by a single regionserver ?

I suppose you could make a version of the balancer do this.

General notion is that its easier to achieve good distribution when
regions are not massive.  For example, say you do achieve one region
per server and one portion of the table is hot.  If big regions only,
then one regionserver is carrying all load whereas if you had more
regions, the hot section could be distributed about the cluster.

> I'd
> rather then keep the region size to unlimited, and if the region gets hot,
> manually split and move ? Any risk associated with this approach ?

Sure. You could do this (In one of the FB talks, they mention they've
turned off splitting and instead manually split regions rather than
let hbase do it.  Apparently they have someone do the balance/split
process every "tuesday afternoon" or something).


> I guess
> this ties into my previous question, if during a move, a lot of data is
> physically moved from one location to another, you probably do not want to
> run into a situation where you are moving very large regions around in the
> cluster at the same time ...
>


On region move, we do not move data.   Post-region move, on open,
it'll serve from the remote datanodes.  As it runs, in the background,
the data tends to migrate local.

Hope this helps,
St.Ack

Reply via email to