On Thu, Jun 30, 2011 at 11:33 PM, Stack <st...@duboce.net> wrote:

> On Mon, Jun 27, 2011 at 11:37 PM, Aditya Karanth A
> <aditya_kara...@mindtree.com> wrote:
> >> I have heard that bigger the size of the regionserver, more time it
> takes
> >> for region splitting and slower the reads are. Is this true?
> > (I have not been able to experiment with all these in our environments
> yet,
> > but if anyone has been there and done that, would be good to know)
> >
>



>
> Well, splitting is fast in that it just writes out references files;
> it does not actually rewrite data so size shouldn't matter.
>

This is interesting.  I always thought of a region split as a single file
being copied out as two. Is there more documentation on this ? If not, what
code can I look at to better understand splits ?
Also, when a region is moved from one regionserver to another, doesn't that
need to move the data local to the new regionserver for better performance
and reducing I/O ?


>
> Scan reads don't care about file size (bigger may actually be slightly
> faster).  Random read performance also is unrelated to file/region
> size (We consult the in-memory index to figure where to jump to to
> start the read -- this should be the same for big or small files).
>

If this is true, when will you ever want to have multiple regions for the
same table and column family being served by a single regionserver ? I'd
rather then keep the region size to unlimited, and if the region gets hot,
manually split and move ? Any risk associated with this approach ? I guess
this ties into my previous question, if during a move, a lot of data is
physically moved from one location to another, you probably do not want to
run into a situation where you are moving very large regions around in the
cluster at the same time ...



>
> St.Ack
>

Reply via email to