On Thu, Jun 30, 2011 at 11:33 PM, Stack <st...@duboce.net> wrote: > On Mon, Jun 27, 2011 at 11:37 PM, Aditya Karanth A > <aditya_kara...@mindtree.com> wrote: > >> I have heard that bigger the size of the regionserver, more time it > takes > >> for region splitting and slower the reads are. Is this true? > > (I have not been able to experiment with all these in our environments > yet, > > but if anyone has been there and done that, would be good to know) > > >
> > Well, splitting is fast in that it just writes out references files; > it does not actually rewrite data so size shouldn't matter. > This is interesting. I always thought of a region split as a single file being copied out as two. Is there more documentation on this ? If not, what code can I look at to better understand splits ? Also, when a region is moved from one regionserver to another, doesn't that need to move the data local to the new regionserver for better performance and reducing I/O ? > > Scan reads don't care about file size (bigger may actually be slightly > faster). Random read performance also is unrelated to file/region > size (We consult the in-memory index to figure where to jump to to > start the read -- this should be the same for big or small files). > If this is true, when will you ever want to have multiple regions for the same table and column family being served by a single regionserver ? I'd rather then keep the region size to unlimited, and if the region gets hot, manually split and move ? Any risk associated with this approach ? I guess this ties into my previous question, if during a move, a lot of data is physically moved from one location to another, you probably do not want to run into a situation where you are moving very large regions around in the cluster at the same time ... > > St.Ack >