That's almost exactly what mozilla is doing with sorocco (google for
their presentations).

Also you seem to assume things about the region balancer that are, at
least at the moment, untrue:

> Then the assumption is this process would continue until every server in the 
> cluster has on region of data

That's more like the end result rather than the goal.

> Then during retrieval I could the use ten Threads, each would use a Start and 
> End row with their prefix and the query should be distributed evenly out 
> among the server.

Nothing is done to make sure that your regions will be distributed
that way, the last region for each salt key may very well end up on
the same region server. That's why it's better to use more salting.

And have you seen this?
http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/

J-D

On Fri, Apr 22, 2011 at 1:18 PM, Peter Haidinyak <[email protected]> wrote:
> Thanks, that's the way I visualized it happening. Then the assumption is this 
> process would continue until every server in the cluster has on region of 
> data (more or less). My underlying question is that I need to store my data 
> with the key starting with the date (YYYY-MM-DD). I know this means I will 
> have hot spots during inserts but make retrieval more efficient by using a 
> scan with start and end rows. I was thinking of adding a prefix number of 00 
> to 09, for the ten servers. In theory, each server should only end up with 
> one of the prefixes. Then during retrieval I could the use ten Threads, each 
> would use a Start and End row with their prefix and the query should be 
> distributed evenly out among the server. I'm not sure if using ten Thread to 
> insert the data would buy me anything or not. Anyway, I'm going to try this 
> out at home on my own cluster to see how it performs.
>
> Thanks
>
> -Pete
>
> -----Original Message-----
> From: Buttler, David [mailto:[email protected]]
> Sent: Friday, April 22, 2011 12:10 PM
> To: [email protected]
> Subject: RE: Row Key Question
>
> Regions split when they are larger than the configuration parameter region 
> size.  Your data is small enough to fit on a single region.
>
> Keys are sorted in a region.  When a region splits the new regions are about 
> half the size of the original region, and contain half the key space each.
>
> Dave
>
> -----Original Message-----
> From: Peter Haidinyak [mailto:[email protected]]
> Sent: Friday, April 22, 2011 10:41 AM
> To: [email protected]
> Subject: Row Key Question
>
> I have a question on how HBase decides to save rows based on Row Keys. Say I 
> have a million rows to insert into a new table in a ten node cluster. Each 
> row's key is some random 32 byte value and there are two columns per row, 
> each column contains some random 32 byte value.
> My question is how does HBase know when to 'split' the table between the ten 
> nodes? Or how does HBase 'split' the random keys between the ten nodes?
>
> Thanks
>
> -Pete
>

Reply via email to