I'm sorry to butt in, but I have a question about that "split region in
half" growth strategy. If it's too off-topic, you can ignore it..
I know that people always complain about using HBase for monotonously
increasing inputs, and recommend to use a random or hashed key strategy
to distribute the load across servers.
I've always wondered if you can change the "split region in half"
strategy to "create new region". If such a strategy could be useful for
people that want to have monotonously increasing inputs. It won't be as
efficient as full randomly distributed keys, but might be a good enough
compromise to keep code complexity down.
From your example: If you have an almost full region with keys 1-10.
And I go to add key 11. Right not it would split in half: 1-5, 6-11;
And have to incur the overhead of creating new regions and doing lots of
data copying to create all of the new region data files, etc. And to
add further insult once that new region is close to full, we have 1-5,
6-16; once 17 is added, then it will go through a region split once
again: 1-5, 6-10, 11-16, leaving half filled regions.
What if instead, you had a nearly full region, keys 1-10. Then add key
11. And because it's in a "monotonously increasing inputs" mode, it
would leave region 1-10 alone, and simply create a brand new region,
11-11 (hopefully in a different region server).
I know that someone must have thought about this already, and must have
good reasons to not follow it, but I've never seen this discussed in the
architecture docs, nor in the email list.. so just wondering... :)
On 6/15/11 10:58 AM, Chris Tarnas wrote:
There are a few ways:
1) Dynamically as data added. You start with one region and all data goes
there. When a region grows to big, it gets split in half. So if a region had
keys 1-10 we now have 1-5 and 5-10.