Is there a reason you have the split size set to 2MB? That's rather small and you'll end up constantly splitting, even once you have good distribution.

I'd go for pre-splitting, as others suggest, but with larger region sizes.

Ryan Rawson wrote:
An interesting thing about HBase is it really performs better with
more data. Pre-splitting tables is one way.

Another performance bottleneck is the write-ahead-log. You can disable
it by calling:
Put.setWriteToWAL(false);

and you will achieve a substantial speedup.

Good luck!
-ryan

On Tue, Sep 22, 2009 at 3:39 PM, stack <st...@duboce.net> wrote:
Split your table in advance?  You can do it from the UI or shell (Script
it?)

Regards same performance for 10 nodes as for 5 nodes, how many regions in
your table?  What happens if you pile on more data?

The split algorithm will be sped up in coming versions for sure.  Two
minutes seems like a long time.   Its under load at this time?

St.Ack



On Tue, Sep 22, 2009 at 3:14 PM, Molinari, Guy <guy.molin...@disney.com>wrote:

Hello all,

    I've been working with HBase for the past few months on a proof of
concept/technology adoption evaluation.    I wanted to describe my
scenario to the user/development community to get some input on my
observations.



I've written an application that is comprised of two tables.  It models
a classic many-to-many relationship.   One table stores "User" data and
the other represents an "Inbox" of items assigned to that user.    The
key for the user is a string generated by the JDK's UUID.randomUUID()
method.   The key for the "Inbox" is a monotonically increasing value.



It works just fine.   I've reviewed the performance tuning info on the
HBase WIKI page.   The client application spins up 100 threads each one
grabbing a range of keys (for the "Inbox").    The I/O mix is about
50/50 read/write.   The test client inserts 1,000,000 "Inbox" items and
verifies the existence of a "User" (FK check).   It uses column families
to maintain integrity of the relationships.



I'm running versions 0.19.3 and 0.20.0.    The behavior is basically the
same.   The cluster consists of 10 nodes.   I'm running my namenode and
HBase master on one dedicated box.   The other 9 run datanodes/region
servers.



I'm seeing around ~1000 "Inbox" transactions per second (dividing total
time for the batch by total count inserted).    The problem is that I
get the same results with 5 nodes as with 10.    Not quite what I was
expecting.



The bottleneck seems to be the splitting algorithms.   I've set my
region size to 2M.   I can see that as the process moves forward, HBase
pauses and re-distributes the data and splits regions.   It does this
first for the "Inbox" table and then pauses again and redistributes the
"User" table.    This pause can be quite long.   Often 2 minutes or
more.



Can the key ranges be pre-defined somehow in advance to avoid this?   I
would rather not burden application developers/DBA's with this.
Perhaps the divvy algorithms could be sped up?   Any configuration
recommendations?



Thanks in advance,

Guy



Reply via email to