Split your table in advance? You can do it from the UI or shell (Script it?)
Regards same performance for 10 nodes as for 5 nodes, how many regions in your table? What happens if you pile on more data? The split algorithm will be sped up in coming versions for sure. Two minutes seems like a long time. Its under load at this time? St.Ack On Tue, Sep 22, 2009 at 3:14 PM, Molinari, Guy <[email protected]>wrote: > Hello all, > > I've been working with HBase for the past few months on a proof of > concept/technology adoption evaluation. I wanted to describe my > scenario to the user/development community to get some input on my > observations. > > > > I've written an application that is comprised of two tables. It models > a classic many-to-many relationship. One table stores "User" data and > the other represents an "Inbox" of items assigned to that user. The > key for the user is a string generated by the JDK's UUID.randomUUID() > method. The key for the "Inbox" is a monotonically increasing value. > > > > It works just fine. I've reviewed the performance tuning info on the > HBase WIKI page. The client application spins up 100 threads each one > grabbing a range of keys (for the "Inbox"). The I/O mix is about > 50/50 read/write. The test client inserts 1,000,000 "Inbox" items and > verifies the existence of a "User" (FK check). It uses column families > to maintain integrity of the relationships. > > > > I'm running versions 0.19.3 and 0.20.0. The behavior is basically the > same. The cluster consists of 10 nodes. I'm running my namenode and > HBase master on one dedicated box. The other 9 run datanodes/region > servers. > > > > I'm seeing around ~1000 "Inbox" transactions per second (dividing total > time for the batch by total count inserted). The problem is that I > get the same results with 5 nodes as with 10. Not quite what I was > expecting. > > > > The bottleneck seems to be the splitting algorithms. I've set my > region size to 2M. I can see that as the process moves forward, HBase > pauses and re-distributes the data and splits regions. It does this > first for the "Inbox" table and then pauses again and redistributes the > "User" table. This pause can be quite long. Often 2 minutes or > more. > > > > Can the key ranges be pre-defined somehow in advance to avoid this? I > would rather not burden application developers/DBA's with this. > Perhaps the divvy algorithms could be sped up? Any configuration > recommendations? > > > > Thanks in advance, > > Guy > >
