Hello all, I've been working with HBase for the past few months on a proof of concept/technology adoption evaluation. I wanted to describe my scenario to the user/development community to get some input on my observations.
I've written an application that is comprised of two tables. It models a classic many-to-many relationship. One table stores "User" data and the other represents an "Inbox" of items assigned to that user. The key for the user is a string generated by the JDK's UUID.randomUUID() method. The key for the "Inbox" is a monotonically increasing value. It works just fine. I've reviewed the performance tuning info on the HBase WIKI page. The client application spins up 100 threads each one grabbing a range of keys (for the "Inbox"). The I/O mix is about 50/50 read/write. The test client inserts 1,000,000 "Inbox" items and verifies the existence of a "User" (FK check). It uses column families to maintain integrity of the relationships. I'm running versions 0.19.3 and 0.20.0. The behavior is basically the same. The cluster consists of 10 nodes. I'm running my namenode and HBase master on one dedicated box. The other 9 run datanodes/region servers. I'm seeing around ~1000 "Inbox" transactions per second (dividing total time for the batch by total count inserted). The problem is that I get the same results with 5 nodes as with 10. Not quite what I was expecting. The bottleneck seems to be the splitting algorithms. I've set my region size to 2M. I can see that as the process moves forward, HBase pauses and re-distributes the data and splits regions. It does this first for the "Inbox" table and then pauses again and redistributes the "User" table. This pause can be quite long. Often 2 minutes or more. Can the key ranges be pre-defined somehow in advance to avoid this? I would rather not burden application developers/DBA's with this. Perhaps the divvy algorithms could be sped up? Any configuration recommendations? Thanks in advance, Guy