Hi Stack (and others), The reason for the small initial region size was intended to force splits so that the load would be evenly distributed. If I could pre-define the key ranges for the splits, then I could go to a much larger block size. So, say if I have 10 nodes and a 100MB data set, a block size of 10MB would be ideal (as I understand it).
I can see the "split" button on the UI as suggested. How do I specify the key ranges and assign those to regions on specific nodes? Thanks for the quick responses, Guy -----Original Message----- From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of stack Sent: Tuesday, September 22, 2009 5:17 PM To: hbase-user@hadoop.apache.org Subject: Re: Hbase and linear scaling with small write intensive clusters (Funny, I read the 2MB as 2GB -- yeah, why so small Guy?) On Tue, Sep 22, 2009 at 4:59 PM, Jonathan Gray <jl...@streamy.com> wrote: > Is there a reason you have the split size set to 2MB? That's rather small > and you'll end up constantly splitting, even once you have good > distribution. > > I'd go for pre-splitting, as others suggest, but with larger region sizes. > > Ryan Rawson wrote: > >> An interesting thing about HBase is it really performs better with >> more data. Pre-splitting tables is one way. >> >> Another performance bottleneck is the write-ahead-log. You can disable >> it by calling: >> Put.setWriteToWAL(false); >> >> and you will achieve a substantial speedup. >> >> Good luck! >> -ryan >> >> On Tue, Sep 22, 2009 at 3:39 PM, stack <st...@duboce.net> wrote: >> >>> Split your table in advance? You can do it from the UI or shell (Script >>> it?) >>> >>> Regards same performance for 10 nodes as for 5 nodes, how many regions in >>> your table? What happens if you pile on more data? >>> >>> The split algorithm will be sped up in coming versions for sure. Two >>> minutes seems like a long time. Its under load at this time? >>> >>> St.Ack >>> >>> >>> >>> On Tue, Sep 22, 2009 at 3:14 PM, Molinari, Guy <guy.molin...@disney.com >>> >wrote: >>> >>> Hello all, >>>> >>>> I've been working with HBase for the past few months on a proof of >>>> concept/technology adoption evaluation. I wanted to describe my >>>> scenario to the user/development community to get some input on my >>>> observations. >>>> >>>> >>>> >>>> I've written an application that is comprised of two tables. It models >>>> a classic many-to-many relationship. One table stores "User" data and >>>> the other represents an "Inbox" of items assigned to that user. The >>>> key for the user is a string generated by the JDK's UUID.randomUUID() >>>> method. The key for the "Inbox" is a monotonically increasing value. >>>> >>>> >>>> >>>> It works just fine. I've reviewed the performance tuning info on the >>>> HBase WIKI page. The client application spins up 100 threads each one >>>> grabbing a range of keys (for the "Inbox"). The I/O mix is about >>>> 50/50 read/write. The test client inserts 1,000,000 "Inbox" items and >>>> verifies the existence of a "User" (FK check). It uses column families >>>> to maintain integrity of the relationships. >>>> >>>> >>>> >>>> I'm running versions 0.19.3 and 0.20.0. The behavior is basically the >>>> same. The cluster consists of 10 nodes. I'm running my namenode and >>>> HBase master on one dedicated box. The other 9 run datanodes/region >>>> servers. >>>> >>>> >>>> >>>> I'm seeing around ~1000 "Inbox" transactions per second (dividing total >>>> time for the batch by total count inserted). The problem is that I >>>> get the same results with 5 nodes as with 10. Not quite what I was >>>> expecting. >>>> >>>> >>>> >>>> The bottleneck seems to be the splitting algorithms. I've set my >>>> region size to 2M. I can see that as the process moves forward, HBase >>>> pauses and re-distributes the data and splits regions. It does this >>>> first for the "Inbox" table and then pauses again and redistributes the >>>> "User" table. This pause can be quite long. Often 2 minutes or >>>> more. >>>> >>>> >>>> >>>> Can the key ranges be pre-defined somehow in advance to avoid this? I >>>> would rather not burden application developers/DBA's with this. >>>> Perhaps the divvy algorithms could be sped up? Any configuration >>>> recommendations? >>>> >>>> >>>> >>>> Thanks in advance, >>>> >>>> Guy >>>> >>>> >>>> >>