RE: Hbase and linear scaling with small write intensive clusters

Molinari, Guy Wed, 23 Sep 2009 09:57:07 -0700

Hi Stack (and others),
     The reason for the small initial region size was intended to force
splits so that the load would be evenly distributed.   If I could
pre-define the key ranges for the splits, then I could go to a much
larger block size.   So, say if I have 10 nodes and a 100MB data set, a
block size of 10MB would be ideal (as I understand it).


I can see the "split" button on the UI as suggested.   How do I specify
the key ranges and assign those to regions on specific nodes?

Thanks for the quick responses,
Guy

-----Original Message-----
From: saint....@gmail.com [mailto:saint....@gmail.com] On Behalf Of
stack
Sent: Tuesday, September 22, 2009 5:17 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Hbase and linear scaling with small write intensive
clusters

(Funny, I read the 2MB as 2GB -- yeah, why so small Guy?)

On Tue, Sep 22, 2009 at 4:59 PM, Jonathan Gray <jl...@streamy.com>
wrote:

> Is there a reason you have the split size set to 2MB?  That's rather
small
> and you'll end up constantly splitting, even once you have good
> distribution.
>
> I'd go for pre-splitting, as others suggest, but with larger region
sizes.
>
> Ryan Rawson wrote:
>
>> An interesting thing about HBase is it really performs better with
>> more data. Pre-splitting tables is one way.
>>
>> Another performance bottleneck is the write-ahead-log. You can
disable
>> it by calling:
>> Put.setWriteToWAL(false);
>>
>> and you will achieve a substantial speedup.
>>
>> Good luck!
>> -ryan
>>
>> On Tue, Sep 22, 2009 at 3:39 PM, stack <st...@duboce.net> wrote:
>>
>>> Split your table in advance?  You can do it from the UI or shell
(Script
>>> it?)
>>>
>>> Regards same performance for 10 nodes as for 5 nodes, how many
regions in
>>> your table?  What happens if you pile on more data?
>>>
>>> The split algorithm will be sped up in coming versions for sure.
Two
>>> minutes seems like a long time.   Its under load at this time?
>>>
>>> St.Ack
>>>
>>>
>>>
>>> On Tue, Sep 22, 2009 at 3:14 PM, Molinari, Guy
<guy.molin...@disney.com
>>> >wrote:
>>>
>>>  Hello all,
>>>>
>>>>    I've been working with HBase for the past few months on a proof
of
>>>> concept/technology adoption evaluation.    I wanted to describe my
>>>> scenario to the user/development community to get some input on my
>>>> observations.
>>>>
>>>>
>>>>
>>>> I've written an application that is comprised of two tables.  It
models
>>>> a classic many-to-many relationship.   One table stores "User" data
and
>>>> the other represents an "Inbox" of items assigned to that user.
The
>>>> key for the user is a string generated by the JDK's
UUID.randomUUID()
>>>> method.   The key for the "Inbox" is a monotonically increasing
value.
>>>>
>>>>
>>>>
>>>> It works just fine.   I've reviewed the performance tuning info on
the
>>>> HBase WIKI page.   The client application spins up 100 threads each
one
>>>> grabbing a range of keys (for the "Inbox").    The I/O mix is about
>>>> 50/50 read/write.   The test client inserts 1,000,000 "Inbox" items
and
>>>> verifies the existence of a "User" (FK check).   It uses column
families
>>>> to maintain integrity of the relationships.
>>>>
>>>>
>>>>
>>>> I'm running versions 0.19.3 and 0.20.0.    The behavior is
basically the
>>>> same.   The cluster consists of 10 nodes.   I'm running my namenode
and
>>>> HBase master on one dedicated box.   The other 9 run
datanodes/region
>>>> servers.
>>>>
>>>>
>>>>
>>>> I'm seeing around ~1000 "Inbox" transactions per second (dividing
total
>>>> time for the batch by total count inserted).    The problem is that
I
>>>> get the same results with 5 nodes as with 10.    Not quite what I
was
>>>> expecting.
>>>>
>>>>
>>>>
>>>> The bottleneck seems to be the splitting algorithms.   I've set my
>>>> region size to 2M.   I can see that as the process moves forward,
HBase
>>>> pauses and re-distributes the data and splits regions.   It does
this
>>>> first for the "Inbox" table and then pauses again and redistributes
the
>>>> "User" table.    This pause can be quite long.   Often 2 minutes or
>>>> more.
>>>>
>>>>
>>>>
>>>> Can the key ranges be pre-defined somehow in advance to avoid this?
I
>>>> would rather not burden application developers/DBA's with this.
>>>> Perhaps the divvy algorithms could be sped up?   Any configuration
>>>> recommendations?
>>>>
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> Guy
>>>>
>>>>
>>>>
>>

RE: Hbase and linear scaling with small write intensive clusters

Reply via email to