Re: Question about HBase

Jonathan Gray Mon, 20 Jul 2009 12:04:41 -0700

Schubert,

Sounds like you know what you're doing.

There are two different LRU implementations in current trunk, but theydo more than you need them to. Both act on objects that implementHeapSize, so they are a heapsize-bound LRU. May not be that bad of anidea. The algorithm is universal though... Check out:


org.apache.hadoop.hbase.regionserver.LruHashMap

There is a newer, fancier, concurrent one specifically for the block cache:

org.apache.hadoop.hbase.io.hfile.LruBlockCache

Both are in trunk.

As far as 0.19 vs 0.20, we are MUCH better with our usage and awarenessof memory usage in 0.20. I'd strongly encourage you to try to go the0.20 route... Especially in order to get the support you need from us,we are really focused on 0.20 improvements rather than continuedimprovements to 0.19 branch.

We expect to cut an RC very soon, just one or two bugs we're working at(they seem last-mile) and the migration (which is of no concern to you,right?).

Our indexes are much improved, and can be easily controlled by playingwith block sizes, region sizes, etc... You will be able to get wayfurther and tweak things in your favor on the latest HBase.


Keep us in the loop!  Best of luck.

JG

zsongbo wrote:

J-G,

Thanks for your reply. You really understand my question.

I am planning to experiment the partitioned-tables (day by day partition).
Yes, it will let our query application become complex. But there is a award
that let it easy to delete the old data, and easy to run mapreduce jobs
which only access data of specified days. To gain a good ingesting
performance, I should not use time as the first-part of rowkey.

Yes, HBase-0.20.0 implemented the new HFile. The MapFile solution is for
current released versions, such as 0.19.x. we will check the LRU
implementations in HBase to replace SoftReferences. Could you please give me
a clue about one of such LRU implementations.

This problem is very important in our project. We had considerd and
experimented about building it straight up HDFS via mapreduce. But later we
found we still need to implement many functions such as data partitions,
data-merging, distributed indexing, etc. Then, we ensure our data will be
queried randomly, and come back to the Bigtable world.

Regards,
Schubert
On Tue, Jul 21, 2009 at 12:02 AM, Jonathan Gray <[email protected]> wrote:

You're right about the conflict between randomized keys or time-ordered
keys.  Getting the best load distribution vs isolating regions being written
to.

There's a number of different ways you could deal with this, some being
fairly complex (you could partition tables by time).  You could add a random
number (0 through #nodes) in front of the stamp, so you'd be hitting #nodes
different sections of your table at a time.

But I'd say you should experiment with your dataset and patterns to see how
HBase handles it and what works best for you.  We'd all be interested to see
how far you can get, and of course will help you along the way.

As for the index, it's not actually a MapFile index anymore it's our own
block index from HFile.  It might be reasonable to use SoftReferences, but
in our experience under load they tend to misbehave (leaving you open to an
OOME anyway).  We have a couple different LRU implementations in the HBase
codebase which could be used to implement that.

Once you start to push the limits, I think you're going to need more
intelligent load-balancing.  Right now, it's just balancing the # of regions
per server, with no consideration to some being heavily written to, others
heavily read from, others idle, etc...  That may be an issue.  But you might
be able to pull something simple together once you find a certain issue
biting you.

Your problem is interesting so I know I'm interested to see how far HBase
can be pushed in this respect.

JG


zsongbo wrote:

Ryan,

Thanks, maybe my previous email did not describe clearly. We really know
that the 'memcache'/'memstor' is a write buffer. The read operation will
not
need such a cache. :-)

So according to your answer "strict limiting factor is the index size.",
I am considering to use the 'softReference' to load the index (MapFile
index) as read cache. Since my query application need not so fast response
(e.g. second is ok).
Do you think this consideration is reasonable?

Another question about memstore is:
If there are too many regions (thousands) in one regionserver, then the
flush of memstore will be very frequent, and the HStoreFile will be very
small. Then, the compaction will be frequent. Thus, the performance will
become bad.

To let some old data/regions become "static" which occupy less
resource(ram), we must choose time-based rowkey. Thus, only the current
inserting data/regions are active to accept writing. But there is another
problem: only the latest region which regionserver is busy, here we lose
the
performance of parallel ingesting data.  Do you have any good idea?

My application need random reads for low latency(sedond is ok) random
query.
But it still need fast data-ingesting with high throughput. We have about
2TB uncompressed data or billions records to be load into data, near
real-time.

Regards,
Schubert

On Mon, Jul 20, 2009 at 3:12 PM, Ryan Rawson <[email protected]> wrote:

I think you might be misunderstanding what the 'memcache' is, we are

calling it 'memstore' now.  It is a write buffer, not a cache. It is
also memory sensitive, so as you insert more data, hbase will flush
the 'memcache' to HDFS.  By default memcache is limited to 64MB a
store, 40% of Xmx, and we also limit how many outstanding write ahead
logs there are, so as these fill up, we flush.  The strict limiting
factor is the index size.  You can control that with block size and
compression.

To answer some other questions:
1. The ram used in indexes is the primary limiting factor of how much
a regionserver can serve.
2. A region under no write load only occupies the index ram. Only
regions actually under write load will have memcache size > 0. This
depends on your application.

If your application doesn't use random reads, you might consider
storing data straight up in HDFS.  If you want fast random reads to
any part of the dataset, then HBase is your thing.

On Mon, Jul 20, 2009 at 12:01 AM, zsongbo<[email protected]> wrote:

Ryan,

I know we can store more than 250GB in one region ssrver. But how about

3TB,

even 10TB.
Except for the memory usage by indexs, there may have other factors,
such

as

the memcache.
If there are 5000 regions opened, the total memcache heap will be very
large.

So, I am thinking two:
1. What is the key factor of system resource usage in HBase?
2. Is there any way to make the majority of regions to be static (do not

use

too much memory and CPU), and only few regions are active to serve the

data

ingesting.

The above considerations are based on following situations:
1. Usually, we are processing and managing time-serial dataset. The old

data

(e.g. the data of one week or one day ago) can be static and rarely be
queried.
2. The application of the dataset may be ingesting heavy. But only a few
client to query.
3. The total dataset is very big (500TB un-compressed), but we cannot
run

2000-nodes cluster to hold this dataset. Large cluster is so expensive,

and

we can only run a cluster with noly 20-50 nodes.

Schubert


On Fri, Jul 10, 2009 at 3:35 PM, Ryan Rawson <[email protected]>
wrote:

By dedicating more ram to the situation you can achieve more regions

under a single regionserver.  I have noticed that in my own region
servers, 200-600MB = 1-2MB of index.  This value, however, is
dependent on the size of your keys and values.  I have very small keys
and values.  You can also tune the index size with the blocksize
setting, which directly impacts several things:
- larger blocks mean less index entries (more efficient ram use).
- larger blocks mean larger IO units. Can impact random read perf

(reduces
it)

If you have larger data sizes, you can end up with essentially 1 index
entry per thing (eg: if each thing is 5m, you end up with 1 entry per
5m of data).

So depends on the size of your data, some tuning factors, etc.

Putting this all together means, 1 regionserver can most certainly be
in charge of more than 250GB of stored data.

-ryan

On Fri, Jul 10, 2009 at 12:23 AM, zsongbo<[email protected]> wrote:

Ryan,

Yes. you are right.
But my question is that, even through 1000 regions (250MB)) per
regionserver, each regionserver can only support 250GB storage.

Please also check this thread "Help needed - Adding HBase to

architecture",

Stack and Andrew have put some talk there.

Schubert

On Fri, Jul 10, 2009 at 12:55 PM, Ryan Rawson <[email protected]>

wrote:

That size is not memory-resident, so the total data size is not an

issue.  The index size is what limits you with RAM, and its about 1

MB

 per region (256MB region).

-ryan

On Thu, Jul 9, 2009 at 9:51 PM, zsongbo<[email protected]> wrote:

Hi Ryan,

Thanks.

If your regionsize is about 250MB, than 400 regions can store 100GB

data

on

each regionserver.
Now, if you have 100TB data, then you need 1000 regionservers.
We are not google or yahoo who have so many nodes.

Schubert

On Fri, Jul 10, 2009 at 12:29 PM, Ryan Rawson <[email protected]>

wrote:

re: #2: in fact we don't know that... I know that I ran run
200-400

  regions on a regionserver with a heap size of 4-5gb.  More even.

  bet I could have 1000 regions open on 4gb ram.  Each region is ~

1mb

  of all the time data, so there we go.

As for compactions, they are fairly fast, 0-30s or so depending on

  number of factors.  Practically speaking it has not been a problem

for

 me, and I've put 1200 gb into hbase so far.

On Thu, Jul 9, 2009 at 8:58 PM, zsongbo<[email protected]> wrote:

Hi all,

1. In this configuration property:

 <property>
  <name>hbase.hstore.compactionThreshold</name>
  <value>3</value>
  <description>
  If more than this number of HStoreFiles in any one HStore
  (one HStoreFile is written per flush of memcache) then a

compaction

    is run to rewrite all HStoreFiles files as one.  Larger

numbers

     put off compaction but when it runs, it takes longer to

complete.

    During a compaction, updates cannot be flushed to disk.  Long

  compactions require memory sufficient to carry the logging of
  all updates across the duration of the compaction.
  If too large, clients timeout during compaction.
  </description>
 </property>


That says "During a compaction, updates cannot be flushed to

disk."

   Does it mean that, when compaction, the memcache cannot be

flushed

to

 disk?

I think it is not good.

2. We know that HBase cannot serve too many regions on each

regionserver.

If

only 200 regions(256MB), only 50GB storage can be used.
I my tested whith have 1.5GB heap and 256MB regionsize, each

regionserver

 can support 150 regions, and then OutOfMem.

Can anybody explain more detail here of the reason?

To use more storage, can I set larger regionsize? such as 1GB,

10GB?

  I have worry about the compaction time would be long with so

large

  regions.

Schubert

Re: Question about HBase

Reply via email to