Re: Query on analyze big data with Hbase

Cosmin Lehene Wed, 16 Nov 2011 02:22:58 -0800

You should consider looking over the available HBase resources
There's an online book http://hbase.apache.org/book.html

And there's Lars George's book from O'Reilly
(http://shop.oreilly.com/product/0636920014348.do)

On 11/16/11 6:39 AM, "Stuti Awasthi" <stutiawas...@hcl.com> wrote:

>Hi all,
>
>I have a scenario in which my Hbase tables will be fed with data size
>more than 250GB every day. I have to do analysis on that data using MR
>jobs and save the output in Hbase table itself.
>
>1.      My concern is will Hbase be able to handle such data as it is
>built to handle big data?
Yes, it should be able to handle this amount of data. However you need to
determine the number of simultaneous requests and the size of each request
so you could determine the minimum number of region servers and their
configuration.

You could do some testing on a small cluster once you decide the hardware
you're going to use.

>
>2.      What hardware /hbase configuration points I must keep in mind to
>create a cluster for such requirement.?

It depends on the data access patterns: e.g. run a map-reduce job
incrementally on the new data or have the data available for lots of
random reads.
It also depends on the desired duration of the map-reduce job or the
average latency you want for the random reads.

Generally depending on what you need you'll have to tune a core x spindle
x RAM formula.
If you have to few disks then you'll end up with a IO bottleneck, if you
add too many you'll either saturate the CPU or the NIC and have some disks
idle.
I'm not sure if a golden rule is what you should be relying on, but 1 core
x 1 spindle x 4 GB RAM is common so you can use this as a baseline and
adapt. 
Optimizing the map-reduce code will generally change things dramatically
:).

You need to take bandwidth utilization into account as well: considering
that all data written through the HBase API will (optionally) initially go
to 
        a Write-Ahead Log (WAL) in HDFS that is replicated on 3 machines
        in HRegionServer cache (RAM) - these are flushed to HDFS as well (3
replicas) 
One of the replicas will always be on the local machine (given that you
run DataNode and HRegionServer on same machines), but the two others will
go out on different machines.

>
>a.      How many region server?

This depends a lot on the data access pattern and on the hardware that the
HRegionServer runs on (how much RAM, how many cores, how many spindles).
Normally if you don't access much the old data, than it's ok to keep it on
less region servers with more space as it won't take up resources.

>
>b.      How many regions per region server ?

There are some points on this in the books. By default the region size is
250MB and it's configurable to larger sizes. Facebook has some interesting
points on this as well.
There's a balance between avoiding region splits (if a region grows larger
than the defined size it will be split in two) and having a good data
distribution on the cluster (e.g. if you have huge region and all the
writes go to that one you'll end up using a single region server for all
writes) - so you need to decide a good key distribution.

>
>3.      My schema is such that in one table with one cf , there will be
>millions of column qualifier. What can be the consequences of such design.

It means that you need to make sure you're not exceeding the region size
with a single row.
You also have to consider that getting an entire row will be an expensive
operation. You should look at the batching options for Scans
(incrementally retrieve batches of columns from a row).

Again, testing is key :)

You could consider outputting the MR job output to a HFile that you can
load into HBase after, instead of going through the HBase API - especially
if the resulted data is large.

Cosmin

>
>Please suggest.
>
>Regards,
>Stuti Awasthi
>HCL Comnet Systems and Services Ltd
>F-8/9 Basement, Sec-3,Noida.
>
>
>________________________________
>::DISCLAIMER::
>--------------------------------------------------------------------------
>---------------------------------------------
>
>The contents of this e-mail and any attachment(s) are confidential and
>intended for the named recipient(s) only.
>It shall not attach any liability on the originator or HCL or its
>affiliates. Any views or opinions presented in
>this email are solely those of the author and may not necessarily reflect
>the opinions of HCL or its affiliates.
>Any form of reproduction, dissemination, copying, disclosure,
>modification, distribution and / or publication of
>this message without the prior written consent of the author of this
>e-mail is strictly prohibited. If you have
>received this email in error please delete it and notify the sender
>immediately. Before opening any mail and
>attachments please check them for viruses and defect.
>
>--------------------------------------------------------------------------
>---------------------------------------------

Re: Query on analyze big data with Hbase

Reply via email to