Re: Cassandra vs HBase

Jonathan Gray Wed, 02 Sep 2009 11:33:01 -0700

@Sylvain

If you describe your use case, perhaps we can help you to understandwhat others are doing / have done similarly. Event logging is certainlysomething many of us have done.

If you're wondering about how much load HBase can handle, provide somenumbers of what you expect. How much data in bytes are associated witheach event, how many events per hour, and what operations do you want todo on it? We could help you determine how big of a cluster you mightneed and the kind of write/read throughput you might see.


@Schubert

You do not need to partition your tables by stamp. One possibility isto put the stamp as the first part of your rowkeys, and in that way youwill have the table sorted by time. Using Scan's start/stop keys, youcan prevent doing a full table scan.

For both of you... If you are storing massive amounts of streaminglog-type data, do you need full random read access to it? If you justneed to process on subsets of time, that's easily partitioned by file.HBase should be used if you need to *read* from it randomly, notstreaming. If you have processing that HBase's inherent sorting,grouping, and indexing can benefit from, then it also can make sense touse HBase in order to avoid full-scans of data.

HBase is not the answer because of lack of HDFS append. You couldbuffer in something outside HDFS, close files after a certain size/time(this his what hbase does now, we can have data loss because of no

appends as well), etc...

Reads/writes of lots of streaming data to HBase will always be slowerthan HDFS. HBase adds additional buffering, and the compaction/splitprocesses actually mean you copy the same data multiple times (probably3-4 times avg which lines up with the 3-4x slowdown you see).

And there is currently a patch in development (that works at leastpartially) to do direct-to-hdfs imports to HBase which would then benearly as fast as a normal HDFS writing job.


Issue here:  https://issues.apache.org/jira/browse/HBASE-48


JG

Sylvain Hellegouarch wrote:

I must admit, I'm left as puzzled as you are. Our current use case atwork involve large amount of small event log writing. Of course HDFS wasquickly out of question since it's not there yet to append to a file andmore generally to handle large amount of small write ops.
So we decided with HBase because we trust the Hadoop/HBaseinfrastructure will offer us the robustness and reliability we need.That being said, I'm not feeling at ease in regards to the capacity ofHBase to handle the potential load we are looking at inputing.
In fact, it's a common treat of such systems, they've been designed witha certain use case in mind and sometimes I feel like their design andimplementation leak way too much on our infrastructure, leading us downthe path of a virtual lock-in.
Now I am not accusing anyone here, just observing that I find it reallyhard to locate any industrial story of those systems in a similar usecase we have at hand.
The number of nodes this or that company has doesn't quite interest meas much as the way they are actually using HBase and Hadoop.
RDBMS don't scale as well but they've got a long history and people doknow how to optimise, use and manage them. It seems column-orienteddatabase systems are still young :)
- Sylvain

Schubert Zhang a écrit :
Regardless Cassandra, I want to discuss some questions about
HBase/Bigtable.  Any advices are expected.

Regards runing MapReduce to scan/analyze big data in HBase.

Compared to sequentially reading data from HDFS files directly,
scan/sequential-reading data from HBase is slower. (As my test, atleast 3:1
or 4:1).

For the data in HBase, it is diffcult to only analyze specified part of
data. For example, it is diffcult to only analyze the recent one day of
data. In my application, I am considering partition data into different
HBase tables (e.g. one day - one table), then, I can only touch one table
for analyze via MapReduce.
In Google's Bigtable paper, in the "8.1 Google Analytics", they also
discribe this usage, but I don't know how.

It is also slower to put flooding data into HBase table than writing to
files. (As my test, at least 3:1 or 4:1 too). So, maybe in the future,HBase
can provide a bulk-load feature, like PNUTS?
Many people suggest us to only store metadata into HBase tables, andleave
data in HDFS files, because our time-series dataset is very big.  I
understand this idea make sense for some simple application requirements.
But usually, I want different indexes to the raw data. It is diffcult to
build such indexes if the the raw data files (which are raw or are
reconstructed via MapReduce periodically on recent data ) are nottotally
sorted.  .... HBase can provide us many expected features: sorted,
distributed b-tree, compact/merge.

So, it is very difficult for me to make trade-off.
If I store data in HDFS files (may be partitioned), and metadata/index in
HBase. The metadata/index is very difficult to be build.
If I rely on HBase totally, the performance of ingesting-data and
scaning-data is not good. Is it reasonable to do MapReduce on HBase?We know
the goal of HBase is to provide random access over HDFS, and it is a
extention or adaptor over HDFS.

----
Many a time, I am thinking, maybe we need a data storage engine, whichneed
not so strong consistency, and it can provide better writing and
reading throughput like HDFS. Maybe, we can design another system like a
simpler HBase ?

Schubert
On Wed, Sep 2, 2009 at 8:56 AM, Andrew Purtell <[email protected]>wrote:
To be precise, S3. http://status.aws.amazon.com/s3-20080720.html

  - Andy




________________________________
From: Andrew Purtell <[email protected]>
To: [email protected]
Sent: Tuesday, September 1, 2009 5:53:09 PM
Subject: Re: Cassandra vs HBase
Right... I recall an incident in AWS where a malformed gossip packettook
down all of Dynamo. Seems that even P2P doesn't mitigate against corner
cases.
On Tue, Sep 1, 2009 at 3:12 PM, Jonathan Ellis <[email protected]>wrote:
The big win for Cassandra is that its p2p distribution model -- which
drives the consistency model -- means there is no single point of
failure.  SPF can be mitigated by failover but it's really, really
hard to get all the corner cases right with that approach.  Even
Google with their 3 year head start and huge engineering resources
still has trouble with that occasionally.  (See e.g.
http://groups.google.com/group/google-appengine/msg/ba95ded980c8c179.)

Re: Cassandra vs HBase

Reply via email to