Re: Cassandra vs HBase

Schubert Zhang Wed, 02 Sep 2009 09:54:25 -0700

Regardless Cassandra, I want to discuss some questions about
HBase/Bigtable.  Any advices are expected.

Regards runing MapReduce to scan/analyze big data in HBase.

Compared to sequentially reading data from HDFS files directly,
scan/sequential-reading data from HBase is slower. (As my test, at least 3:1
or 4:1).

For the data in HBase, it is diffcult to only analyze specified part of
data. For example, it is diffcult to only analyze the recent one day of
data. In my application, I am considering partition data into different
HBase tables (e.g. one day - one table), then, I can only touch one table
for analyze via MapReduce.
In Google's Bigtable paper, in the "8.1 Google Analytics", they also
discribe this usage, but I don't know how.

It is also slower to put flooding data into HBase table than writing to
files. (As my test, at least 3:1 or 4:1 too). So, maybe in the future, HBase
can provide a bulk-load feature, like PNUTS?

Many people suggest us to only store metadata into HBase tables, and leave
data in HDFS files, because our time-series dataset is very big.  I
understand this idea make sense for some simple application requirements.
But usually, I want different indexes to the raw data. It is diffcult to
build such indexes if the the raw data files (which are raw or are
reconstructed via MapReduce  periodically on recent data ) are not totally
sorted.  .... HBase can provide us many expected features: sorted,
distributed b-tree, compact/merge.

So, it is very difficult for me to make trade-off.
If I store data in HDFS files (may be partitioned), and metadata/index in
HBase. The metadata/index is very difficult to be build.
If I rely on HBase totally, the performance of ingesting-data and
scaning-data is not good. Is it reasonable to do MapReduce on HBase? We know
the goal of HBase is to provide random access over HDFS, and it is a
extention or adaptor over HDFS.

----
Many a time, I am thinking, maybe we need a data storage engine, which need
not so strong consistency, and it can provide better writing and
reading throughput like HDFS. Maybe, we can design another system like a
simpler HBase ?

Schubert

On Wed, Sep 2, 2009 at 8:56 AM, Andrew Purtell <[email protected]> wrote:

> To be precise, S3. http://status.aws.amazon.com/s3-20080720.html
>
>   - Andy
>
>
>
>
> ________________________________
> From: Andrew Purtell <[email protected]>
> To: [email protected]
> Sent: Tuesday, September 1, 2009 5:53:09 PM
> Subject: Re: Cassandra vs HBase
>
>
> Right... I recall an incident in AWS where a malformed gossip packet took
> down all of Dynamo. Seems that even P2P doesn't mitigate against corner
> cases.
>
>
> On Tue, Sep 1, 2009 at 3:12 PM, Jonathan Ellis <[email protected]> wrote:
>
> > The big win for Cassandra is that its p2p distribution model -- which
> > drives the consistency model -- means there is no single point of
> > failure.  SPF can be mitigated by failover but it's really, really
> > hard to get all the corner cases right with that approach.  Even
> > Google with their 3 year head start and huge engineering resources
> > still has trouble with that occasionally.  (See e.g.
> > http://groups.google.com/group/google-appengine/msg/ba95ded980c8c179.)
>
>
>
>

Re: Cassandra vs HBase

Reply via email to