Re: HBase mention in VLDB keynote

Jonathan Gray Tue, 25 Aug 2009 14:08:39 -0700

If you are just looking for numbers, they can vary quite drasticallydepending on the cluster configuration, cluster hardware, jvm/gcconfiguration, dataset properties, read patterns, and load patterns.The ones I provided in that presentation are on a very small cluster butwith simple data and low load, my attempt at some getting some base numbers.

You really need to load up some of your own data and see how it behaveson your own cluster. And tuning is increasingly important now as we arelimited by Java GC quite a bit.


JG

Schubert Zhang wrote:

@stack
We know HIVE-705, and already have good communication with the contributor,
since we are all chinese. :-)
In fact some code of the patch are used and tested in our project. But we
need more flexible data store schema to resolve engineering problems,
especially performance and practicability.

@andy
Does ryan's result different from JG's?
On Wed, Aug 26, 2009 at 2:50 AM, Andrew Purtell <apurt...@apache.org> wrote:

Hi Schubert,

Regards "...and JG's/Ryan's performance test results for 0.20 stand as a

contradiction." Can you provide more references? such as a url/link of these
contradiction?

For JG: http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime

I'm sure you have seen this already.

Ryan has posted some information on the list now and again.

Also I think your work with performance evaluation is very important
feedback and data points. Thanks for that.

We are doing a interesting thing to make Hive can use HBase as it's data

store. Now we can use Hive's SQL to query/mapreduce data stored in HBase,
and also we can directly query/scan data from HBase.

That sounds REALLY interesting!

  - Andy




________________________________
From: Schubert Zhang <zson...@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Tuesday, August 25, 2009 8:26:50 PM
 Subject: Re: HBase mention in VLDB keynote

hi andy,

Even though current HBase is not yet ready for production, but we know it
is
really testable and evaluation-able for its data model and architecture.

Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
contradiction." Can you provide more references? such as a url/link of
these
contradiction?

Regards Hive, it's really a good design, especially about its abatraction
of
MapReduce workflow matched to SQL. Hive made a good success inside
Facebook, the report says 29% of Facebook employees use Hive, and 51% of
those users are from outside engineering. It should be caused by the easy
leaned SQL than other languages such as Pig Latin, etc. In fact, Pig is now
adding features of metadata and sql, which are provided in Hive. But Hive
is
still not very flexible to use alternate data store than HDFS files. We are
doing a interesting thing to make Hive can use HBase as it's data store.
Now
we can use Hive's SQL to query/mapreduce data stored in HBase, and also we
can directly query/scan data from HBase.

I believe HBase can be a data store to work as a storage adapter layer
above
HDFS. It is not a database, it is just a data storage adapter system above
HDFS, with a distributed b-tree clustered index. BigTable is designed to
provide more easy-used ways to store small data objects and provide
random-access, since GFS is designed for
sequential-access/batch-processing/large-data storage and GFS is not
appropriate to store small data objects and random-access.

I also believe HBase can be a data store to let MapReduce over HBase
possiable. If we review the Bigtable paper's, especially secetor 8, we can
find it is widely used for to do mapreduce analysis/summary in many google
applications.


In the recent ACM Queue interview to Sean Quinlan, Google GFS leader, we
can
find google's new GFS integrated some data models of Bigtable.
http://queue.acm.org/detail.cfm?id=1594206


Schubert

On Wed, Aug 26, 2009 at 12:36 AM, Bradford Stephens <
bradfordsteph...@gmail.com> wrote:

Interesting. I need to see what sort of eval was going on for that
presentation...

He probably forgot to tweak GC :)

On Tue, Aug 25, 2009 at 9:32 AM, Andrew Purtell <apurt...@apache.org>
wrote:

Can we write him to figure more on how evaluation was done?


This was one interaction with that group, maybe the only other one

aside

from a question about sizing memstore:
http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html
Now I wonder if the eval was done via the REST gateway... A followup

might

be useful. If I run into someone from Yahoo Research here I'll ask.
Otherwise we should try mailing them, yes.

Should we try and get into VLDB next year?

We can certainly submit a candidate paper given a novel contribution of
some kind which moves the state of the art forward. There are other

venues

besides VLDB also we can consider. Regardless, I think one of us should
attend VLDB every year.

Any thing else interesting at the conference?

Yes.

ETH Zurich presented a system which tailors consistency to the needs of
various data items -- "consistency rationing in the cloud: pay only

when

it

matters" -- choosing eventual (session) consistency or pessimistic 2PC

on

demand according to a cost model, with good results. Made me think of
possibilities with THBase. Also, I watched a demo of HIVE, something I
hadn't see to date. Their query planner and mapreduce scheduler is
interesting in concept and in detail. We're looking at Cascading for

batch

analytics on top of HBase instead, but knowing more about alternatives

is

always good.

The Hadoop-y track is really tomorrow.

Outside of direct relevance to things HBase I attended talks on aspects

of

data fusion, ETL, and complex event processing / stream processing,

wearing

my TM hat. Lots of good stuff here.

  - Andy

________________________________
From: Stack <saint....@gmail.com>
To: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
Sent: Tuesday, August 25, 2009 4:47:57 PM
Subject: Re: HBase mention in VLDB keynote

The same fella did keynote at apachecon eu on a similar topic.  Then he
talked mostly of Sherpa/pnuts yahoo tech.   In that presentation we got

no

mention.  There the comparison strangely was to couchdb and perhaps
Cassandra (iirc).

So, mention is an improvement (do you think the kick up the behind I
rendered him after his amsterdam talk could have had anything to do

with

it?).

Can we write him to figure more on how evaluation was done?

Should we try and get into vldb next year?

Good stuff Andy.  Any thing else interesting at the conference?

Stack



On Aug 25, 2009, at 6:17 AM, Andrew Purtell <apurt...@apache.org>

wrote:

In this keynote address here at VLDB 2009 (

http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's
Chief Scientist, made prominent mention of HBase, much to my surprise

(and

later chagrin). This happened near the end of the talk when a number of

the

new elastic/scalable/"nosql" storage systems were discussed to make

concrete

some of the architectural and data model points made earlier. The
alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and
Cassandra. I don't know what version of HBase was used exactly but
unfortunately the message was "not ready yet". Perhaps it was a
configuration or provisioning issue but HBase did not really survive

the

evaluation, leading to short hyperbolic performance curves terminating

on

the far left of the various graphs. This was quite disappointing to see

as

the other alternatives were apparently successfully tested on what can

be

presumed to be the same resources. It stands to reason there
 is opportunity for HBase to improve here if only we know what that is.

It

was also a little disappointing that it appears through a mailing list
search that these issues were not brought to either hbase-dev@ or
hbase-users@, only a minor question relating to the REST interface.
Perhaps the community could have identified a specific configuration
problem, recommended a correction for a deployment/provisioning error,

or

resolved a bug. To future evaluators of HBase, on behalf of the

community

humbly request that you share you results, good or bad, so we can take

the

feedback, or the bug reports and their artifacts (logs, etc.) and

improve

our software.

At least, the story has already changed from what was presented today

--

for example, the multimaster architecture of 0.20 was not presented,

rather

the older one (circa 0.19); and JG's/Ryan's performance test results

for

0.20 stand as a contradiction. We should look into opportunities to

produce

a peer reviewed positive contribution. I think we have opportunities to

take

some novel approaches in the system itself and/or produce a novel

vertical

contribution and 0.20 is a good substrate for that.

Though this was unfortunately a missed opportunity for a good showing

for

HBase in particular, the keynote in general was a well formulated
introduction of the emerging area of "cloud scale" storage / "nosql"

systems

to the largest elite gathering of database and data processing

researchers

in the world. The presentation was importantly also a call for

participation

in the future development and directions of the new and growing "nosql"
constellation. Such participation, whether it is specific involvement

with

the HBase project or not, would be and is most welcome as the problems

of

serving data at very large scale under "cloud" constraints is an area

of

both significant challenge and significant promise. HBase like other
projects in this area are in an early stage of development. They cover

the

use cases of their creators but, as answers to the larger set of

problems,

they are not -- that space is untapped and only waiting for creativity

and

effort. I
 think I can speak for HBase in particular, we welcome this and would

be

pleased to assist at every opportunity.

   - Andy



--
http://www.roadtofailure.com -- The Fringes of Scalability, Social

Media,

and Computer Science

Re: HBase mention in VLDB keynote

Reply via email to