Hi Schubert,
Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
contradiction." Can you provide more references? such as a url/link of these
contradiction?
For JG: http://www.docstoc.com/docs/7493304/HBase-Goes-Realtime
I'm sure you have seen this already.
Ryan has posted some information on the list now and again.
Also I think your work with performance evaluation is very important
feedback and data points. Thanks for that.
We are doing a interesting thing to make Hive can use HBase as it's data
store. Now we can use Hive's SQL to query/mapreduce data stored in HBase,
and also we can directly query/scan data from HBase.
That sounds REALLY interesting!
- Andy
________________________________
From: Schubert Zhang <zson...@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Tuesday, August 25, 2009 8:26:50 PM
Subject: Re: HBase mention in VLDB keynote
hi andy,
Even though current HBase is not yet ready for production, but we know it
is
really testable and evaluation-able for its data model and architecture.
Regards "...and JG's/Ryan's performance test results for 0.20 stand as a
contradiction." Can you provide more references? such as a url/link of
these
contradiction?
Regards Hive, it's really a good design, especially about its abatraction
of
MapReduce workflow matched to SQL. Hive made a good success inside
Facebook, the report says 29% of Facebook employees use Hive, and 51% of
those users are from outside engineering. It should be caused by the easy
leaned SQL than other languages such as Pig Latin, etc. In fact, Pig is now
adding features of metadata and sql, which are provided in Hive. But Hive
is
still not very flexible to use alternate data store than HDFS files. We are
doing a interesting thing to make Hive can use HBase as it's data store.
Now
we can use Hive's SQL to query/mapreduce data stored in HBase, and also we
can directly query/scan data from HBase.
I believe HBase can be a data store to work as a storage adapter layer
above
HDFS. It is not a database, it is just a data storage adapter system above
HDFS, with a distributed b-tree clustered index. BigTable is designed to
provide more easy-used ways to store small data objects and provide
random-access, since GFS is designed for
sequential-access/batch-processing/large-data storage and GFS is not
appropriate to store small data objects and random-access.
I also believe HBase can be a data store to let MapReduce over HBase
possiable. If we review the Bigtable paper's, especially secetor 8, we can
find it is widely used for to do mapreduce analysis/summary in many google
applications.
In the recent ACM Queue interview to Sean Quinlan, Google GFS leader, we
can
find google's new GFS integrated some data models of Bigtable.
http://queue.acm.org/detail.cfm?id=1594206
Schubert
On Wed, Aug 26, 2009 at 12:36 AM, Bradford Stephens <
bradfordsteph...@gmail.com> wrote:
Interesting. I need to see what sort of eval was going on for that
presentation...
He probably forgot to tweak GC :)
On Tue, Aug 25, 2009 at 9:32 AM, Andrew Purtell <apurt...@apache.org>
wrote:
Can we write him to figure more on how evaluation was done?
This was one interaction with that group, maybe the only other one
aside
from a question about sizing memstore:
http://osdir.com/ml/hbase-user-hadoop-apache/2009-07/msg00552.html
Now I wonder if the eval was done via the REST gateway... A followup
might
be useful. If I run into someone from Yahoo Research here I'll ask.
Otherwise we should try mailing them, yes.
Should we try and get into VLDB next year?
We can certainly submit a candidate paper given a novel contribution of
some kind which moves the state of the art forward. There are other
venues
besides VLDB also we can consider. Regardless, I think one of us should
attend VLDB every year.
Any thing else interesting at the conference?
Yes.
ETH Zurich presented a system which tailors consistency to the needs of
various data items -- "consistency rationing in the cloud: pay only
when
it
matters" -- choosing eventual (session) consistency or pessimistic 2PC
on
demand according to a cost model, with good results. Made me think of
possibilities with THBase. Also, I watched a demo of HIVE, something I
hadn't see to date. Their query planner and mapreduce scheduler is
interesting in concept and in detail. We're looking at Cascading for
batch
analytics on top of HBase instead, but knowing more about alternatives
is
always good.
The Hadoop-y track is really tomorrow.
Outside of direct relevance to things HBase I attended talks on aspects
of
data fusion, ETL, and complex event processing / stream processing,
wearing
my TM hat. Lots of good stuff here.
- Andy
________________________________
From: Stack <saint....@gmail.com>
To: "hbase-user@hadoop.apache.org" <hbase-user@hadoop.apache.org>
Sent: Tuesday, August 25, 2009 4:47:57 PM
Subject: Re: HBase mention in VLDB keynote
The same fella did keynote at apachecon eu on a similar topic. Then he
talked mostly of Sherpa/pnuts yahoo tech. In that presentation we got
no
mention. There the comparison strangely was to couchdb and perhaps
Cassandra (iirc).
So, mention is an improvement (do you think the kick up the behind I
rendered him after his amsterdam talk could have had anything to do
with
it?).
Can we write him to figure more on how evaluation was done?
Should we try and get into vldb next year?
Good stuff Andy. Any thing else interesting at the conference?
Stack
On Aug 25, 2009, at 6:17 AM, Andrew Purtell <apurt...@apache.org>
wrote:
In this keynote address here at VLDB 2009 (
http://vldb2009.org/?q=node/22) Raghu Ramakrishnan, Yahoo! Research's
Chief Scientist, made prominent mention of HBase, much to my surprise
(and
later chagrin). This happened near the end of the talk when a number of
the
new elastic/scalable/"nosql" storage systems were discussed to make
concrete
some of the architectural and data model points made earlier. The
alternatives considered were Yahoo's PNUTS, sharded MySQL, HBase, and
Cassandra. I don't know what version of HBase was used exactly but
unfortunately the message was "not ready yet". Perhaps it was a
configuration or provisioning issue but HBase did not really survive
the
evaluation, leading to short hyperbolic performance curves terminating
on
the far left of the various graphs. This was quite disappointing to see
as
the other alternatives were apparently successfully tested on what can
be
presumed to be the same resources. It stands to reason there
is opportunity for HBase to improve here if only we know what that is.
It
was also a little disappointing that it appears through a mailing list
search that these issues were not brought to either hbase-dev@ or
hbase-users@, only a minor question relating to the REST interface.
Perhaps the community could have identified a specific configuration
problem, recommended a correction for a deployment/provisioning error,
or
resolved a bug. To future evaluators of HBase, on behalf of the
community
I
humbly request that you share you results, good or bad, so we can take
the
feedback, or the bug reports and their artifacts (logs, etc.) and
improve
our software.
At least, the story has already changed from what was presented today
--
for example, the multimaster architecture of 0.20 was not presented,
rather
the older one (circa 0.19); and JG's/Ryan's performance test results
for
0.20 stand as a contradiction. We should look into opportunities to
produce
a peer reviewed positive contribution. I think we have opportunities to
take
some novel approaches in the system itself and/or produce a novel
vertical
contribution and 0.20 is a good substrate for that.
Though this was unfortunately a missed opportunity for a good showing
for
HBase in particular, the keynote in general was a well formulated
introduction of the emerging area of "cloud scale" storage / "nosql"
systems
to the largest elite gathering of database and data processing
researchers
in the world. The presentation was importantly also a call for
participation
in the future development and directions of the new and growing "nosql"
constellation. Such participation, whether it is specific involvement
with
the HBase project or not, would be and is most welcome as the problems
of
serving data at very large scale under "cloud" constraints is an area
of
both significant challenge and significant promise. HBase like other
projects in this area are in an early stage of development. They cover
the
use cases of their creators but, as answers to the larger set of
problems,
they are not -- that space is untapped and only waiting for creativity
and
effort. I
think I can speak for HBase in particular, we welcome this and would
be
pleased to assist at every opportunity.
- Andy
--
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media,
and Computer Science