Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Brian Bockelman Tue, 14 Apr 2009 11:03:18 -0700


On Apr 14, 2009, at 12:47 PM, Guilherme Germoglio wrote:

Hi Brian,
I'm sorry but it is not my paper. :-) I've posted the link herebecausewe're always looking for comparison data -- so, I thought thisbenchmark
would be welcome.

Ah, sorry, I guess I was being dense when looking at the author list.I'm dense a lot.

Also, I won't attend the conference. However, it would be a goodidea tosomeone who will to ask directly to the authors all these questionsand
comments and then post their answers here.


It would be interesting!

In the particular field I'm working for (HEP), databases have a long,colorful history of failure while MapReduce-like approaches (althoughcolored in very, very different terms and some interesting alternateoptimizations ... perhaps the best way to describe it would be Map-Reduce on a partially column-oriented, unstructured data store) havesurvived.

I'm a big fan of "use the right tool for the job". There are jobs forMap-Reduce, there are jobs for DBMS, and (as the authors point out),there is overlap and possible cross-pollination between the two. Inthe end, if the tools get better, everyone wins.


Brian

On Tue, Apr 14, 2009 at 2:26 PM, Brian Bockelman<bbock...@cse.unl.edu>wrote:
Hey Guilherme,

It's good to see comparisons, especially as it helps folks understand
better what tool is the best for their problem. As you show inyour paper,a MapReduce system is hideously bad in performing tasks that column-store
databases were designed for (selecting a single value along an index,
joining tables).

Some comments:
1) For some of your graphs, you show Hadoop's numbers in half-grey,
half-white. I can't figure out for the life of me what thissignifies!
What have I overlooked?
2) I see that one of your co-authors is the CEO/inventor of theVertica DB.Out of curiosity, how did you interact with Vertica versus HadoopversusDBMS-X? Did you get help tuning the systems from the experts?I.e., if yousat down with a Hadoop expert for a few days, I'm certain you couldsqueezeout more performance, just like whenever I sit down with an OracleDBA for afew hours, my DB queries are much faster. You touch upon thesociologicalissues (having to program your own code versus having to only knowSQL, aswell as the comparative time it took to set up the DB) - I'd liketo hearhow much time you spent "tweaking" and learning the best practicesfor thethree, very different approaches. If you added a 5th test, what'sthe
marginal effort required?
3) It would be nice to see how some of your more DB-like tasksperform onsomething like HBase. That'd be a much more apples-to-applescomparison ofcolumn-store DBMS versus column-store data system, although theHBase workis just now revving up. I'm a bit uninformed in that area, so Idon't have
a good gut in how that'd do.
4) I think that the UDF aggregation task (calculating the inlinkcount foreach document in a sample) is interesting - it's a more Map-Reduceoriented
task, and it sounds like it was fairly miserable to hack around the
limitations / bugs in the DBMS.
5) I really think you undervalue the benefits of replication and
reliability, especially in terms of cost. As someone who helpswith a smallsite (about 300 machines) that range from commodity workers to SunThumpers,if your site depends on all your storage nodes functioning, thenyour costsgo way up. You can't make cheap hardware scale unless yoursoftware can
account for it.
- Yes, I realize this is a different approach than you take. Therearepros and cons to large expensive hardware versus lots of cheaphardware ...the argument has been going on since the dawn of time. However,it's a bitunfair to just outright dismiss one approach. I am a bit wary ofthe claimsthat your results can scale up to Google/Yahoo scale, but I doagree that
there are truly few users that are that large!

I love your last paragraph, it's a very good conclusion.  It kind of
reminds me of the grid computing field which was (is?) completelyshocked by
the emergence of cloud computing.  After you cut through the hype
surrounding the new fads, you find (a) that there are some verygood reasonsthat the fads are popular - they have definite strengths that theexistingfield was missing (or didn't want to hear) and (b) there's a lot ofcommon
ground and learning that has to be done, even to get a good common
terminology :)

Enjoy your conference!

Brian

On Apr 14, 2009, at 9:16 AM, Guilherme Germoglio wrote:

(Hadoop is used in the benchmarks)
http://database.cs.brown.edu/sigmod09/

There is currently considerable enthusiasm around the MapReduce
(MR) paradigm for large-scale data analysis [17]. Although the
basic control ﬂow of this framework has existed in parallel SQL
database management systems (DBMS) for over 20 years, some
have called MR a dramatically new computing model [8, 17]. In
this paper, we describe and compare both paradigms. Furthermore,
we evaluate both kinds of systems in terms of performance and de-
velopment complexity. To this end, we deﬁne a benchmark con-
sisting of a collection of tasks that we have run on an open source
version of MR as well as on two parallel DBMSs. For each task,
we measure each system’s performance for various degrees of par-
allelism on a cluster of 100 nodes. Our results reveal some inter-
esting trade-offs. Although the process to load data into and tune
the execution of parallel DBMSs took much longer than the MR
system, the observed performance of these DBMSs was strikingly
better. We speculate about the causes of the dramatic performance
difference and consider implementation concepts that future sys-
tems should take from both kinds of architectures.


--
Guilherme

msn: guigermog...@hotmail.com
homepage: http://germoglio.googlepages.com
--
Guilherme

msn: guigermog...@hotmail.com
homepage: http://germoglio.googlepages.com

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

Reply via email to