couchdb Comparison

Rob Stewart Wed, 07 Oct 2009 05:53:58 -0700

Hi all.
Two things about me:
1. I am new to couchdb (and erlang)
2. I am doing a Masters in Software Engineering, and for my dissertation
study, I am keen to give comparisons of distributed systems.


At my university, I have a cluster that I am going to be able to use for my
research. I am intending to deploy Hadoop across the cluster.

I have some ideas about what I am wanting to analyse in detail, although I
am really needing some sound advise from you guys, as to what is relevant,
interesting, and feasible. I have had a look at Pig, the abstraction layer
on top of Hadoop, which is, for those of you who are less familiar with it,
ideal for data log processing (web server access log files and so forth).
Pig taps into the Hadoop cluster, and intelligently distributes the
execution and data across the DataNodes.

So... couchDB. I have had a look over the last few days into CouchDB, what
its ideal functions are, but more importantly, how this can be distributed
across a cluster, for comparitive analysis. Unfortunately, what I have found
is a mixed bag. As far as I can tell, there is no de-facto routine to
distribute both the data *and* the execution of a CouchDB job seamlessly
across a cluster. Instead, there currently exists methods of using a Proxy
server. See the Google Summer of Code project for a much better explanation
of CouchDB's lack of formal distribution features  (
http://socghop.appspot.com/document/show/user/rleeds/couchdb_cluster )

So here's my question, or need for advise. What would be the best way to
give a relevant comparative study between Pig (or another Hadoop package if
necessary (HBase for e.g.) against CouchDB, in its current form. Ideally I
would want to compare the performance of the two, when each has full use of
the cluster at university, although with CouchDB's current limitations, I
realise this may not be possible.

Failing that, is it an unfair assessment to compare performance of CouchDB
running on a single CouchDB server, against the computation performance of
Pig, running on a Hadoop cluster? (Is that an unfair test, or reasonable?).

Finally.. the dumbest question of all. I realise that CouchDB is a document
based database system. But, what are the tools at a users' peril to
manipulate the returned data. e.g. In a typical DBMS, one could: " select *
from CarList list where list.colour='blue'  ". Or perhaps a SQL join
statement, or perhaps a filter statement similar to the Pig filter command.
 So is there functionality within CouchDB for this sort of record
manipulation, or do I need to code in some other format to create these
sorts of functions?


Many thanks. I look forward to follow this mailing list over the course of
the year, and do indeed look forward to my study over the coming year, where
ever that may take me.

Thanks

Rob Stewart

couchdb Comparison

Reply via email to