Hi all. Two things about me: 1. I am new to couchdb (and erlang) 2. I am doing a Masters in Software Engineering, and for my dissertation study, I am keen to give comparisons of distributed systems.
At my university, I have a cluster that I am going to be able to use for my research. I am intending to deploy Hadoop across the cluster. I have some ideas about what I am wanting to analyse in detail, although I am really needing some sound advise from you guys, as to what is relevant, interesting, and feasible. I have had a look at Pig, the abstraction layer on top of Hadoop, which is, for those of you who are less familiar with it, ideal for data log processing (web server access log files and so forth). Pig taps into the Hadoop cluster, and intelligently distributes the execution and data across the DataNodes. So... couchDB. I have had a look over the last few days into CouchDB, what its ideal functions are, but more importantly, how this can be distributed across a cluster, for comparitive analysis. Unfortunately, what I have found is a mixed bag. As far as I can tell, there is no de-facto routine to distribute both the data *and* the execution of a CouchDB job seamlessly across a cluster. Instead, there currently exists methods of using a Proxy server. See the Google Summer of Code project for a much better explanation of CouchDB's lack of formal distribution features ( http://socghop.appspot.com/document/show/user/rleeds/couchdb_cluster ) So here's my question, or need for advise. What would be the best way to give a relevant comparative study between Pig (or another Hadoop package if necessary (HBase for e.g.) against CouchDB, in its current form. Ideally I would want to compare the performance of the two, when each has full use of the cluster at university, although with CouchDB's current limitations, I realise this may not be possible. Failing that, is it an unfair assessment to compare performance of CouchDB running on a single CouchDB server, against the computation performance of Pig, running on a Hadoop cluster? (Is that an unfair test, or reasonable?). Finally.. the dumbest question of all. I realise that CouchDB is a document based database system. But, what are the tools at a users' peril to manipulate the returned data. e.g. In a typical DBMS, one could: " select * from CarList list where list.colour='blue' ". Or perhaps a SQL join statement, or perhaps a filter statement similar to the Pig filter command. So is there functionality within CouchDB for this sort of record manipulation, or do I need to code in some other format to create these sorts of functions? Many thanks. I look forward to follow this mailing list over the course of the year, and do indeed look forward to my study over the coming year, where ever that may take me. Thanks Rob Stewart
