Hi there, I am currently trying to dive into Map/Reduce, Clustering (kmeans, conopy etc.), large data sets and so on...
To be honest, I am a bit confused about all the stuff out there: I had a small look at Apache Hadoop, the pythonic Disco-Framework and others. Today, I found CouchDB. And, waht shall I say, I really like the lightweight feeling of CouchDB. I tried a bit with the CouchDB Map/Reduce-views, but I am wondering if I can use map/reduce further in this context. More precisely, is it possible to run more than one map/reduce-job over a complete dataset? The most simple example I can imagine is borrowed from the usual map/reduce example: word_count. Imagine, I have a database, where some (not all!) documents have a field called "fulltext" and I want to count all words in that field. The common map/reduce approach would consist of to jobs: first: get all documents with that field and second: count the words in those documents. I know, with CouchDB I could run that in one job, but if you think of more complex examples, it would be nice to further map/reduce the query-result-set. Another example: I have a document set where all documents have the fields "link_to", "permalink", "date_published". And now I want to know which articles got a backlink on last sunday. So, first I would create a view giving me all documents with "date_published"=last sunday. And in a second step I would emit all documents which match to link_to on this query result. That sounds a bit like a relational database issue and I know, CouchDB isn't designed to replace an RDMS, but a query like that should be possible. I know there are work arrounds for those examples so that you can handle it with one single map/reduce view, but if you have a look at more complex map/reduce-algorithms (see also: Apache Mahout), it would be very great, if one could combine the great accessibility of CouchDB with a full featured map/reduce framework. Is it possible with CouchDB? Thank you in advance for your comments? Kind regards, Hendrijk
