I've been doing web-spidering with CouchDB for quite a while now. Here's the approach I would take:
Ideally, you'll want to run your views on an already parsed set of data - eg, JSON representations of the web pages. So perhaps you'll crawl into CouchDB (I use Nutch to crawl into Hadoop, and then have Hpricot/CouchRest push parsed data into CouchDB). The upshot is that you'd have one database for raw crawled data, and another for parsed data. It might be equivalent to store the raw data as attachments, if you do the parsing at crawl-time. The important thing is keeping from shuffling the raw html through the map functions, as that will gum up the works quite a bit. Once you have the parsed data in CouchDB documents, you can start to work on views. For performance, it is important to avoid emitting the entire doc. Emitting just the data you need for a given application function will be much faster. The notes above on reduce are correct. Your reduce should never aim to return a list of values, just an aggregate (sum, average, count, that sort of thing). As far as geo queries go, having a view with emit(lat, null) and another with emit(long, null), and computing the intersection in the client is probably the most straightforward way to go. It maybe that a custom geo-index will be better, but for the time being, that is how I'd do it. Note that all view rows include the docid, so emitting null as the value is good when you just want to know which docs match. Once you've computed the intersection of docs within the lat/long range you're searching, its straightforward to query CouchDB for the set of documents. (And it will get even easier when the multi-key and include-docs patches are added, which should be real soon now.) Hope this helps and please keep us updated with your progress. Chris
