Re: 3 Newbie questions

Chris Anderson Thu, 25 Sep 2008 09:57:36 -0700

I've been doing web-spidering with CouchDB for quite a while now.
Here's the approach I would take:


Ideally, you'll want to run your views on an already parsed set of
data - eg, JSON representations of the web pages. So perhaps you'll
crawl into CouchDB (I use Nutch to crawl into Hadoop, and then have
Hpricot/CouchRest push parsed data into CouchDB).

The upshot is that you'd have one database for raw crawled data, and
another for parsed data. It might be equivalent to store the raw data
as attachments, if you do the parsing at crawl-time. The important
thing is keeping from shuffling the raw html through the map
functions, as that will gum up the works quite a bit.

Once you have the parsed data in CouchDB documents, you can start to
work on views. For performance, it is important to avoid emitting the
entire doc. Emitting just the data you need for a given application
function will be much faster. The notes above on reduce are correct.
Your reduce should never aim to return a list of values, just an
aggregate (sum, average, count, that sort of thing).

As far as geo queries go, having a view with emit(lat, null) and
another with emit(long, null), and computing the intersection in the
client is probably the most straightforward way to go. It maybe that a
custom geo-index will be better, but for the time being, that is how
I'd do it. Note that all view rows include the docid, so emitting null
as the value is good when you just want to know which docs match. Once
you've computed the intersection of docs within the lat/long range
you're searching, its straightforward to query CouchDB for the set of
documents. (And it will get even easier when the multi-key and
include-docs patches are added, which should be real soon now.)

Hope this helps and please keep us updated with your progress.

Chris

Re: 3 Newbie questions

Reply via email to