On Mar 16, 2008, at 7:44 PM, Chris Anderson wrote:
Couchers, I've been diving into CouchDB lately, and seeing how it's a great fit for my application. I've run into some questions about how to record information in an efficient way. Here's a description of what I'm trying to do, and a couple of methods I've come up with. Hopefully someone on the list can give me some insight into how determine what the pros and cons of each approach are. Let's say I'm crawling the web, looking for embeds of YouTube videos on blogs and such. When I come across one, I'll be recording: the YouTube video URL. the URL the video was embedded on. a snippet of the context in which it was found. In the end I'd like to give people the ability to see where their videos are being embedded. E.g. start from a video and find the embeds of it from across the web. I'll be recrawling some blogs quite frequently, so I have this idea about how to avoid duplicate content: I calculate an MD5 hash from the information I want to store in a deterministic way, so processing the same page twice creates identical computed hash values. I use the hash values as document_ids, and PUT the data to CouchDB, with no _rev attribute. CouchDB will rejects the PUTs of duplicate data with a conflict. In my application I just ignore the conflict, as all it means is that I've already put that data there (maybe in an earlier crawl). The alternative approach is to forgo the MD5 hash calculation, and POST the parsed data into CouchDB, creating a new record with an arbitrary id. I imagine that I would end up with a lot of identical data in this case, and it would become the job of the Map/Combine/Reduce process to filter duplicates while creating the lookup indexes. I suppose my question boils down to this: Are there unforeseen costs to building a high percentage of failing PUTs into my application design? It seems like the most elegant way to ensure unique data. But perhaps I am putting too high a premium on unique data - I suppose in the end it depend on the cost to compute a conflict, vs the ongoing cost of calculating reductions across redundant data sets.
I don't see any problems with this approach. For you purposes, using MD5 hashes of data should work just fine.
Thanks for any insights! Chris -- Chris Anderson http://jchris.mfdz.com
