Totally away of the topic: If your using the MD5 as CRC why don't you simply use CRC?
Just a thought :) Have a nice weeend On Mon, Mar 17, 2008 at 12:42 AM, Damien Katz <[EMAIL PROTECTED]> wrote: > > On Mar 16, 2008, at 7:44 PM, Chris Anderson wrote: > > > Couchers, > > > > I've been diving into CouchDB lately, and seeing how it's a great fit > > for my application. I've run into some questions about how to record > > information in an efficient way. Here's a description of what I'm > > trying to do, and a couple of methods I've come up with. Hopefully > > someone on the list can give me some insight into how determine what > > the pros and cons of each approach are. > > > > Let's say I'm crawling the web, looking for embeds of YouTube videos > > on blogs and such. When I come across one, I'll be recording: > > > > the YouTube video URL. > > the URL the video was embedded on. > > a snippet of the context in which it was found. > > > > In the end I'd like to give people the ability to see where their > > videos are being embedded. E.g. start from a video and find the embeds > > of it from across the web. > > > > I'll be recrawling some blogs quite frequently, so I have this idea > > about how to avoid duplicate content: > > > > I calculate an MD5 hash from the information I want to store in a > > deterministic way, so processing the same page twice creates identical > > computed hash values. I use the hash values as document_ids, and PUT > > the data to CouchDB, with no _rev attribute. CouchDB will rejects the > > PUTs of duplicate data with a conflict. In my application I just > > ignore the conflict, as all it means is that I've already put that > > data there (maybe in an earlier crawl). > > > > The alternative approach is to forgo the MD5 hash calculation, and > > POST the parsed data into CouchDB, creating a new record with an > > arbitrary id. I imagine that I would end up with a lot of identical > > data in this case, and it would become the job of the > > Map/Combine/Reduce process to filter duplicates while creating the > > lookup indexes. > > > > I suppose my question boils down to this: Are there unforeseen costs > > to building a high percentage of failing PUTs into my application > > design? It seems like the most elegant way to ensure unique data. But > > perhaps I am putting too high a premium on unique data - I suppose in > > the end it depend on the cost to compute a conflict, vs the ongoing > > cost of calculating reductions across redundant data sets. > > I don't see any problems with this approach. For you purposes, using > MD5 hashes of data should work just fine. > > > > > > > Thanks for any insights! > > Chris > > > > -- > > Chris Anderson > > http://jchris.mfdz.com > > -- Nuno Job IBM DB2 Student Ambassador [ http://caos.di.uminho.pt/~db2] Open Source Support Center Member [http://caos.di.uminho.pt] Blog [ http://nunojob.wordpress.com ] LinkedIn [ http://www.linkedin.com/in/njpinto]
