Nuno, I guess the relative cost vs assurance of no collisions make the MD5 worth it for me, especially as it is computed at the edges, so it should parallelize well. Perhaps I am underestimating the power of CRC for these purposes...
Thanks, Damien for the assurances. Chris On Sun, Mar 16, 2008 at 5:46 PM, Nuno Job <[EMAIL PROTECTED]> wrote: > Totally away of the topic: > > If your using the MD5 as CRC why don't you simply use CRC? > > Just a thought :) > > Have a nice weeend > > > > On Mon, Mar 17, 2008 at 12:42 AM, Damien Katz <[EMAIL PROTECTED]> wrote: > > > > > On Mar 16, 2008, at 7:44 PM, Chris Anderson wrote: > > > > > Couchers, > > > > > > I've been diving into CouchDB lately, and seeing how it's a great fit > > > for my application. I've run into some questions about how to record > > > information in an efficient way. Here's a description of what I'm > > > trying to do, and a couple of methods I've come up with. Hopefully > > > someone on the list can give me some insight into how determine what > > > the pros and cons of each approach are. > > > > > > Let's say I'm crawling the web, looking for embeds of YouTube videos > > > on blogs and such. When I come across one, I'll be recording: > > > > > > the YouTube video URL. > > > the URL the video was embedded on. > > > a snippet of the context in which it was found. > > > > > > In the end I'd like to give people the ability to see where their > > > videos are being embedded. E.g. start from a video and find the embeds > > > of it from across the web. > > > > > > I'll be recrawling some blogs quite frequently, so I have this idea > > > about how to avoid duplicate content: > > > > > > I calculate an MD5 hash from the information I want to store in a > > > deterministic way, so processing the same page twice creates identical > > > computed hash values. I use the hash values as document_ids, and PUT > > > the data to CouchDB, with no _rev attribute. CouchDB will rejects the > > > PUTs of duplicate data with a conflict. In my application I just > > > ignore the conflict, as all it means is that I've already put that > > > data there (maybe in an earlier crawl). > > > > > > The alternative approach is to forgo the MD5 hash calculation, and > > > POST the parsed data into CouchDB, creating a new record with an > > > arbitrary id. I imagine that I would end up with a lot of identical > > > data in this case, and it would become the job of the > > > Map/Combine/Reduce process to filter duplicates while creating the > > > lookup indexes. > > > > > > I suppose my question boils down to this: Are there unforeseen costs > > > to building a high percentage of failing PUTs into my application > > > design? It seems like the most elegant way to ensure unique data. But > > > perhaps I am putting too high a premium on unique data - I suppose in > > > the end it depend on the cost to compute a conflict, vs the ongoing > > > cost of calculating reductions across redundant data sets. > > > > I don't see any problems with this approach. For you purposes, using > > MD5 hashes of data should work just fine. > > > > > > > > > > > Thanks for any insights! > > > Chris > > > > > > -- > > > Chris Anderson > > > http://jchris.mfdz.com > > > > > > > -- > Nuno Job > IBM DB2 Student Ambassador [ http://caos.di.uminho.pt/~db2] > Open Source Support Center Member [http://caos.di.uminho.pt] > Blog [ http://nunojob.wordpress.com ] LinkedIn [ > http://www.linkedin.com/in/njpinto] > -- Chris Anderson http://jchris.mfdz.com
