Hi Ken, It's correct that uncommon words are most likely not showing up in the signature. However, I was trying to say that if two documents has 99% common tokens and differ in one token with frequency > quantised frequency, the two resulted hashes are completely different. If you want true near dup detection, what you would like to have is two hashes that differ only in 1-2 bytes. That way, the signatures will truely reflect the content of the document they present. However, with this approach, you need a bit more work to cluster near dup documents. Basically, once you have the hash function as I describe above, finding similar documents comes down to Hamming distance problem: two docs are near dup if ther hashes different in k positions (with k small, might be < 3).
On Nov 22, 2007 2:35 AM, Ken Krugler <[EMAIL PROTECTED]> wrote: > >The duplication detection mechanism in Nutch is quite primitive. I > >think it uses a MD5 signature generated from the content of a field. > >The generation algorithm is described here: > >http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html. > > > >The problem with this approach is MD5 hash is very sensitive: one > >letter difference will generate completely different hash. > > I'm confused by your answer, assuming it's based on the page > referenced by the URL you provided. > > The approach by TextProfileSignature would only generate a different > MD5 hash with a single letter change if that change resulted in a > change in the quantized frequency for that word. And if it's an > uncommon word, then it wouldn't even show up in the signature. > > -- Ken > > > >You > >probably have to roll your own near duplication detection algorithm. > >My advice is have a look at existing literature on near duplication > >detection techniques and then implement one of them. I know Google has > >some papers that describe a technique called minhash. I read the paper > >and found it's very interesting. I'm not sure if you can implement the > >algorithm because they have patented it. That said, there are plenty > >literature on near dup detection so you should be able to get one for > >free! > > > >On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote: > >> Otis, > >> > >> Thanks for your response. > >> > > > I just gave a quick look to the Nutch Forum and find that there is an > >> implementation to obtain de-duplicate documents/pages but none for Near > >> Duplicates documents. Can you guide me a little further as to where > >> exactly > > > under Nutch I should be concentrating, regarding near duplicate > > documents? > > > > > > Regards, > >> Rishabh > >> > >> On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > >> wrote: > >> > >> > >> > To whomever started this thread: look at Nutch. I believe something > >> > related to this already exists in Nutch for near-duplicate detection. > >> > > >> > Otis > >> > -- > >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > >> > > >> > ----- Original Message ---- > >> > From: Mike Klaas <[EMAIL PROTECTED]> > >> > To: solr-user@lucene.apache.org > >> > Sent: Sunday, November 18, 2007 11:08:38 PM > >> > Subject: Re: Near Duplicate Documents > >> > > >> > On 18-Nov-07, at 8:17 AM, Eswar K wrote: > >> > > >> > > Is there any idea implementing that feature in the up coming > >> > releases? > >> > > > > > Not currently. Feel free to contribute something if you find a good > >> > solution <g>. > > > > > >> > -Mike > >> > > >> > > >> > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > >> > > > >> > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > >> > >>> We have a scenario, where we want to find out documents which are > >> > >> similar in > >> > >>> content. To elaborate a little more on what we mean here, lets > >> > >>> take an > >> > >>> example. > >> > >>> > >> > >>> The example of this email chain in which we are interacting on, > >> > >>> can be > >> > >> best > >> > >>> used for illustrating the concept of near dupes (We are not getting > >> > >> confused > >> > >>> with threads, they are two different things.). Each email in this > >> > >>> thread > >> > >> is > >> > >>> treated as a document by the system. A reply to the original mail > >> > >>> also > >> > >>> includes the original mail in which case it becomes a near > >> > >>> duplicate of > >> > >> the > >> > >>> orginal mail (depending on the percentage of similarity). > >> > >>> Similarly it > >> > >> goes > >> > >>> on. The near dupes need not be limited to emails. > >> > >> > >> > >> I think this is what's known as "shingling." See > >> > >> http://en.wikipedia.org/wiki/W-shingling > >> > >> Lucene (and therefore Solr) does not implement shingling. The > >> > >> "MoreLikeThis" query might be close enough, however. > >> > >> > > > > >> -Stuart > > -- > Ken Krugler > Krugle, Inc. > +1 530-210-6378 > "If you can't find it, you can't fix it" > -- Regards, Cuong Hoang