Otis, Thanks for your response.
I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly under Nutch I should be concentrating, regarding near duplicate documents? Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > To whomever started this thread: look at Nutch. I believe something > related to this already exists in Nutch for near-duplicate detection. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > ----- Original Message ---- > From: Mike Klaas <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Sunday, November 18, 2007 11:08:38 PM > Subject: Re: Near Duplicate Documents > > On 18-Nov-07, at 8:17 AM, Eswar K wrote: > > > Is there any idea implementing that feature in the up coming > releases? > > Not currently. Feel free to contribute something if you find a good > solution <g>. > > -Mike > > > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > > > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > >>> We have a scenario, where we want to find out documents which are > >> similar in > >>> content. To elaborate a little more on what we mean here, lets > >>> take an > >>> example. > >>> > >>> The example of this email chain in which we are interacting on, > >>> can be > >> best > >>> used for illustrating the concept of near dupes (We are not getting > >> confused > >>> with threads, they are two different things.). Each email in this > >>> thread > >> is > >>> treated as a document by the system. A reply to the original mail > >>> also > >>> includes the original mail in which case it becomes a near > >>> duplicate of > >> the > >>> orginal mail (depending on the percentage of similarity). > >>> Similarly it > >> goes > >>> on. The near dupes need not be limited to emails. > >> > >> I think this is what's known as "shingling." See > >> http://en.wikipedia.org/wiki/W-shingling > >> Lucene (and therefore Solr) does not implement shingling. The > >> "MoreLikeThis" query might be close enough, however. > >> > >> -Stuart > >> > > > > >