Otis,

Thanks for your response.

I just gave a quick look to the Nutch Forum and find that there is an
implementation to obtain de-duplicate documents/pages but none for Near
Duplicates documents. Can you guide me a little further as to where exactly
under Nutch I should be concentrating, regarding near duplicate documents?

Regards,
Rishabh

On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> To whomever started this thread: look at Nutch.  I believe something
> related to this already exists in Nutch for near-duplicate detection.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Mike Klaas <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Sunday, November 18, 2007 11:08:38 PM
> Subject: Re: Near Duplicate Documents
>
> On 18-Nov-07, at 8:17 AM, Eswar K wrote:
>
> > Is there any idea implementing that feature in the up coming
>  releases?
>
> Not currently.  Feel free to contribute something if you find a good
> solution <g>.
>
> -Mike
>
>
> > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> >
> >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> >>> We have a scenario, where we want to find out documents which are
> >> similar in
> >>> content. To elaborate a little more on what we mean here, lets
> >>> take an
> >>> example.
> >>>
> >>> The example of this email chain in which we are interacting on,
> >>> can be
> >> best
> >>> used for illustrating the concept of near dupes (We are not getting
> >> confused
> >>> with threads, they are two different things.). Each email in this
> >>> thread
> >> is
> >>> treated as a document by the system. A reply to the original mail
> >>> also
> >>> includes the original mail in which case it becomes a near
> >>> duplicate of
> >> the
> >>> orginal mail (depending on the percentage of similarity).
> >>> Similarly it
> >> goes
> >>> on. The near dupes need not be limited to emails.
> >>
> >> I think this is what's known as "shingling."  See
> >> http://en.wikipedia.org/wiki/W-shingling
> >> Lucene (and therefore Solr) does not implement shingling.  The
> >> "MoreLikeThis" query might be close enough, however.
> >>
> >> -Stuart
> >>
>
>
>
>
>

Reply via email to