Re: Near Duplicate Documents

climbingrose Wed, 21 Nov 2007 16:08:52 -0800

Hi Ken,

It's correct that uncommon words are most likely not showing up in the
signature. However, I was trying to say that if two documents has 99%
common tokens and differ in one token with frequency > quantised
frequency, the two resulted hashes are completely different. If you
want true near dup detection, what you would like to have is two
hashes that differ only in 1-2 bytes. That way, the signatures will
truely reflect the content of the document they present. However, with
this approach, you need a bit more work to cluster near dup documents.
Basically, once you have the hash function as I describe above,
finding similar documents comes down to Hamming distance problem: two
docs are near dup if ther hashes different in k positions (with k
small, might be < 3).



On Nov 22, 2007 2:35 AM, Ken Krugler <[EMAIL PROTECTED]> wrote:
> >The duplication detection mechanism in Nutch is quite primitive. I
> >think it uses a MD5 signature generated from the content of a field.
> >The generation algorithm is described here:
> >http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.
> >
> >The problem with this approach is MD5 hash is very sensitive: one
> >letter difference will generate completely different hash.
>
> I'm confused by your answer, assuming it's based on the page
> referenced by the URL you provided.
>
> The approach by TextProfileSignature would only generate a different
> MD5 hash with a single letter change if that change resulted in a
> change in the quantized frequency for that word. And if it's an
> uncommon word, then it wouldn't even show up in the signature.
>
> -- Ken
>
>
> >You
> >probably have to roll your own near duplication detection algorithm.
> >My advice is have a look at existing literature on near duplication
> >detection techniques and then implement one of them. I know Google has
> >some papers that describe a technique called minhash. I read the paper
> >and found it's very interesting. I'm not sure if you can implement the
> >algorithm because they have patented it. That said, there are plenty
> >literature on near dup detection so you should be able to get one for
> >free!
> >
> >On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote:
> >>  Otis,
> >>
> >>  Thanks for your response.
> >>
> >  > I just gave a quick look to the Nutch Forum and find that there is an
> >>  implementation to obtain de-duplicate documents/pages but none for Near
> >>  Duplicates documents. Can you guide me a little further as to where 
> >> exactly
> >  > under Nutch I should be concentrating, regarding near duplicate 
> > documents?
> >  >
> >  > Regards,
> >>  Rishabh
> >>
> >>  On Nov 21, 2007 12:41 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> >>  wrote:
> >>
> >>
> >>  > To whomever started this thread: look at Nutch.  I believe something
> >>  > related to this already exists in Nutch for near-duplicate detection.
> >>  >
> >>  > Otis
> >>  > --
> >>  > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >>  >
> >>  > ----- Original Message ----
> >>  > From: Mike Klaas <[EMAIL PROTECTED]>
> >>  > To: solr-user@lucene.apache.org
> >>  > Sent: Sunday, November 18, 2007 11:08:38 PM
> >>  > Subject: Re: Near Duplicate Documents
> >>  >
> >>  > On 18-Nov-07, at 8:17 AM, Eswar K wrote:
> >>  >
> >>  > > Is there any idea implementing that feature in the up coming
> >>  >  releases?
> >>  >
> >  > > Not currently.  Feel free to contribute something if you find a good
> >>  > solution <g>.
> >  > >
> >>  > -Mike
> >>  >
> >>  >
> >>  > > On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> >>  > >
> >>  > >> On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> >>  > >>> We have a scenario, where we want to find out documents which are
> >>  > >> similar in
> >>  > >>> content. To elaborate a little more on what we mean here, lets
> >>  > >>> take an
> >>  > >>> example.
> >>  > >>>
> >>  > >>> The example of this email chain in which we are interacting on,
> >>  > >>> can be
> >>  > >> best
> >>  > >>> used for illustrating the concept of near dupes (We are not getting
> >>  > >> confused
> >>  > >>> with threads, they are two different things.). Each email in this
> >>  > >>> thread
> >>  > >> is
> >>  > >>> treated as a document by the system. A reply to the original mail
> >>  > >>> also
> >>  > >>> includes the original mail in which case it becomes a near
> >>  > >>> duplicate of
> >>  > >> the
> >>  > >>> orginal mail (depending on the percentage of similarity).
> >>  > >>> Similarly it
> >>  > >> goes
> >>  > >>> on. The near dupes need not be limited to emails.
> >>  > >>
> >>  > >> I think this is what's known as "shingling."  See
> >>  > >> http://en.wikipedia.org/wiki/W-shingling
> >>  > >> Lucene (and therefore Solr) does not implement shingling.  The
> >>  > >> "MoreLikeThis" query might be close enough, however.
> >>  > >>
> >  > > >> -Stuart
>
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"
>



-- 
Regards,

Cuong Hoang

Re: Near Duplicate Documents

Reply via email to