Miles,

I'm assuming that you want to detect documents that are "almost"
exactly the same (since if they were identical, you could just do a
straight string compare or md5 compare, etc).

If you're storing term vectors in your index, you could compare the
term vectors for the search results, and if their similarity is less
than some threshold, consider them duplicates.  How you define
"similarity" is up to you, but using the cosine similarity between
each document is probably a good start.

You don't want to compare each document in your result set to each
other document (this would be n^2 comparisons!).  But you know the
documents are returned in order of relevance to your query (assuming
you didn't sort on some other field).  So you know that if two or more
documents are going to be near-duplicates, then they should appear
consecutively in the Hits.  This saves you not only from the n^2
comparisons, but from comparing documents whose scores are so
different that they can't possibly be duplicates.

Thus, I think the following logic may perform fairly efficiently:

final float DUPE_THRESHOLD = 0.05f;  // you'll have to tune this parameter
float previousScore = Float.MAX_VALUE;

for (int i=0; i < hits.length(); ++i) {
  Document doc = hits.doc(i);
  // assumption: scores are non-increasing
  float delta = previousScore - hits.score(i);
  if (delta < DUPE_THRESHOLD) {
    // docs score is very close, now check for dupes
    // using cosine similarity or some other measure    
    ...
    // probably don't set previous score here, since you
    // want to compare each doc to the first non-duplicate doc in the group
  } else {
    // scores are too far apart to be duplicates
    // so don't even compute cosine similarity
    previousScore = hits.score(i);
  }
}


I haven't tried this approach out yet.  If anyone sees any big flaws
in it, please let me know, and if you try it, I'd like to hear how it
works out.

-chris


On Mon, 14 Mar 2005 17:19:24 +0000, Miles Barr
<[EMAIL PROTECTED]> wrote:
> Has anyone tried to remove similar documents from their search results?
> It looks like Google does some on the fly filtering of the results,
> hiding pages which is thinks are too similar, i.e. when you see:
> 
> "In order to show you the most relevant results, we have omitted some
> entries very similar to the 7 already displayed.
> If you like, you can repeat the search with the omitted results
> included."
> 
> at the bottom of the page.
> 
> Is there anything in Lucene or one of the contrib packages that compares
> two documents?
> 
> --
> Miles Barr <[EMAIL PROTECTED]>
> Runtime Collective Ltd.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to