Miles, I'm assuming that you want to detect documents that are "almost" exactly the same (since if they were identical, you could just do a straight string compare or md5 compare, etc).
If you're storing term vectors in your index, you could compare the term vectors for the search results, and if their similarity is less than some threshold, consider them duplicates. How you define "similarity" is up to you, but using the cosine similarity between each document is probably a good start. You don't want to compare each document in your result set to each other document (this would be n^2 comparisons!). But you know the documents are returned in order of relevance to your query (assuming you didn't sort on some other field). So you know that if two or more documents are going to be near-duplicates, then they should appear consecutively in the Hits. This saves you not only from the n^2 comparisons, but from comparing documents whose scores are so different that they can't possibly be duplicates. Thus, I think the following logic may perform fairly efficiently: final float DUPE_THRESHOLD = 0.05f; // you'll have to tune this parameter float previousScore = Float.MAX_VALUE; for (int i=0; i < hits.length(); ++i) { Document doc = hits.doc(i); // assumption: scores are non-increasing float delta = previousScore - hits.score(i); if (delta < DUPE_THRESHOLD) { // docs score is very close, now check for dupes // using cosine similarity or some other measure ... // probably don't set previous score here, since you // want to compare each doc to the first non-duplicate doc in the group } else { // scores are too far apart to be duplicates // so don't even compute cosine similarity previousScore = hits.score(i); } } I haven't tried this approach out yet. If anyone sees any big flaws in it, please let me know, and if you try it, I'd like to hear how it works out. -chris On Mon, 14 Mar 2005 17:19:24 +0000, Miles Barr <[EMAIL PROTECTED]> wrote: > Has anyone tried to remove similar documents from their search results? > It looks like Google does some on the fly filtering of the results, > hiding pages which is thinks are too similar, i.e. when you see: > > "In order to show you the most relevant results, we have omitted some > entries very similar to the 7 already displayed. > If you like, you can repeat the search with the omitted results > included." > > at the bottom of the page. > > Is there anything in Lucene or one of the contrib packages that compares > two documents? > > -- > Miles Barr <[EMAIL PROTECTED]> > Runtime Collective Ltd. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]