See http://wiki.apache.org/solr/Deduplication. Should be fairly easy to pull out if you are doing just Lucene.
On Mar 5, 2011, at 1:49 AM, Mark wrote: > Is there a way one could detect duplicates (say by using some unique hash of > certain fields) and marking a document as a duplicate but not remove it. > > Here is an example: > > Doc 1) This is my test > Doc 2) This is my test > Doc 3) Another test > Doc 4) This is my test > > Doc 1 and 3 should be considered unique whereas 2 and 4 should be marked as > duplicates (of doc 1). > > Can this be easily accomplished? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org