Have you considered removing them at index time? See: http://wiki.apache.org/solr/Deduplication
Best Erick On Fri, Nov 25, 2011 at 3:13 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > See http://en.wikipedia.org/wiki/Locality-sensitive_hashing > > The obvious thought that I had just after hitting send was that you could > put the LSH signatures on the documents. That would let you do the scan at > low volume and using LSH would make the duplicate scan almost as fast as > your score scan idea. > > Whether Solr will do this for you is really neither here nor there. Solr > does an awful lot of stuff for a an awful lot of people who find it very > congenial. They probably don't have lots of duplicate documents. If you > really think that this capability is core, then you can contribute an > implementation to Solr and all will be made whole. In the short-term, I > would recommend you prototype independently. > > On Fri, Nov 25, 2011 at 4:47 AM, Fred Zimmerman <zimzaz....@gmail.com>wrote: > >> thanks. i did consider postprocessing and may wind up doing that, i was >> hoping there was a way to have Solr do it for me! that I have to as this >> question is probably not a good sign, but what is LSH clustering? >> >> On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning <ted.dunn...@gmail.com> >> wrote: >> >> > You can do that pretty easily by just retrieving extra documents and post >> > processing the results list. >> > >> > You are likely to have a significant number of apparent duplicates this >> > way. >> > >> > To really get rid of duplicates in results, it might be better to remove >> > them from the corpus by deploying something like LSH clustering. >> > >> > On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman <zimzaz....@gmail.com >> > >wrote: >> > >> > > I have a corpus that has a lot of identical or nearly identical >> > documents. >> > > I'd like to return only the unique ones (excluding the "nearly >> identical" >> > > which are redirects). I notice that all the identical/nearly >> identicals >> > > have identical Solr scores. How can I tell Solr to throw out all the >> > > successive documents in an answer set that have identical scores? >> > > >> > > doc 1 score 5.0 >> > > doc 2 score 5.0 >> > > doc 3 score 5.0 >> > > doc 4 score 4.9 >> > > >> > > skip docs 2 and 3 >> > > >> > > bring back 10 docs with unique scores >> > > >> > >> >