Have you considered removing them at index time? See:
http://wiki.apache.org/solr/Deduplication

Best
Erick

On Fri, Nov 25, 2011 at 3:13 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> See http://en.wikipedia.org/wiki/Locality-sensitive_hashing
>
> The obvious thought that I had just after hitting send was that you could
> put the LSH signatures on the documents.  That would let you do the scan at
> low volume and using LSH would make the duplicate scan almost as fast as
> your score scan idea.
>
> Whether Solr will do this for you is really neither here nor there.  Solr
> does an awful lot of stuff for a an awful lot of people who find it very
> congenial.  They probably don't have lots of duplicate documents.  If you
> really think that this capability is core, then you can contribute an
> implementation to Solr and all will be made whole.  In the short-term, I
> would recommend you prototype independently.
>
> On Fri, Nov 25, 2011 at 4:47 AM, Fred Zimmerman <zimzaz....@gmail.com>wrote:
>
>> thanks.  i did consider postprocessing and may wind up doing that, i was
>> hoping there was a way to have Solr do it for me! that I have to as this
>> question is probably not a good sign, but what is LSH clustering?
>>
>> On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>>
>> > You can do that pretty easily by just retrieving extra documents and post
>> > processing the results list.
>> >
>> > You are likely to have a significant number of apparent duplicates this
>> > way.
>> >
>> > To really get rid of duplicates in results, it might be better to remove
>> > them from the corpus by deploying something like LSH clustering.
>> >
>> > On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman <zimzaz....@gmail.com
>> > >wrote:
>> >
>> > > I have a corpus that has a lot of identical or nearly identical
>> > documents.
>> > > I'd like to return only the unique ones (excluding the "nearly
>> identical"
>> > > which are redirects).  I notice that all the identical/nearly
>> identicals
>> > > have identical Solr scores. How can I tell Solr to  throw out all the
>> > > successive documents in an answer set that have identical scores?
>> > >
>> > > doc 1 score 5.0
>> > > doc 2  score 5.0
>> > > doc 3 score 5.0
>> > > doc 4 score 4.9
>> > >
>> > > skip docs 2 and 3
>> > >
>> > > bring back 10 docs with unique scores
>> > >
>> >
>>
>

Reply via email to