Re: Removing duplicate documents from search results

François Schiettecatte Tue, 28 Jun 2011 06:01:26 -0700

Indeed, take a look at this:
        
        http://wiki.apache.org/solr/Deduplication


I have not used it but it looks like it will do the trick.

François

On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote:

> I found the deduplication thing really useful. Although I have not yet
> started to work on it, as there are some other low hanging fruits I've to
> capture. Will share my thoughts soon.
> 
> 
> *Pranav Prakash*
> 
> "temet nosce"
> 
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
> Google <http://www.google.com/profiles/pranny>
> 
> 
> 2011/6/28 François Schiettecatte <fschietteca...@gmail.com>
> 
>> Maybe there is a way to get Solr to reject documents that already exist in
>> the index but I doubt it, maybe someone else with can chime here here. You
>> could do a search for each document prior to indexing it so see if it is
>> already in the index, that is probably non-optimal, maybe it is easiest to
>> check if the document exists in your Riak repository, it no add it and index
>> it, and drop if it already exists.
>> 
>> François
>> 
>> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
>> 
>>> I am making the Hash from URL, but I can't use this as UniqueKey because
>> I
>>> am using UUID as UniqueKey,
>>> Since I am using SOLR as  index engine Only and using Riak(key-value
>>> storage) as storage engine, I dont want to do the overwrite on duplicate.
>>> I just need to discard the duplicates.
>>> 
>>> 
>>> 
>>> 2011/6/28 François Schiettecatte <fschietteca...@gmail.com>
>>> 
>>>> Create a hash from the url and use that as the unique key, md5 or sha1
>>>> would probably be good enough.
>>>> 
>>>> Cheers
>>>> 
>>>> François
>>>> 
>>>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>>>> 
>>>>> I also have the problem of duplicate docs.
>>>>> I am indexing news articles, Every news article will have the source
>> URL,
>>>>> If two news-article has the same URL, only one need to index,
>>>>> removal of duplicate at index time.
>>>>> 
>>>>> 
>>>>> 
>>>>> On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote:
>>>>> 
>>>>>> have you checked out the deduplication process that's available at
>>>>>> indexing time ? This includes a fuzzy hash algorithm .
>>>>>> 
>>>>>> http://wiki.apache.org/solr/Deduplication
>>>>>> 
>>>>>> -Simon
>>>>>> 
>>>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com>
>>>> wrote:
>>>>>>> This approach would definitely work is the two documents are
>> *Exactly*
>>>>>> the
>>>>>>> same. But this is very fragile. Even if one extra space has been
>> added,
>>>>>> the
>>>>>>> whole hash would change. What I am really looking for is some %age
>>>>>>> similarity between documents, and remove those documents which are
>> more
>>>>>> than
>>>>>>> 95% similar.
>>>>>>> 
>>>>>>> *Pranav Prakash*
>>>>>>> 
>>>>>>> "temet nosce"
>>>>>>> 
>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>>>> http://blog.myblive.com> |
>>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote:
>>>>>>> 
>>>>>>>> What you need to do, is to calculate some HASH (using any message
>>>> digest
>>>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
>>>> solr
>>>>>>>> field collapse capabilities. Should not be too complicated..
>>>>>>>> 
>>>>>>>> *Omri Cohen*
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
>>>>>> +972-3-6036295
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
>>>>>> [image:
>>>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>>>>>> WordPress]<http://omricohen.me>
>>>>>>>> Please consider your environmental responsibility. Before printing
>>>> this
>>>>>>>> e-mail message, ask yourself whether you really need a hard copy.
>>>>>>>> IMPORTANT: The contents of this email and any attachments are
>>>>>> confidential.
>>>>>>>> They are intended for the named recipient(s) only. If you have
>>>> received
>>>>>>>> this
>>>>>>>> email by mistake, please notify the sender immediately and do not
>>>>>> disclose
>>>>>>>> the contents to anyone or make copies thereof.
>>>>>>>> Signature powered by
>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>>>> 
>>>>>>>> WiseStamp<
>>>>>>>> 
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---------- Forwarded message ----------
>>>>>>>> From: Pranav Prakash <pra...@gmail.com>
>>>>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
>>>>>>>> Subject: Removing duplicate documents from search results
>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>>>> How can I remove very similar documents from search results?
>>>>>>>> 
>>>>>>>> My scenario is that there are documents in the index which are
>> almost
>>>>>>>> similar (people submitting same stuff multiple times, sometimes
>>>>>> different
>>>>>>>> people submitting same stuff). Now when a search is performed for
>>>>>>>> "keyword",
>>>>>>>> in the top N results, quite frequently, same document comes up
>>>> multiple
>>>>>>>> times. I want to remove those duplicate (or possible duplicate)
>>>>>> documents.
>>>>>>>> Very similar to what Google does when they say "In order to show you
>>>>>> most
>>>>>>>> relevant result, duplicates have been removed". How can I achieve
>> this
>>>>>>>> functionality using Solr? Does Solr has an implied or plugin which
>>>> could
>>>>>>>> help me with it?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> *Pranav Prakash*
>>>>>>>> 
>>>>>>>> "temet nosce"
>>>>>>>> 
>>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>>>> http://blog.myblive.com
>>>>>>>>> 
>>>>>>>> |
>>>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks and Regards
>>>>> Mohammad Shariq
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Thanks and Regards
>>> Mohammad Shariq
>> 
>>

Re: Removing duplicate documents from search results

Reply via email to