Re: Removing duplicate documents from search results

Paul Libbrecht Tue, 28 Jun 2011 06:35:36 -0700

Mohammad,

just in case you meant it, I would like to discourage you to try to deduplicate 
*the search result*.
There are many things that go wrong if you do that; we had it in one version of 
the ActiveMath search environment (which uses Lucene):
- paging is inappropriate
- total count is wrong unless you go through all the results
- performance can go really bad if you try to go through all the results
- performance does go bad for some search results if you try to fill the page 
(need to fetch till you find)
- you to go through all search results again and again when delivering the next 
ones


So, as others have suggested, please be sure to deduplicate somehow at indexing 
time.

paul

Le 28 juin 2011 à 14:24, Mohammad Shariq a écrit :

> I am making the Hash from URL, but I can't use this as UniqueKey because I
> am using UUID as UniqueKey,
> Since I am using SOLR as  index engine Only and using Riak(key-value
> storage) as storage engine, I dont want to do the overwrite on duplicate.
> I just need to discard the duplicates.
> 
> 
> 
> 2011/6/28 François Schiettecatte <fschietteca...@gmail.com>
> 
>> Create a hash from the url and use that as the unique key, md5 or sha1
>> would probably be good enough.
>> 
>> Cheers
>> 
>> François
>> 
>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>> 
>>> I also have the problem of duplicate docs.
>>> I am indexing news articles, Every news article will have the source URL,
>>> If two news-article has the same URL, only one need to index,
>>> removal of duplicate at index time.
>>> 
>>> 
>>> 
>>> On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote:
>>> 
>>>> have you checked out the deduplication process that's available at
>>>> indexing time ? This includes a fuzzy hash algorithm .
>>>> 
>>>> http://wiki.apache.org/solr/Deduplication
>>>> 
>>>> -Simon
>>>> 
>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com>
>> wrote:
>>>>> This approach would definitely work is the two documents are *Exactly*
>>>> the
>>>>> same. But this is very fragile. Even if one extra space has been added,
>>>> the
>>>>> whole hash would change. What I am really looking for is some %age
>>>>> similarity between documents, and remove those documents which are more
>>>> than
>>>>> 95% similar.
>>>>> 
>>>>> *Pranav Prakash*
>>>>> 
>>>>> "temet nosce"
>>>>> 
>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>> http://blog.myblive.com> |
>>>>> Google <http://www.google.com/profiles/pranny>
>>>>> 
>>>>> 
>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote:
>>>>> 
>>>>>> What you need to do, is to calculate some HASH (using any message
>> digest
>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
>> solr
>>>>>> field collapse capabilities. Should not be too complicated..
>>>>>> 
>>>>>> *Omri Cohen*
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
>>>> +972-3-6036295
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
>>>> [image:
>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>>>> WordPress]<http://omricohen.me>
>>>>>> Please consider your environmental responsibility. Before printing
>> this
>>>>>> e-mail message, ask yourself whether you really need a hard copy.
>>>>>> IMPORTANT: The contents of this email and any attachments are
>>>> confidential.
>>>>>> They are intended for the named recipient(s) only. If you have
>> received
>>>>>> this
>>>>>> email by mistake, please notify the sender immediately and do not
>>>> disclose
>>>>>> the contents to anyone or make copies thereof.
>>>>>> Signature powered by
>>>>>> <
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>> 
>>>>>> WiseStamp<
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------- Forwarded message ----------
>>>>>> From: Pranav Prakash <pra...@gmail.com>
>>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
>>>>>> Subject: Removing duplicate documents from search results
>>>>>> To: solr-user@lucene.apache.org
>>>>>> 
>>>>>> 
>>>>>> How can I remove very similar documents from search results?
>>>>>> 
>>>>>> My scenario is that there are documents in the index which are almost
>>>>>> similar (people submitting same stuff multiple times, sometimes
>>>> different
>>>>>> people submitting same stuff). Now when a search is performed for
>>>>>> "keyword",
>>>>>> in the top N results, quite frequently, same document comes up
>> multiple
>>>>>> times. I want to remove those duplicate (or possible duplicate)
>>>> documents.
>>>>>> Very similar to what Google does when they say "In order to show you
>>>> most
>>>>>> relevant result, duplicates have been removed". How can I achieve this
>>>>>> functionality using Solr? Does Solr has an implied or plugin which
>> could
>>>>>> help me with it?
>>>>>> 
>>>>>> 
>>>>>> *Pranav Prakash*
>>>>>> 
>>>>>> "temet nosce"
>>>>>> 
>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>> http://blog.myblive.com
>>>>>>> 
>>>>>> |
>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Thanks and Regards
>>> Mohammad Shariq
>> 
>> 
> 
> 
> -- 
> Thanks and Regards
> Mohammad Shariq

Re: Removing duplicate documents from search results

Reply via email to