Re: Removing duplicate documents from search results

François Schiettecatte Tue, 28 Jun 2011 06:37:33 -0700

Yeah, I read the overview which suggests that duplicates can be prevented from 
entering the index and scanned the rest, it does not look like you can actually 
drop the document entirely. Maybe I am missing something here.


François

On Jun 28, 2011, at 9:14 AM, Mohammad Shariq wrote:

> Hey François,
> thanks for your suggestion, I followed the same link (
> http://wiki.apache.org/solr/Deduplication)
> 
> they have the solution*, either make Hash as uniqueKey OR overwrite on
> duplicate,
> I dont need either.
> 
> I need Discard on Duplicate.
> *
> 
>> 
>> 
>> I have not used it but it looks like it will do the trick.
>> 
>> François
>> 
>> On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote:
>> 
>>> I found the deduplication thing really useful. Although I have not yet
>>> started to work on it, as there are some other low hanging fruits I've to
>>> capture. Will share my thoughts soon.
>>> 
>>> 
>>> *Pranav Prakash*
>>> 
>>> "temet nosce"
>>> 
>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>> http://blog.myblive.com> |
>>> Google <http://www.google.com/profiles/pranny>
>>> 
>>> 
>>> 2011/6/28 François Schiettecatte <fschietteca...@gmail.com>
>>> 
>>>> Maybe there is a way to get Solr to reject documents that already exist
>> in
>>>> the index but I doubt it, maybe someone else with can chime here here.
>> You
>>>> could do a search for each document prior to indexing it so see if it is
>>>> already in the index, that is probably non-optimal, maybe it is easiest
>> to
>>>> check if the document exists in your Riak repository, it no add it and
>> index
>>>> it, and drop if it already exists.
>>>> 
>>>> François
>>>> 
>>>> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
>>>> 
>>>>> I am making the Hash from URL, but I can't use this as UniqueKey
>> because
>>>> I
>>>>> am using UUID as UniqueKey,
>>>>> Since I am using SOLR as  index engine Only and using Riak(key-value
>>>>> storage) as storage engine, I dont want to do the overwrite on
>> duplicate.
>>>>> I just need to discard the duplicates.
>>>>> 
>>>>> 
>>>>> 
>>>>> 2011/6/28 François Schiettecatte <fschietteca...@gmail.com>
>>>>> 
>>>>>> Create a hash from the url and use that as the unique key, md5 or sha1
>>>>>> would probably be good enough.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> François
>>>>>> 
>>>>>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>>>>>> 
>>>>>>> I also have the problem of duplicate docs.
>>>>>>> I am indexing news articles, Every news article will have the source
>>>> URL,
>>>>>>> If two news-article has the same URL, only one need to index,
>>>>>>> removal of duplicate at index time.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> have you checked out the deduplication process that's available at
>>>>>>>> indexing time ? This includes a fuzzy hash algorithm .
>>>>>>>> 
>>>>>>>> http://wiki.apache.org/solr/Deduplication
>>>>>>>> 
>>>>>>>> -Simon
>>>>>>>> 
>>>>>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com>
>>>>>> wrote:
>>>>>>>>> This approach would definitely work is the two documents are
>>>> *Exactly*
>>>>>>>> the
>>>>>>>>> same. But this is very fragile. Even if one extra space has been
>>>> added,
>>>>>>>> the
>>>>>>>>> whole hash would change. What I am really looking for is some %age
>>>>>>>>> similarity between documents, and remove those documents which are
>>>> more
>>>>>>>> than
>>>>>>>>> 95% similar.
>>>>>>>>> 
>>>>>>>>> *Pranav Prakash*
>>>>>>>>> 
>>>>>>>>> "temet nosce"
>>>>>>>>> 
>>>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>>>>>> http://blog.myblive.com> |
>>>>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote:
>>>>>>>>> 
>>>>>>>>>> What you need to do, is to calculate some HASH (using any message
>>>>>> digest
>>>>>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
>>>>>> solr
>>>>>>>>>> field collapse capabilities. Should not be too complicated..
>>>>>>>>>> 
>>>>>>>>>> *Omri Cohen*
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
>>>>>>>> +972-3-6036295
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
>>>>>>>> [image:
>>>>>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>>>>>>>> WordPress]<http://omricohen.me>
>>>>>>>>>> Please consider your environmental responsibility. Before printing
>>>>>> this
>>>>>>>>>> e-mail message, ask yourself whether you really need a hard copy.
>>>>>>>>>> IMPORTANT: The contents of this email and any attachments are
>>>>>>>> confidential.
>>>>>>>>>> They are intended for the named recipient(s) only. If you have
>>>>>> received
>>>>>>>>>> this
>>>>>>>>>> email by mistake, please notify the sender immediately and do not
>>>>>>>> disclose
>>>>>>>>>> the contents to anyone or make copies thereof.
>>>>>>>>>> Signature powered by
>>>>>>>>>> <
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>>>>>> 
>>>>>>>>>> WiseStamp<
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>>> From: Pranav Prakash <pra...@gmail.com>
>>>>>>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
>>>>>>>>>> Subject: Removing duplicate documents from search results
>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> How can I remove very similar documents from search results?
>>>>>>>>>> 
>>>>>>>>>> My scenario is that there are documents in the index which are
>>>> almost
>>>>>>>>>> similar (people submitting same stuff multiple times, sometimes
>>>>>>>> different
>>>>>>>>>> people submitting same stuff). Now when a search is performed for
>>>>>>>>>> "keyword",
>>>>>>>>>> in the top N results, quite frequently, same document comes up
>>>>>> multiple
>>>>>>>>>> times. I want to remove those duplicate (or possible duplicate)
>>>>>>>> documents.
>>>>>>>>>> Very similar to what Google does when they say "In order to show
>> you
>>>>>>>> most
>>>>>>>>>> relevant result, duplicates have been removed". How can I achieve
>>>> this
>>>>>>>>>> functionality using Solr? Does Solr has an implied or plugin which
>>>>>> could
>>>>>>>>>> help me with it?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> *Pranav Prakash*
>>>>>>>>>> 
>>>>>>>>>> "temet nosce"
>>>>>>>>>> 
>>>>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>>>>>> http://blog.myblive.com
>>>>>>>>>>> 
>>>>>>>>>> |
>>>>>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Thanks and Regards
>>>>>>> Mohammad Shariq
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks and Regards
>>>>> Mohammad Shariq
>>>> 
>>>> 
>> 
>> 
> 
> 
> -- 
> Thanks and Regards
> Mohammad Shariq

Re: Removing duplicate documents from search results

Reply via email to