Re: Removing duplicate documents from search results

Mohammad Shariq Tue, 28 Jun 2011 06:14:34 -0700

Hey François,
thanks for your suggestion, I followed the same link (
http://wiki.apache.org/solr/Deduplication)


they have the solution*, either make Hash as uniqueKey OR overwrite on
duplicate,
I dont need either.

I need Discard on Duplicate.
*

>
>
> I have not used it but it looks like it will do the trick.
>
> François
>
> On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote:
>
> > I found the deduplication thing really useful. Although I have not yet
> > started to work on it, as there are some other low hanging fruits I've to
> > capture. Will share my thoughts soon.
> >
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
> >
> > Twitter <http://twitter.com/pranavprakash> | Blog <
> http://blog.myblive.com> |
> > Google <http://www.google.com/profiles/pranny>
> >
> >
> > 2011/6/28 François Schiettecatte <fschietteca...@gmail.com>
> >
> >> Maybe there is a way to get Solr to reject documents that already exist
> in
> >> the index but I doubt it, maybe someone else with can chime here here.
> You
> >> could do a search for each document prior to indexing it so see if it is
> >> already in the index, that is probably non-optimal, maybe it is easiest
> to
> >> check if the document exists in your Riak repository, it no add it and
> index
> >> it, and drop if it already exists.
> >>
> >> François
> >>
> >> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
> >>
> >>> I am making the Hash from URL, but I can't use this as UniqueKey
> because
> >> I
> >>> am using UUID as UniqueKey,
> >>> Since I am using SOLR as  index engine Only and using Riak(key-value
> >>> storage) as storage engine, I dont want to do the overwrite on
> duplicate.
> >>> I just need to discard the duplicates.
> >>>
> >>>
> >>>
> >>> 2011/6/28 François Schiettecatte <fschietteca...@gmail.com>
> >>>
> >>>> Create a hash from the url and use that as the unique key, md5 or sha1
> >>>> would probably be good enough.
> >>>>
> >>>> Cheers
> >>>>
> >>>> François
> >>>>
> >>>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
> >>>>
> >>>>> I also have the problem of duplicate docs.
> >>>>> I am indexing news articles, Every news article will have the source
> >> URL,
> >>>>> If two news-article has the same URL, only one need to index,
> >>>>> removal of duplicate at index time.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote:
> >>>>>
> >>>>>> have you checked out the deduplication process that's available at
> >>>>>> indexing time ? This includes a fuzzy hash algorithm .
> >>>>>>
> >>>>>> http://wiki.apache.org/solr/Deduplication
> >>>>>>
> >>>>>> -Simon
> >>>>>>
> >>>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com>
> >>>> wrote:
> >>>>>>> This approach would definitely work is the two documents are
> >> *Exactly*
> >>>>>> the
> >>>>>>> same. But this is very fragile. Even if one extra space has been
> >> added,
> >>>>>> the
> >>>>>>> whole hash would change. What I am really looking for is some %age
> >>>>>>> similarity between documents, and remove those documents which are
> >> more
> >>>>>> than
> >>>>>>> 95% similar.
> >>>>>>>
> >>>>>>> *Pranav Prakash*
> >>>>>>>
> >>>>>>> "temet nosce"
> >>>>>>>
> >>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >>>>>> http://blog.myblive.com> |
> >>>>>>> Google <http://www.google.com/profiles/pranny>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote:
> >>>>>>>
> >>>>>>>> What you need to do, is to calculate some HASH (using any message
> >>>> digest
> >>>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
> >>>> solr
> >>>>>>>> field collapse capabilities. Should not be too complicated..
> >>>>>>>>
> >>>>>>>> *Omri Cohen*
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
> >>>>>> +972-3-6036295
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
> >>>>>> [image:
> >>>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
> >>>>>>>> WordPress]<http://omricohen.me>
> >>>>>>>> Please consider your environmental responsibility. Before printing
> >>>> this
> >>>>>>>> e-mail message, ask yourself whether you really need a hard copy.
> >>>>>>>> IMPORTANT: The contents of this email and any attachments are
> >>>>>> confidential.
> >>>>>>>> They are intended for the named recipient(s) only. If you have
> >>>> received
> >>>>>>>> this
> >>>>>>>> email by mistake, please notify the sender immediately and do not
> >>>>>> disclose
> >>>>>>>> the contents to anyone or make copies thereof.
> >>>>>>>> Signature powered by
> >>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>>>>>
> >>>>>>>> WiseStamp<
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ---------- Forwarded message ----------
> >>>>>>>> From: Pranav Prakash <pra...@gmail.com>
> >>>>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
> >>>>>>>> Subject: Removing duplicate documents from search results
> >>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> How can I remove very similar documents from search results?
> >>>>>>>>
> >>>>>>>> My scenario is that there are documents in the index which are
> >> almost
> >>>>>>>> similar (people submitting same stuff multiple times, sometimes
> >>>>>> different
> >>>>>>>> people submitting same stuff). Now when a search is performed for
> >>>>>>>> "keyword",
> >>>>>>>> in the top N results, quite frequently, same document comes up
> >>>> multiple
> >>>>>>>> times. I want to remove those duplicate (or possible duplicate)
> >>>>>> documents.
> >>>>>>>> Very similar to what Google does when they say "In order to show
> you
> >>>>>> most
> >>>>>>>> relevant result, duplicates have been removed". How can I achieve
> >> this
> >>>>>>>> functionality using Solr? Does Solr has an implied or plugin which
> >>>> could
> >>>>>>>> help me with it?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> *Pranav Prakash*
> >>>>>>>>
> >>>>>>>> "temet nosce"
> >>>>>>>>
> >>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >>>>>> http://blog.myblive.com
> >>>>>>>>>
> >>>>>>>> |
> >>>>>>>> Google <http://www.google.com/profiles/pranny>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks and Regards
> >>>>> Mohammad Shariq
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Thanks and Regards
> >>> Mohammad Shariq
> >>
> >>
>
>


-- 
Thanks and Regards
Mohammad Shariq

Re: Removing duplicate documents from search results

Reply via email to