Re: Removing duplicate documents from search results

Mohammad Shariq Tue, 28 Jun 2011 05:25:23 -0700

I am making the Hash from URL, but I can't use this as UniqueKey because I
am using UUID as UniqueKey,
Since I am using SOLR as  index engine Only and using Riak(key-value
storage) as storage engine, I dont want to do the overwrite on duplicate.
I just need to discard the duplicates.




2011/6/28 François Schiettecatte <fschietteca...@gmail.com>

> Create a hash from the url and use that as the unique key, md5 or sha1
> would probably be good enough.
>
> Cheers
>
> François
>
> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>
> > I also have the problem of duplicate docs.
> > I am indexing news articles, Every news article will have the source URL,
> > If two news-article has the same URL, only one need to index,
> > removal of duplicate at index time.
> >
> >
> >
> > On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote:
> >
> >> have you checked out the deduplication process that's available at
> >> indexing time ? This includes a fuzzy hash algorithm .
> >>
> >> http://wiki.apache.org/solr/Deduplication
> >>
> >> -Simon
> >>
> >> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com>
> wrote:
> >>> This approach would definitely work is the two documents are *Exactly*
> >> the
> >>> same. But this is very fragile. Even if one extra space has been added,
> >> the
> >>> whole hash would change. What I am really looking for is some %age
> >>> similarity between documents, and remove those documents which are more
> >> than
> >>> 95% similar.
> >>>
> >>> *Pranav Prakash*
> >>>
> >>> "temet nosce"
> >>>
> >>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >> http://blog.myblive.com> |
> >>> Google <http://www.google.com/profiles/pranny>
> >>>
> >>>
> >>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote:
> >>>
> >>>> What you need to do, is to calculate some HASH (using any message
> digest
> >>>> algorithm you want, md5, sha-1 and so on), then do some reading on
> solr
> >>>> field collapse capabilities. Should not be too complicated..
> >>>>
> >>>> *Omri Cohen*
> >>>>
> >>>>
> >>>>
> >>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
> >> +972-3-6036295
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
> >> [image:
> >>>> Twitter] <http://www.twitter.com/omricohe> [image:
> >>>> WordPress]<http://omricohen.me>
> >>>> Please consider your environmental responsibility. Before printing
> this
> >>>> e-mail message, ask yourself whether you really need a hard copy.
> >>>> IMPORTANT: The contents of this email and any attachments are
> >> confidential.
> >>>> They are intended for the named recipient(s) only. If you have
> received
> >>>> this
> >>>> email by mistake, please notify the sender immediately and do not
> >> disclose
> >>>> the contents to anyone or make copies thereof.
> >>>> Signature powered by
> >>>> <
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>
> >>>> WiseStamp<
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> ---------- Forwarded message ----------
> >>>> From: Pranav Prakash <pra...@gmail.com>
> >>>> Date: Thu, Jun 23, 2011 at 12:26 PM
> >>>> Subject: Removing duplicate documents from search results
> >>>> To: solr-user@lucene.apache.org
> >>>>
> >>>>
> >>>> How can I remove very similar documents from search results?
> >>>>
> >>>> My scenario is that there are documents in the index which are almost
> >>>> similar (people submitting same stuff multiple times, sometimes
> >> different
> >>>> people submitting same stuff). Now when a search is performed for
> >>>> "keyword",
> >>>> in the top N results, quite frequently, same document comes up
> multiple
> >>>> times. I want to remove those duplicate (or possible duplicate)
> >> documents.
> >>>> Very similar to what Google does when they say "In order to show you
> >> most
> >>>> relevant result, duplicates have been removed". How can I achieve this
> >>>> functionality using Solr? Does Solr has an implied or plugin which
> could
> >>>> help me with it?
> >>>>
> >>>>
> >>>> *Pranav Prakash*
> >>>>
> >>>> "temet nosce"
> >>>>
> >>>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >> http://blog.myblive.com
> >>>>>
> >>>> |
> >>>> Google <http://www.google.com/profiles/pranny>
> >>>>
> >>>
> >>
> >
> >
> >
> > --
> > Thanks and Regards
> > Mohammad Shariq
>
>


-- 
Thanks and Regards
Mohammad Shariq

Re: Removing duplicate documents from search results

Reply via email to