Re: Removing duplicate documents from search results

Mohammad Shariq Tue, 28 Jun 2011 04:29:41 -0700

I also have the problem of duplicate docs.
I am indexing news articles, Every news article will have the source URL,
If two news-article has the same URL, only one need to index,
removal of duplicate at index time.




On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote:

> have you checked out the deduplication process that's available at
> indexing time ? This includes a fuzzy hash algorithm .
>
> http://wiki.apache.org/solr/Deduplication
>
> -Simon
>
> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com> wrote:
> > This approach would definitely work is the two documents are *Exactly*
> the
> > same. But this is very fragile. Even if one extra space has been added,
> the
> > whole hash would change. What I am really looking for is some %age
> > similarity between documents, and remove those documents which are more
> than
> > 95% similar.
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
> >
> > Twitter <http://twitter.com/pranavprakash> | Blog <
> http://blog.myblive.com> |
> > Google <http://www.google.com/profiles/pranny>
> >
> >
> > On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote:
> >
> >> What you need to do, is to calculate some HASH (using any message digest
> >> algorithm you want, md5, sha-1 and so on), then do some reading on solr
> >> field collapse capabilities. Should not be too complicated..
> >>
> >> *Omri Cohen*
> >>
> >>
> >>
> >> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
> +972-3-6036295
> >>
> >>
> >>
> >>
> >> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
> [image:
> >> Twitter] <http://www.twitter.com/omricohe> [image:
> >> WordPress]<http://omricohen.me>
> >>  Please consider your environmental responsibility. Before printing this
> >> e-mail message, ask yourself whether you really need a hard copy.
> >> IMPORTANT: The contents of this email and any attachments are
> confidential.
> >> They are intended for the named recipient(s) only. If you have received
> >> this
> >> email by mistake, please notify the sender immediately and do not
> disclose
> >> the contents to anyone or make copies thereof.
> >> Signature powered by
> >> <
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >> >
> >> WiseStamp<
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >> >
> >>
> >>
> >>
> >> ---------- Forwarded message ----------
> >> From: Pranav Prakash <pra...@gmail.com>
> >> Date: Thu, Jun 23, 2011 at 12:26 PM
> >> Subject: Removing duplicate documents from search results
> >> To: solr-user@lucene.apache.org
> >>
> >>
> >> How can I remove very similar documents from search results?
> >>
> >> My scenario is that there are documents in the index which are almost
> >> similar (people submitting same stuff multiple times, sometimes
> different
> >> people submitting same stuff). Now when a search is performed for
> >> "keyword",
> >> in the top N results, quite frequently, same document comes up multiple
> >> times. I want to remove those duplicate (or possible duplicate)
> documents.
> >> Very similar to what Google does when they say "In order to show you
> most
> >> relevant result, duplicates have been removed". How can I achieve this
> >> functionality using Solr? Does Solr has an implied or plugin which could
> >> help me with it?
> >>
> >>
> >> *Pranav Prakash*
> >>
> >> "temet nosce"
> >>
> >> Twitter <http://twitter.com/pranavprakash> | Blog <
> http://blog.myblive.com
> >> >
> >> |
> >> Google <http://www.google.com/profiles/pranny>
> >>
> >
>



-- 
Thanks and Regards
Mohammad Shariq

Re: Removing duplicate documents from search results

Reply via email to