I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time.
On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote: > have you checked out the deduplication process that's available at > indexing time ? This includes a fuzzy hash algorithm . > > http://wiki.apache.org/solr/Deduplication > > -Simon > > On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com> wrote: > > This approach would definitely work is the two documents are *Exactly* > the > > same. But this is very fragile. Even if one extra space has been added, > the > > whole hash would change. What I am really looking for is some %age > > similarity between documents, and remove those documents which are more > than > > 95% similar. > > > > *Pranav Prakash* > > > > "temet nosce" > > > > Twitter <http://twitter.com/pranavprakash> | Blog < > http://blog.myblive.com> | > > Google <http://www.google.com/profiles/pranny> > > > > > > On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote: > > > >> What you need to do, is to calculate some HASH (using any message digest > >> algorithm you want, md5, sha-1 and so on), then do some reading on solr > >> field collapse capabilities. Should not be too complicated.. > >> > >> *Omri Cohen* > >> > >> > >> > >> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | > +972-3-6036295 > >> > >> > >> > >> > >> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> > [image: > >> Twitter] <http://www.twitter.com/omricohe> [image: > >> WordPress]<http://omricohen.me> > >> Please consider your environmental responsibility. Before printing this > >> e-mail message, ask yourself whether you really need a hard copy. > >> IMPORTANT: The contents of this email and any attachments are > confidential. > >> They are intended for the named recipient(s) only. If you have received > >> this > >> email by mistake, please notify the sender immediately and do not > disclose > >> the contents to anyone or make copies thereof. > >> Signature powered by > >> < > >> > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > >> > > >> WiseStamp< > >> > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > >> > > >> > >> > >> > >> ---------- Forwarded message ---------- > >> From: Pranav Prakash <pra...@gmail.com> > >> Date: Thu, Jun 23, 2011 at 12:26 PM > >> Subject: Removing duplicate documents from search results > >> To: solr-user@lucene.apache.org > >> > >> > >> How can I remove very similar documents from search results? > >> > >> My scenario is that there are documents in the index which are almost > >> similar (people submitting same stuff multiple times, sometimes > different > >> people submitting same stuff). Now when a search is performed for > >> "keyword", > >> in the top N results, quite frequently, same document comes up multiple > >> times. I want to remove those duplicate (or possible duplicate) > documents. > >> Very similar to what Google does when they say "In order to show you > most > >> relevant result, duplicates have been removed". How can I achieve this > >> functionality using Solr? Does Solr has an implied or plugin which could > >> help me with it? > >> > >> > >> *Pranav Prakash* > >> > >> "temet nosce" > >> > >> Twitter <http://twitter.com/pranavprakash> | Blog < > http://blog.myblive.com > >> > > >> | > >> Google <http://www.google.com/profiles/pranny> > >> > > > -- Thanks and Regards Mohammad Shariq