I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates.
2011/6/28 François Schiettecatte <fschietteca...@gmail.com> > Create a hash from the url and use that as the unique key, md5 or sha1 > would probably be good enough. > > Cheers > > François > > On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: > > > I also have the problem of duplicate docs. > > I am indexing news articles, Every news article will have the source URL, > > If two news-article has the same URL, only one need to index, > > removal of duplicate at index time. > > > > > > > > On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote: > > > >> have you checked out the deduplication process that's available at > >> indexing time ? This includes a fuzzy hash algorithm . > >> > >> http://wiki.apache.org/solr/Deduplication > >> > >> -Simon > >> > >> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com> > wrote: > >>> This approach would definitely work is the two documents are *Exactly* > >> the > >>> same. But this is very fragile. Even if one extra space has been added, > >> the > >>> whole hash would change. What I am really looking for is some %age > >>> similarity between documents, and remove those documents which are more > >> than > >>> 95% similar. > >>> > >>> *Pranav Prakash* > >>> > >>> "temet nosce" > >>> > >>> Twitter <http://twitter.com/pranavprakash> | Blog < > >> http://blog.myblive.com> | > >>> Google <http://www.google.com/profiles/pranny> > >>> > >>> > >>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote: > >>> > >>>> What you need to do, is to calculate some HASH (using any message > digest > >>>> algorithm you want, md5, sha-1 and so on), then do some reading on > solr > >>>> field collapse capabilities. Should not be too complicated.. > >>>> > >>>> *Omri Cohen* > >>>> > >>>> > >>>> > >>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | > >> +972-3-6036295 > >>>> > >>>> > >>>> > >>>> > >>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> > >> [image: > >>>> Twitter] <http://www.twitter.com/omricohe> [image: > >>>> WordPress]<http://omricohen.me> > >>>> Please consider your environmental responsibility. Before printing > this > >>>> e-mail message, ask yourself whether you really need a hard copy. > >>>> IMPORTANT: The contents of this email and any attachments are > >> confidential. > >>>> They are intended for the named recipient(s) only. If you have > received > >>>> this > >>>> email by mistake, please notify the sender immediately and do not > >> disclose > >>>> the contents to anyone or make copies thereof. > >>>> Signature powered by > >>>> < > >>>> > >> > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > >>>>> > >>>> WiseStamp< > >>>> > >> > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > >>>>> > >>>> > >>>> > >>>> > >>>> ---------- Forwarded message ---------- > >>>> From: Pranav Prakash <pra...@gmail.com> > >>>> Date: Thu, Jun 23, 2011 at 12:26 PM > >>>> Subject: Removing duplicate documents from search results > >>>> To: solr-user@lucene.apache.org > >>>> > >>>> > >>>> How can I remove very similar documents from search results? > >>>> > >>>> My scenario is that there are documents in the index which are almost > >>>> similar (people submitting same stuff multiple times, sometimes > >> different > >>>> people submitting same stuff). Now when a search is performed for > >>>> "keyword", > >>>> in the top N results, quite frequently, same document comes up > multiple > >>>> times. I want to remove those duplicate (or possible duplicate) > >> documents. > >>>> Very similar to what Google does when they say "In order to show you > >> most > >>>> relevant result, duplicates have been removed". How can I achieve this > >>>> functionality using Solr? Does Solr has an implied or plugin which > could > >>>> help me with it? > >>>> > >>>> > >>>> *Pranav Prakash* > >>>> > >>>> "temet nosce" > >>>> > >>>> Twitter <http://twitter.com/pranavprakash> | Blog < > >> http://blog.myblive.com > >>>>> > >>>> | > >>>> Google <http://www.google.com/profiles/pranny> > >>>> > >>> > >> > > > > > > > > -- > > Thanks and Regards > > Mohammad Shariq > > -- Thanks and Regards Mohammad Shariq