Hey François, thanks for your suggestion, I followed the same link ( http://wiki.apache.org/solr/Deduplication)
they have the solution*, either make Hash as uniqueKey OR overwrite on duplicate, I dont need either. I need Discard on Duplicate. * > > > I have not used it but it looks like it will do the trick. > > François > > On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote: > > > I found the deduplication thing really useful. Although I have not yet > > started to work on it, as there are some other low hanging fruits I've to > > capture. Will share my thoughts soon. > > > > > > *Pranav Prakash* > > > > "temet nosce" > > > > Twitter <http://twitter.com/pranavprakash> | Blog < > http://blog.myblive.com> | > > Google <http://www.google.com/profiles/pranny> > > > > > > 2011/6/28 François Schiettecatte <fschietteca...@gmail.com> > > > >> Maybe there is a way to get Solr to reject documents that already exist > in > >> the index but I doubt it, maybe someone else with can chime here here. > You > >> could do a search for each document prior to indexing it so see if it is > >> already in the index, that is probably non-optimal, maybe it is easiest > to > >> check if the document exists in your Riak repository, it no add it and > index > >> it, and drop if it already exists. > >> > >> François > >> > >> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: > >> > >>> I am making the Hash from URL, but I can't use this as UniqueKey > because > >> I > >>> am using UUID as UniqueKey, > >>> Since I am using SOLR as index engine Only and using Riak(key-value > >>> storage) as storage engine, I dont want to do the overwrite on > duplicate. > >>> I just need to discard the duplicates. > >>> > >>> > >>> > >>> 2011/6/28 François Schiettecatte <fschietteca...@gmail.com> > >>> > >>>> Create a hash from the url and use that as the unique key, md5 or sha1 > >>>> would probably be good enough. > >>>> > >>>> Cheers > >>>> > >>>> François > >>>> > >>>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: > >>>> > >>>>> I also have the problem of duplicate docs. > >>>>> I am indexing news articles, Every news article will have the source > >> URL, > >>>>> If two news-article has the same URL, only one need to index, > >>>>> removal of duplicate at index time. > >>>>> > >>>>> > >>>>> > >>>>> On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote: > >>>>> > >>>>>> have you checked out the deduplication process that's available at > >>>>>> indexing time ? This includes a fuzzy hash algorithm . > >>>>>> > >>>>>> http://wiki.apache.org/solr/Deduplication > >>>>>> > >>>>>> -Simon > >>>>>> > >>>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com> > >>>> wrote: > >>>>>>> This approach would definitely work is the two documents are > >> *Exactly* > >>>>>> the > >>>>>>> same. But this is very fragile. Even if one extra space has been > >> added, > >>>>>> the > >>>>>>> whole hash would change. What I am really looking for is some %age > >>>>>>> similarity between documents, and remove those documents which are > >> more > >>>>>> than > >>>>>>> 95% similar. > >>>>>>> > >>>>>>> *Pranav Prakash* > >>>>>>> > >>>>>>> "temet nosce" > >>>>>>> > >>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog < > >>>>>> http://blog.myblive.com> | > >>>>>>> Google <http://www.google.com/profiles/pranny> > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote: > >>>>>>> > >>>>>>>> What you need to do, is to calculate some HASH (using any message > >>>> digest > >>>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on > >>>> solr > >>>>>>>> field collapse capabilities. Should not be too complicated.. > >>>>>>>> > >>>>>>>> *Omri Cohen* > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | > >>>>>> +972-3-6036295 > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> > >>>>>> [image: > >>>>>>>> Twitter] <http://www.twitter.com/omricohe> [image: > >>>>>>>> WordPress]<http://omricohen.me> > >>>>>>>> Please consider your environmental responsibility. Before printing > >>>> this > >>>>>>>> e-mail message, ask yourself whether you really need a hard copy. > >>>>>>>> IMPORTANT: The contents of this email and any attachments are > >>>>>> confidential. > >>>>>>>> They are intended for the named recipient(s) only. If you have > >>>> received > >>>>>>>> this > >>>>>>>> email by mistake, please notify the sender immediately and do not > >>>>>> disclose > >>>>>>>> the contents to anyone or make copies thereof. > >>>>>>>> Signature powered by > >>>>>>>> < > >>>>>>>> > >>>>>> > >>>> > >> > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > >>>>>>>>> > >>>>>>>> WiseStamp< > >>>>>>>> > >>>>>> > >>>> > >> > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> ---------- Forwarded message ---------- > >>>>>>>> From: Pranav Prakash <pra...@gmail.com> > >>>>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM > >>>>>>>> Subject: Removing duplicate documents from search results > >>>>>>>> To: solr-user@lucene.apache.org > >>>>>>>> > >>>>>>>> > >>>>>>>> How can I remove very similar documents from search results? > >>>>>>>> > >>>>>>>> My scenario is that there are documents in the index which are > >> almost > >>>>>>>> similar (people submitting same stuff multiple times, sometimes > >>>>>> different > >>>>>>>> people submitting same stuff). Now when a search is performed for > >>>>>>>> "keyword", > >>>>>>>> in the top N results, quite frequently, same document comes up > >>>> multiple > >>>>>>>> times. I want to remove those duplicate (or possible duplicate) > >>>>>> documents. > >>>>>>>> Very similar to what Google does when they say "In order to show > you > >>>>>> most > >>>>>>>> relevant result, duplicates have been removed". How can I achieve > >> this > >>>>>>>> functionality using Solr? Does Solr has an implied or plugin which > >>>> could > >>>>>>>> help me with it? > >>>>>>>> > >>>>>>>> > >>>>>>>> *Pranav Prakash* > >>>>>>>> > >>>>>>>> "temet nosce" > >>>>>>>> > >>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog < > >>>>>> http://blog.myblive.com > >>>>>>>>> > >>>>>>>> | > >>>>>>>> Google <http://www.google.com/profiles/pranny> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Thanks and Regards > >>>>> Mohammad Shariq > >>>> > >>>> > >>> > >>> > >>> -- > >>> Thanks and Regards > >>> Mohammad Shariq > >> > >> > > -- Thanks and Regards Mohammad Shariq