Indeed, take a look at this: http://wiki.apache.org/solr/Deduplication
I have not used it but it looks like it will do the trick. François On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote: > I found the deduplication thing really useful. Although I have not yet > started to work on it, as there are some other low hanging fruits I've to > capture. Will share my thoughts soon. > > > *Pranav Prakash* > > "temet nosce" > > Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | > Google <http://www.google.com/profiles/pranny> > > > 2011/6/28 François Schiettecatte <fschietteca...@gmail.com> > >> Maybe there is a way to get Solr to reject documents that already exist in >> the index but I doubt it, maybe someone else with can chime here here. You >> could do a search for each document prior to indexing it so see if it is >> already in the index, that is probably non-optimal, maybe it is easiest to >> check if the document exists in your Riak repository, it no add it and index >> it, and drop if it already exists. >> >> François >> >> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: >> >>> I am making the Hash from URL, but I can't use this as UniqueKey because >> I >>> am using UUID as UniqueKey, >>> Since I am using SOLR as index engine Only and using Riak(key-value >>> storage) as storage engine, I dont want to do the overwrite on duplicate. >>> I just need to discard the duplicates. >>> >>> >>> >>> 2011/6/28 François Schiettecatte <fschietteca...@gmail.com> >>> >>>> Create a hash from the url and use that as the unique key, md5 or sha1 >>>> would probably be good enough. >>>> >>>> Cheers >>>> >>>> François >>>> >>>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: >>>> >>>>> I also have the problem of duplicate docs. >>>>> I am indexing news articles, Every news article will have the source >> URL, >>>>> If two news-article has the same URL, only one need to index, >>>>> removal of duplicate at index time. >>>>> >>>>> >>>>> >>>>> On 23 June 2011 21:24, simon <mtnes...@gmail.com> wrote: >>>>> >>>>>> have you checked out the deduplication process that's available at >>>>>> indexing time ? This includes a fuzzy hash algorithm . >>>>>> >>>>>> http://wiki.apache.org/solr/Deduplication >>>>>> >>>>>> -Simon >>>>>> >>>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pra...@gmail.com> >>>> wrote: >>>>>>> This approach would definitely work is the two documents are >> *Exactly* >>>>>> the >>>>>>> same. But this is very fragile. Even if one extra space has been >> added, >>>>>> the >>>>>>> whole hash would change. What I am really looking for is some %age >>>>>>> similarity between documents, and remove those documents which are >> more >>>>>> than >>>>>>> 95% similar. >>>>>>> >>>>>>> *Pranav Prakash* >>>>>>> >>>>>>> "temet nosce" >>>>>>> >>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog < >>>>>> http://blog.myblive.com> | >>>>>>> Google <http://www.google.com/profiles/pranny> >>>>>>> >>>>>>> >>>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <o...@yotpo.com> wrote: >>>>>>> >>>>>>>> What you need to do, is to calculate some HASH (using any message >>>> digest >>>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on >>>> solr >>>>>>>> field collapse capabilities. Should not be too complicated.. >>>>>>>> >>>>>>>> *Omri Cohen* >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | >>>>>> +972-3-6036295 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> >>>>>> [image: >>>>>>>> Twitter] <http://www.twitter.com/omricohe> [image: >>>>>>>> WordPress]<http://omricohen.me> >>>>>>>> Please consider your environmental responsibility. Before printing >>>> this >>>>>>>> e-mail message, ask yourself whether you really need a hard copy. >>>>>>>> IMPORTANT: The contents of this email and any attachments are >>>>>> confidential. >>>>>>>> They are intended for the named recipient(s) only. If you have >>>> received >>>>>>>> this >>>>>>>> email by mistake, please notify the sender immediately and do not >>>>>> disclose >>>>>>>> the contents to anyone or make copies thereof. >>>>>>>> Signature powered by >>>>>>>> < >>>>>>>> >>>>>> >>>> >> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer >>>>>>>>> >>>>>>>> WiseStamp< >>>>>>>> >>>>>> >>>> >> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ---------- Forwarded message ---------- >>>>>>>> From: Pranav Prakash <pra...@gmail.com> >>>>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM >>>>>>>> Subject: Removing duplicate documents from search results >>>>>>>> To: solr-user@lucene.apache.org >>>>>>>> >>>>>>>> >>>>>>>> How can I remove very similar documents from search results? >>>>>>>> >>>>>>>> My scenario is that there are documents in the index which are >> almost >>>>>>>> similar (people submitting same stuff multiple times, sometimes >>>>>> different >>>>>>>> people submitting same stuff). Now when a search is performed for >>>>>>>> "keyword", >>>>>>>> in the top N results, quite frequently, same document comes up >>>> multiple >>>>>>>> times. I want to remove those duplicate (or possible duplicate) >>>>>> documents. >>>>>>>> Very similar to what Google does when they say "In order to show you >>>>>> most >>>>>>>> relevant result, duplicates have been removed". How can I achieve >> this >>>>>>>> functionality using Solr? Does Solr has an implied or plugin which >>>> could >>>>>>>> help me with it? >>>>>>>> >>>>>>>> >>>>>>>> *Pranav Prakash* >>>>>>>> >>>>>>>> "temet nosce" >>>>>>>> >>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog < >>>>>> http://blog.myblive.com >>>>>>>>> >>>>>>>> | >>>>>>>> Google <http://www.google.com/profiles/pranny> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Thanks and Regards >>>>> Mohammad Shariq >>>> >>>> >>> >>> >>> -- >>> Thanks and Regards >>> Mohammad Shariq >> >>