Re: Deduplication in 1.4

Otis Gospodnetic Thu, 26 Nov 2009 03:10:14 -0800

Hi Martijn,

 
----- Original Message ----


> From: Martijn v Groningen <martijn.is.h...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Thu, November 26, 2009 3:19:40 AM
> Subject: Re: Deduplication in 1.4
> 
> Field collapsing has been used by many in their production
> environment. 

Got any pointers to public sites you know use it?  I know of a high traffic 
site that used an early version, and it caused performance problems.  Is 
double-tripping still required?

> The last few months the stability of the patch grew as
> quiet some bugs were fixed. The only big feature missing currently is
> caching of the collapsing algorithm. I'm currently working on that and

Is it also full distributed-search-ready?

> I will put it in a new patch in the coming next days.  So yes the
> patch is very near being production ready.

Thanks,
Otis

> Martijn
> 
> 2009/11/26 KaktuChakarabati :
> >
> > Hey Otis,
> > Yep, I realized this myself after playing some with the dedupe feature
> > yesterday.
> > So it does look like Field collapsing is what I need pretty much.
> > Any idea on how close it is to being production-ready?
> >
> > Thanks,
> > -Chak
> >
> > Otis Gospodnetic wrote:
> >>
> >> Hi,
> >>
> >> As far as I know, the point of deduplication in Solr (
> >> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
> >> document before indexing it in order to avoid duplicates in the index in
> >> the first place.
> >>
> >> What you are describing is closer to field collapsing patch in SOLR-236.
> >>
> >>  Otis
> >> --
> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >>
> >>
> >>
> >> ----- Original Message ----
> >>> From: KaktuChakarabati 
> >>> To: solr-user@lucene.apache.org
> >>> Sent: Tue, November 24, 2009 5:29:00 PM
> >>> Subject: Deduplication in 1.4
> >>>
> >>>
> >>> Hey,
> >>> I've been trying to find some documentation on using this feature in 1.4
> >>> but
> >>> Wiki page is alittle sparse..
> >>> In specific, here's what i'm trying to do:
> >>>
> >>> I have a field, say 'duplicate_group_id' that i'll populate based on some
> >>> offline documents deduplication process I have.
> >>>
> >>> All I want is for solr to compute a 'duplicate_signature' field based on
> >>> this one at update time, so that when i search for documents later, all
> >>> documents with same original 'duplicate_group_id' value will be rolled up
> >>> (e.g i'll just get the first one that came back  according to relevancy).
> >>>
> >>> I enabled the deduplication processor and put it into updater, but i'm
> >>> not
> >>> seeing any difference in returned results (i.e results with same
> >>> duplicate_id are returned separately..)
> >>>
> >>> is there anything i need to supply in query-time for this to take effect?
> >>> what should be the behaviour? is there any working example of this?
> >>>
> >>> Anything will be helpful..
> >>>
> >>> Thanks,
> >>> Chak
> >>> --
> >>> View this message in context:
> >>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
> >>> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >>
> >
> > --
> > View this message in context: 
> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >

Re: Deduplication in 1.4

Reply via email to