Re: Deduplication in 1.4

Martijn v Groningen Thu, 26 Nov 2009 00:20:08 -0800

Field collapsing has been used by many in their production
environment. The last few months the stability of the patch grew as
quiet some bugs were fixed. The only big feature missing currently is
caching of the collapsing algorithm. I'm currently working on that and
I will put it in a new patch in the coming next days.  So yes the
patch is very near being production ready.


Martijn

2009/11/26 KaktuChakarabati <jimmoe...@gmail.com>:
>
> Hey Otis,
> Yep, I realized this myself after playing some with the dedupe feature
> yesterday.
> So it does look like Field collapsing is what I need pretty much.
> Any idea on how close it is to being production-ready?
>
> Thanks,
> -Chak
>
> Otis Gospodnetic wrote:
>>
>> Hi,
>>
>> As far as I know, the point of deduplication in Solr (
>> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
>> document before indexing it in order to avoid duplicates in the index in
>> the first place.
>>
>> What you are describing is closer to field collapsing patch in SOLR-236.
>>
>>  Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> ----- Original Message ----
>>> From: KaktuChakarabati <jimmoe...@gmail.com>
>>> To: solr-user@lucene.apache.org
>>> Sent: Tue, November 24, 2009 5:29:00 PM
>>> Subject: Deduplication in 1.4
>>>
>>>
>>> Hey,
>>> I've been trying to find some documentation on using this feature in 1.4
>>> but
>>> Wiki page is alittle sparse..
>>> In specific, here's what i'm trying to do:
>>>
>>> I have a field, say 'duplicate_group_id' that i'll populate based on some
>>> offline documents deduplication process I have.
>>>
>>> All I want is for solr to compute a 'duplicate_signature' field based on
>>> this one at update time, so that when i search for documents later, all
>>> documents with same original 'duplicate_group_id' value will be rolled up
>>> (e.g i'll just get the first one that came back  according to relevancy).
>>>
>>> I enabled the deduplication processor and put it into updater, but i'm
>>> not
>>> seeing any difference in returned results (i.e results with same
>>> duplicate_id are returned separately..)
>>>
>>> is there anything i need to supply in query-time for this to take effect?
>>> what should be the behaviour? is there any working example of this?
>>>
>>> Anything will be helpful..
>>>
>>> Thanks,
>>> Chak
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Deduplication in 1.4

Reply via email to