Pass Analyzed Field to SignatureUpdateProcessorFactory

2017-01-25 Thread Leonidas Zagkaretos
Hi all,

We have successfully integrated Solr in our application, and now we are
facing a requirement where the application should be able to search for
duplicate records in Solr core based on equality in 3 distinct fields.

Tried using SignatureUpdateProcessorFactory as described in
https://cwiki.apache.org/confluence/display/solr/De-Duplication and
Lookup3Signature and everything seems to work fine, signature field is
being filled with unique hash values.

One issue we have, is that we need to pass to
SignatureUpdateProcessorFactory the stemmed value of 1 of 3 fields.
Currenty, the following documents produce different hash values, and we
need them to produce unique.
Analysis for field1 and values "value1_a" and "value1_b" produce stemmed
value "value1"

documentA: {
field1: value1_a,
field2: value2,
field3: value3,
signature: hash_value1
}

documentB: {
field1: value1_b,
field2: value2,
field3: value3,
signature: hash_value2
}

I would like to ask whether it is possible to have required behavior, and
some tips about how to accomplish this task.

Thank you in advance,

Leonidas


RE: Pass Analyzed Field to SignatureUpdateProcessorFactory

2017-01-25 Thread Markus Jelsma
Hello, 

This is not possible out of the box, you would need to manually pass the input 
through an analyzer with a tokenizer and your steming token filter, and put the 
output together again.

Markus

 
 
-Original message-
> From:Leonidas Zagkaretos 
> Sent: Wednesday 25th January 2017 17:51
> To: solr-user@lucene.apache.org
> Subject: Pass Analyzed Field to SignatureUpdateProcessorFactory
> 
> Hi all,
> 
> We have successfully integrated Solr in our application, and now we are
> facing a requirement where the application should be able to search for
> duplicate records in Solr core based on equality in 3 distinct fields.
> 
> Tried using SignatureUpdateProcessorFactory as described in
> https://cwiki.apache.org/confluence/display/solr/De-Duplication and
> Lookup3Signature and everything seems to work fine, signature field is
> being filled with unique hash values.
> 
> One issue we have, is that we need to pass to
> SignatureUpdateProcessorFactory the stemmed value of 1 of 3 fields.
> Currenty, the following documents produce different hash values, and we
> need them to produce unique.
> Analysis for field1 and values "value1_a" and "value1_b" produce stemmed
> value "value1"
> 
> documentA: {
> field1: value1_a,
> field2: value2,
> field3: value3,
> signature: hash_value1
> }
> 
> documentB: {
> field1: value1_b,
> field2: value2,
> field3: value3,
> signature: hash_value2
> }
> 
> I would like to ask whether it is possible to have required behavior, and
> some tips about how to accomplish this task.
> 
> Thank you in advance,
> 
> Leonidas
> 


Re: Pass Analyzed Field to SignatureUpdateProcessorFactory

2017-01-25 Thread Alexandre Rafalovitch
It might be possible by sticking additional update request processors
before the signature one. For example clone field, regex instead of
tokenizing on the clone, then signature. If a clone is too much of a
burden, it may even be possible to then add IgnoreField URP to remove
it or map it in the schema to index/store/docValues=false field.

Regards,
   Alex.
P.s. The full all-in-one list of URPs is available at:
http://www.solr-start.com/info/update-request-processors/


http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 25 January 2017 at 12:00, Markus Jelsma  wrote:
> Hello,
>
> This is not possible out of the box, you would need to manually pass the 
> input through an analyzer with a tokenizer and your steming token filter, and 
> put the output together again.
>
> Markus
>
>
>
> -Original message-
>> From:Leonidas Zagkaretos 
>> Sent: Wednesday 25th January 2017 17:51
>> To: solr-user@lucene.apache.org
>> Subject: Pass Analyzed Field to SignatureUpdateProcessorFactory
>>
>> Hi all,
>>
>> We have successfully integrated Solr in our application, and now we are
>> facing a requirement where the application should be able to search for
>> duplicate records in Solr core based on equality in 3 distinct fields.
>>
>> Tried using SignatureUpdateProcessorFactory as described in
>> https://cwiki.apache.org/confluence/display/solr/De-Duplication and
>> Lookup3Signature and everything seems to work fine, signature field is
>> being filled with unique hash values.
>>
>> One issue we have, is that we need to pass to
>> SignatureUpdateProcessorFactory the stemmed value of 1 of 3 fields.
>> Currenty, the following documents produce different hash values, and we
>> need them to produce unique.
>> Analysis for field1 and values "value1_a" and "value1_b" produce stemmed
>> value "value1"
>>
>> documentA: {
>> field1: value1_a,
>> field2: value2,
>> field3: value3,
>> signature: hash_value1
>> }
>>
>> documentB: {
>> field1: value1_b,
>> field2: value2,
>> field3: value3,
>> signature: hash_value2
>> }
>>
>> I would like to ask whether it is possible to have required behavior, and
>> some tips about how to accomplish this task.
>>
>> Thank you in advance,
>>
>> Leonidas
>>


Re: Pass Analyzed Field to SignatureUpdateProcessorFactory

2017-01-26 Thread Leonidas Zagkaretos
Finally, I was able to implement desirable behavior using your suggestions
as follows:

- Added StatelessScriptUpdateProcessorFactory before
SignatureUpdateProcessorFactory in order to analyze "field1" and set
analyzed value to "field1_tmp_ss"
- Passed "field1_tmp_ss" to SignatureUpdateProcessorFactory
- Used IgnoreFieldUpdateProcessorFactory to ignore "field1_tmp_ss" from
document stored

Everything seems to work fine and as expected.

Thank you very much,
Have a nice day,

Leonidas

2017-01-25 19:19 GMT+02:00 Alexandre Rafalovitch :

> It might be possible by sticking additional update request processors
> before the signature one. For example clone field, regex instead of
> tokenizing on the clone, then signature. If a clone is too much of a
> burden, it may even be possible to then add IgnoreField URP to remove
> it or map it in the schema to index/store/docValues=false field.
>
> Regards,
>Alex.
> P.s. The full all-in-one list of URPs is available at:
> http://www.solr-start.com/info/update-request-processors/
>
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 25 January 2017 at 12:00, Markus Jelsma 
> wrote:
> > Hello,
> >
> > This is not possible out of the box, you would need to manually pass the
> input through an analyzer with a tokenizer and your steming token filter,
> and put the output together again.
> >
> > Markus
> >
> >
> >
> > -----Original message-----
> >> From:Leonidas Zagkaretos 
> >> Sent: Wednesday 25th January 2017 17:51
> >> To: solr-user@lucene.apache.org
> >> Subject: Pass Analyzed Field to SignatureUpdateProcessorFactory
> >>
> >> Hi all,
> >>
> >> We have successfully integrated Solr in our application, and now we are
> >> facing a requirement where the application should be able to search for
> >> duplicate records in Solr core based on equality in 3 distinct fields.
> >>
> >> Tried using SignatureUpdateProcessorFactory as described in
> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication and
> >> Lookup3Signature and everything seems to work fine, signature field is
> >> being filled with unique hash values.
> >>
> >> One issue we have, is that we need to pass to
> >> SignatureUpdateProcessorFactory the stemmed value of 1 of 3 fields.
> >> Currenty, the following documents produce different hash values, and we
> >> need them to produce unique.
> >> Analysis for field1 and values "value1_a" and "value1_b" produce stemmed
> >> value "value1"
> >>
> >> documentA: {
> >> field1: value1_a,
> >> field2: value2,
> >> field3: value3,
> >> signature: hash_value1
> >> }
> >>
> >> documentB: {
> >> field1: value1_b,
> >> field2: value2,
> >> field3: value3,
> >> signature: hash_value2
> >> }
> >>
> >> I would like to ask whether it is possible to have required behavior,
> and
> >> some tips about how to accomplish this task.
> >>
> >> Thank you in advance,
> >>
> >> Leonidas
> >>
>