(11/05/29 8:47), tinman wrote:
Hi all,

I've followed the instructions at this link
http://wiki.apache.org/solr/Deduplication and got the basic dedupe field
working. However, it doesn't seem to recognize case differences or white
space differences even thought I've defined the type of the fields to be
used for dedupe as well as the signature field as followings in schema.xml

<fieldType autoGeneratePhraseQueries="true" class="solr.TextField"
name="text_ws_lower" positionIncrementGap="100">
       <analyzer type="index">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
       <analyzer type="query">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
     </fieldType>
<field name="name" type="text_ws_lower"/>
<field name="signatureField" type="text_ws_lower"/>

and in the solrconfig.xml<updateRequestProcessorChain name="dedupe">
     <processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
       <bool name="enabled">true</bool>
       <bool name="overwriteDupes">false</bool>
       <str name="signatureField">signatureField</str>
       <str name="fields">name</str>
       <str
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
   </updateRequestProcessorChain>

I know a possible solution is to lowercase and remove white spaces for the
field "name" before submiting documents to solr, but is there any other
alternatives so that when the following data is given
Name: JOHN SMITH and jOhn      SMITh the documents have the same outcome in
signatureField?

I can't believe this. Those signatures should be different.

Are you sure you see same signatures in signatureField (it should be stored=true
in order to see the result of signature)? Or did you just see those duplicate 
documents
were registered and not checked signatureField by yourself? If latter, it is 
feature.
Because you set overwriteDupes=false and it mean duplication check works on 
uniqueKey field.

koji
--
http://www.rondhuit.com/en/

Reply via email to