How to ignore whitespace/ case sensitivity with dedupe

2011-05-28 Thread tinman
Hi all,

I've followed the instructions at this link
http://wiki.apache.org/solr/Deduplication and got the basic dedupe field
working. However, it doesn't seem to recognize case differences or white
space differences even thought I've defined the type of the fields to be
used for dedupe as well as the signature field as followings in schema.xml

fieldType autoGeneratePhraseQueries=true class=solr.TextField
name=text_ws_lower positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
field name=name type=text_ws_lower/
field name=signatureField type=text_ws_lower/

and in the solrconfig.xml updateRequestProcessorChain name=dedupe
processor
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupesfalse/bool
  str name=signatureFieldsignatureField/str
  str name=fieldsname/str
  str
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

I know a possible solution is to lowercase and remove white spaces for the
field name before submiting documents to solr, but is there any other
alternatives so that when the following data is given
Name: JOHN SMITH and jOhn  SMITh the documents have the same outcome in
signatureField?

Thanks heaps
Cheers
tinman







--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-ignore-whitespace-case-sensitivity-with-dedupe-tp2997624p2997624.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to ignore whitespace/ case sensitivity with dedupe

2011-05-28 Thread Koji Sekiguchi

(11/05/29 8:47), tinman wrote:

Hi all,

I've followed the instructions at this link
http://wiki.apache.org/solr/Deduplication and got the basic dedupe field
working. However, it doesn't seem to recognize case differences or white
space differences even thought I've defined the type of the fields to be
used for dedupe as well as the signature field as followings in schema.xml

fieldType autoGeneratePhraseQueries=true class=solr.TextField
name=text_ws_lower positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType
field name=name type=text_ws_lower/
field name=signatureField type=text_ws_lower/

and in the solrconfig.xmlupdateRequestProcessorChain name=dedupe
 processor
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
   bool name=enabledtrue/bool
   bool name=overwriteDupesfalse/bool
   str name=signatureFieldsignatureField/str
   str name=fieldsname/str
   str
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
 /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain

I know a possible solution is to lowercase and remove white spaces for the
field name before submiting documents to solr, but is there any other
alternatives so that when the following data is given
Name: JOHN SMITH and jOhn  SMITh the documents have the same outcome in
signatureField?


I can't believe this. Those signatures should be different.

Are you sure you see same signatures in signatureField (it should be stored=true
in order to see the result of signature)? Or did you just see those duplicate 
documents
were registered and not checked signatureField by yourself? If latter, it is 
feature.
Because you set overwriteDupes=false and it mean duplication check works on 
uniqueKey field.

koji
--
http://www.rondhuit.com/en/


Re: How to ignore whitespace/ case sensitivity with dedupe

2011-05-28 Thread tinman
By default, stored = true, indexed = true. Any case, this is an example
output from solr search console.

result name=response numFound=2 start=0
  doc
str name=id1234/str
str name=nameJOHN   SMITH /str
str name=signatureField5430fbe9e6374611/str/doc
  doc
str name=id1233/str
str name=name   john SMITh/str
str name=signatureField49867a7835ff6741/str/doc
/result

As you can see, the 2 signature fields are different. And I want the
overrides = false as I want to use field collapsing for removing dedupe at
query time.

Thanks
tinman


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-ignore-whitespace-case-sensitivity-with-dedupe-tp2997624p2997738.html
Sent from the Solr - User mailing list archive at Nabble.com.