How to ignore whitespace/ case sensitivity with dedupe

2011-05-28 Thread tinman
Hi all,

I've followed the instructions at this link
http://wiki.apache.org/solr/Deduplication and got the basic dedupe field
working. However, it doesn't seem to recognize case differences or white
space differences even thought I've defined the type of the fields to be
used for dedupe as well as the signature field as followings in schema.xml

fieldType autoGeneratePhraseQueries=true class=solr.TextField
name=text_ws_lower positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType
field name=name type=text_ws_lower/
field name=signatureField type=text_ws_lower/

and in the solrconfig.xml updateRequestProcessorChain name=dedupe
processor
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupesfalse/bool
  str name=signatureFieldsignatureField/str
  str name=fieldsname/str
  str
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

I know a possible solution is to lowercase and remove white spaces for the
field name before submiting documents to solr, but is there any other
alternatives so that when the following data is given
Name: JOHN SMITH and jOhn  SMITh the documents have the same outcome in
signatureField?

Thanks heaps
Cheers
tinman







--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-ignore-whitespace-case-sensitivity-with-dedupe-tp2997624p2997624.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to ignore whitespace/ case sensitivity with dedupe

2011-05-28 Thread tinman
By default, stored = true, indexed = true. Any case, this is an example
output from solr search console.

result name=response numFound=2 start=0
  doc
str name=id1234/str
str name=nameJOHN   SMITH /str
str name=signatureField5430fbe9e6374611/str/doc
  doc
str name=id1233/str
str name=name   john SMITh/str
str name=signatureField49867a7835ff6741/str/doc
/result

As you can see, the 2 signature fields are different. And I want the
overrides = false as I want to use field collapsing for removing dedupe at
query time.

Thanks
tinman


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-ignore-whitespace-case-sensitivity-with-dedupe-tp2997624p2997738.html
Sent from the Solr - User mailing list archive at Nabble.com.