Re: WordDelimiterFilterFactory removes words when options set to 0

2009-04-28 Thread Chris Hostetter

: In trying to understand the various options for 
: WordDelimiterFilterFactory, I tried setting all options to 0. This seems 
: to prevent a number of words from being output at all. In particular 
: can't and 99dxl don't get output, nor do any wods containing hypens. 
: Is this correct behavior?

For the record: there are other options you haven't set... splitOnNumerics 
defaults to 1; preserveOriginal defaults to 0 ... i'm guessing if you 
set splitOnNumerics=0 you'd see a lot more tokens come through, and if 
you set preserveOriginal=1 you'd definitely see a lot more tokens come 
through my default.

: fieldtype name=mbooksOcrXPatLike class=solr.TextField
:   analyzer
:   tokenizer class=solr.WhitespaceTokenizerFactory/
:   filter class=solr.WordDelimiterFilterFactory
: splitOnCaseChange=0
: generateWordParts=0
: generateNumberParts=0
:   catenateWords=0
: catenateNumbers=0
: catenateAll=0
: /
:   filter class=solr.LowerCaseFilterFactory/
:   /analyzer
: /fieldtype

-Hoss



WordDelimiterFilterFactory removes words when options set to 0

2009-04-17 Thread Burton-West, Tom
In trying to understand the various options for WordDelimiterFilterFactory, I 
tried setting all options to 0.
This seems to prevent a number of words from being output at all. In particular 
can't and 99dxl don't get output, nor do any wods containing hypens. Is 
this correct behavior?


Here is what the Solr Analyzer output

org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position   1   2   3   4   5   6   7   8   
9
term text   ca-55   99_3_a9 55-67   powerShot   ca999x15foo-bar 
can't   joe's   99dxl

 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=0, 
generateNumberParts=0, catenateWords=0, generateWordParts=0, catenateAll=0, 
catenateNumbers=0}

term position   1   5
term text   powerShot   joe
term type   wordword
source start,end20,29   53,56

Here is the schema
fieldtype name=mbooksOcrXPatLike class=solr.TextField
  analyzer
  tokenizer class=solr.WhitespaceTokenizerFactory/
  filter class=solr.WordDelimiterFilterFactory
splitOnCaseChange=0
generateWordParts=0
generateNumberParts=0
catenateWords=0
catenateNumbers=0
catenateAll=0
/
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldtype

Tom