Re: WordDelimiterFilterFactory removes words when options set to 0
: In trying to understand the various options for : WordDelimiterFilterFactory, I tried setting all options to 0. This seems : to prevent a number of words from being output at all. In particular : can't and 99dxl don't get output, nor do any wods containing hypens. : Is this correct behavior? For the record: there are other options you haven't set... splitOnNumerics defaults to 1; preserveOriginal defaults to 0 ... i'm guessing if you set splitOnNumerics=0 you'd see a lot more tokens come through, and if you set preserveOriginal=1 you'd definitely see a lot more tokens come through my default. : fieldtype name=mbooksOcrXPatLike class=solr.TextField : analyzer : tokenizer class=solr.WhitespaceTokenizerFactory/ : filter class=solr.WordDelimiterFilterFactory : splitOnCaseChange=0 : generateWordParts=0 : generateNumberParts=0 : catenateWords=0 : catenateNumbers=0 : catenateAll=0 : / : filter class=solr.LowerCaseFilterFactory/ : /analyzer : /fieldtype -Hoss
WordDelimiterFilterFactory removes words when options set to 0
In trying to understand the various options for WordDelimiterFilterFactory, I tried setting all options to 0. This seems to prevent a number of words from being output at all. In particular can't and 99dxl don't get output, nor do any wods containing hypens. Is this correct behavior? Here is what the Solr Analyzer output org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 2 3 4 5 6 7 8 9 term text ca-55 99_3_a9 55-67 powerShot ca999x15foo-bar can't joe's 99dxl org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=0, generateNumberParts=0, catenateWords=0, generateWordParts=0, catenateAll=0, catenateNumbers=0} term position 1 5 term text powerShot joe term type wordword source start,end20,29 53,56 Here is the schema fieldtype name=mbooksOcrXPatLike class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 / filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype Tom