SOLR Tokenizer “solr.SimplePatternSplitTokenizerFactory” splits at unexpected characters

Stephan Damson Mon, 25 Feb 2019 23:19:16 -0800

Hi!

I'm having unexpected results with the solr.SimplePatternSplitTokenizerFactory. 
The pattern used is actually from an example in the SOLR documentation and I do 
not understand where I made a mistake or why it does not work as expected.
If we take the example input "operative", the analyzer shows that during 
indexing, the input gets split into the tokens "ope", "a" and "ive", that is 
the tokenizer splits at the characters "r" and "t", and not at the expected 
whitespace characters (CR, TAB). Just to be sure I also tried to use more than 
one backspace in the pattern (e.g. \t and \\t<file:///\\t>), but this did not 
change how the input is tokenized during indexing.


What am I missing?
SOLR version used is 7.5.0.
The definition of the field type in the schema is as follows:
<fieldType name="text_custom" class="solr.TextField" positionIncrementGap="100" 
multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ 
\t\r\n]+"/>

    <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ 
\t\r\n]+"/>

    <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />
    <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Many thanks in advance for any help you can provide!

SOLR Tokenizer “solr.SimplePatternSplitTokenizerFactory” splits at unexpected characters

Reply via email to