I have a Solr 3.6 deployment I inherited.

The schema.xml specifies the use of StandardTokenizerFactory like so ...

    <fieldType name="text_general" class="solr.TextField"
      <tokenizer class="solr.StandardTokenizerFactory"/>

According to this reference guide (
https://home.apache.org/~ctargett/RefGuidePOC/jekyll/Tokenizers.html) ...
the StandardTokenizer will treat punctuation as a delimiters.

However, here is my content that gets indexed:

    "IOM-1:BA9ATS0FAB,\"Company Name

CM Rear Module\",B-6,000009XP12133407,"

This piece `B-A,000006KB09029932` gets tokenized into two words ... `|B-A|`
and `|000006KB09029932|`.

But this piece `B-6,000009XP12133407` gets tokenized into one word ...

What I've observed is the comma is not considered a delimiter when it is
proceeded by a digit ... almost like it considers "6,000" to be currency or

QUESTION: Is this a bug in StandardTokenizer, or do I misunderstand how
commas are used as delimiters?


Reply via email to