Re: Search term automatically split at non-alphanumeric

Scott Derrick Wed, 31 Dec 2025 14:19:16 -0800

Hi Ciprian,

I tried to replace the solr.StandardTokenizerFactory withsolr.WhitespaceTokenizerFactory


My original (default)

    <!-- A general text field that has reasonable, generic
         cross-language defaults: it tokenizes with StandardTokenizer,
               removes stop words from case-insensitive "stopwords.txt"
               (empty by default), and down cases.  At query time only, it
               also applies synonyms.
          -->

<fieldType name="text_general" class="solr.TextField"positionIncrementGap="100" multiValued="true">

      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" />

        <!-- in this example, we will only use synonyms at query time

<filter class="solr.SynonymGraphFilterFactory"synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>

        <filter class="solr.FlattenGraphFilterFactory"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" /> <filter class="solr.SynonymGraphFilterFactory"synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

if I replace the index and query analyzer like so I get no results fromsoul#person or "soul#person", but lots from soul person

<fieldType name="text_general" class="solr.TextField"positionIncrementGap="100" multiValued="true">

      <analyzer type="index">
<!--        <tokenizer class="solr.StandardTokenizerFactory"/> -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" />

        <!-- in this example, we will only use synonyms at query time

<filter class="solr.SynonymGraphFilterFactory"synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>

        <filter class="solr.FlattenGraphFilterFactory"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
<!--        <tokenizer class="solr.StandardTokenizerFactory"/> -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" /> <filter class="solr.SynonymGraphFilterFactory"synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

If I replace the index analyzer but not the query analyzer like so itbehaves like the default setup

<fieldType name="text_general" class="solr.TextField"positionIncrementGap="100" multiValued="true">

      <analyzer type="index">
<!--        <tokenizer class="solr.StandardTokenizerFactory"/> -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" />

        <!-- in this example, we will only use synonyms at query time

<filter class="solr.SynonymGraphFilterFactory"synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>

        <filter class="solr.FlattenGraphFilterFactory"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" /> <filter class="solr.SynonymGraphFilterFactory"synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

if I replace the query analyzer but not the index like so I get noresults from soul#person or "soul#person", but lots from soul person

<fieldType name="text_general" class="solr.TextField"positionIncrementGap="100" multiValued="true">

      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" />

        <!-- in this example, we will only use synonyms at query time

<filter class="solr.SynonymGraphFilterFactory"synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>

        <filter class="solr.FlattenGraphFilterFactory"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
<!--        <tokenizer class="solr.StandardTokenizerFactory"/> -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" /> <filter class="solr.SynonymGraphFilterFactory"synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

None of the modified TextField's produce what I would like to see.

I will try the WordDelimiterGraphFilter....

Though I'm somewhat confused as to which analyzer type, index or query, I should be changing.


Also there are numerous solr.TextField's defined in the schema

text_ws, text_general(the one I modified), text_gen_sort, etc... Arethey all used?


thanks again,

Scott


On 12/27/25 13:41, Ciprian Dimofte - Opensolr.com wrote:

Hi Scott,

This is a classic Solr text analysis issue. The default tokenizer (usually 
StandardTokenizer or ClassicTokenizer) treats # as a delimiter, so soul#person 
gets split into two separate tokens: soul and person.

Where to look:
Your field type definition in schema.xml (or managed-schema) - specifically the 
<analyzer> section for your text field.

Options to fix it:
        1.      Use WhitespaceTokenizer - Only splits on whitespace, so 
soul#person stays as a single token. In your field type, change the tokenizer 
to: solr.WhitespaceTokenizerFactory
        2.      Use PatternTokenizer with a custom regex - Gives you 
fine-grained control over what characters split tokens.
        3.      Add a WordDelimiterGraphFilter with specific settings - You can 
configure exactly which characters cause splits. Set splitOnNumerics=“0”, 
splitOnCaseChange=“0”, generateWordParts=“0”, generateNumberParts=“0”, 
catenateWords=“1”, preserveOriginal=“1”.
        4.      Use a MappingCharFilter - Map # to something that won’t cause a 
split before tokenization.
Documentation links:
        ∙       Tokenizers: 
https://solr.apache.org/guide/solr/latest/indexing-guide/tokenizers.html
        ∙       Filters: 
https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html
Important: Whatever you change at index time, you need to apply the same 
analysis at query time, then reindex your data.

Ciprian

Opensolr SRL
Your Path to Ai Search
https://opensolr.com

On 27 Dec 2025, at 23:36, Scott Derrick <[email protected]> wrote:

Hi,

     I just noticed that when searching for a term that has an embedded 
non-alphanumeric, the default schema for solr splits it into multiple terms.

     The example was soul#person, which caused a search for soul or person.  The behavior 
we want would be the equivalent of "soul#person".  We don't want the user to 
have to enter their search term in quotes .

     Looking for directions to the specific documentation so I can get this 
fixed...

thanks

Scott

Re: Search term automatically split at non-alphanumeric

Reply via email to