Hi Ciprian,

    I tried to replace the solr.StandardTokenizerFactory with solr.WhitespaceTokenizerFactory

My original (default)

    <!-- A general text field that has reasonable, generic
         cross-language defaults: it tokenizes with StandardTokenizer,
               removes stop words from case-insensitive "stopwords.txt"
               (empty by default), and down cases.  At query time only, it
               also applies synonyms.
          -->
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymGraphFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

if I replace the index and query analyzer like so I get no results from soul#person or "soul#person", but lots from soul person

   <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
      <analyzer type="index">
<!--        <tokenizer class="solr.StandardTokenizerFactory"/> -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymGraphFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
<!--        <tokenizer class="solr.StandardTokenizerFactory"/> -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

If I replace the index analyzer but not the query analyzer like so it behaves like the default setup

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
      <analyzer type="index">
<!--        <tokenizer class="solr.StandardTokenizerFactory"/> -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymGraphFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

if I replace the query analyzer but not the index like so I get no results from soul#person or "soul#person", but lots from soul person

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymGraphFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
<!--        <tokenizer class="solr.StandardTokenizerFactory"/> -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />         <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

None of the modified TextField's produce what I would like to see.

I will try the WordDelimiterGraphFilter....

Though I'm somewhat confused as to which analyzer type, index or query,  I should be changing.

Also there are numerous solr.TextField's defined in the schema

text_ws, text_general(the one I modified),  text_gen_sort, etc... Are they all used?

thanks again,

Scott


On 12/27/25 13:41, Ciprian Dimofte - Opensolr.com wrote:
Hi Scott,

This is a classic Solr text analysis issue. The default tokenizer (usually 
StandardTokenizer or ClassicTokenizer) treats # as a delimiter, so soul#person 
gets split into two separate tokens: soul and person.

Where to look:
Your field type definition in schema.xml (or managed-schema) - specifically the 
<analyzer> section for your text field.

Options to fix it:
        1.      Use WhitespaceTokenizer - Only splits on whitespace, so 
soul#person stays as a single token. In your field type, change the tokenizer 
to: solr.WhitespaceTokenizerFactory
        2.      Use PatternTokenizer with a custom regex - Gives you 
fine-grained control over what characters split tokens.
        3.      Add a WordDelimiterGraphFilter with specific settings - You can 
configure exactly which characters cause splits. Set splitOnNumerics=“0”, 
splitOnCaseChange=“0”, generateWordParts=“0”, generateNumberParts=“0”, 
catenateWords=“1”, preserveOriginal=“1”.
        4.      Use a MappingCharFilter - Map # to something that won’t cause a 
split before tokenization.
Documentation links:
        ∙       Tokenizers: 
https://solr.apache.org/guide/solr/latest/indexing-guide/tokenizers.html
        ∙       Filters: 
https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html
Important: Whatever you change at index time, you need to apply the same 
analysis at query time, then reindex your data.

Ciprian

Opensolr SRL
Your Path to Ai Search
https://opensolr.com

On 27 Dec 2025, at 23:36, Scott Derrick <[email protected]> wrote:

Hi,

     I just noticed that when searching for a term that has an embedded 
non-alphanumeric, the default schema for solr splits it into multiple terms.

     The example was soul#person, which caused a search for soul or person.  The behavior 
we want would be the equivalent of "soul#person".  We don't want the user to 
have to enter their search term in quotes .

     Looking for directions to the specific documentation so I can get this 
fixed...

thanks

Scott

Reply via email to