Hi Ciprian,
I tried to replace the solr.StandardTokenizerFactory with
solr.WhitespaceTokenizerFactory
My original (default)
<!-- A general text field that has reasonable, generic
cross-language defaults: it tokenizes with StandardTokenizer,
removes stop words from case-insensitive "stopwords.txt"
(empty by default), and down cases. At query time only, it
also applies synonyms.
-->
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymGraphFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.FlattenGraphFilterFactory"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
if I replace the index and query analyzer like so I get no results from
soul#person or "soul#person", but lots from soul person
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<!-- <tokenizer class="solr.StandardTokenizerFactory"/> -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymGraphFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.FlattenGraphFilterFactory"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<!-- <tokenizer class="solr.StandardTokenizerFactory"/> -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If I replace the index analyzer but not the query analyzer like so it
behaves like the default setup
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<!-- <tokenizer class="solr.StandardTokenizerFactory"/> -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymGraphFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.FlattenGraphFilterFactory"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
if I replace the query analyzer but not the index like so I get no
results from soul#person or "soul#person", but lots from soul person
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymGraphFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.FlattenGraphFilterFactory"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<!-- <tokenizer class="solr.StandardTokenizerFactory"/> -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
None of the modified TextField's produce what I would like to see.
I will try the WordDelimiterGraphFilter....
Though I'm somewhat confused as to which analyzer type, index or query,
I should be changing.
Also there are numerous solr.TextField's defined in the schema
text_ws, text_general(the one I modified), text_gen_sort, etc... Are
they all used?
thanks again,
Scott
On 12/27/25 13:41, Ciprian Dimofte - Opensolr.com wrote:
Hi Scott,
This is a classic Solr text analysis issue. The default tokenizer (usually
StandardTokenizer or ClassicTokenizer) treats # as a delimiter, so soul#person
gets split into two separate tokens: soul and person.
Where to look:
Your field type definition in schema.xml (or managed-schema) - specifically the
<analyzer> section for your text field.
Options to fix it:
1. Use WhitespaceTokenizer - Only splits on whitespace, so
soul#person stays as a single token. In your field type, change the tokenizer
to: solr.WhitespaceTokenizerFactory
2. Use PatternTokenizer with a custom regex - Gives you
fine-grained control over what characters split tokens.
3. Add a WordDelimiterGraphFilter with specific settings - You can
configure exactly which characters cause splits. Set splitOnNumerics=“0”,
splitOnCaseChange=“0”, generateWordParts=“0”, generateNumberParts=“0”,
catenateWords=“1”, preserveOriginal=“1”.
4. Use a MappingCharFilter - Map # to something that won’t cause a
split before tokenization.
Documentation links:
∙ Tokenizers:
https://solr.apache.org/guide/solr/latest/indexing-guide/tokenizers.html
∙ Filters:
https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html
Important: Whatever you change at index time, you need to apply the same
analysis at query time, then reindex your data.
Ciprian
Opensolr SRL
Your Path to Ai Search
https://opensolr.com
On 27 Dec 2025, at 23:36, Scott Derrick <[email protected]> wrote:
Hi,
I just noticed that when searching for a term that has an embedded
non-alphanumeric, the default schema for solr splits it into multiple terms.
The example was soul#person, which caused a search for soul or person. The behavior
we want would be the equivalent of "soul#person". We don't want the user to
have to enter their search term in quotes .
Looking for directions to the specific documentation so I can get this
fixed...
thanks
Scott