Hi Scott,
This is a classic Solr text analysis issue. The default tokenizer (usually
StandardTokenizer or ClassicTokenizer) treats # as a delimiter, so soul#person
gets split into two separate tokens: soul and person.
Where to look:
Your field type definition in schema.xml (or managed-schema) - specifically the
<analyzer> section for your text field.
Options to fix it:
1. Use WhitespaceTokenizer - Only splits on whitespace, so
soul#person stays as a single token. In your field type, change the tokenizer
to: solr.WhitespaceTokenizerFactory
2. Use PatternTokenizer with a custom regex - Gives you
fine-grained control over what characters split tokens.
3. Add a WordDelimiterGraphFilter with specific settings - You can
configure exactly which characters cause splits. Set splitOnNumerics=“0”,
splitOnCaseChange=“0”, generateWordParts=“0”, generateNumberParts=“0”,
catenateWords=“1”, preserveOriginal=“1”.
4. Use a MappingCharFilter - Map # to something that won’t cause a
split before tokenization.
Documentation links:
∙ Tokenizers:
https://solr.apache.org/guide/solr/latest/indexing-guide/tokenizers.html
∙ Filters:
https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html
Important: Whatever you change at index time, you need to apply the same
analysis at query time, then reindex your data.
Ciprian
Opensolr SRL
Your Path to Ai Search
https://opensolr.com
> On 27 Dec 2025, at 23:36, Scott Derrick <[email protected]> wrote:
>
> Hi,
>
> I just noticed that when searching for a term that has an embedded
> non-alphanumeric, the default schema for solr splits it into multiple terms.
>
> The example was soul#person, which caused a search for soul or person.
> The behavior we want would be the equivalent of "soul#person". We don't want
> the user to have to enter their search term in quotes .
>
> Looking for directions to the specific documentation so I can get this
> fixed...
>
> thanks
>
> Scott