Re: standard tokenizer seemingly splitting on dot

Shawn Heisey Tue, 02 May 2023 13:56:24 -0700

On 5/2/23 13:16, Bill Tantzen wrote:

This tokenizer splits the text field into tokens, treating whitespace and
punctuation as delimiters.
Delimiter characters are discarded, with the following exceptions:
Periods (dots) that are not followed by whitespace are kept as part of the
token, including Internet domain names.

I checked on a dev version (9.3.0-SNAPSHOT) and StandardTokenizer doesindeed do exactly what the docs say.

The analysis definition in the fieldType probably has things beyond theStandardTokenizer, one or more filters that DO break up terms on a period.


Thanks,
Shawn

Re: standard tokenizer seemingly splitting on dot

Reply via email to