On 2021-06-11 05:49, Tom Van Cuyck wrote:
I have an issue with the ClassicTokenizer. According to the
documentation
(https://solr.apache.org/guide/8_8/tokenizers.html#classic-tokenizer)
this should work as follows:
- Words are split at hyphens, unless there is a number in the word, in
which case the token is not split and the numbers and hyphen(s) are
preserved.
If I run the analysis on 'abc-123' it properly returns a single token.
However if I enter 'abc-def-123' it returns 2 tokens: 'abc' and
'def-123' which is unexpected to me.
Is there a tokenizer or setting that can keep this as a single token?
As ClassicTokenizer is a Lucene class, the javadoc is there:
https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html
And that says what you found in the Solr docs. I suspect that it's
doing exactly as advertised. It sees the first part and emits the token
"abc" ... then continues on. Then when it is working on the next part,
it sees the number after the delimiter and the documented behavior where
numbers are concerned kicks in.
Getting the tokenizer to look ahead through multiple delimiters to do
what you're expecting would probably be a lot harder than it sounds.
I'm not an expert in analyzer code, though.
I do not have any idea about the token type. That does sound a little
bit wrong, but I can't speak for the code author's intent.
Thanks,
Shawn