Re: ClassicTokenizer not working as expected

Shawn Heisey Fri, 11 Jun 2021 18:20:17 -0700

On 2021-06-11 05:49, Tom Van Cuyck wrote:

I have an issue with the ClassicTokenizer. According to the
documentation
(https://solr.apache.org/guide/8_8/tokenizers.html#classic-tokenizer)
this should work as follows:


- Words are split at hyphens, unless there is a number in the word, in
which case the token is not split and the numbers and hyphen(s) are
preserved.

If I run the analysis on 'abc-123' it properly returns a single token.
However if I enter 'abc-def-123' it returns 2 tokens: 'abc' and
'def-123' which is unexpected to me.

Is there a tokenizer or setting that can keep this as a single token?


As ClassicTokenizer is a Lucene class, the javadoc is there:

https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html

And that says what you found in the Solr docs. I suspect that it'sdoing exactly as advertised. It sees the first part and emits the token"abc" ... then continues on. Then when it is working on the next part,it sees the number after the delimiter and the documented behavior wherenumbers are concerned kicks in.

Getting the tokenizer to look ahead through multiple delimiters to dowhat you're expecting would probably be a lot harder than it sounds.I'm not an expert in analyzer code, though.

I do not have any idea about the token type. That does sound a littlebit wrong, but I can't speak for the code author's intent.


Thanks,
Shawn

Re: ClassicTokenizer not working as expected

Reply via email to