You provided a list of TokenFilters that you use in your Analyzer, 
but you didn't mention anything about what Tokenizer you are using.

You also mentioned seeing a difference in the "tokenization result" and 
the example output you gave does in fact seem to be the output of the 
tokenizer -- not the output of the TokenFilters you mentioned -- since 
ShingleFilter would be producing more output tokens then you listed.

All of which suggests that the discrepency you are seeing is in your 
tokenizer.

Generally speaking: the best way to ensure folks on the mailing list can 
make sense of your situation and offer assistance is if you can provide 
reproducible snippets of code w/hardcoded input (ala unit tests) that 
demonstrates what you're seeing.

: Our current code is based on Lucene7.
: In some analyzer testcase, give a string "*Google's biologist’s*", the
: tokenization result is, *["google", "biologist"]*
: 
: But after I migrating the codebase to Lucene9,
: the result becomes, *["googles", "**biologist’s**"]*


: The analyzer uses the following three Lucene libraries:
: 
: org.apache.lucene.analysis.core.FlattenGraphFilter;
: 
: org.apache.lucene.analysis.shingle.ShingleFilter;
: 
: org.apache.lucene.analysis.synonym.SynonymGraphFilter;


-Hoss
http://www.lucidworks.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to