[ https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894944#comment-16894944 ]
Tomoko Uchida commented on LUCENE-8937: --------------------------------------- Hi [~agallou], the added isLetter() check looks okay to me. Can you please merge the two patch (0001-xxxx.patch and 0002-xxxx.patch) to one patch? "{{LUCENE-8937.patch}}" is correct file name here. And please remove "SOLR-8937.patch" to avoid confusion. Also, can you add a few more tests for regression and edge cases, I think the same kind of tests for LUCENE-4063 would be needed. [~steve_rowe] and [~rcmuir], do you have any thoughts or comments about this change? > Avoid agressive stemming on numbers in the FrenchMinimalStemmer > --------------------------------------------------------------- > > Key: LUCENE-8937 > URL: https://issues.apache.org/jira/browse/LUCENE-8937 > Project: Lucene - Core > Issue Type: Bug > Reporter: Adrien Gallou > Priority: Major > Attachments: 0001-adds-test-cases-on-french-minimal-stemmer.patch, > 0002-check-if-the-last-character-is-a-letter-before-remov.patch, > SOLR-8937.patch > > > Here is the discussion on the mailing list : > [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser] > The light stemmer removes the last character of a word if the last two > characters are identical. > We can see that here: > > [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263] > In this light stemmer, there is a check to avoid altering the token if the > token is a number. > The minimal stemmer also removes the last character of a word if the last > two characters are identical. > We can see that here: > > [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77] > But in this minimal stemmer there is no check to see if the character is a > letter or not. > So when we have numeric tokens with the last two characters identical they > are altered. > For example "1234567899" will be stemmed as "123456789". > It could be great of it's not altered. > Here is the same issue for the LightStemmer : > https://issues.apache.org/jira/browse/LUCENE-4063 -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org