[
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894944#comment-16894944
]
Tomoko Uchida commented on LUCENE-8937:
---------------------------------------
Hi [~agallou],
the added isLetter() check looks okay to me.
Can you please merge the two patch (0001-xxxx.patch and 0002-xxxx.patch) to one
patch? "{{LUCENE-8937.patch}}" is correct file name here. And please remove
"SOLR-8937.patch" to avoid confusion.
Also, can you add a few more tests for regression and edge cases, I think the
same kind of tests for LUCENE-4063 would be needed.
[~steve_rowe] and [~rcmuir], do you have any thoughts or comments about this
change?
> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---------------------------------------------------------------
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Adrien Gallou
> Priority: Major
> Attachments: 0001-adds-test-cases-on-french-minimal-stemmer.patch,
> 0002-check-if-the-last-character-is-a-letter-before-remov.patch,
> SOLR-8937.patch
>
>
> Here is the discussion on the mailing list :
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
> characters are identical.
> We can see that here:
>
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263]
> In this light stemmer, there is a check to avoid altering the token if the
> token is a number.
> The minimal stemmer also removes the last character of a word if the last
> two characters are identical.
> We can see that here:
>
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77]
> But in this minimal stemmer there is no check to see if the character is a
> letter or not.
> So when we have numeric tokens with the last two characters identical they
> are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer :
> https://issues.apache.org/jira/browse/LUCENE-4063
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]