[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-30 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8937:
--
Component/s: modules/analysis

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Adrien Gallou
>Assignee: Tomoko Uchida
>Priority: Minor
> Fix For: master (9.0)
>
> Attachments: 
> 0001-LUCENE-8937-Avoid-agressive-stemming-on-numbers-in-t.patch, 
> LUCENE-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-29 Thread Adrien Gallou (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Gallou updated LUCENE-8937:
--
Attachment: (was: 
0002-check-if-the-last-character-is-a-letter-before-remov.patch)

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Priority: Minor
> Attachments: 
> 0001-LUCENE-8937-Avoid-agressive-stemming-on-numbers-in-t.patch, 
> LUCENE-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-29 Thread Adrien Gallou (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Gallou updated LUCENE-8937:
--
Attachment: (was: SOLR-8937.patch)

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Priority: Minor
> Attachments: 
> 0001-LUCENE-8937-Avoid-agressive-stemming-on-numbers-in-t.patch, 
> LUCENE-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-29 Thread Adrien Gallou (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Gallou updated LUCENE-8937:
--
Attachment: LUCENE-8937.patch
0001-LUCENE-8937-Avoid-agressive-stemming-on-numbers-in-t.patch

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Priority: Minor
> Attachments: 
> 0001-LUCENE-8937-Avoid-agressive-stemming-on-numbers-in-t.patch, 
> LUCENE-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-29 Thread Adrien Gallou (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Gallou updated LUCENE-8937:
--
Attachment: (was: 0001-adds-test-cases-on-french-minimal-stemmer.patch)

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Priority: Minor
> Attachments: 
> 0002-check-if-the-last-character-is-a-letter-before-remov.patch, 
> SOLR-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-29 Thread Adrien Gallou (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Gallou updated LUCENE-8937:
--
Description: 
Here is the discussion on the mailing list : 
[http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]

The light stemmer removes the last character of a word if the last two
 characters are identical.
 We can see that here:
 
https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
 In this light stemmer, there is a check to avoid altering the token if the
 token is a number.

The minimal stemmer also removes the last character of a word if the last
 two characters are identical.
 We can see that here:
 
https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77

But in this minimal stemmer there is no check to see if the character is a
 letter or not.
 So when we have numeric tokens with the last two characters identical they
 are altered.

For example "1234567899" will be stemmed as "123456789".

It could be great of it's not altered.

Here is the same issue for the LightStemmer : 
https://issues.apache.org/jira/browse/LUCENE-4063

  was:
Here is the discussion on the mailing list : 
[http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]

The light stemmer removes the last character of a word if the last two
 characters are identical.
 We can see that here:
 
[https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263]
 In this light stemmer, there is a check to avoid altering the token if the
 token is a number.

The minimal stemmer also removes the last character of a word if the last
 two characters are identical.
 We can see that here:
 
[https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77]

But in this minimal stemmer there is no check to see if the character is a
 letter or not.
 So when we have numeric tokens with the last two characters identical they
 are altered.

For example "1234567899" will be stemmed as "123456789".

It could be great of it's not altered.

Here is the same issue for the LightStemmer : 
https://issues.apache.org/jira/browse/LUCENE-4063


> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Priority: Minor
> Attachments: 0001-adds-test-cases-on-french-minimal-stemmer.patch, 
> 0002-check-if-the-last-character-is-a-letter-before-remov.patch, 
> SOLR-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-28 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8937:
--
Priority: Minor  (was: Major)

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Priority: Minor
> Attachments: 0001-adds-test-cases-on-french-minimal-stemmer.patch, 
> 0002-check-if-the-last-character-is-a-letter-before-remov.patch, 
> SOLR-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263]
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77]
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-28 Thread Tomoko Uchida (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-8937:
--
Issue Type: Improvement  (was: Bug)

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Gallou
>Priority: Major
> Attachments: 0001-adds-test-cases-on-french-minimal-stemmer.patch, 
> 0002-check-if-the-last-character-is-a-letter-before-remov.patch, 
> SOLR-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263]
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77]
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-28 Thread Adrien Gallou (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Gallou updated LUCENE-8937:
--
Description: 
Here is the discussion on the mailing list : 
[http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]

The light stemmer removes the last character of a word if the last two
 characters are identical.
 We can see that here:
 
[https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263]
 In this light stemmer, there is a check to avoid altering the token if the
 token is a number.

The minimal stemmer also removes the last character of a word if the last
 two characters are identical.
 We can see that here:
 
[https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77]

But in this minimal stemmer there is no check to see if the character is a
 letter or not.
 So when we have numeric tokens with the last two characters identical they
 are altered.

For example "1234567899" will be stemmed as "123456789".

It could be great of it's not altered.

Here is the same issue for the LightStemmer : 
https://issues.apache.org/jira/browse/LUCENE-4063

  was:
Here is the discussion on the mailing list : 
[http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]

The light stemmer removes the last character of a word if the last two
characters are identical.
We can see that here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
In this light stemmer, there is a check to avoid altering the token if the
token is a number.

The minimal stemmer also removes the last character of a word if the last
two characters are identical.
We can see that here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77

But in this minimal stemmer there is no check to see if the character is a
letter or not.
So when we have numeric tokens with the last two characters identical they
are altered.

For example "1234567899" will be stemmed as "123456789".

It could be great of it's not altered.


> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Gallou
>Priority: Major
> Attachments: 0001-adds-test-cases-on-french-minimal-stemmer.patch, 
> 0002-check-if-the-last-character-is-a-letter-before-remov.patch, 
> SOLR-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
>  characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263]
>  In this light stemmer, there is a check to avoid altering the token if the
>  token is a number.
> The minimal stemmer also removes the last character of a word if the last
>  two characters are identical.
>  We can see that here:
>  
> [https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77]
> But in this minimal stemmer there is no check to see if the character is a
>  letter or not.
>  So when we have numeric tokens with the last two characters identical they
>  are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.
> Here is the same issue for the LightStemmer : 
> https://issues.apache.org/jira/browse/LUCENE-4063



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer

2019-07-28 Thread Adrien Gallou (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Gallou updated LUCENE-8937:
--
   Attachment: 
0002-check-if-the-last-character-is-a-letter-before-remov.patch
   0001-adds-test-cases-on-french-minimal-stemmer.patch
   SOLR-8937.patch
Lucene Fields: New,Patch Available  (was: New)
   Status: Open  (was: Open)

> Avoid agressive stemming on numbers in the FrenchMinimalStemmer
> ---
>
> Key: LUCENE-8937
> URL: https://issues.apache.org/jira/browse/LUCENE-8937
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Gallou
>Priority: Major
> Attachments: 0001-adds-test-cases-on-french-minimal-stemmer.patch, 
> 0002-check-if-the-last-character-is-a-letter-before-remov.patch, 
> SOLR-8937.patch
>
>
> Here is the discussion on the mailing list : 
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]
> The light stemmer removes the last character of a word if the last two
> characters are identical.
> We can see that here:
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> In this light stemmer, there is a check to avoid altering the token if the
> token is a number.
> The minimal stemmer also removes the last character of a word if the last
> two characters are identical.
> We can see that here:
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> But in this minimal stemmer there is no check to see if the character is a
> letter or not.
> So when we have numeric tokens with the last two characters identical they
> are altered.
> For example "1234567899" will be stemmed as "123456789".
> It could be great of it's not altered.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org