[ https://issues.apache.org/jira/browse/LUCENE-8876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871423#comment-16871423 ]
Mark Harwood commented on LUCENE-8876: -------------------------------------- {quote} but then doesn't it mean that exceptions of the 2nd rule are always ignored? {quote} Good point. Rule 1 exceptions are odd too - I have not found a single common English word that ends in aies or eies. > EnglishMinimalStemmer does not implement s-stemmer paper correctly? > ------------------------------------------------------------------- > > Key: LUCENE-8876 > URL: https://issues.apache.org/jira/browse/LUCENE-8876 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Reporter: Mark Harwood > Priority: Minor > > The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and > employees. > The [original > paper|[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828&rep=rep1&type=pdf]] > has this table of rules: > !https://user-images.githubusercontent.com/170925/59616454-5dc7d580-911c-11e9-80b0-c7a59458c5a7.png! > The notes accompanying the table state : > {quote}"the first applicable rule encountered is the only one used" > {quote} > > For the {{ees}} and {{oes}} suffixes I think EnglishMinimalStemmer > misinterpreted the rule logic and consequently {{bees != bee}} and {{tomatoes > != tomato}}. The {{oes}} and {{ees}} suffixes are left intact. > "The first applicable rule" for {{ees}} could be interpreted as rule 2 or 3 > in the table depending on if you take {{applicable}} to mean "the THEN part > of the rule has fired" or just that the suffix was referenced in the rule. > EnglishMinimalStemmer has assumed the latter and I think it should be the > former. We should fall through into rule 3 for {{ees}} and {{oes}} (remove > any trailing S). That's certainly the conclusion I came to independently > testing on real data. > There are some additional changes I'd like to see in a plural stemmer but I > won't list them here - the focus should be making the code here match the > original paper it references. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org