Hello,
I see that the stopwords_fr.txt list included in the solr default configset
is "out of date" compared to the same file in lucene.
https://github.com/apache/solr/blob/a42c605fb916439222a086356f368f02cf80304a/solr/server/solr/configsets/_default/conf/lang/stopwords_fr.txt
https://github.com/apache/lucene/blame/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt

Specifically, I was running into issues where I am trying to index été
(summer), and it was being removed due to it also being a conjugation of
être ("to be").
It appears that the snowball list (
https://snowballstem.org/algorithms/french/stop.txt) has been updated to
resolve this specific issue, and by looking at the commit history in the
lucene repository this happened many years ago (
https://issues.apache.org/jira/browse/LUCENE-9354)

Does it make sense to also update this list in solr? I have an apache jira
account and so would be happy to raise the necessary issue and make a
contribution for this update if it can help speed up the process.

Regards,
Alastair

Reply via email to