On Fri, Feb 12, 2021 at 7:05 AM Peter Gromov <[email protected]> wrote:
> > Robert, for n=20 the speedup is quite small, 2-8% for me depending on the > language. Unfortunately Hunspell dictionaries don't have stop word > information, it'd be quite useful. > > OK, maybe with a cache size that small it won't cache the stopwords, I don't know. Was just mentioning it on the side. We do have stopword information for a lot of languages as resource files in lucene: https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis Some users will remove them before they get to the hunspell, some users won't. But we also have a way in the analysis chain to override stemming for particular words. It stems them the way you want and then sets a marker so that Hunspell wouldn't even be called on them: https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.html So if the user really wants to keep the stopwords, they could put this "thing" in front of it to prevent them from slowing stuff down.
