Hi,

I was checking some details in Hunspell implementation and to me it seems
that there might be some useful opportunities for perf improvements.

Currenty, when HunspellStemmer.stem() is called the dictionary is checked
for all possible pre/suffixes for given word.

For example when we want to stem word "externalization" then we check for
the following suffixes:
[ externalization, xternalization, ternalization, ... , ion, on, n, "" ]
(and similarly for prefixes).

My questions are:

1) Why do we test longer suffixes than is necessary?
Would it make sense to learn the max length of pre/suffix when loading the
dictionary and then apply this value when generating possible suffixes? For
example if english dictionary would contain max suffix of length 4 then in
the above example for word "externalization" we can safely test only the
following suffixes: tion, ion, on, n, "".
Given that the same algorithm is recursively applied to all generated
folded word forms until we hit recursion level this can make for a lot of
dictionary lookups for each individual token. I noticed that the lookup is
implemented using CharArrayMap (and it says it should be fast) but still we
could easily skip many lookups calls just because we KNOW in advance they
can not return any result. Do you think such optimization is worth the
effort?

2) Can suffix be represented by whole input word?
Given we generate possible strings for suffix lookup from input word why
don't we skip the first letter? For example if the input word is "hey" does
it make sense to consider only the following possible suffixes "ey", "y",
""? If I read the code correctly we include "hey" into the set as well (the
whole string itself). Is this correct? My understanding is that if I have a
word and I cut suffix from it I shouldn't be left with empty string, no?
May be this is language/dictionary specific, dunno.

(I am happy to open issue for any of these and provide patches).

Regards,
Lukas

Reply via email to