>
> That's an interesting idea about storing shortest (easy) or most frequent
> (hard) words separately. Unfortunately the distribution isn't entirely
> zipfian, as the dictionaries tend to contain a lot of short but uncommon
> words, like abbreviations. Still, something for me to explore, thanks!
>

I meant taking advantage of empirical distribution of words occurring in an
average text (not their lengths) and providing a faster code path for
these.
I'd assume a typical run would spend the majority of the time looking up
the same words over and over. A small cache could be added to the
dictionary but also anywhere else - even on top of Stemmer or SpellChecker.
This cache could be very, very simplistic - even a dumb throw-away HashMap
discarded after it reaches a certain size (cache hits would occur only
until it's filled up). The distribution of the input still will make it
perform relatively well.

Here's an example - I used Mike McCandless's enwiki and ran English test
from TestPerformance on it. My master branch results:

Loaded c:\_tmp\hunspell\libreoffice\en\en_AU.aff
Stemming en: average 1958.7142857142858, all times = [1968, 1960, 1945,
1955, 1959, 1972, 1952]
Spellchecking en: average 2438.1428571428573, all times = [2465, 2425,
2429, 2434, 2452, 2429, 2433]

I then added [1] a dumb no-eviction-policy caching on top of stemming and
spell checking.

Results for cache size of 2500 entries:
Stemming en: average 886.5714285714286, all times = [885, 885, 889, 889,
890, 886, 882]
Spellchecking en: average 1182.0, all times = [1207, 1177, 1184, 1187,
1179, 1170, 1170]

Results for cache size of 5000 entries:
Stemming en: average 762.8571428571429, all times = [764, 758, 756, 766,
763, 766, 767]
Spellchecking en: average 1065.2857142857142, all times = [1070, 1068,
1061, 1065, 1064, 1067, 1062]

Results for cache size of 10000 entries:
Stemming en: average 628.0, all times = [626, 628, 627, 624, 627, 634, 630]
Spellchecking en: average 926.8571428571429, all times = [924, 925, 930,
927, 925, 930, 927]

The sizes of these caches are literally nothing and they don't require any
pretraining or eviction policies. This is all I meant by "hybrid" lookup
based on the distribution... this could be built in the dictionary but
doesn't have to be.

D.

[1]
https://github.com/apache/lucene-solr/commit/2b3cc7d1fc4ee3d999873f056764f70129390121

Reply via email to