> > That's an interesting idea about storing shortest (easy) or most frequent > (hard) words separately. Unfortunately the distribution isn't entirely > zipfian, as the dictionaries tend to contain a lot of short but uncommon > words, like abbreviations. Still, something for me to explore, thanks! >
I meant taking advantage of empirical distribution of words occurring in an average text (not their lengths) and providing a faster code path for these. I'd assume a typical run would spend the majority of the time looking up the same words over and over. A small cache could be added to the dictionary but also anywhere else - even on top of Stemmer or SpellChecker. This cache could be very, very simplistic - even a dumb throw-away HashMap discarded after it reaches a certain size (cache hits would occur only until it's filled up). The distribution of the input still will make it perform relatively well. Here's an example - I used Mike McCandless's enwiki and ran English test from TestPerformance on it. My master branch results: Loaded c:\_tmp\hunspell\libreoffice\en\en_AU.aff Stemming en: average 1958.7142857142858, all times = [1968, 1960, 1945, 1955, 1959, 1972, 1952] Spellchecking en: average 2438.1428571428573, all times = [2465, 2425, 2429, 2434, 2452, 2429, 2433] I then added [1] a dumb no-eviction-policy caching on top of stemming and spell checking. Results for cache size of 2500 entries: Stemming en: average 886.5714285714286, all times = [885, 885, 889, 889, 890, 886, 882] Spellchecking en: average 1182.0, all times = [1207, 1177, 1184, 1187, 1179, 1170, 1170] Results for cache size of 5000 entries: Stemming en: average 762.8571428571429, all times = [764, 758, 756, 766, 763, 766, 767] Spellchecking en: average 1065.2857142857142, all times = [1070, 1068, 1061, 1065, 1064, 1067, 1062] Results for cache size of 10000 entries: Stemming en: average 628.0, all times = [626, 628, 627, 624, 627, 634, 630] Spellchecking en: average 926.8571428571429, all times = [924, 925, 930, 927, 925, 930, 927] The sizes of these caches are literally nothing and they don't require any pretraining or eviction policies. This is all I meant by "hybrid" lookup based on the distribution... this could be built in the dictionary but doesn't have to be. D. [1] https://github.com/apache/lucene-solr/commit/2b3cc7d1fc4ee3d999873f056764f70129390121
