Dawid, I like these numbers very much! Did you put the caching inside Dictionary#lookupWord? Did you cache negative results as well?
Peter On Thu, Feb 11, 2021 at 12:17 PM Dawid Weiss <[email protected]> wrote: > That's an interesting idea about storing shortest (easy) or most frequent >> (hard) words separately. Unfortunately the distribution isn't entirely >> zipfian, as the dictionaries tend to contain a lot of short but uncommon >> words, like abbreviations. Still, something for me to explore, thanks! >> > > I meant taking advantage of empirical distribution of words occurring in > an average text (not their lengths) and providing a faster code path for > these. > I'd assume a typical run would spend the majority of the time looking up > the same words over and over. A small cache could be added to the > dictionary but also anywhere else - even on top of Stemmer or SpellChecker. > This cache could be very, very simplistic - even a dumb throw-away HashMap > discarded after it reaches a certain size (cache hits would occur only > until it's filled up). The distribution of the input still will make it > perform relatively well. > > Here's an example - I used Mike McCandless's enwiki and ran English test > from TestPerformance on it. My master branch results: > > Loaded c:\_tmp\hunspell\libreoffice\en\en_AU.aff > Stemming en: average 1958.7142857142858, all times = [1968, 1960, 1945, > 1955, 1959, 1972, 1952] > Spellchecking en: average 2438.1428571428573, all times = [2465, 2425, > 2429, 2434, 2452, 2429, 2433] > > I then added [1] a dumb no-eviction-policy caching on top of stemming and > spell checking. > > Results for cache size of 2500 entries: > Stemming en: average 886.5714285714286, all times = [885, 885, 889, 889, > 890, 886, 882] > Spellchecking en: average 1182.0, all times = [1207, 1177, 1184, 1187, > 1179, 1170, 1170] > > Results for cache size of 5000 entries: > Stemming en: average 762.8571428571429, all times = [764, 758, 756, 766, > 763, 766, 767] > Spellchecking en: average 1065.2857142857142, all times = [1070, 1068, > 1061, 1065, 1064, 1067, 1062] > > Results for cache size of 10000 entries: > Stemming en: average 628.0, all times = [626, 628, 627, 624, 627, 634, 630] > Spellchecking en: average 926.8571428571429, all times = [924, 925, 930, > 927, 925, 930, 927] > > The sizes of these caches are literally nothing and they don't require any > pretraining or eviction policies. This is all I meant by "hybrid" lookup > based on the distribution... this could be built in the dictionary but > doesn't have to be. > > D. > > [1] > https://github.com/apache/lucene-solr/commit/2b3cc7d1fc4ee3d999873f056764f70129390121 > >
