Dawid,

I like these numbers very much! Did you put the caching inside
Dictionary#lookupWord? Did you cache negative results as well?

Peter

On Thu, Feb 11, 2021 at 12:17 PM Dawid Weiss <[email protected]> wrote:

> That's an interesting idea about storing shortest (easy) or most frequent
>> (hard) words separately. Unfortunately the distribution isn't entirely
>> zipfian, as the dictionaries tend to contain a lot of short but uncommon
>> words, like abbreviations. Still, something for me to explore, thanks!
>>
>
> I meant taking advantage of empirical distribution of words occurring in
> an average text (not their lengths) and providing a faster code path for
> these.
> I'd assume a typical run would spend the majority of the time looking up
> the same words over and over. A small cache could be added to the
> dictionary but also anywhere else - even on top of Stemmer or SpellChecker.
> This cache could be very, very simplistic - even a dumb throw-away HashMap
> discarded after it reaches a certain size (cache hits would occur only
> until it's filled up). The distribution of the input still will make it
> perform relatively well.
>
> Here's an example - I used Mike McCandless's enwiki and ran English test
> from TestPerformance on it. My master branch results:
>
> Loaded c:\_tmp\hunspell\libreoffice\en\en_AU.aff
> Stemming en: average 1958.7142857142858, all times = [1968, 1960, 1945,
> 1955, 1959, 1972, 1952]
> Spellchecking en: average 2438.1428571428573, all times = [2465, 2425,
> 2429, 2434, 2452, 2429, 2433]
>
> I then added [1] a dumb no-eviction-policy caching on top of stemming and
> spell checking.
>
> Results for cache size of 2500 entries:
> Stemming en: average 886.5714285714286, all times = [885, 885, 889, 889,
> 890, 886, 882]
> Spellchecking en: average 1182.0, all times = [1207, 1177, 1184, 1187,
> 1179, 1170, 1170]
>
> Results for cache size of 5000 entries:
> Stemming en: average 762.8571428571429, all times = [764, 758, 756, 766,
> 763, 766, 767]
> Spellchecking en: average 1065.2857142857142, all times = [1070, 1068,
> 1061, 1065, 1064, 1067, 1062]
>
> Results for cache size of 10000 entries:
> Stemming en: average 628.0, all times = [626, 628, 627, 624, 627, 634, 630]
> Spellchecking en: average 926.8571428571429, all times = [924, 925, 930,
> 927, 925, 930, 927]
>
> The sizes of these caches are literally nothing and they don't require any
> pretraining or eviction policies. This is all I meant by "hybrid" lookup
> based on the distribution... this could be built in the dictionary but
> doesn't have to be.
>
> D.
>
> [1]
> https://github.com/apache/lucene-solr/commit/2b3cc7d1fc4ee3d999873f056764f70129390121
>
>

Reply via email to