Re: Hunspell performance

Robert Muir Thu, 11 Feb 2021 10:55:40 -0800

for TokenStreams, they are simple, accessed by only one thread: you
can also do a proper LRU via LinkedHashMap which is maybe even less
code.


but putting caches around tokenstream/stemmer has implications, if the
user has huge numbers of threads and especially high churn (like solr
with its dynamic threadpool with max of 10000, my earlier mail).
another alternative would be to centralize it around hunspell
"Dictionary", but then it needs to be thread-safe and so on.

On Thu, Feb 11, 2021 at 1:44 PM Dawid Weiss <[email protected]> wrote:
>
>
> Hi Peter,
>
> I provided a link to the commit diff? It was just a dumb wrapper in perf. 
> test - take a look. I'm sure it can be done better but the approach itself is 
> sound (and I've done it before). The cache-no-eviction-policy approach 
> essentially relies on non-uniform distribution of the elements on input 
> (which is the case in many situations). I've used it before on token streams 
> (to cache token attributes of frequent tokens). I'm sure it'd do some good 
> here as well.
>
> It doesn't negate the need to work on speeding up the core of the algorithm 
> itself, but it's an easy way to squeeze some additional performance without 
> making things much more complex.
>
> Dawid
>
> On Thu, Feb 11, 2021 at 6:28 PM Peter Gromov 
> <[email protected]> wrote:
>>
>> Dawid,
>>
>> I like these numbers very much! Did you put the caching inside 
>> Dictionary#lookupWord? Did you cache negative results as well?
>>
>> Peter
>>
>> On Thu, Feb 11, 2021 at 12:17 PM Dawid Weiss <[email protected]> wrote:
>>>>
>>>> That's an interesting idea about storing shortest (easy) or most frequent 
>>>> (hard) words separately. Unfortunately the distribution isn't entirely 
>>>> zipfian, as the dictionaries tend to contain a lot of short but uncommon 
>>>> words, like abbreviations. Still, something for me to explore, thanks!
>>>
>>>
>>> I meant taking advantage of empirical distribution of words occurring in an 
>>> average text (not their lengths) and providing a faster code path for these.
>>> I'd assume a typical run would spend the majority of the time looking up 
>>> the same words over and over. A small cache could be added to the 
>>> dictionary but also anywhere else - even on top of Stemmer or SpellChecker. 
>>> This cache could be very, very simplistic - even a dumb throw-away HashMap 
>>> discarded after it reaches a certain size (cache hits would occur only 
>>> until it's filled up). The distribution of the input still will make it 
>>> perform relatively well.
>>>
>>> Here's an example - I used Mike McCandless's enwiki and ran English test 
>>> from TestPerformance on it. My master branch results:
>>>
>>> Loaded c:\_tmp\hunspell\libreoffice\en\en_AU.aff
>>> Stemming en: average 1958.7142857142858, all times = [1968, 1960, 1945, 
>>> 1955, 1959, 1972, 1952]
>>> Spellchecking en: average 2438.1428571428573, all times = [2465, 2425, 
>>> 2429, 2434, 2452, 2429, 2433]
>>>
>>> I then added [1] a dumb no-eviction-policy caching on top of stemming and 
>>> spell checking.
>>>
>>> Results for cache size of 2500 entries:
>>> Stemming en: average 886.5714285714286, all times = [885, 885, 889, 889, 
>>> 890, 886, 882]
>>> Spellchecking en: average 1182.0, all times = [1207, 1177, 1184, 1187, 
>>> 1179, 1170, 1170]
>>>
>>> Results for cache size of 5000 entries:
>>> Stemming en: average 762.8571428571429, all times = [764, 758, 756, 766, 
>>> 763, 766, 767]
>>> Spellchecking en: average 1065.2857142857142, all times = [1070, 1068, 
>>> 1061, 1065, 1064, 1067, 1062]
>>>
>>> Results for cache size of 10000 entries:
>>> Stemming en: average 628.0, all times = [626, 628, 627, 624, 627, 634, 630]
>>> Spellchecking en: average 926.8571428571429, all times = [924, 925, 930, 
>>> 927, 925, 930, 927]
>>>
>>> The sizes of these caches are literally nothing and they don't require any 
>>> pretraining or eviction policies. This is all I meant by "hybrid" lookup 
>>> based on the distribution... this could be built in the dictionary but 
>>> doesn't have to be.
>>>
>>> D.
>>>
>>> [1] 
>>> https://github.com/apache/lucene-solr/commit/2b3cc7d1fc4ee3d999873f056764f70129390121
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Hunspell performance

Reply via email to