Dawid, I didn't notice the commit link then, thanks for pointing that out!

This "// TODO: make sure these returned charsref are immutable?" is a good
point, because now they're very mutable, referring to internal preallocated
buffers in Stemmer which are constantly reused.

In cache-all condition, you ignore the maxSize intentionally, right?

I've reproduced your results for English. I also checked German and French,
which have compounds and more advanced inflection. They're improved as
well, but not so much (30-40% on cache=10000, while calling native Hunspell
via JNI is 2-4 times faster).

I dream of making Hunspell Stemmer thread-safe, and have even got rid of
some preallocated stuff there, but there still remains some, so in the near
future it'll remain thread-unsafe and caching can fit in there.

Clients might deduplicate the requests themselves, I've done that a couple
of times. Then the cache inside Hunspell would be useless and just add some
overhead (luckily, not much, as per my CPU snapshots).

Robert, for n=20 the speedup is quite small, 2-8% for me depending on the
language. Unfortunately Hunspell dictionaries don't have stop word
information, it'd be quite useful.

On Fri, Feb 12, 2021 at 12:56 PM Robert Muir <[email protected]> wrote:

>
> On Fri, Feb 12, 2021 at 4:01 AM Dawid Weiss <[email protected]> wrote:
>
>>
>> It's all an intellectual exercise though. I consider the initial
>> performance drop on small cache windows a nice, cheap, win. Increasing
>> the cache leads to other problems that may sting later (gc activity,
>> memory consumption on multiple parallel threads, etc.).
>>
>>
> Is that because of stopwords that aren't being removed? I guess what I'm
> asking is, for this test is n=20 enough? :)
>
> If so, leads to many other potential solutions (without caches). Also it
> would suggest the benchmark might be a bit too biased.
>

Reply via email to