Dawid, I didn't notice the commit link then, thanks for pointing that out! This "// TODO: make sure these returned charsref are immutable?" is a good point, because now they're very mutable, referring to internal preallocated buffers in Stemmer which are constantly reused.
In cache-all condition, you ignore the maxSize intentionally, right? I've reproduced your results for English. I also checked German and French, which have compounds and more advanced inflection. They're improved as well, but not so much (30-40% on cache=10000, while calling native Hunspell via JNI is 2-4 times faster). I dream of making Hunspell Stemmer thread-safe, and have even got rid of some preallocated stuff there, but there still remains some, so in the near future it'll remain thread-unsafe and caching can fit in there. Clients might deduplicate the requests themselves, I've done that a couple of times. Then the cache inside Hunspell would be useless and just add some overhead (luckily, not much, as per my CPU snapshots). Robert, for n=20 the speedup is quite small, 2-8% for me depending on the language. Unfortunately Hunspell dictionaries don't have stop word information, it'd be quite useful. On Fri, Feb 12, 2021 at 12:56 PM Robert Muir <[email protected]> wrote: > > On Fri, Feb 12, 2021 at 4:01 AM Dawid Weiss <[email protected]> wrote: > >> >> It's all an intellectual exercise though. I consider the initial >> performance drop on small cache windows a nice, cheap, win. Increasing >> the cache leads to other problems that may sting later (gc activity, >> memory consumption on multiple parallel threads, etc.). >> >> > Is that because of stopwords that aren't being removed? I guess what I'm > asking is, for this test is n=20 enough? :) > > If so, leads to many other potential solutions (without caches). Also it > would suggest the benchmark might be a bit too biased. >
