Peter, looks like you are way ahead of me :) Thanks for all the work
you have been doing here, and thanks to Dawid for helping!

You probably know a lot of this code better than me at this point, but
I remember a couple of these pain points, inline below:

On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
<[email protected]> wrote:
>
> Hi Robert,
>
> Yes, having multiple dictionaries in the same process would increase the 
> memory significantly. Do you have any idea about how many of them people are 
> loading, and how much memory they give to Lucene?

Yeah in many cases, the user is using a server such as solr or elasticsearch.
Let's use solr as an example, as others are here to correct it, if I am wrong.

Example to understand the challenges: user uses one of solr's 3
mechanisms to detect language and send to different pipeline:
https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
Now we know these language detectors are imperfect, if the user maps a
lot of languages to hunspell pipelines, they may load lots of
dictionaries, even by just one stray miscategorized document.
So it doesn't have to be some extreme "enterprise" use-case like
wikipedia.org, it can happen for a little guy faced with a
multilingual corpus.

Imagine the user decides to go further, and host solr search in this
way for a couple local businesses or govt agencies.
They support many languages and possibly use this detection scheme
above to try to make language a "non-issue".
The user may assign each customer a solr "core" (separate index) with
this configuration.
Does each solr core load its own HunspellStemFactory? I think it might
(in isolated classloader), I could be wrong.

For the elasticsearch case, maybe the resource usage in the same case
is lower, because they reuse dictionaries per-node?
I think this is how it works, but I honestly can't remember.
Still the problem remains, easy to end up with dozens of these things in memory.

Also we have the problem that memory usage for a specific can blow up
in several ways.
Some languages have bigger .aff file than .dic!

> Thanks for the idea about root arcs. I've done some quick sampling and 
> tracing (for German). 80% of root arc processing time is spent in direct 
> addressing, and the remainder is linear scan (so root acrs don't seem to 
> present major issues). For non-root arcs, ~50% is directly addressed, ~45% 
> linearly-scanned, and the remainder binary-searched. Overall there's about 
> 60% of direct addressing, both in time and invocation counts, which doesn't 
> seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing 
> that might increase the number of directly addressed arcs, but I'm not sure 
> that'd speed up much given that time and invocation counts seem to correlate.
>

Sure, but 20% of those linear scans are maybe 7x slower, its
O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
Hard to reason about, but maybe worth testing out. It still helps for
all the other segmenters (japanese, korean) using fst.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to