Peter, looks like you are way ahead of me :) Thanks for all the work you have been doing here, and thanks to Dawid for helping!
You probably know a lot of this code better than me at this point, but I remember a couple of these pain points, inline below: On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov <[email protected]> wrote: > > Hi Robert, > > Yes, having multiple dictionaries in the same process would increase the > memory significantly. Do you have any idea about how many of them people are > loading, and how much memory they give to Lucene? Yeah in many cases, the user is using a server such as solr or elasticsearch. Let's use solr as an example, as others are here to correct it, if I am wrong. Example to understand the challenges: user uses one of solr's 3 mechanisms to detect language and send to different pipeline: https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html Now we know these language detectors are imperfect, if the user maps a lot of languages to hunspell pipelines, they may load lots of dictionaries, even by just one stray miscategorized document. So it doesn't have to be some extreme "enterprise" use-case like wikipedia.org, it can happen for a little guy faced with a multilingual corpus. Imagine the user decides to go further, and host solr search in this way for a couple local businesses or govt agencies. They support many languages and possibly use this detection scheme above to try to make language a "non-issue". The user may assign each customer a solr "core" (separate index) with this configuration. Does each solr core load its own HunspellStemFactory? I think it might (in isolated classloader), I could be wrong. For the elasticsearch case, maybe the resource usage in the same case is lower, because they reuse dictionaries per-node? I think this is how it works, but I honestly can't remember. Still the problem remains, easy to end up with dozens of these things in memory. Also we have the problem that memory usage for a specific can blow up in several ways. Some languages have bigger .aff file than .dic! > Thanks for the idea about root arcs. I've done some quick sampling and > tracing (for German). 80% of root arc processing time is spent in direct > addressing, and the remainder is linear scan (so root acrs don't seem to > present major issues). For non-root arcs, ~50% is directly addressed, ~45% > linearly-scanned, and the remainder binary-searched. Overall there's about > 60% of direct addressing, both in time and invocation counts, which doesn't > seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing > that might increase the number of directly addressed arcs, but I'm not sure > that'd speed up much given that time and invocation counts seem to correlate. > Sure, but 20% of those linear scans are maybe 7x slower, its O(log2(alphabet_size)) right (assuming alphabet size ~ 128)? Hard to reason about, but maybe worth testing out. It still helps for all the other segmenters (japanese, korean) using fst. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
