I was hoping for some numbers :) In the meantime, I've got some of my own. I loaded 90 dictionaries from https://github.com/wooorm/dictionaries (there's more, but I ignored dialects of the same base language). Together they currently consume a humble 166MB. With one of my less memory-hungry approaches, they'd take ~500MB (maybe less if I optimize, but probably not significantly). Is this very bad or tolerable for, say, 50% speedup?
I've seen huge *.aff files, and I'm planning to do something with affix FSTs, too. They take some noticeable time, too, but much less than *.dic-s one, so for now I concentrate on *.dic. > Sure, but 20% of those linear scans are maybe 7x slower Checked that. The distribution appears to be decreasing monotonically. No linear scans are longer than 8, and ~85% of all linear scans end after no more than 1 miss. I'll try BYTE1 if I manage to do it. It turned out to be surprisingly complicated :( On Wed, Feb 10, 2021 at 5:04 PM Robert Muir <[email protected]> wrote: > Peter, looks like you are way ahead of me :) Thanks for all the work > you have been doing here, and thanks to Dawid for helping! > > You probably know a lot of this code better than me at this point, but > I remember a couple of these pain points, inline below: > > On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov > <[email protected]> wrote: > > > > Hi Robert, > > > > Yes, having multiple dictionaries in the same process would increase the > memory significantly. Do you have any idea about how many of them people > are loading, and how much memory they give to Lucene? > > Yeah in many cases, the user is using a server such as solr or > elasticsearch. > Let's use solr as an example, as others are here to correct it, if I am > wrong. > > Example to understand the challenges: user uses one of solr's 3 > mechanisms to detect language and send to different pipeline: > > https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html > Now we know these language detectors are imperfect, if the user maps a > lot of languages to hunspell pipelines, they may load lots of > dictionaries, even by just one stray miscategorized document. > So it doesn't have to be some extreme "enterprise" use-case like > wikipedia.org, it can happen for a little guy faced with a > multilingual corpus. > > Imagine the user decides to go further, and host solr search in this > way for a couple local businesses or govt agencies. > They support many languages and possibly use this detection scheme > above to try to make language a "non-issue". > The user may assign each customer a solr "core" (separate index) with > this configuration. > Does each solr core load its own HunspellStemFactory? I think it might > (in isolated classloader), I could be wrong. > > For the elasticsearch case, maybe the resource usage in the same case > is lower, because they reuse dictionaries per-node? > I think this is how it works, but I honestly can't remember. > Still the problem remains, easy to end up with dozens of these things in > memory. > > Also we have the problem that memory usage for a specific can blow up > in several ways. > Some languages have bigger .aff file than .dic! > > > Thanks for the idea about root arcs. I've done some quick sampling and > tracing (for German). 80% of root arc processing time is spent in direct > addressing, and the remainder is linear scan (so root acrs don't seem to > present major issues). For non-root arcs, ~50% is directly addressed, ~45% > linearly-scanned, and the remainder binary-searched. Overall there's about > 60% of direct addressing, both in time and invocation counts, which doesn't > seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing > that might increase the number of directly addressed arcs, but I'm not sure > that'd speed up much given that time and invocation counts seem to > correlate. > > > > Sure, but 20% of those linear scans are maybe 7x slower, its > O(log2(alphabet_size)) right (assuming alphabet size ~ 128)? > Hard to reason about, but maybe worth testing out. It still helps for > all the other segmenters (japanese, korean) using fst. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
