50% speedup for the HunspellStemmer use case? for 3x the memory space?

Just my opinion: Seems like the correct tradeoff to me.
Analysis chain is a serious bottleneck for indexing speed: this
hunspell is one of the slower ones.

To me the challenge with such a change is just trying to prevent
strange dictionaries from blowing up to 30x the space :)

On Wed, Feb 10, 2021 at 12:53 PM Peter Gromov
<[email protected]> wrote:
>
> I was hoping for some numbers :) In the meantime, I've got some of my own. I 
> loaded 90 dictionaries from https://github.com/wooorm/dictionaries (there's 
> more, but I ignored dialects of the same base language). Together they 
> currently consume a humble 166MB. With one of my less memory-hungry 
> approaches, they'd take ~500MB (maybe less if I optimize, but probably not 
> significantly). Is this very bad or tolerable for, say, 50% speedup?
>
> I've seen huge *.aff files, and I'm planning to do something with affix FSTs, 
> too. They take some noticeable time, too, but much less than *.dic-s one, so 
> for now I concentrate on *.dic.
>
> > Sure, but 20% of those linear scans are maybe 7x slower
>
> Checked that. The distribution appears to be decreasing monotonically. No 
> linear scans are longer than 8, and ~85% of all linear scans end after no 
> more than 1 miss.
>
> I'll try BYTE1 if I manage to do it. It turned out to be surprisingly 
> complicated :(
>
> On Wed, Feb 10, 2021 at 5:04 PM Robert Muir <[email protected]> wrote:
>>
>> Peter, looks like you are way ahead of me :) Thanks for all the work
>> you have been doing here, and thanks to Dawid for helping!
>>
>> You probably know a lot of this code better than me at this point, but
>> I remember a couple of these pain points, inline below:
>>
>> On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
>> <[email protected]> wrote:
>> >
>> > Hi Robert,
>> >
>> > Yes, having multiple dictionaries in the same process would increase the 
>> > memory significantly. Do you have any idea about how many of them people 
>> > are loading, and how much memory they give to Lucene?
>>
>> Yeah in many cases, the user is using a server such as solr or elasticsearch.
>> Let's use solr as an example, as others are here to correct it, if I am 
>> wrong.
>>
>> Example to understand the challenges: user uses one of solr's 3
>> mechanisms to detect language and send to different pipeline:
>> https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
>> Now we know these language detectors are imperfect, if the user maps a
>> lot of languages to hunspell pipelines, they may load lots of
>> dictionaries, even by just one stray miscategorized document.
>> So it doesn't have to be some extreme "enterprise" use-case like
>> wikipedia.org, it can happen for a little guy faced with a
>> multilingual corpus.
>>
>> Imagine the user decides to go further, and host solr search in this
>> way for a couple local businesses or govt agencies.
>> They support many languages and possibly use this detection scheme
>> above to try to make language a "non-issue".
>> The user may assign each customer a solr "core" (separate index) with
>> this configuration.
>> Does each solr core load its own HunspellStemFactory? I think it might
>> (in isolated classloader), I could be wrong.
>>
>> For the elasticsearch case, maybe the resource usage in the same case
>> is lower, because they reuse dictionaries per-node?
>> I think this is how it works, but I honestly can't remember.
>> Still the problem remains, easy to end up with dozens of these things in 
>> memory.
>>
>> Also we have the problem that memory usage for a specific can blow up
>> in several ways.
>> Some languages have bigger .aff file than .dic!
>>
>> > Thanks for the idea about root arcs. I've done some quick sampling and 
>> > tracing (for German). 80% of root arc processing time is spent in direct 
>> > addressing, and the remainder is linear scan (so root acrs don't seem to 
>> > present major issues). For non-root arcs, ~50% is directly addressed, ~45% 
>> > linearly-scanned, and the remainder binary-searched. Overall there's about 
>> > 60% of direct addressing, both in time and invocation counts, which 
>> > doesn't seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. 
>> > Reducing that might increase the number of directly addressed arcs, but 
>> > I'm not sure that'd speed up much given that time and invocation counts 
>> > seem to correlate.
>> >
>>
>> Sure, but 20% of those linear scans are maybe 7x slower, its
>> O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
>> Hard to reason about, but maybe worth testing out. It still helps for
>> all the other segmenters (japanese, korean) using fst.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to