Just throwing out another random idea: if you are doing a lot of FST traversals (e.g. for inexact matching or decomposition), you may end out "hammering" the root arcs of the FST heavily, depending on how the algorithm works. Because root arcs are "busy", they end out being O(logN) lookups in the FST and get slow. Japanese and Korean analyzers are doing "decompounding" too, and have hacks to waste some RAM, ensuring the heavy root arc traversals are O(1): https://github.com/apache/lucene-solr/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoFST.java
Bruno did some FST improvements across the board here, but last time we checked, these hacks were still needed for segmentation usecases like this: see his benchmark here: https://s.apache.org/ffelc For example, maybe it makes sense to cache a few hundred nodes here in a similar way, depending on dictionary's alphabet size, to accelerate segmentation, I don't know if it will help. Maybe also the current FST "INPUT_TYPEs" are inappropriate and it would work better as BYTE1 FST rather than BYTE2 or BYTE4 or whatever it is using now. The current stemming doesn't put much pressure on this, so it isn't optimized. On Wed, Feb 10, 2021 at 7:53 AM Robert Muir <[email protected]> wrote: > > The RAM usage used to be bad as you describe, it blows up way worse > for other languages than German. There were many issues :) > > For Lucene, one common issue was that users wanted to have a lot of > these things in RAM: e.g. supporting many different languages on a > single server (multilingual data) and so forth. > Can we speed up your use-case by using the FST in a smarter way? Why > are there so many traversals... is it the way it is doing inexact > matching? decomposition? > > That was the trick done with stemming, and the stemming was > accelerated with some side data structures. For example "patternIndex" > thing which is a scary precomputed list of tableized DFAs... its > wasting a "little" space with these tables to speed up hotspot for > stemming. In that patternIndex example, some assumptions / limits had > to be set, that hopefully no dictionary would ever make: that's all > the "please report this to [email protected]" checks in the code. > some tests were run against all the crazy OO dictionaries out there to > examine the memory usage when looking at changes like this. Some of > these are really, really crazy and do surprising things. > > On Wed, Feb 10, 2021 at 6:16 AM Peter Gromov > <[email protected]> wrote: > > > > Hi there, > > > > I'm mostly done with supporting major Hunspell features necessary for most > > european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but > > of course I anticipate more minor fixes to come). Thanks Dawid Weiss for > > thorough reviews and prompt accepting my PRs so far! > > > > Now I'd like to make this Hunspell implementation at least as fast as the > > native Hunspell called via JNI, ideally faster. Now it seems 1.5-3 times > > slower for me, depending on the language (I've checked en/de/fr so far). > > I've profiled it, done some minor optimizations, and now it appears that > > most time is taken by FST traversals. I've prototyped decreasing the number > > of these traversals, and the execution time goes down noticeably (e.g. > > 30%), but it's still not enough, and the code becomes complicated. > > > > So I'm considering other data structures instead of FSTs (Hunspell/C++ > > itself doesn't bother with tries: it uses hash tables and linear searches > > instead). The problem is, FST is very well space-optimized, and other data > > structures consume more memory. > > > > So my question is: what's the relative importance of speed and memory in > > Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be OK > > to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45% > > improvement in stemming)? Or, with a BytesRefHash plus an array I can make > > it ~9MB, with almost the same speedup (but more complex code). > > > > How much memory usage is acceptable at all? > > > > Maybe there are other suitable data structures in Lucene core that I'm not > > aware of? I basically need a Map<String, Object>, which'd be better queried > > with a char[]+offset+length keys (like CharArrayMap does). > > > > Peter --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
