[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13965215#comment-13965215 ]
Christian Ziech commented on LUCENE-5584: ----------------------------------------- Thx for the very quick and helpful replies. It seems that I owe you some more hard and concrete information on our use case, what we exactly do and our environment. About the environment - the tests were run with {quote} java version "1.7.0_45" OpenJDK Runtime Environment (rhel-2.4.3.4.el6_5-x86_64 u45-b15) OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode) {quote} on a CentOS 6.5. Our vm options don't enable the tlab right now but I'm definitely consider using it for other reasons. Currently we are running with the following (gc relevant) arguments: -Xmx6g -XX:MaxNewSize=700m -XX:+UseConcMarkSweepGC -XX:MaxDirectMemorySize=35g. I'm not so much worried about the get performance although that could be improved as well. We are using lucenes LevenshteinAutomata class to generate a couple of Levenshtein automatons with edit distance 1 or 2 (one for each search term), build the union of them and intersect them with our FST using a modified version of the method org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() which uses a callback method to push every matched entry instead of returning the whole list of paths (for efficiency reasons as well: we don't actually need the byte arrays but we want to parse them into a value object, hence reusing the output byte array is ok for us). Our FST has about 500M entires and each entry has a value of approx. 10-20 bytes. That produces for a random query with 4 terms (and hence a union of 4 levenshtein automatons) an amount of ~2M visited nodes with output (hence 2M created temporary byte []) and a total size ~7.5M for the temporary byte arrays (+ the overhead per instance). In that experiment I matched about 10k terms in the FST. Those numbers are taking into account that we already used our own add implementation that writes to always the same BytesRef instance when adding outputs. The overall impact on the GC and also the execution speed of the method was rather significant in total - I can try to dig up numbers for that but they would be rather application specific. Does this help and answers all the questions so far? Btw: Experimenting a little with the change I noticed that things may be a slightly more complicated since the output of a node is often overwritten with "NO_OUTPUT" from the Outputs - so that method would need to recycle the current output as well if possible but that may have interesting side effects - but hopefully that should be manageable. > Allow FST read method to also recycle the output value when traversing FST > -------------------------------------------------------------------------- > > Key: LUCENE-5584 > URL: https://issues.apache.org/jira/browse/LUCENE-5584 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Affects Versions: 4.7.1 > Reporter: Christian Ziech > > The FST class heavily reuses Arc instances when traversing the FST. The > output of an Arc however is not reused. This can especially be important when > traversing large portions of a FST and using the ByteSequenceOutputs and > CharSequenceOutputs. Those classes create a new byte[] or char[] for every > node read (which has an output). > In our use case we intersect a lucene Automaton with a FST<BytesRef> much > like it is done in > org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and > since the Automaton and the FST are both rather large tens or even hundreds > of thousands of temporary byte array objects are created. > One possible solution to the problem would be to change the > org.apache.lucene.util.fst.Outputs class to have two additional methods (if > you don't want to change the existing methods for compatibility): > {code} > /** Decode an output value previously written with {@link > * #write(Object, DataOutput)} reusing the object passed in if possible */ > public abstract T read(DataInput in, T reuse) throws IOException; > /** Decode an output value previously written with {@link > * #writeFinalOutput(Object, DataOutput)}. By default this > * just calls {@link #read(DataInput)}. This tries to reuse the object > * passed in if possible */ > public T readFinalOutput(DataInput in, T reuse) throws IOException { > return read(in, reuse); > } > {code} > The new methods could then be used in the FST in the readNextRealArc() method > passing in the output of the reused Arc. For most inputs they could even just > invoke the original read(in) method. > If you should decide to make that change I'd be happy to supply a patch > and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org