[ https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996247#comment-13996247 ]
Michael McCandless commented on LUCENE-5584: -------------------------------------------- bq. If the outputs of the FST wouldn't actually be a field of the FST itself but if they would be under control of the caller of the FST read*Arc methods just like the BytesReader is, we wouldn't have the problem (maybe instead of the BytesReader). This would essentially push thread-privateness of the Outputs out to the caller. It's true we did this for the BytesReader (and we've wondered in the past about using a ThreadPrivate instead), but it makes me nervous also pushing thread-privateness of Outputs to the caller. I'm also confused on why a custom Outputs impl that "secretly" reuses isn't sufficient here. Actually, let me ammend the suggestion I made before, to this: {noformat} public BytesRef add(BytesRef a, BytesRef b) { BytesRef result; if (a == NO_OUTPUT) { result = new BytesRef(); } else { result = a; } result.append(b); return result; } {noformat} (Not tested). I think something like this would not require any thread-privateness yet would allow multiple threads to work correctly because each thread would first do that "result = new BytesRef()" and then re-use that output from then on, without requiring explicit ThreadLocal anywhere? > Allow FST read method to also recycle the output value when traversing FST > -------------------------------------------------------------------------- > > Key: LUCENE-5584 > URL: https://issues.apache.org/jira/browse/LUCENE-5584 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Affects Versions: 4.7.1 > Reporter: Christian Ziech > Attachments: fst-itersect-benchmark.tgz > > > The FST class heavily reuses Arc instances when traversing the FST. The > output of an Arc however is not reused. This can especially be important when > traversing large portions of a FST and using the ByteSequenceOutputs and > CharSequenceOutputs. Those classes create a new byte[] or char[] for every > node read (which has an output). > In our use case we intersect a lucene Automaton with a FST<BytesRef> much > like it is done in > org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and > since the Automaton and the FST are both rather large tens or even hundreds > of thousands of temporary byte array objects are created. > One possible solution to the problem would be to change the > org.apache.lucene.util.fst.Outputs class to have two additional methods (if > you don't want to change the existing methods for compatibility): > {code} > /** Decode an output value previously written with {@link > * #write(Object, DataOutput)} reusing the object passed in if possible */ > public abstract T read(DataInput in, T reuse) throws IOException; > /** Decode an output value previously written with {@link > * #writeFinalOutput(Object, DataOutput)}. By default this > * just calls {@link #read(DataInput)}. This tries to reuse the object > * passed in if possible */ > public T readFinalOutput(DataInput in, T reuse) throws IOException { > return read(in, reuse); > } > {code} > The new methods could then be used in the FST in the readNextRealArc() method > passing in the output of the reused Arc. For most inputs they could even just > invoke the original read(in) method. > If you should decide to make that change I'd be happy to supply a patch > and/or tests for the feature. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org