FST doesn't fully share common prefix across all outputs
--------------------------------------------------------
Key: LUCENE-3297
URL: https://issues.apache.org/jira/browse/LUCENE-3297
Project: Lucene - Java
Issue Type: Improvement
Reporter: Michael McCandless
Priority: Minor
FST will try to share prefixes of outputs when possible, however in the [I
think unusual in practice] case where all outputs share a common prefix, FST
really ought to store this just once, on the root arc, but instead it's only
able to push back to the N root arcs. It's sort of an off-by-one on how far
back the pushing goes...
One [synthetic] example where this makes a big difference is the new
Test2BPostings test, when it uses MemoryCodec, because this test has 26 terms
(letters of alphabet) and each term has exactly the same long (~85 MB) all 1s
byte[] as the postings. If we fixed this issue, then the resulting FST would
only be ~85 MB but now instead it needs to be ~85 * 26 MB.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]