[
https://issues.apache.org/jira/browse/LUCENE-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081557#comment-13081557
]
Michael McCandless commented on LUCENE-3297:
--------------------------------------------
If indeed we can make the code more generic and not lose (too much)
perf then that would be awesome... I'm just having trouble seeing how
adding explicit <eps> label will be more generic since <eps> would
only (and, always) be used in exactly one special-cased place (the
root arc), I think?
I must be missing something in your proposal...
Or, are you suggesting we actually make a "before start" symbol (hmm,
the mirror image of FST.END_LABEL) and always forcefully/explicitly
insert this in front of every byte[] passed to Builder? This would in
fact fix this issue, since Builder should push a global output prefix
onto that first arc... and then that first arc would become the FST's
root arc.
> FST doesn't fully share common prefix across all outputs
> --------------------------------------------------------
>
> Key: LUCENE-3297
> URL: https://issues.apache.org/jira/browse/LUCENE-3297
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/FSTs
> Reporter: Michael McCandless
> Priority: Minor
>
> FST will try to share prefixes of outputs when possible, however in the [I
> think unusual in practice] case where all outputs share a common prefix, FST
> really ought to store this just once, on the root arc, but instead it's only
> able to push back to the N root arcs. It's sort of an off-by-one on how far
> back the pushing goes...
> One [synthetic] example where this makes a big difference is the new
> Test2BPostings test, when it uses MemoryCodec, because this test has 26 terms
> (letters of alphabet) and each term has exactly the same long (~85 MB) all 1s
> byte[] as the postings. If we fixed this issue, then the resulting FST would
> only be ~85 MB but now instead it needs to be ~85 * 26 MB.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]