[ https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557229#comment-13557229 ]
Commit Tag Bot commented on LUCENE-4682: ---------------------------------------- [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1435141 LUCENE-4677, LUCENE-4682, LUCENE-4678, LUCENE-3298: Merged /lucene/dev/trunk:r1432459,1432466,1432472,1432474,1432522,1432646,1433026,1433109 > Reduce wasted bytes in FST due to array arcs > -------------------------------------------- > > Key: LUCENE-4682 > URL: https://issues.apache.org/jira/browse/LUCENE-4682 > Project: Lucene - Core > Issue Type: Improvement > Components: core/FSTs > Reporter: Michael McCandless > Priority: Minor > Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch > > > When a node is close to the root, or it has many outgoing arcs, the FST > writes the arcs as an array (each arc gets N bytes), so we can e.g. bin > search on lookup. > The problem is N is set to the max(numBytesPerArc), so if you have an outlier > arc e.g. with a big output, you can waste many bytes for all the other arcs > that didn't need so many bytes. > I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size > 1535612 = ~18% wasted. > It would be nice to reduce this. > One thing we could do without packing is: in addNode, if we detect that > number of wasted bytes is above some threshold, then don't do the expansion. > Another thing, if we are packing: we could record stats in the first pass > about which nodes wasted the most, and then in the second pass (paack) we > could set the threshold based on the top X% nodes that waste ... > Another idea is maybe to deref large outputs, so that the numBytesPerArc is > more uniform ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org