[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

Michael McCandless (JIRA) Mon, 14 Jan 2013 05:02:17 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552633#comment-13552633
 ]


Michael McCandless commented on LUCENE-4682:
--------------------------------------------

{quote}
bq. So FST would be ~39% larger if we remove NEXT

But according to your notes above, we have 28% waste for this (with a long 
output).
Are we making the right tradeoff?
{quote}

Wait: the 28% waste comes from the array arcs (unrelated to NEXT?).  To fix 
that I think we should use a skip list?  Surely the bytes required to encode 
the skip list are less than our waste today.

{quote}
bq. Maybe, we can find a way to do NEXT without the confusing 
per-node-reverse-bytes?

Or, not do it at all if we cant figure it out? The reversing holds back other 
improvements so
benchmarking it by itself could be misleading.
{quote}

I don't think we should drop NEXT unless we have some alternative?  39% 
increase is size is non-trivial!

I know reversing held back delta-code of the node target, but, that looks like 
it won't gain much.  What else is it holding back?
                
> Reduce wasted bytes in FST due to array arcs
> --------------------------------------------
>
>                 Key: LUCENE-4682
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4682
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch
>
>
> When a node is close to the root, or it has many outgoing arcs, the FST 
> writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
> search on lookup.
> The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
> arc e.g. with a big output, you can waste many bytes for all the other arcs 
> that didn't need so many bytes.
> I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
> 1535612 = ~18% wasted.
> It would be nice to reduce this.
> One thing we could do without packing is: in addNode, if we detect that 
> number of wasted bytes is above some threshold, then don't do the expansion.
> Another thing, if we are packing: we could record stats in the first pass 
> about which nodes wasted the most, and then in the second pass (paack) we 
> could set the threshold based on the top X% nodes that waste ...
> Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
> more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

Reply via email to