[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

Michael McCandless (JIRA) Mon, 14 Jan 2013 04:40:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552614#comment-13552614
 ]


Michael McCandless commented on LUCENE-4682:
--------------------------------------------

I tried removing NEXT opto in building the all-English-Wikipedia-terms FST and 
it was a big hit:

  * With NEXT: 59267794 bytes

  * Without NEXT: 82543993 bytes

So FST would be ~39% larger if we remove NEXT ... however lookup sped up from 
726 ns per lookup to 636 ns.  But: we could get this speedup today, if we just 
fixed setting of a NEXT arc's target to be lazy instead.  Today it's very 
costly for non-array arcs because we scan to the end of all nodes to set the 
target, even if the caller isn't going to use it, which is really ridiculous.

I also "tested" delta-coding the arc target instead of the abs vInt we have 
today ... it wasn't a real test, instead I just measured how many bytes the 
vInt delta would be vs how many bytes the vInt abs it today, and the results 
were disappointing:

  * Abs vInt (what we do today): 26681349 bytes

  * Delta vInt: 25479316 bytes

Which is surprising ... I guess we don't see much "locality" for the nodes ... 
or, eg the common suffixes freeze early on and then lots of future nodes refer 
to them.

Maybe, we can find a way to do NEXT without the confusing 
per-node-reverse-bytes?
                
> Reduce wasted bytes in FST due to array arcs
> --------------------------------------------
>
>                 Key: LUCENE-4682
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4682
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: kuromoji.wasted.bytes.txt, LUCENE-4682.patch
>
>
> When a node is close to the root, or it has many outgoing arcs, the FST 
> writes the arcs as an array (each arc gets N bytes), so we can e.g. bin 
> search on lookup.
> The problem is N is set to the max(numBytesPerArc), so if you have an outlier 
> arc e.g. with a big output, you can waste many bytes for all the other arcs 
> that didn't need so many bytes.
> I generated Kuromoji's FST and found it has 271187 wasted bytes vs total size 
> 1535612 = ~18% wasted.
> It would be nice to reduce this.
> One thing we could do without packing is: in addNode, if we detect that 
> number of wasted bytes is above some threshold, then don't do the expansion.
> Another thing, if we are packing: we could record stats in the first pass 
> about which nodes wasted the most, and then in the second pass (paack) we 
> could set the threshold based on the top X% nodes that waste ...
> Another idea is maybe to deref large outputs, so that the numBytesPerArc is 
> more uniform ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4682) Reduce wasted bytes in FST due to array arcs

Reply via email to