[ 
https://issues.apache.org/jira/browse/LUCENE-8084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288954#comment-16288954
 ] 

Dawid Weiss commented on LUCENE-8084:
-------------------------------------

Hi Mike! I did see the todos in the logic of whether to expand a node or not, 
but I honestly didn't have an idea what would be a good improvement. We 
sometimes do need those branches to be expanded for fast lookups, especially 
close to the root level (where the cost of a linear scan is huge when you 
consider millions of lookups). In fact, even the limit on the root arc cache is 
something we had to bypass manually (root level fanout is huge if you have 
varied int keys) and we expanded all of those arcs. 

Perhaps we have a specific use case, I don't know. I'll experiment a bit and 
see, just wanted to mark the fact that the current storage of expanded nodes is 
fairly inefficient sometimes.

> FSTs can be very space-inefficient on array-expanded nodes
> ----------------------------------------------------------
>
>                 Key: LUCENE-8084
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8084
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Dawid Weiss
>            Priority: Minor
>         Attachments: capture-4.png
>
>
> We have FSTs which operate on a larger alphabet (keys in int) space and emit 
> character sequence outputs. I noticed that certain nodes get expanded into 
> fixed-size arrays to accelerate lookups (binary search). This has a potential 
> problem, however, when the outputs emit larger blobs of data (say, one of the 
> outputs is very long, all the others are small). Then the fixed-size array is 
> very much overallocated, as evident on the attached picture.
> I wonder if it'd be better to encode the array as fixed-size, but without the 
> outputs and use a local fixed-size pointer into a "value pool" somewhere next 
> to the node's arcs. Theoretically this "value pool" could even be shared by 
> all of automaton's nodes (saved once at the end or flushed periodically). 
> Just a thought.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to