I've been experimenting with a new FST encoding, and the performance
gains are exciting on FST-intensive benchmarks like for the suggesters
and for PKLookup in luceneutil. In our production system we see some
gains in regular search performance as well, although these are modest
since FST lookup is not a major component of our costs there. The
gains vary a lot depending on the workload and the makeup of the terms
encoded in the FST, but for example I've seen as much as a 2x speedup
in Fuzzy Suggester performance. I'd like to open an issue to discuss
further, and share a PR, but here's a quick summary of what the
proposed change is:

FST is basically a graph composed of Arcs. Today, outgoing Arcs of a
given Arc can be encoded in two ways: either as a list of
variable-length-encoded Arcs, or as an array of fixed-length-encoded
Arcs. When seeking in the FST (eg looking up a term), one matches
successive characters against the Arc labels in the graph. If Arcs are
encoded in the list format, we scan each item in the list to find a
matching label, terminating early since they are ordered (by label).
When Arcs are in the array format, we use a binary search to find an
Arc with a matching label. The array format is used when there are a
relatively larger number of Arcs, and more so nearer the start of FST
graph.The new "direct array" encoding stores outgoing Arcs in a
full-sized array big enough to cover the complete span of outgoing
labels, so we can address directly by label and avoid the binary
search. Generally such an array will have some gaps, so this
fundamentally offers a space/time tradeoff.

Unless there is some strenuous objection (FST must not be touched!),
I'll open an issue soon and post a patch for comments.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to