I've been experimenting with a new FST encoding, and the performance gains are exciting on FST-intensive benchmarks like for the suggesters and for PKLookup in luceneutil. In our production system we see some gains in regular search performance as well, although these are modest since FST lookup is not a major component of our costs there. The gains vary a lot depending on the workload and the makeup of the terms encoded in the FST, but for example I've seen as much as a 2x speedup in Fuzzy Suggester performance. I'd like to open an issue to discuss further, and share a PR, but here's a quick summary of what the proposed change is:
FST is basically a graph composed of Arcs. Today, outgoing Arcs of a given Arc can be encoded in two ways: either as a list of variable-length-encoded Arcs, or as an array of fixed-length-encoded Arcs. When seeking in the FST (eg looking up a term), one matches successive characters against the Arc labels in the graph. If Arcs are encoded in the list format, we scan each item in the list to find a matching label, terminating early since they are ordered (by label). When Arcs are in the array format, we use a binary search to find an Arc with a matching label. The array format is used when there are a relatively larger number of Arcs, and more so nearer the start of FST graph.The new "direct array" encoding stores outgoing Arcs in a full-sized array big enough to cover the complete span of outgoing labels, so we can address directly by label and avoid the binary search. Generally such an array will have some gaps, so this fundamentally offers a space/time tradeoff. Unless there is some strenuous objection (FST must not be touched!), I'll open an issue soon and post a patch for comments. -Mike --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org