[
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839307#action_12839307
]
Robert Muir commented on LUCENE-2089:
-------------------------------------
Mike, this is awesome. I ran benchmarks: we are just as fast as before (with
only Lev1 and Lev2 enabled), but with smaller generated code.
When i turn on Lev3, it speeds up the worst-case ones (no prefix, pq=1024,
fuzzy of n=3, n=4), but slows down some of the "better-case" n=3/n=4 cases
where there is a prefix or PQ.
I think this is because the benchmark is contrived, but realistically n=3 (with
seeking!) should be a win for users. A less-contrived benchmark (a 'typical'
massive term dictionary) would help for tuning.
separately, I think we can add heuristics: e.g. for n > 3 WITH a prefix, use
the DFA in "linear mode" until you drop to n=2, as you already have a nice
prefix anyway, stuff like that. But if the user doesn't supply a prefix, i
think seeking is always a win.
Here are the results anyway: I ran it many times and its consistent (obviously
differences of just a few MS are not significant). I bolded the ones i think
illustrate the differences I am talking about.
Its cool to be at the point where we are actually able to measure these kinds
of tradeoffs!
{{Minimum Sim = 0.73f (edit distance of 1)}}
||Prefix Length||PQ Size||Avg MS (flex trunk)||Avg MS (1,2)||Avg MS (1,2,3)||
|0|1024|3286.0|7.8|7.6
|0|64|3320.4|7.6|8.0
|1|1024|316.8|5.6|5.3
|1|64|314.3|5.6|5.2
|2|1024|31.8|3.8|4.2
|2|64|31.9|3.7|4.5
{{Minimum Sim = 0.58f (edit distance of 2)}}
||Prefix Length||PQ Size||Avg MS (flex trunk)||Avg MS (1,2)||Avg MS (1,2,3)||
|0|1024|4223.3|87.7|91.2
|0|64|4199.7|12.6|13.2
|1|1024|430.1|56.4|62.0
|1|64|392.8|9.3|8.5
|2|1024|82.5|45.5|48.0
|2|64|38.4|6.2|6.3
{{Minimum Sim = 0.43f (edit distance of 3)}}
||Prefix Length||PQ Size||Avg MS (flex trunk)||Avg MS (1,2)||Avg MS (1,2,3)||
|0|1024|5299.9|424.0|*199.8*
|0|64|5231.8|54.1|*93.2*
|1|1024|522.9|103.6|107.9
|1|64|480.9|14.5|*49.3*
|2|1024|89.0|67.9|70.8
|2|64|46.3|6.8|*19.7*
{{Minimum Sim = 0.29f (edit distance of 4)}}
||Prefix Length||PQ Size||Avg MS (flex trunk)||Avg MS (1,2)||Avg MS (1,2,3)||
|0|1024|6258.1|363.7|*206.5*
|0|64|6247.6|75.6|78.8
|1|1024|609.9|108.3|110.0
|1|64|567.1|13.3|*45.5*
|2|1024|98.6|66.6|73.8
|2|64|55.6|6.8|*22.3*
> explore using automaton for fuzzyquery
> --------------------------------------
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: Flex Branch
> Reporter: Robert Muir
> Assignee: Mark Miller
> Priority: Minor
> Fix For: Flex Branch
>
> Attachments: ContrivedFuzzyBenchmark.java, gen.py, gen.py, gen.py,
> gen.py, gen.py, gen.py, Lev2ParametricDescription.java,
> Lev2ParametricDescription.java, Lev2ParametricDescription.java,
> Lev2ParametricDescription.java, LUCENE-2089.patch, LUCENE-2089.patch,
> LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch,
> LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch,
> LUCENE-2089_concat.patch, Moman-0.2.1.tar.gz, TestFuzzy.java
>
>
> we can optimize fuzzyquery by using AutomatonTermsEnum. The idea is to speed
> up the core FuzzyQuery in similar fashion to Wildcard and Regex speedups,
> maintaining all backwards compatibility.
> The advantages are:
> * we can seek to terms that are useful, instead of brute-forcing the entire
> terms dict
> * we can determine matches faster, as true/false from a DFA is array lookup,
> don't even need to run levenshtein.
> We build Levenshtein DFAs in linear time with respect to the length of the
> word: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> To implement support for 'prefix' length, we simply concatenate two DFAs,
> which doesn't require us to do NFA->DFA conversion, as the prefix portion is
> a singleton. the concatenation is also constant time with respect to the size
> of the fuzzy DFA, it only need examine its start state.
> with this algorithm, parametric tables are precomputed so that DFAs can be
> constructed very quickly.
> if the required number of edits is too large (we don't have a table for it),
> we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
> As the priority queue fills up during enumeration, the similarity score
> required to be a competitive term increases, so, the enum gets faster and
> faster as this happens. This is because terms in core FuzzyQuery are sorted
> by boost value, then by term (in lexicographic order).
> For a large term dictionary with a low minimal similarity, you will fill the
> pq very quickly since you will match many terms.
> This not only provides a mechanism to switch to more efficient DFAs (edit
> distance of 2 -> edit distance of 1 -> edit distance of 0) during
> enumeration, but also to switch from "dumb mode" to "smart mode".
> With this design, we can add more DFAs at any time by adding additional
> tables. The tradeoff is the tables get rather large, so for very high K, we
> would start to increase the size of Lucene's jar file. The idea is we don't
> have include large tables for very high K, by using the 'competitive boost'
> attribute of the priority queue.
> For more information, see http://en.wikipedia.org/wiki/Levenshtein_automaton
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]