[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

Robert Muir (JIRA) Mon, 01 Mar 2010 12:48:36 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839783#action_12839783
 ]


Robert Muir edited comment on LUCENE-2089 at 3/1/10 8:46 PM:
-------------------------------------------------------------

this is mike's patch, except with the additional tests (that fail for lev2/lev3 
due to the off-by-one bug).

edit: found the bug, it is in moman, the author is taking a look. the problem 
was i ran tests against moman with oflazer, (a slower alternative algorithm), 
and the bug in moman doesn't happen there.


      was (Author: rcmuir):
    this is mike's patch, except with the additional tests (that fail for 
lev2/lev3 due to the off-by-one bug).

i checked moman, the problem is not moman, its something we are doing :)

  
> explore using automaton for fuzzyquery
> --------------------------------------
>
>                 Key: LUCENE-2089
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2089
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: Flex Branch
>            Reporter: Robert Muir
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: Flex Branch
>
>         Attachments: ContrivedFuzzyBenchmark.java, gen.py, gen.py, gen.py, 
> gen.py, gen.py, gen.py, Lev2ParametricDescription.java, 
> Lev2ParametricDescription.java, Lev2ParametricDescription.java, 
> Lev2ParametricDescription.java, LUCENE-2089.patch, LUCENE-2089.patch, 
> LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
> LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
> LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089_concat.patch, 
> Moman-0.2.1.tar.gz, TestFuzzy.java
>
>
> we can optimize fuzzyquery by using AutomatonTermsEnum. The idea is to speed 
> up the core FuzzyQuery in similar fashion to Wildcard and Regex speedups, 
> maintaining all backwards compatibility.
> The advantages are:
> * we can seek to terms that are useful, instead of brute-forcing the entire 
> terms dict
> * we can determine matches faster, as true/false from a DFA is array lookup, 
> don't even need to run levenshtein.
> We build Levenshtein DFAs in linear time with respect to the length of the 
> word: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> To implement support for 'prefix' length, we simply concatenate two DFAs, 
> which doesn't require us to do NFA->DFA conversion, as the prefix portion is 
> a singleton. the concatenation is also constant time with respect to the size 
> of the fuzzy DFA, it only need examine its start state.
> with this algorithm, parametric tables are precomputed so that DFAs can be 
> constructed very quickly.
> if the required number of edits is too large (we don't have a table for it), 
> we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
> As the priority queue fills up during enumeration, the similarity score 
> required to be a competitive term increases, so, the enum gets faster and 
> faster as this happens. This is because terms in core FuzzyQuery are sorted 
> by boost value, then by term (in lexicographic order).
> For a large term dictionary with a low minimal similarity, you will fill the 
> pq very quickly since you will match many terms. 
> This not only provides a mechanism to switch to more efficient DFAs (edit 
> distance of 2 -> edit distance of 1 -> edit distance of 0) during 
> enumeration, but also to switch from "dumb mode" to "smart mode".
> With this design, we can add more DFAs at any time by adding additional 
> tables. The tradeoff is the tables get rather large, so for very high K, we 
> would start to increase the size of Lucene's jar file. The idea is we don't 
> have include large tables for very high K, by using the 'competitive boost' 
> attribute of the priority queue.
> For more information, see http://en.wikipedia.org/wiki/Levenshtein_automaton

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-2089) explore using automaton for fuzzyquery

Reply via email to