[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790368#action_12790368 ]
Mark Miller commented on LUCENE-2089: ------------------------------------- bq. If you do take hold of it, do not hesitate to share The original paper and C++ code likewise melt my brain, and I needed the algo in some other place. The java impl I was onto was about 75% complete according to the author, but I have not yet looked at the code. Robert was convinced it was a different less efficient algorithm last I heard though. We have cracked much of the paper - thats how Robert implemented n=1 here - thats from the paper. The next step is to work out how to construct the tables for n as Robert says above. And store those tables efficiently as they start getting quite large rather fast - though we might only use as high as n=3 or 4 in Lucene - Robert suspects term seeking will outweigh any gains at that point. I think we know how to do the majority of the work for the n case, but I don't really have much/any time for this, so it probably depends on if/when Robert gets to it. If he loses interest on finishing, I def plan to come back to it someday. I'd like to complete my understanding of the paper and see a full n java impl of this in either case. The main piece left that I don't understand fully (computing all possible states for n), can be computed with just a brute force check (thats how the python impl is doing it), so there may not be much more to understand. I would like to know how the paper is getting 'i' parametrized state generators though - thats much more efficient. The paper shows them for n=1 and n=2. > explore using automaton for fuzzyquery > -------------------------------------- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Wish > Components: Search > Reporter: Robert Muir > Assignee: Mark Miller > Priority: Minor > Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java > > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least small common E up to some max K (1,2,3, etc) we should create > a DFA for each E. > if the required E is above our supported max, we use "dumb mode" at first (no > seeking, no DFA, just brute force like now). > As the pq fills, we swap progressively lower DFAs into the enum, based upon > the lowest score in the pq. > This should work well on avg, at high E, you will typically fill the pq very > quickly since you will match many terms. > This not only provides a mechanism to switch to more efficient DFAs during > enumeration, but also to switch from "dumb mode" to "smart mode". > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNNNNNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org