Thanks David, That would be super duper great! <Regex1><LEV><Regex2>...
the trick with "smart skipping" terms enum is not changing with automaton construction? These two steps are independent... if you build infinite Automaton like REgex(".*"), your problem :) This cannot be better at the moment. One day, TermDict will become automaton as well... but for now we live with this brilliant idea to "use sorted list of terms as other automaton for matching" . But even than, REgex(".*") should match everything :) My proposal is about performance and flexibility Compare simple use case: in order to support transposition in the first two characters like in my example, you would need Lev. Automaton that has maxDistance 2, so it would waste time to scan *much much more* than needed for given constraint. It does not matter if you match against "sorted list" or another automaton. Alternative would be to run two fuzzy enums with Lev automata with maxDistance = 1. and filter them out on prefix later. But these are two separate "term enum scans" With concatenation, "composite automaton" would scan prefix part like normal finite regex prefixes (fast!) and the rest with Lev. automaton with maxDistance 1, using whatever is at the moment possible (sorted list, or another automation like mikes FST TermDictionary) Another example would be to add some phonetic variations here and there... "CARIN" <Regex("C|K")><Lev("ARIN")> Point being, this enables us to restrict search space of Lev automaton significantly (which is normally huge as simple transposition needs minDistance 2 ...) to achieve what we want. (e.g. most of CARIN matches with LEV(2) are mostly nonsense) one could also restrict suffixes easily, with supported intersection, one could do other wild things... I'll definitely have to look at this "scary" code as Mike indicated in his blog Cheers, eks On Wed, Aug 10, 2011 at 11:07 AM, Dawid Weiss <dawid.we...@cs.put.poznan.pl> wrote: > > The thing you're describing is a regular composition of automata (as it > exists, for example, when composing clauses of a regular expression). If I > recall right the Levenshtein automaton in Lucene is built on modified brics > code... if so then this should be not a problem. The problem may be that > currently automatons are used in enums in a way that skips from one accepted > sequence to another accepted sequence (if possible). If the automaton has * > operators then there is no way to establish these and everything falls back > to full matching strategy. > Dawid > > On Wed, Aug 10, 2011 at 10:54 AM, eks dev <eks...@yahoo.co.uk> wrote: >> >> Hi Robert, Mike & other FS(A|T) gurus, >> >> a challenge for you ;) >> >> Would it be possible to combine these brilliant peaces of functionality >> with normal Automaton somehow... >> >> Example to illustrate. >> DirectSpellChecker: >> - where instead of minPrefix, we would specify Regex (other Automaton) >> pfxAutiomaton = Regex("(AB)|(BA)") // e.g. Saying, >> levAutomaton = LevenshteinAutomata("XYZ") >> >> spell(pfxAutomaton, levAutomaton); >> >> would match terms that start with "AB" or "BA" and suffix part are normal >> edit distance matches, like ABXY, with one delete >> This would support wild things, like "enable only transpositions in first >> three characters"... In order to gat these matches today, you need to make >> Lev. Automata with maxDistance = 2 (which is then HUGE space to search >> without prefix)... Or generate more Lev. automata and make union of results >> (expensive to itterate) >> >> Other good use cases are simple to construct... >> >> The most general question, can we support at least concatenation between >> LevenshteinAutomata and normal Automata. Intersection/union would be crazy >> thing as well? Where we would have: >> FilteringAutomata.intersect(LevenshteinAutomata)... but I guess I am >> dreaming with this one, but concatenation sounds doable (at least prefix >> side) >> >> Cheers, >> Eks >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org