[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786557#action_12786557 ]
Robert Muir commented on LUCENE-1606: ------------------------------------- here is an explanation of the cleanupPosition, and the cases it handles Symbols: A = \u0000 - \uD7FFF (lower BMP) H = \uD800 - \uDBFFF (high/lead surrogates) L = \uDC00 - \uDFFFF (low/trail surrogates) Z = \uE000 - \uFFFF (upper BMP) case 1: // an illegal combination, where we have not yet enumerated into the supp. plane // so we increment to H + \uDC00 (the lowest possible trail surrogate) HA -> H \uDC00 HH -> H \uDC00 case 2: // an illegal combination where we have already enumerated the supp. plane // we must replace both H and L with \uE000 (the lowest possible "upper BMP") HZ -> \uE000 case 3: // an illegal combination where we have a trailing lead surrogate. // we have not yet enumerated the supp plane, so append \uDC000 (lowest possible trail surrogate) // this is just like case 1, except in final position. H$ -> H \uDC00 case 4/5: // an unpaired low surrogate. this is invalid when not preceded by lead surrogate // (and if there was one, the above rules would have dealt with it already) // in this case we have to bump to \uE0000 (the lowest possible "upper BMP") unpaired L -> \uE000 > Automaton Query/Filter (scalable regex) > --------------------------------------- > > Key: LUCENE-1606 > URL: https://issues.apache.org/jira/browse/LUCENE-1606 > Project: Lucene - Java > Issue Type: New Feature > Components: Search > Reporter: Robert Muir > Assignee: Robert Muir > Priority: Minor > Fix For: 3.1 > > Attachments: automaton.patch, automatonMultiQuery.patch, > automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, > automatonWithWildCard.patch, automatonWithWildCard2.patch, > BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, > LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, > LUCENE-1606_nodep.patch > > > Attached is a patch for an AutomatonQuery/Filter (name can change if its not > suitable). > Whereas the out-of-box contrib RegexQuery is nice, I have some very large > indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. > Additionally all of the existing RegexQuery implementations in Lucene are > really slow if there is no constant prefix. This implementation does not > depend upon constant prefix, and runs the same query in 640ms. > Some use cases I envision: > 1. lexicography/etc on large text corpora > 2. looking for things such as urls where the prefix is not constant (http:// > or ftp://) > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert > regular expressions into a DFA. Then, the filter "enumerates" terms in a > special way, by using the underlying state machine. Here is my short > description from the comments: > The algorithm here is pretty basic. Enumerate terms but instead of a > binary accept/reject do: > > 1. Look at the portion that is OK (did not enter a reject state in the > DFA) > 2. Generate the next possible String and seek to that. > the Query simply wraps the filter with ConstantScoreQuery. > I did not include the automaton.jar inside the patch but it can be downloaded > from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org