[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

Robert Muir (JIRA) Sat, 21 Nov 2009 10:21:04 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-1606:
--------------------------------

    Attachment: BenchWildcard.java

attached is benchmark, which generates random wildcard queries.
it builds an index of 10million docs, each with a term from 0-10million.
it will fill a pattern such as N?N?N?N? with random digits, substituting a 
random digit for N.

||Pattern||Iter||AvgHits||AvgMS (old)||AvgMS (new)||
|N?N?N?N|10|1000.0|288.6|38.5|
|?NNNNNN|10|10.0|2453.1|6.4|
|??NNNNN|10|100.0|2484.2|10.1|
|???NNNN|10|1000.0|2821.3|47.8|
|????NNN|10|10000.0|2346.9|299.8|
|NN??NNN|10|100.0|34.8|6.3|
|NN?N*|10|10000.0|26.5|9.4|
|?NN*|10|100000.0|2009.0|73.5|
|*N|10|1000000.0|6837.4|6087.9|
|NNNNN??|10|100.0|1.9|2.3|

i would like to incorporate part of this logic into the junit tests, on maybe a 
smaller index, because its how i found the recursion bug.


> Automaton Query/Filter (scalable regex)
> ---------------------------------------
>
>                 Key: LUCENE-1606
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1606
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: automaton.patch, automatonMultiQuery.patch, 
> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
> automatonWithWildCard.patch, automatonWithWildCard2.patch, 
> BenchWildcard.java, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, 
> LUCENE-1606.patch, LUCENE-1606_nodep.patch
>
>
> Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
> suitable).
> Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
> indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
> Additionally all of the existing RegexQuery implementations in Lucene are 
> really slow if there is no constant prefix. This implementation does not 
> depend upon constant prefix, and runs the same query in 640ms.
> Some use cases I envision:
>  1. lexicography/etc on large text corpora
>  2. looking for things such as urls where the prefix is not constant (http:// 
> or ftp://)
> The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
> regular expressions into a DFA. Then, the filter "enumerates" terms in a 
> special way, by using the underlying state machine. Here is my short 
> description from the comments:
>      The algorithm here is pretty basic. Enumerate terms but instead of a 
> binary accept/reject do:
>       
>      1. Look at the portion that is OK (did not enter a reject state in the 
> DFA)
>      2. Generate the next possible String and seek to that.
> the Query simply wraps the filter with ConstantScoreQuery.
> I did not include the automaton.jar inside the patch but it can be downloaded 
> from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

Reply via email to