Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

Robert Muir Sat, 05 Dec 2009 14:11:32 -0800

Hi Ghazal,

I am sorry this one is a bit confusing. I think it is because a lot of
people are working on it (which is great) and a lot of ideas going back and
forth, causing lots of files to be uploaded, etc.


Can you tell us more about your interest in working with NFA/DFA in Lucene?
I am very curious to hear any uses cases you might have, or why you are
interested!

In general, for contributing to lucene this link is helpful:
http://wiki.apache.org/lucene-java/HowToContribute

It tells you how the patch submission process works, how to get the latest
code from subversion, etc.

On Sat, Dec 5, 2009 at 4:58 PM, Ghazal Gharooni
<[email protected]>wrote:

> Hello,
>
> I am new in the community and I've completely been confused. Please anybody
> help me out to know which part of codes you are working with. How should I
> participate in work? Thank you!
>
>
>
>
>
> On Sat, Dec 5, 2009 at 1:02 PM, Uwe Schindler (JIRA) <[email protected]>wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>
>> Uwe Schindler updated LUCENE-1606:
>> ----------------------------------
>>
>>     Attachment:     (was: LUCENE-1606-flex.patch)
>>
>> > Automaton Query/Filter (scalable regex)
>> > ---------------------------------------
>> >
>> >                 Key: LUCENE-1606
>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1606
>> >             Project: Lucene - Java
>> >          Issue Type: New Feature
>> >          Components: Search
>> >            Reporter: Robert Muir
>> >            Assignee: Robert Muir
>> >            Priority: Minor
>> >             Fix For: 3.1
>> >
>> >         Attachments: automaton.patch, automatonMultiQuery.patch,
>> automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
>> automatonWithWildCard.patch, automatonWithWildCard2.patch,
>> BenchWildcard.java, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
>> LUCENE-1606-flex.patch, LUCENE-1606-flex.patch, LUCENE-1606-flex.patch,
>> LUCENE-1606-flex.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> LUCENE-1606.patch, LUCENE-1606.patch, LUCENE-1606.patch,
>> LUCENE-1606_nodep.patch
>> >
>> >
>> > Attached is a patch for an AutomatonQuery/Filter (name can change if its
>> not suitable).
>> > Whereas the out-of-box contrib RegexQuery is nice, I have some very
>> large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes,
>> etc. Additionally all of the existing RegexQuery implementations in Lucene
>> are really slow if there is no constant prefix. This implementation does not
>> depend upon constant prefix, and runs the same query in 640ms.
>> > Some use cases I envision:
>> >  1. lexicography/etc on large text corpora
>> >  2. looking for things such as urls where the prefix is not constant
>> (http:// or ftp://)
>> > The Filter uses the BRICS package (http://www.brics.dk/automaton/) to
>> convert regular expressions into a DFA. Then, the filter "enumerates" terms
>> in a special way, by using the underlying state machine. Here is my short
>> description from the comments:
>> >      The algorithm here is pretty basic. Enumerate terms but instead of
>> a binary accept/reject do:
>> >
>> >      1. Look at the portion that is OK (did not enter a reject state in
>> the DFA)
>> >      2. Generate the next possible String and seek to that.
>> > the Query simply wraps the filter with ConstantScoreQuery.
>> > I did not include the automaton.jar inside the patch but it can be
>> downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>


-- 
Robert Muir
[email protected]

Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

Reply via email to