Could you show us some examples of the kinds of things
you're using regex for? I.e. the raw text and the regex you
use to match the example?

The reason I ask is that perhaps there are other approaches,
especially thinking about some clever analyzing at index time.

For instance, perhaps NGrams are an option. Perhaps
just making WordDelimiterFilterFactory do its tricks. Perhaps.....

In other words, this could be an "XY problem"....


On Thu, Dec 8, 2011 at 11:14 AM, Robert Muir <> wrote:
> On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker <> wrote:
>> Hi,
>> I am trying to provide a means to search our corpus of nearly 2
>> million fulltext astronomy and physics articles using regular
>> expressions. A small percentage of our users need to be able to
>> locate, for example, certain types of identifiers that are present
>> within the fulltext (grant numbers, dataset identifers, etc).
>> My straightforward attempts to do this using RegexQuery have been
>> successful only in the sense that I get the results I'm looking for.
>> The performance, however, is pretty terrible, with most queries taking
>> five minutes or longer. Is this the performance I should expect
>> considering the size of my index and the massive number of terms? Are
>> there any alternative approaches I could try?
>> Things I've already tried:
>>  * reducing the sheer number of terms by adding a LengthFilter,
>> min=6, to my index analysis chain
>>  * swapping in the JakartaRegexpCapabilities
>> Things I intend to try if no one has any better suggestions:
>>  * chunk up the index and search concurrently, either by sharding or
>> using a RangeQuery based on document id
>> Any suggestions appreciated.
> This RegexQuery is not really scalable in my opinion, its always
> linear to the number of terms except in super-rare circumstances where
> it can compute a "common prefix" (and slow to boot).
> You can try svn trunk's RegexpQuery <-- don't forget the "p", instead
> from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/
> etc)
> The performance is faster, but keep in mind its only as good as the
> regular expressions, if the regular expressions are like /.*foo.*/,
> then
> its just as slow as wildcard of *foo*.
> --

Reply via email to