On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker <lb...@reallywow.com> wrote: > Hi, > > I am trying to provide a means to search our corpus of nearly 2 > million fulltext astronomy and physics articles using regular > expressions. A small percentage of our users need to be able to > locate, for example, certain types of identifiers that are present > within the fulltext (grant numbers, dataset identifers, etc). > > My straightforward attempts to do this using RegexQuery have been > successful only in the sense that I get the results I'm looking for. > The performance, however, is pretty terrible, with most queries taking > five minutes or longer. Is this the performance I should expect > considering the size of my index and the massive number of terms? Are > there any alternative approaches I could try? > > Things I've already tried: > * reducing the sheer number of terms by adding a LengthFilter, > min=6, to my index analysis chain > * swapping in the JakartaRegexpCapabilities > > Things I intend to try if no one has any better suggestions: > * chunk up the index and search concurrently, either by sharding or > using a RangeQuery based on document id > > Any suggestions appreciated. >
This RegexQuery is not really scalable in my opinion, its always linear to the number of terms except in super-rare circumstances where it can compute a "common prefix" (and slow to boot). You can try svn trunk's RegexpQuery <-- don't forget the "p", instead from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/ etc) The performance is faster, but keep in mind its only as good as the regular expressions, if the regular expressions are like /.*foo.*/, then its just as slow as wildcard of *foo*. -- lucidimagination.com