Re: regex-based query contribution

Paul Elschot Thu, 13 Oct 2005 00:15:58 -0700

On Thursday 13 October 2005 01:44, Erik Hatcher wrote:
> I've developed normal and span-based Query implementations that use  
> regex to match index terms rather than the simplified WildcardQuery.   
> This allows for queries like "abc[0-9]xyz" that would match abc1xyz,  
> but not abc12xyz for example.
> 
> I've seen a lot of interest lately in being able to do a phrase query  
> with a nested wildcard term inside, such as "the q.*k brown f.x".  I  
> turn a query like that into a SpanNearQuery of SpanTermQuery("the"),  
> SpanPatternQuery("q.*k"), SpanTermQuery("brown"), and SpanPatternQuery 
> ("f.x") with a slop of 0.
> 
> The code is fairly minimal thanks to the wonderful infrastructure  
> already provided.  I'm ready to contribute it to Lucene.  The  
> question is, where?  Should this be part of the core?  Or should it  
> reside in a contrib area?  If in contrib, shall it be a new area  
> called "regex" perhaps, or "regex-query"?
> 
> I'm inclined to put it in the core, so if I don't hear otherwise I'll  
> start with it there.
> 
> The main negative to this query, just like with WildcardQuery and  
> FuzzyQuery, is the possible performance issue.  However, just like  
> WildcardQuery, this really depends on how clever the indexing side of  
> things is and matching that cleverness with an appropriate regex.  In  
> my actual use of these queries involves doing overlapped rotated term  
> indexing and also rotating the query term to have the best possible  
> prefix for term enumeration.  Naive use of this query using ".*foo"  
> of course will have the same impact as WildcardQuery using *foo - and  
> perhaps slightly slower with regex matching involved.
> 
> Overall, I think it is a good addition and will allow users to be  
> more expressive than the lower-level MultiPhraseQuery (aka  
> PhrasePrefixQuery).
> 
> Thoughts?


In the surround language, this was done by splitting the query term
in a fixed prefix and a remainder starting with a truncation character.
For this remainder a regular expression is built and used.
The prefix is used to limit the number of terms fed to the regular expression
matcher. The code is in SrndTruncQuery.java here:
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query/

So, with an addition to the javadocs that the length of the prefix is
important for performance, I think a regular expression based query term
would be very useful, especially when combined an analyzer that does
appropriate term rotation.

Regards,
Paul Elschot



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: regex-based query contribution

Reply via email to