On Thursday 13 October 2005 01:44, Erik Hatcher wrote:
> I've developed normal and span-based Query implementations that use
> regex to match index terms rather than the simplified WildcardQuery.
> This allows for queries like "abc[0-9]xyz" that would match abc1xyz,
> but not abc12xyz for example.
>
> I've seen a lot of interest lately in being able to do a phrase query
> with a nested wildcard term inside, such as "the q.*k brown f.x". I
> turn a query like that into a SpanNearQuery of SpanTermQuery("the"),
> SpanPatternQuery("q.*k"), SpanTermQuery("brown"), and SpanPatternQuery
> ("f.x") with a slop of 0.
>
> The code is fairly minimal thanks to the wonderful infrastructure
> already provided. I'm ready to contribute it to Lucene. The
> question is, where? Should this be part of the core? Or should it
> reside in a contrib area? If in contrib, shall it be a new area
> called "regex" perhaps, or "regex-query"?
>
> I'm inclined to put it in the core, so if I don't hear otherwise I'll
> start with it there.
>
> The main negative to this query, just like with WildcardQuery and
> FuzzyQuery, is the possible performance issue. However, just like
> WildcardQuery, this really depends on how clever the indexing side of
> things is and matching that cleverness with an appropriate regex. In
> my actual use of these queries involves doing overlapped rotated term
> indexing and also rotating the query term to have the best possible
> prefix for term enumeration. Naive use of this query using ".*foo"
> of course will have the same impact as WildcardQuery using *foo - and
> perhaps slightly slower with regex matching involved.
>
> Overall, I think it is a good addition and will allow users to be
> more expressive than the lower-level MultiPhraseQuery (aka
> PhrasePrefixQuery).
>
> Thoughts?
In the surround language, this was done by splitting the query term
in a fixed prefix and a remainder starting with a truncation character.
For this remainder a regular expression is built and used.
The prefix is used to limit the number of terms fed to the regular expression
matcher. The code is in SrndTruncQuery.java here:
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/surround/src/java/org/apache/lucene/queryParser/surround/query/
So, with an addition to the javadocs that the length of the prefix is
important for performance, I think a regular expression based query term
would be very useful, especially when combined an analyzer that does
appropriate term rotation.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]