I've developed normal and span-based Query implementations that use
regex to match index terms rather than the simplified WildcardQuery.
This allows for queries like "abc[0-9]xyz" that would match abc1xyz,
but not abc12xyz for example.
I've seen a lot of interest lately in being able to do a phrase query
with a nested wildcard term inside, such as "the q.*k brown f.x". I
turn a query like that into a SpanNearQuery of SpanTermQuery("the"),
SpanPatternQuery("q.*k"), SpanTermQuery("brown"), and SpanPatternQuery
("f.x") with a slop of 0.
The code is fairly minimal thanks to the wonderful infrastructure
already provided. I'm ready to contribute it to Lucene. The
question is, where? Should this be part of the core? Or should it
reside in a contrib area? If in contrib, shall it be a new area
called "regex" perhaps, or "regex-query"?
I'm inclined to put it in the core, so if I don't hear otherwise I'll
start with it there.
The main negative to this query, just like with WildcardQuery and
FuzzyQuery, is the possible performance issue. However, just like
WildcardQuery, this really depends on how clever the indexing side of
things is and matching that cleverness with an appropriate regex. In
my actual use of these queries involves doing overlapped rotated term
indexing and also rotating the query term to have the best possible
prefix for term enumeration. Naive use of this query using ".*foo"
of course will have the same impact as WildcardQuery using *foo - and
perhaps slightly slower with regex matching involved.
Overall, I think it is a good addition and will allow users to be
more expressive than the lower-level MultiPhraseQuery (aka
PhrasePrefixQuery).
Thoughts?
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]