On Thursday 24 November 2005 10:25, Erik Hatcher wrote:
>
> On 24 Nov 2005, at 03:17, Paul Elschot wrote:
> >> I must admit that I haven't used the surround parser. For my custom
> >> parser (a legacy syntax that no one here would want), I take any term
> >> that has an *, ?, or [...] syntax as a regex term.
> >
> > I had another look at the javadocs of java regex package.
> > The normal brackets in a regex are not needed for queries, so they
> > can be left as they are.
>
> I don't understand what you mean about brackets not being needed.
> Brackets certainly factor in greatly into regular expressions:
>
> <http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html>
>
> What am I missing?
Capturing groups and special contexts need normal brackets ().
Capturing groups are used for replacements, and I don't see a use
for that in a query language.
Special constructs with () brackets are used for non capturing groups,
match flags, and lookahead/lookbehind.
Would you know a use for these in a query language?
> >> There are still some TODO's with the (Span)RegexQuery - such as being
> >> wise about the prefix length. Right now it is not wise enough. I've
> >> spent some time looking for a regex parser that could parse a regex
> >> expression into an AST so that it could be used for determining the
> >> last static character to start term enumeration. This would also
> >> come in very handy in being able to rotate a regular expression
> >> string to maximize the static prefix when indexing with an analyzer
> >> that rotates terms. If anyone has suggestions/pointers to how this
> >> could be accomplished, it'd be most appreciated!
> >
> > I think I'll simply treat each term as a potential regex and
> > use alphanumeric characters for the prefix. I'll try and leave
> > parsing of the regex to the java regex package as much as possible.
>
> Simply using alphanumeric to determine the prefix doesn't work in all
> cases though. Here's an example I just added, commented out, to
> TestRegexQuery:
>
> assertEquals(1, regexQueryNrHits("r?over"));
>
> The 'r' is optional, yet it is chosen as a prefix and thus no
> documents match even though "over" does match that regex. The quick
You're right, I missed the suffix meaning of ? .
> fix for this is, like FuzzyQuery, to enumerate all terms blindly.
> Another option is to allow the user of (Span)RegexQuery to provide
> the prefix, at least pushing the burden back a bit.
>
> I still think a regex parser would be mighty useful for this issue,
> as well as the next...
>
> > Rotating from the suffix should also be straightforward for
> > alphanumerical chars.
>
> It's not as trivial as this, if you want the rotated prefixes to be
> maximized. Take, for example, the regex expression
>
> ThisContainsTwo[abc]RegexPieces.*Total
>
> Using rotation to maximize the prefix and speeding up term
> enumeration a calculation of of the maximum non-regex piece is
> needed, including a calculation on whether the head and tail combined
> make a larger prefix. For example, using '$' to denote the end of
> the string, the rotated version of this should be:
>
> Total$ThisContainsTwo[abc]RegexPieces.*
>
> With a regex parse tree, it should be possible to be wise about what
> is a static prefix and to compute the size of all the static pieces
> allowing for clever rotation to make regex queries as efficient as
> possible. Now where is that regex parser grammar?! :/
I also missed things like \u2014, which only add to the problem.
There are some older regex implementations in java, but I
have no idea about the licences and the availabiility.
Doesn't apache have one somewhere?
Without a regex parser a prefix regex operator like
PRE(prefix,regex)
in the surround parser would just do what is needed,
except for the comma and the brackets, and a regular expression
can work around these by using \xhh or \0nn .
Btw. $ also has a special meaning in regexes.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]