Hi -
First a little background. I'm implementing regular expression
search capabilities within Lucene:
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/regex/
I've made it pluggable such that any regular expression
implementation can be used. One detail that is very desirable within
Lucene when doing multi-term queries such as wildcard, fuzzy, or
regular expression matching is to narrow the number of terms
enumerated. Because terms (think of these as simply words) are in
lexicographical order, picking the best starting point is crucial to
the best performance. In the most naive implementations of such
enumeration, all terms in the index are enumerated which can be a
real performance killer.
For a regular expression such as "foo.*" it is desirable to have the
prefix "foo" to speed up term enumeration (yes, I know that "foo.*"
matches any "foo.*", not necessarily at the beginning of the string -
I'm accounting for this in other ways I can describe if desired).
Jakarta Regexp provides this as a package protected internal variable
REProgram.prefix. I have written a little hack gateway to give this
to me:
package org.apache.regexp;
/**
* This class exists as a gateway to access useful Jakarta Regexp
package protected data.
*/
public class RegexpTunnel {
public static char[] getPrefix(RE regexp) {
REProgram program = regexp.getProgram();
return program.prefix;
}
}
Would it be possible to add a public getter to return this prefix?
I realize that Jakarta Regexp is not that maintained, so I'm curious
about other regex implementations and whether they can also provide
this handy prefix, or if anyone has suggestions along these lines. I
did not see this capability within ORO or java.util.regex.
Thanks,
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]