Hi -

First a little background. I'm implementing regular expression search capabilities within Lucene:

        http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/regex/

I've made it pluggable such that any regular expression implementation can be used. One detail that is very desirable within Lucene when doing multi-term queries such as wildcard, fuzzy, or regular expression matching is to narrow the number of terms enumerated. Because terms (think of these as simply words) are in lexicographical order, picking the best starting point is crucial to the best performance. In the most naive implementations of such enumeration, all terms in the index are enumerated which can be a real performance killer.

For a regular expression such as "foo.*" it is desirable to have the prefix "foo" to speed up term enumeration (yes, I know that "foo.*" matches any "foo.*", not necessarily at the beginning of the string - I'm accounting for this in other ways I can describe if desired). Jakarta Regexp provides this as a package protected internal variable REProgram.prefix. I have written a little hack gateway to give this to me:

  package org.apache.regexp;

  /**
* This class exists as a gateway to access useful Jakarta Regexp package protected data.
   */
  public class RegexpTunnel {
    public static char[] getPrefix(RE regexp) {
      REProgram program = regexp.getProgram();
      return program.prefix;
    }
  }


Would it be possible to add a public getter to return this prefix?

I realize that Jakarta Regexp is not that maintained, so I'm curious about other regex implementations and whether they can also provide this handy prefix, or if anyone has suggestions along these lines. I did not see this capability within ORO or java.util.regex.

Thanks,
        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to