Re: [cp-patches] RFC: gnu.regexp: fixed bugs in stingy match of RETokenRepeated

2006-01-29 Thread Ito Kazumitsu
From: Ito Kazumitsu [EMAIL PROTECTED]
Date: Sun, 29 Jan 2006 01:15:27 +0900 (JST)

 The patch proposed on Mon, 23 Jan 2006 has been cancelled,
 and here is another patch.

Hopefully improved.

TODO:

This patch cannot make the following test pass.

/^(b+?|a){1,2}?c/
bbac
 0: bbac
 1: a

/b+?/ matches only b, and bb is not tried. 

ChangeLog
2006-01-29  Ito Kazumitsu  [EMAIL PROTECTED]

Fixes bug #24876
* gnu/regexp/gnu/regexp/RE.java(REG_TRY_ENTIRE_MATCH):
New execution flag.
(getMatchImpl): if REG_TRY_ENTIRE_MATCH is set, add an
implicit RETokenEnd at the end of the regexp chain.
Do not select the longest match, but select the first match.
(match): Do not take care of REMatch.empty.
* gnu/regexp/REMatch.java(empty): To be used only in RETokenRepeated.
* gnu/regexp/RETokenOneOf.java: Corrected a typo in a comment.
* gnu/regexp/RETokenBackRef.java: Do not take care of REMatch.empty.
* gnu/regexp/RETokenRepeated.java (match): Rewrote stingy matching.
Do not take care of REMatch.empty. Set and check REMatch.empty
when trying to match the single token.

Index: classpath/gnu/regexp/RE.java
===
RCS file: /cvsroot/classpath/classpath/gnu/regexp/RE.java,v
retrieving revision 1.12
diff -u -r1.12 RE.java
--- classpath/gnu/regexp/RE.java22 Jan 2006 02:22:21 -  1.12
+++ classpath/gnu/regexp/RE.java29 Jan 2006 15:17:37 -
@@ -217,6 +217,13 @@
*/
   public static final int REG_NO_INTERPOLATE = 128;
 
+  /**
+   * Execution flag.
+   * Try to match the whole input string. An implicit match-end operator
+   * is added to this regexp.
+   */
+  public static final int REG_TRY_ENTIRE_MATCH = 256;
+
   /** Returns a string representing the version of the gnu.regexp package. */
   public static final String version() {
 return VERSION;
@@ -1265,20 +1272,14 @@
   
 /* Implements abstract method REToken.match() */
 boolean match(CharIndexed input, REMatch mymatch) { 
-   int origin = mymatch.index;
-   boolean b;
if (firstToken == null) {
-   b = next(input, mymatch);
-   if (b) mymatch.empty = (mymatch.index == origin);
-   return b;
+   return next(input, mymatch);
}
 
// Note the start of this subexpression
mymatch.start[subIndex] = mymatch.index;
 
-   b = firstToken.match(input, mymatch);
-   if (b) mymatch.empty = (mymatch.index == origin);
-   return b;
+   return firstToken.match(input, mymatch);
 }
   
   /**
@@ -1337,23 +1338,34 @@
   }
 
   REMatch getMatchImpl(CharIndexed input, int anchor, int eflags, StringBuffer 
buffer) {
+  boolean tryEntireMatch = ((eflags  REG_TRY_ENTIRE_MATCH) != 0);
+  RE re = (tryEntireMatch ? (RE) this.clone() : this);
+  if (tryEntireMatch) {
+ re.chain(new RETokenEnd(0, null));
+  }
   // Create a new REMatch to hold results
   REMatch mymatch = new REMatch(numSubs, anchor, eflags);
   do {
  // Optimization: check if anchor + minimumLength  length
  if (minimumLength == 0 || input.charAt(minimumLength-1) != 
CharIndexed.OUT_OF_BOUNDS) {
- if (match(input, mymatch)) {
- // Find longest match of them all to observe leftmost longest
- REMatch longest = mymatch;
+ if (re.match(input, mymatch)) {
+ REMatch best = mymatch;
+ // We assume that the match that coms first is the best.
+ // And the following The longer, the better rule has
+ // been commented out. The longest is not neccesarily
+ // the best. For example, a out of aaa is the best
+ // match for /a+?/.
+ /*
+ // Find best match of them all to observe leftmost longest
  while ((mymatch = mymatch.next) != null) {
- if (mymatch.index  longest.index) {
- longest = mymatch;
+ if (mymatch.index  best.index) {
+   best = mymatch;
  }
  }
- 
- longest.end[0] = longest.index;
- longest.finish(input);
- return longest;
+ */
+ best.end[0] = best.index;
+ best.finish(input);
+ return best;
  }
  }
  mymatch.clear(++anchor);
Index: classpath/gnu/regexp/REMatch.java
===
RCS file: /cvsroot/classpath/classpath/gnu/regexp/REMatch.java,v
retrieving revision 1.3
diff -u -r1.3 REMatch.java
--- classpath/gnu/regexp/REMatch.java   22 Jan 2006 02:22:21 -  1.3
+++ classpath/gnu/regexp/REMatch.java   29 Jan 2006 15:17:37 -
@@ -67,7 +67,8 @@
 int[] start; // 

Re: [cp-patches] RFC: gnu.regexp: fixed bugs in stingy match of RETokenRepeated

2006-01-25 Thread Tom Tromey
 Ito == Ito Kazumitsu [EMAIL PROTECTED] writes:

Ito I am beginning to think that gnu.regexp is doing more than is
Ito expected:
Ito   Find all possible matches and select the best match.
[...]
Ito If we can simplufy the work of getMatchImpl(), we will be
Ito getting the same results as other implementations as well as
Ito improved performance.

I think this would be excellent.

As I recall, the first time jregex came up was to improve performance.
We were also interested in improved correctness, but you already seem
to be addressing that pretty well :-)

Tom

___
Classpath-patches mailing list
Classpath-patches@gnu.org
http://developer.classpath.org/mailman/listinfo/classpath-patches


Re: [cp-patches] RFC: gnu.regexp: fixed bugs in stingy match of RETokenRepeated

2006-01-24 Thread Ito Kazumitsu
From: Ito Kazumitsu [EMAIL PROTECTED]
Date: Tue, 24 Jan 2006 07:25:29 +0900 (JST)


 This is an ad hoc fix and not carefully designed. I can think of problems
 such as
 
   How should we compare an stingy match and a non-stingy match?

I am beginning to think that gnu.regexp is doing more than is
expected:

  Find all possible matches and select the best match.

Jakarta Regexp seems much simpler (I have not seriously studied
the source code, this is only an impression):

  Go forward. If something fails, backtrack and try another option.
  If there is nothing to be done, then we have found a match.
  If we have more to do but we cannot go ahead, the the matching has
  failed.

Our java.util.regex.Matcher#find() and java.util.regex.Matcher#matches()
both uses the same gnu.regexp.RE#getMatchImpl(), which tries to find
all possible matches.

So with the regexp /^[ab]{1,3}(ab*|b)/, our find() can find
aab from aab.  But other implementations
(Sun's JDK, Jakarta Regexp, Perl, Ruby) can find only aabb.


 1 [ab] : a
 2 [ab] : a
 3 [ab] : b
 4 ab*  : FAIL
 5 b: b
   Other implementations stop here, but gnu.regexp goes back to 2
   and tries another option.

 3 ab*  : FAIL
 4 b: b
   Go back to 1 and try another.

 2 ab*  : ab
   aab has been found.


If we can simplufy the work of getMatchImpl(), we will be
getting the same results as other implementations as well as
improved performance.

In that case, when matches() call getMatchImpl(), an implicit
RETokenEnd should be added. Otherwise, aaa cannot match /a+?/.


___
Classpath-patches mailing list
Classpath-patches@gnu.org
http://lists.gnu.org/mailman/listinfo/classpath-patches


Re: [cp-patches] RFC: gnu.regexp: fixed bugs in stingy match of RETokenRepeated

2006-01-24 Thread Ito Kazumitsu
From: Ito Kazumitsu [EMAIL PROTECTED]
Date: Tue, 24 Jan 2006 23:37:52 +0900 (JST)

 If we can simplufy the work of getMatchImpl(), we will be
 getting the same results as other implementations as well as
 improved performance.
 
 In that case, when matches() call getMatchImpl(), an implicit
 RETokenEnd should be added. Otherwise, aaa cannot match /a+?/.

I am rewriting the code using this idea.  Unfortunately,
the following tests fail.

/(([a-c])b*?\2)*/
ababbbcbc
 0: ababb
 1: bb
 2: b

/^(b+?|a){1,2}?c/
bbc
 0: bbc
 1: b

But I think it is better to go this way than to intoduce some
dirty ad hoc fix proposed by this RFC.

___
Classpath-patches mailing list
Classpath-patches@gnu.org
http://developer.classpath.org/mailman/listinfo/classpath-patches


Re: [cp-patches] RFC: gnu.regexp: fixed bugs in stingy match of RETokenRepeated

2006-01-23 Thread Ito Kazumitsu
From: Ito Kazumitsu [EMAIL PROTECTED]
Date: Mon, 23 Jan 2006 23:06:42 +0900 (JST)

 --- classpath/gnu/regexp/REMatch.java 22 Jan 2006 02:22:21 -  1.3
 +++ classpath/gnu/regexp/REMatch.java 23 Jan 2006 13:43:12 -

 +Vector repeats; // number of repeats of each stingy repeated token
 +// Request For Comment: The Vector repeats contains number of repeats
 +// from left to right without regard to what the token is.
 +// I am not quite sure this is reasonable. 

 --- classpath/gnu/regexp/RE.java  22 Jan 2006 02:22:21 -  1.12
 +++ classpath/gnu/regexp/RE.java  23 Jan 2006 13:43:11 -

 +   REMatch best = mymatch;
 +   if (! best.stingy) {

 }
 +   else {
 +   // Find best match of them all to observe
 +   // leftmost least repeated
 +   while ((mymatch = mymatch.next) != null) {
 +   if (compareRepeats(mymatch, best)  0) {
 +   best = mymatch;
 +   }
 +   }
 +   }

This is an ad hoc fix and not carefully designed. I can think of problems
such as

  How should we compare an stingy match and a non-stingy match?

For example, Sun's JDK shows the following result.

/(a+?|a+)/
a
 0: a
 1: a

/(a+|a+?)/
a
 0: a
 1: a

But my patched gnu.regexp shows:

/(a+?|a+)/
a
 0: a
 1: a

/(a+|a+?)/
a
 0: a
 1: a


___
Classpath-patches mailing list
Classpath-patches@gnu.org
http://lists.gnu.org/mailman/listinfo/classpath-patches