Re: [cp-patches] RFC: gnu.regexp: fixed bugs in RETokenRepeated

2006-01-22 Thread Mark Wielaard
Hi Ito,

On Sun, 2006-01-22 at 11:17 +0900, Ito Kazumitsu wrote:
 I am sorry if I am too impatient, but I am checking this in
 because I am fixing another serious bug.

I don't think you are too impatient. You are our authority on regular
expressions now. There are also lots of test cases backing up your work
in Mauve (thanks for updating those too).

I have CCed Wes who wrote the package originally, if he has time maybe
he can go over your recent patches.

Originally we were looking at jregex adoption, but we lost contact with
the author. So I am really happy you are working so hard to fix the bugs
in our regex implementation to make it a fully free and compatible
java.util.regex implementation.

Thanks,

Mark


signature.asc
Description: This is a digitally signed message part
___
Classpath-patches mailing list
Classpath-patches@gnu.org
http://lists.gnu.org/mailman/listinfo/classpath-patches


Re: [cp-patches] RFC: gnu.regexp: fixed bugs in RETokenRepeated

2006-01-21 Thread Ito Kazumitsu
From: Ito Kazumitsu [EMAIL PROTECTED]
Date: Sat, 21 Jan 2006 01:56:31 +0900 (JST)

 Slightly improved. And this is supposed to fix the bug #25837
 
 ChangeLog
 2006-01-20  Ito Kazumitsu  [EMAIL PROTECTED]

I am sorry if I am too impatient, but I am checking this in
because I am fixing another serious bug.


___
Classpath-patches mailing list
Classpath-patches@gnu.org
http://lists.gnu.org/mailman/listinfo/classpath-patches


Re: [cp-patches] RFC: gnu.regexp: fixed bugs in RETokenRepeated

2006-01-20 Thread Ito Kazumitsu
From: Ito Kazumitsu [EMAIL PROTECTED]
Date: Thu, 19 Jan 2006 23:22:52 +0900 (JST)

 And this is my fix.

Slightly improved. And this is supposed to fix the bug #25837

ChangeLog
2006-01-20  Ito Kazumitsu  [EMAIL PROTECTED]

Fixes bug #25837
* gnu/regexp/REMatch.java(empty): New boolean indicating
an empty string matched.
* gnu/regexp/RE.java(match): Sets empty flag when an empty
string matched.
(initialize): Support back reference \10, \11, and so on.
(parseInt): renamed from getEscapedChar and returns int.
* gnu/regexp/RETokenRepeated.java(match): Sets empty flag
when an empty string matched. Fixed a bug of the case where
an empty string matched. Added special handling of {0}.
* gnu/regexp/RETokenBackRef.java(match): Sets empty flag
when an empty string matched. Fixed the case insensitive matching.

Index: classpath/gnu/regexp/RE.java
===
RCS file: /cvsroot/classpath/classpath/gnu/regexp/RE.java,v
retrieving revision 1.11
diff -u -r1.11 RE.java
--- classpath/gnu/regexp/RE.java19 Jan 2006 13:45:51 -  1.11
+++ classpath/gnu/regexp/RE.java20 Jan 2006 16:37:03 -
@@ -825,12 +825,31 @@
   }
 
   // BACKREFERENCE OPERATOR
-  //  \1 \2 ... \9
+  //  \1 \2 ... \9 and \10 \11 \12 ...
   // not available if RE_NO_BK_REFS is set
+  // Perl recognizes \10, \11, and so on only if enough number of
+  // parentheses have opened before it, otherwise they are treated
+  // as aliases of \010, \011, ... (octal characters).  In case of
+  // Sun's JDK, octal character expression must always begin with \0.
+  // We will do as JDK does. But FIXME, take a look at (a)(b)\29.
+  // JDK treats \2 as a back reference to the 2nd group because
+  // there are only two groups. But in our poor implementation,
+  // we cannot help but treat \29 as a back reference to the 29th group.
 
   else if (unit.bk  Character.isDigit(unit.ch)  
!syntax.get(RESyntax.RE_NO_BK_REFS)) {
addToken(currentToken);
-   currentToken = new 
RETokenBackRef(subIndex,Character.digit(unit.ch,10),insens);
+   int numBegin = index - 1;
+   int numEnd = pLength;
+   for (int i = index; i  pLength; i++) {
+   if (! Character.isDigit(pattern[i])) {
+   numEnd = i;
+   break;
+   }
+   }
+   int num = parseInt(pattern, numBegin, numEnd-numBegin, 10);
+
+   currentToken = new RETokenBackRef(subIndex,num,insens);
+   index = numEnd;
   }
 
   // START OF STRING OPERATOR
@@ -999,12 +1018,12 @@
 return index;
   }
 
-  private static char getEscapedChar(char[] input, int pos, int len, int 
radix) {
+  private static int parseInt(char[] input, int pos, int len, int radix) {
 int ret = 0;
 for (int i = pos; i  pos + len; i++) {
ret = ret * radix + Character.digit(input[i], radix);
 }
-return (char)ret;
+return ret;
   }
 
   /**
@@ -1059,7 +1078,7 @@
l++;
   }
   if (l != expectedLength) return null;
-  ce.ch = getEscapedChar(input, pos + 2, l, 16);
+  ce.ch = (char)(parseInt(input, pos + 2, l, 16));
  ce.len = l + 2;
 }
 else {
@@ -1077,7 +1096,7 @@
   }
   if (l == 3  input[pos + 2]  '3') l--;
   if (l = 0) return null;
-  ce.ch = getEscapedChar(input, pos + 2, l, 8);
+  ce.ch = (char)(parseInt(input, pos + 2, l, 8));
   ce.len = l + 2;
 }
 else {
@@ -1246,12 +1265,20 @@
   
 /* Implements abstract method REToken.match() */
 boolean match(CharIndexed input, REMatch mymatch) { 
-   if (firstToken == null) return next(input, mymatch);
+   int origin = mymatch.index;
+   boolean b;
+   if (firstToken == null) {
+   b = next(input, mymatch);
+   if (b) mymatch.empty = (mymatch.index == origin);
+   return b;
+   }
 
// Note the start of this subexpression
mymatch.start[subIndex] = mymatch.index;
 
-   return firstToken.match(input, mymatch);
+   b = firstToken.match(input, mymatch);
+   if (b) mymatch.empty = (mymatch.index == origin);
+   return b;
 }
   
   /**
Index: classpath/gnu/regexp/REMatch.java
===
RCS file: /cvsroot/classpath/classpath/gnu/regexp/REMatch.java,v
retrieving revision 1.2
diff -u -r1.2 REMatch.java
--- classpath/gnu/regexp/REMatch.java   2 Jul 2005 20:32:15 -   1.2
+++ classpath/gnu/regexp/REMatch.java   20 Jan 2006 16:37:03 -
@@ -67,6 +67,7 @@
 int[] start; // start positions (relative to offset) for each (sub)exp.
 int[] end;   // end positions for the same
 REMatch next; // other possibility (to avoid having to use arrays)
+boolean empty; // empty string matched
 
 public Object 

[cp-patches] RFC: gnu.regexp: fixed bugs in RETokenRepeated

2006-01-19 Thread Ito Kazumitsu
From: Ito Kazumitsu [EMAIL PROTECTED]
Subject: Re: [cp-patches] RFC: gnu.regexp fix to avoid unwanted 
PatternSyntaxException
Date: Thu, 19 Jan 2006 01:39:52 +0900 (JST)

 From: Ito Kazumitsu [EMAIL PROTECTED]
 Date: Thu, 05 Jan 2006 23:47:03 +0900 (JST)
 
  +   // doables.index == lastIndex means an empty string
  +   // was the longest that matched this token.
  +   // We break here, otherwise we will fall into an endless loop.
 
 Studying various cases, I have found that this comment is not
 always true.

And this is my fix.

ChangeLog
2006-01-19  Ito Kazumitsu  [EMAIL PROTECTED]

* gnu/regexp/REMatch.java(empty): New boolean indicating
an empty string matched.
* gnu/regexp/RE.java(match): Sets empty flag when an empty
string matched.
* gnu/regexp/RETokenRepeated.java(match): Sets empty flag
when an empty string matched. Fixed a bug of the case where
an empty string matched.

Index: classpath/gnu/regexp/RE.java
===
RCS file: /cvsroot/classpath/classpath/gnu/regexp/RE.java,v
retrieving revision 1.11
diff -u -r1.11 RE.java
--- classpath/gnu/regexp/RE.java19 Jan 2006 13:45:51 -  1.11
+++ classpath/gnu/regexp/RE.java19 Jan 2006 14:20:20 -
@@ -1012,7 +1012,7 @@
* a  : 'a' itself.
* \0123  : Octal char 0123
* \x1b   : Hex char 0x1b
-   * \u1234 : Unicode char \u1234
+   * \u1234  : Unicode char \u1234
*/
   private static class CharExpression {
 /** character represented by this expression */
@@ -1246,12 +1246,20 @@
   
 /* Implements abstract method REToken.match() */
 boolean match(CharIndexed input, REMatch mymatch) { 
-   if (firstToken == null) return next(input, mymatch);
+   int origin = mymatch.index;
+   boolean b;
+   if (firstToken == null) {
+   b = next(input, mymatch);
+   if (b) mymatch.empty = (mymatch.index == origin);
+   return b;
+   }
 
// Note the start of this subexpression
mymatch.start[subIndex] = mymatch.index;
 
-   return firstToken.match(input, mymatch);
+   b = firstToken.match(input, mymatch);
+   if (b) mymatch.empty = (mymatch.index == origin);
+   return b;
 }
   
   /**
Index: classpath/gnu/regexp/REMatch.java
===
RCS file: /cvsroot/classpath/classpath/gnu/regexp/REMatch.java,v
retrieving revision 1.2
diff -u -r1.2 REMatch.java
--- classpath/gnu/regexp/REMatch.java   2 Jul 2005 20:32:15 -   1.2
+++ classpath/gnu/regexp/REMatch.java   19 Jan 2006 14:20:20 -
@@ -67,6 +67,7 @@
 int[] start; // start positions (relative to offset) for each (sub)exp.
 int[] end;   // end positions for the same
 REMatch next; // other possibility (to avoid having to use arrays)
+boolean empty; // empty string matched
 
 public Object clone() {
try {
@@ -88,6 +89,7 @@
index = other.index;
// need to deep clone?
next = other.next;
+   empty = other.empty;
 }
 
 REMatch(int subs, int anchor, int eflags) {
@@ -124,6 +126,7 @@
start[i] = end[i] = -1;
}
next = null; // cut off alternates
+   empty = false;
 }
 
 /**
Index: classpath/gnu/regexp/RETokenRepeated.java
===
RCS file: /cvsroot/classpath/classpath/gnu/regexp/RETokenRepeated.java,v
retrieving revision 1.5
diff -u -r1.5 RETokenRepeated.java
--- classpath/gnu/regexp/RETokenRepeated.java   8 Jan 2006 23:06:43 -   
1.5
+++ classpath/gnu/regexp/RETokenRepeated.java   19 Jan 2006 14:20:20 -
@@ -91,6 +91,7 @@
 // the subexpression back-reference operator allow that?
 
 boolean match(CharIndexed input, REMatch mymatch) {
+   int origin = mymatch.index;
// number of times we've matched so far
int numRepeats = 0; 

@@ -116,6 +117,7 @@
REMatch result = matchRest(input, newMatch);
if (result != null) {
mymatch.assignFrom(result);
+   mymatch.empty = (mymatch.index == origin);
return true;
}
}
@@ -153,12 +155,43 @@

positions.addElement(newMatch);
 
-   // doables.index == lastIndex means an empty string
-   // was the longest that matched this token.
-   // We break here, otherwise we will fall into an endless loop.
+   // doables.index == lastIndex occurs either
+   //   (1) when an empty string was the longest
+   //   that matched this token.
+   //   And this case occurs either
+   // (1-1) when this token is always empty,
+   //   for example () or (()).
+   // (1-2) when this token is not always empty
+   //   but can