Re: [cp-patches] RFC: gnu.regexp: fixed bugs in RETokenRepeated
Hi Ito, On Sun, 2006-01-22 at 11:17 +0900, Ito Kazumitsu wrote: I am sorry if I am too impatient, but I am checking this in because I am fixing another serious bug. I don't think you are too impatient. You are our authority on regular expressions now. There are also lots of test cases backing up your work in Mauve (thanks for updating those too). I have CCed Wes who wrote the package originally, if he has time maybe he can go over your recent patches. Originally we were looking at jregex adoption, but we lost contact with the author. So I am really happy you are working so hard to fix the bugs in our regex implementation to make it a fully free and compatible java.util.regex implementation. Thanks, Mark signature.asc Description: This is a digitally signed message part ___ Classpath-patches mailing list Classpath-patches@gnu.org http://lists.gnu.org/mailman/listinfo/classpath-patches
Re: [cp-patches] RFC: gnu.regexp: fixed bugs in RETokenRepeated
From: Ito Kazumitsu [EMAIL PROTECTED] Date: Sat, 21 Jan 2006 01:56:31 +0900 (JST) Slightly improved. And this is supposed to fix the bug #25837 ChangeLog 2006-01-20 Ito Kazumitsu [EMAIL PROTECTED] I am sorry if I am too impatient, but I am checking this in because I am fixing another serious bug. ___ Classpath-patches mailing list Classpath-patches@gnu.org http://lists.gnu.org/mailman/listinfo/classpath-patches
Re: [cp-patches] RFC: gnu.regexp: fixed bugs in RETokenRepeated
From: Ito Kazumitsu [EMAIL PROTECTED] Date: Thu, 19 Jan 2006 23:22:52 +0900 (JST) And this is my fix. Slightly improved. And this is supposed to fix the bug #25837 ChangeLog 2006-01-20 Ito Kazumitsu [EMAIL PROTECTED] Fixes bug #25837 * gnu/regexp/REMatch.java(empty): New boolean indicating an empty string matched. * gnu/regexp/RE.java(match): Sets empty flag when an empty string matched. (initialize): Support back reference \10, \11, and so on. (parseInt): renamed from getEscapedChar and returns int. * gnu/regexp/RETokenRepeated.java(match): Sets empty flag when an empty string matched. Fixed a bug of the case where an empty string matched. Added special handling of {0}. * gnu/regexp/RETokenBackRef.java(match): Sets empty flag when an empty string matched. Fixed the case insensitive matching. Index: classpath/gnu/regexp/RE.java === RCS file: /cvsroot/classpath/classpath/gnu/regexp/RE.java,v retrieving revision 1.11 diff -u -r1.11 RE.java --- classpath/gnu/regexp/RE.java19 Jan 2006 13:45:51 - 1.11 +++ classpath/gnu/regexp/RE.java20 Jan 2006 16:37:03 - @@ -825,12 +825,31 @@ } // BACKREFERENCE OPERATOR - // \1 \2 ... \9 + // \1 \2 ... \9 and \10 \11 \12 ... // not available if RE_NO_BK_REFS is set + // Perl recognizes \10, \11, and so on only if enough number of + // parentheses have opened before it, otherwise they are treated + // as aliases of \010, \011, ... (octal characters). In case of + // Sun's JDK, octal character expression must always begin with \0. + // We will do as JDK does. But FIXME, take a look at (a)(b)\29. + // JDK treats \2 as a back reference to the 2nd group because + // there are only two groups. But in our poor implementation, + // we cannot help but treat \29 as a back reference to the 29th group. else if (unit.bk Character.isDigit(unit.ch) !syntax.get(RESyntax.RE_NO_BK_REFS)) { addToken(currentToken); - currentToken = new RETokenBackRef(subIndex,Character.digit(unit.ch,10),insens); + int numBegin = index - 1; + int numEnd = pLength; + for (int i = index; i pLength; i++) { + if (! Character.isDigit(pattern[i])) { + numEnd = i; + break; + } + } + int num = parseInt(pattern, numBegin, numEnd-numBegin, 10); + + currentToken = new RETokenBackRef(subIndex,num,insens); + index = numEnd; } // START OF STRING OPERATOR @@ -999,12 +1018,12 @@ return index; } - private static char getEscapedChar(char[] input, int pos, int len, int radix) { + private static int parseInt(char[] input, int pos, int len, int radix) { int ret = 0; for (int i = pos; i pos + len; i++) { ret = ret * radix + Character.digit(input[i], radix); } -return (char)ret; +return ret; } /** @@ -1059,7 +1078,7 @@ l++; } if (l != expectedLength) return null; - ce.ch = getEscapedChar(input, pos + 2, l, 16); + ce.ch = (char)(parseInt(input, pos + 2, l, 16)); ce.len = l + 2; } else { @@ -1077,7 +1096,7 @@ } if (l == 3 input[pos + 2] '3') l--; if (l = 0) return null; - ce.ch = getEscapedChar(input, pos + 2, l, 8); + ce.ch = (char)(parseInt(input, pos + 2, l, 8)); ce.len = l + 2; } else { @@ -1246,12 +1265,20 @@ /* Implements abstract method REToken.match() */ boolean match(CharIndexed input, REMatch mymatch) { - if (firstToken == null) return next(input, mymatch); + int origin = mymatch.index; + boolean b; + if (firstToken == null) { + b = next(input, mymatch); + if (b) mymatch.empty = (mymatch.index == origin); + return b; + } // Note the start of this subexpression mymatch.start[subIndex] = mymatch.index; - return firstToken.match(input, mymatch); + b = firstToken.match(input, mymatch); + if (b) mymatch.empty = (mymatch.index == origin); + return b; } /** Index: classpath/gnu/regexp/REMatch.java === RCS file: /cvsroot/classpath/classpath/gnu/regexp/REMatch.java,v retrieving revision 1.2 diff -u -r1.2 REMatch.java --- classpath/gnu/regexp/REMatch.java 2 Jul 2005 20:32:15 - 1.2 +++ classpath/gnu/regexp/REMatch.java 20 Jan 2006 16:37:03 - @@ -67,6 +67,7 @@ int[] start; // start positions (relative to offset) for each (sub)exp. int[] end; // end positions for the same REMatch next; // other possibility (to avoid having to use arrays) +boolean empty; // empty string matched public Object
[cp-patches] RFC: gnu.regexp: fixed bugs in RETokenRepeated
From: Ito Kazumitsu [EMAIL PROTECTED] Subject: Re: [cp-patches] RFC: gnu.regexp fix to avoid unwanted PatternSyntaxException Date: Thu, 19 Jan 2006 01:39:52 +0900 (JST) From: Ito Kazumitsu [EMAIL PROTECTED] Date: Thu, 05 Jan 2006 23:47:03 +0900 (JST) + // doables.index == lastIndex means an empty string + // was the longest that matched this token. + // We break here, otherwise we will fall into an endless loop. Studying various cases, I have found that this comment is not always true. And this is my fix. ChangeLog 2006-01-19 Ito Kazumitsu [EMAIL PROTECTED] * gnu/regexp/REMatch.java(empty): New boolean indicating an empty string matched. * gnu/regexp/RE.java(match): Sets empty flag when an empty string matched. * gnu/regexp/RETokenRepeated.java(match): Sets empty flag when an empty string matched. Fixed a bug of the case where an empty string matched. Index: classpath/gnu/regexp/RE.java === RCS file: /cvsroot/classpath/classpath/gnu/regexp/RE.java,v retrieving revision 1.11 diff -u -r1.11 RE.java --- classpath/gnu/regexp/RE.java19 Jan 2006 13:45:51 - 1.11 +++ classpath/gnu/regexp/RE.java19 Jan 2006 14:20:20 - @@ -1012,7 +1012,7 @@ * a : 'a' itself. * \0123 : Octal char 0123 * \x1b : Hex char 0x1b - * \u1234 : Unicode char \u1234 + * \u1234 : Unicode char \u1234 */ private static class CharExpression { /** character represented by this expression */ @@ -1246,12 +1246,20 @@ /* Implements abstract method REToken.match() */ boolean match(CharIndexed input, REMatch mymatch) { - if (firstToken == null) return next(input, mymatch); + int origin = mymatch.index; + boolean b; + if (firstToken == null) { + b = next(input, mymatch); + if (b) mymatch.empty = (mymatch.index == origin); + return b; + } // Note the start of this subexpression mymatch.start[subIndex] = mymatch.index; - return firstToken.match(input, mymatch); + b = firstToken.match(input, mymatch); + if (b) mymatch.empty = (mymatch.index == origin); + return b; } /** Index: classpath/gnu/regexp/REMatch.java === RCS file: /cvsroot/classpath/classpath/gnu/regexp/REMatch.java,v retrieving revision 1.2 diff -u -r1.2 REMatch.java --- classpath/gnu/regexp/REMatch.java 2 Jul 2005 20:32:15 - 1.2 +++ classpath/gnu/regexp/REMatch.java 19 Jan 2006 14:20:20 - @@ -67,6 +67,7 @@ int[] start; // start positions (relative to offset) for each (sub)exp. int[] end; // end positions for the same REMatch next; // other possibility (to avoid having to use arrays) +boolean empty; // empty string matched public Object clone() { try { @@ -88,6 +89,7 @@ index = other.index; // need to deep clone? next = other.next; + empty = other.empty; } REMatch(int subs, int anchor, int eflags) { @@ -124,6 +126,7 @@ start[i] = end[i] = -1; } next = null; // cut off alternates + empty = false; } /** Index: classpath/gnu/regexp/RETokenRepeated.java === RCS file: /cvsroot/classpath/classpath/gnu/regexp/RETokenRepeated.java,v retrieving revision 1.5 diff -u -r1.5 RETokenRepeated.java --- classpath/gnu/regexp/RETokenRepeated.java 8 Jan 2006 23:06:43 - 1.5 +++ classpath/gnu/regexp/RETokenRepeated.java 19 Jan 2006 14:20:20 - @@ -91,6 +91,7 @@ // the subexpression back-reference operator allow that? boolean match(CharIndexed input, REMatch mymatch) { + int origin = mymatch.index; // number of times we've matched so far int numRepeats = 0; @@ -116,6 +117,7 @@ REMatch result = matchRest(input, newMatch); if (result != null) { mymatch.assignFrom(result); + mymatch.empty = (mymatch.index == origin); return true; } } @@ -153,12 +155,43 @@ positions.addElement(newMatch); - // doables.index == lastIndex means an empty string - // was the longest that matched this token. - // We break here, otherwise we will fall into an endless loop. + // doables.index == lastIndex occurs either + // (1) when an empty string was the longest + // that matched this token. + // And this case occurs either + // (1-1) when this token is always empty, + // for example () or (()). + // (1-2) when this token is not always empty + // but can