Hi all,

I have started using the regexp package because is nice and lightweight but 
found I could not use clustering. I think it might be a Perl extension to re 
but found it was easy to implement in this package.

This allows the use of the following style matching.

(?:\w+(?:\s\w+)+
Mary had a little lamb

This will match with the only paren (0) returning the full string.

A better example is domain names (simplified here not sure if it complies 
with the relevant RFC.)...

([a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*)
www.test.com
jakarta.apache.org

Will both match. with paren 0 having the full string.

Now take the above expression and add the protocol...

(:?\w+://)?([a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*)
http://www.test.com
Paren 0 = http://www.test.com
Paren 1 = www.test.com

Anyway there are about 10 tests in RETest.txt that demostrate this.

regards,
Michael


p.s. I think that I striped a bunch of spaces from the end of lines so there 
are a bunch of extra line in the patch. Not very familiar with using diff :)
? bin
? Clustering.patch
? Clustering2.patch
? build/run-tests.sh
Index: docs/RETest.txt
===================================================================
RCS file: /home/cvspublic/jakarta-regexp/docs/RETest.txt,v
retrieving revision 1.1
diff -r1.1 RETest.txt
886a887,980
> 
> #149
> (?:a)
> a
> YES
> a
> 
> #150
> (?:a)
> aa
> YES
> a
> 
> #151
> (?:\w)
> abc
> YES
> a
> 
> #152
> (?:\w\s\w)+
> a b c
> YES
> a b
> 
> #153
> (a\w)(?:,(a\w))+
> ab,ac,ad
> YES
> ab,ac,ad
> ab
> ad
> 
> #154
> z(\w\s+(?:\w\s+\w)+)z
> za   b bc   cd     dz
> YES
> za   b bc   cd     dz
> a   b bc   cd     d
> 
> #155
> (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> http://www.test.com
> YES
> http://www.test.com
> http://
> http
> .com
> 
> #156
> ((?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> ftp://www.test.com
> YES
> ftp://www.test.com
> ftp://
> .com
> 
> #157
> (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*
> htTp://www.test.com
> YES
> htTp://www.test.com
> htTp://
> htTp
> 
> #158
> (?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> FTP://www.test.com
> YES
> FTP://www.test.com
> FTP
> .com
> 
> #159
> ^(?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*$
> http://.www.test.com
> NO
> 
> #160
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> FtP://www.test.com
> YES
> FtP://www.test.com
> 
> #161
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> FtTP://www.test.com
> NO
> 
> #162
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> www.test.com
> YES
> www.test.com
Index: src/java/org/apache/regexp/RE.java
===================================================================
RCS file: /home/cvspublic/jakarta-regexp/src/java/org/apache/regexp/RE.java,v
retrieving revision 1.6
diff -r1.6 RE.java
176,186c176,186
<  *    [:alnum:]            Alphanumeric characters. 
<  *    [:alpha:]            Alphabetic characters. 
<  *    [:blank:]            Space and tab characters. 
<  *    [:cntrl:]            Control characters. 
<  *    [:digit:]            Numeric characters. 
<  *    [:graph:]            Characters that are printable and are also visible. (A 
space is printable, but not visible, while an `a' is both.) 
<  *    [:lower:]            Lower-case alphabetic characters. 
<  *    [:print:]            Printable characters (characters that are not control 
characters.) 
<  *    [:punct:]            Punctuation characters (characters that are not letter, 
digits, control characters, or space characters). 
<  *    [:space:]            Space characters (such as space, tab, and formfeed, to 
name a few). 
<  *    [:upper:]            Upper-case alphabetic characters. 
---
>  *    [:alnum:]            Alphanumeric characters.
>  *    [:alpha:]            Alphabetic characters.
>  *    [:blank:]            Space and tab characters.
>  *    [:cntrl:]            Control characters.
>  *    [:digit:]            Numeric characters.
>  *    [:graph:]            Characters that are printable and are also visible. (A 
>space is printable, but not visible, while an `a' is both.)
>  *    [:lower:]            Lower-case alphabetic characters.
>  *    [:print:]            Printable characters (characters that are not control 
>characters.)
>  *    [:punct:]            Punctuation characters (characters that are not letter, 
>digits, control characters, or space characters).
>  *    [:space:]            Space characters (such as space, tab, and formfeed, to 
>name a few).
>  *    [:upper:]            Upper-case alphabetic characters.
188c188
<  *         
---
>  *
199c199
<  *         
---
>  *
254a255
>  *   (?:A)                 Used for subexpression clustering (just like grouping but 
>no backrefs)
399a401
>     static final char OP_OPEN_CLUSTER     = '<';  //                 opening cluster
400a403
>     static final char OP_CLOSE_CLUSTER    = '>';  //                 closing cluster
421c424
<     static final char POSIX_CLASS_ALPHA   = 'a';  // Alphabetics 
---
>     static final char POSIX_CLASS_ALPHA   = 'a';  // Alphabetics
947a951,955
> 
>                 case OP_OPEN_CLUSTER:
>                 case OP_CLOSE_CLUSTER:
>                     // starting or ending the matching of a subexpression which has 
>no backref.
>                     return matchNodes( next, maxNode, idx );
Index: src/java/org/apache/regexp/RECompiler.java
===================================================================
RCS file: /home/cvspublic/jakarta-regexp/src/java/org/apache/regexp/RECompiler.java,v
retrieving revision 1.2
diff -r1.2 RECompiler.java
1191c1191
<         boolean paren = false;
---
>         int paren = -1;
1196,1198c1196,1208
<             idx++;
<             paren = true;
<             ret = node(RE.OP_OPEN, parens++);
---
>             // if its a cluster ( rather than a proper subexpression ie with 
>backrefs )
>             if ( idx + 2 < len && pattern.charAt( idx + 1 ) == '?' && 
>pattern.charAt( idx + 2 ) == ':' )
>             {
>                 paren = 2;
>                 idx += 3;
>                 ret = node( RE.OP_OPEN_CLUSTER, 0 );
>             }
>             else
>             {
>                 paren = 1;
>                 idx++;
>                 ret = node(RE.OP_OPEN, parens++);
>             }
1223c1233
<         if (paren)
---
>         if ( paren > 0 )
1233c1243,1250
<             end = node(RE.OP_CLOSE, closeParens);
---
>             if ( paren == 1 )
>             {
>                 end = node(RE.OP_CLOSE, closeParens);
>             }
>             else
>             {
>                 end = node( RE.OP_CLOSE_CLUSTER, 0 );
>             }
Index: src/java/org/apache/regexp/RETest.java
===================================================================
RCS file: /home/cvspublic/jakarta-regexp/src/java/org/apache/regexp/RETest.java,v
retrieving revision 1.2
diff -r1.2 RETest.java
58c58
<  */ 
---
>  */
89,90c89,90
<             //new RETest(arg);
<             test();
---
>             new RETest(arg);
>             //test();
Index: xdocs/RETest.txt
===================================================================
RCS file: /home/cvspublic/jakarta-regexp/xdocs/RETest.txt,v
retrieving revision 1.1
diff -r1.1 RETest.txt
886a887,980
> 
> #149
> (?:a)
> a
> YES
> a
> 
> #150
> (?:a)
> aa
> YES
> a
> 
> #151
> (?:\w)
> abc
> YES
> a
> 
> #152
> (?:\w\s\w)+
> a b c
> YES
> a b
> 
> #153
> (a\w)(?:,(a\w))+
> ab,ac,ad
> YES
> ab,ac,ad
> ab
> ad
> 
> #154
> z(\w\s+(?:\w\s+\w)+)z
> za   b bc   cd     dz
> YES
> za   b bc   cd     dz
> a   b bc   cd     d
> 
> #155
> (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> http://www.test.com
> YES
> http://www.test.com
> http://
> http
> .com
> 
> #156
> ((?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> ftp://www.test.com
> YES
> ftp://www.test.com
> ftp://
> .com
> 
> #157
> (([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*
> htTp://www.test.com
> YES
> htTp://www.test.com
> htTp://
> htTp
> 
> #158
> (?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*
> FTP://www.test.com
> YES
> FTP://www.test.com
> FTP
> .com
> 
> #159
> ^(?:([hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(\.[a-zA-Z0-9\-]+)*$
> http://.www.test.com
> NO
> 
> #160
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> FtP://www.test.com
> YES
> FtP://www.test.com
> 
> #161
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> FtTP://www.test.com
> NO
> 
> #162
> ^(?:(?:[hH][tT]{2}[pP]|[fF][tT][pP]):\/\/)?[a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*$
> www.test.com
> YES
> www.test.com

Reply via email to