Re: RegularExpression 'X' option oddity

Michael Glavassevich Wed, 14 Aug 2013 08:32:55 -0700

Hi Jernej,

Jernej Tuljak <jernej.tul...@gmail.com> wrote on 08/14/2013 03:41:17 AM:


> Hi,
> 
> we're abusing org.apache.xerces.impl.xpath.regex.RegularExpression 

Yep. :-)

> to validate XSD flavor regular expression strings and later matching
> test strings against them. It seemingly worked, until someone tried 
> to use a very specific regex.
> 
> Here's the code:
> 
>     import org.apache.xerces.impl.xpath.regex.RegularExpression;
> 
>     public class XercesRegexTest {
>         
>         public static void main(String[] args) {
>             String regexString = "([a-zA-Z][^ ]*)";
>             RegularExpression regex = new RegularExpression(regexString, 
"x");
>             System.out.println(regex.toString());
>         }
>         
>     }
> 
> The `x` option is supposed to make the regex engine conform to XSD 
> regular expressions.

Only 'X' does that. That is the only option which Xerces uses internally.

> But if you run this code, you'll end up with 
> 
>     Exception in thread "main" 
> org.apache.xerces.impl.xpath.regex.ParseException: Unexpected end of
> the pattern in a character class.
>         at org.apache.xerces.impl.xpath.regex.RegexParser.ex(Unknown 
Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegexParser.parseCharacterClass
> (Unknown Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown 
Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm
> (Unknown Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegexParser.processParen(Unknown 
Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseAtom
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegexParser.parseFactor(Unknown 
Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseTerm
> (Unknown Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parseRegex
> (Unknown Source)
>         at org.apache.xerces.impl.xpath.regex.RegexParser.parse
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegularExpression.setPattern
> (Unknown Source)
>         at 
> org.apache.xerces.impl.xpath.regex.RegularExpression.<init>(Unknown 
Source)
>         at com.mgsoft.testing.regex.XercesRegexTest.main
> (XercesRegexTest.java:9)
>     Java Result: 1
> 
> It first looked like a bug in Xerces' regular expression parser, but
> after re-reading the documentation (http://xerces.apache.org/xerces-
> j/apiDocs/org/apache/xerces/utils/regex/RegularExpression.html) of 
> this class, I found out that the `x` option should actually be `X` 
> (upper case).

The docs for that class probably haven't changed much over the years but 
worth pointing out that that's the Xerces-J 1.x documentation not Xerces-J 
2.x.

> Thing is...it worked for countless other regular 
> expressions. In fact it is that space that is causing problems, any 
> other char works fine. Also removing the option and using the single
> string constructor of `RegularExpression` works fine.

If you're not specifying 'X' then you're using a mode that isn't XSD and 
that we never use.

> Does anyone know why this is happening? I realize that this class is
> probably not intended for such usage, but since the spec we're 
> implementing uses XSD regular expressions, we tried to avoid 
> reinventing the wheel though re-usage.

Works for me with the current code in SVN.

> We are using xercesImpl.jar that is distributed with xalan-j 2.7.1.

Whatever you got out of Xalan-J 2.7.1 would be very old now. Have you 
tried Xerces-J 2.11.0?

Thanks.

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org

Re: RegularExpression 'X' option oddity

Reply via email to