Re: j.u.regex: Negated Character Classes

2011-06-08 Thread Tim Ellison
Hi Sherman, ok so I'll admit to reading through to the end of your note
and finding it interesting ;-)

Some comments in-lined.

On 03/Jun/2011 22:55, Xueming Shen wrote:
 I'm sure everybody understands what negated character classes [^...]
 in j.u.regex means.
 You would never have doubt about
 
 [^c] does NOT match c
 [^0-9] does NOT match 8
 [^a-z] does NOT match b
 [^a-bc-d] does NOT match 'c
 
 But how about
 
 does [^[c]] match c?
 does [^[0-9]] match 8?
 does [^[a-z]] match b?
 does [^a-b[c-d]] match c?
 
 I was wrong on all of them when was asked first time and it took me
 a while to figure out what is going on behind it. Oh, btw, the answer
 is yes for all 4, yes, the
 
 [^[c]] matches c
 [^[0-9]] matches 8
 [^[a-z]] matches b.
 [^a-b[c-d]] matches c  (while [^a-bc-d] does NOT match c)

I would not have known the right answer to this quiz either; it seems
that the use of nested character sets is sufficiently rare that we've
not had to learn how these behave.

 Another interesting sample is
 
 [^a-b[c-d]e-f] matches c but does NOT match e (so the e-f part after
 the nested character class [c-d] does back to normal).
 
 It appears the negation of the outer character class does not
 affect its nested character class,

I think the easiest way to explain the situation is not to consider the
negation separately, but that [^ is the syntax of a negated character
set.  Having a normal character set inside a negated character set then
seems ok to me.

 so [^X] is always opposite from
 [^[X]], X to be any character class.
 
 Same strange thing seems to be true for intersection operation 
 as well, so both [a-dc-f] and [^a-dc-f] do NOT match a.
 
 This does not sound correct, at least for me.

This case is, I agree, a gotcha which is hard to justify through the
syntax rather than the implementation.

 The source code suggests that we are treating the nested/embedded
 [...] character class and the intersection  specially, so
 
 [^[X]  is interpreted as [^] union [X]
 [^X[Y]] is interpreted as [^X] union [Y]
 [^X[Y]Z] is interpreted as [^XZ] union [Y]
 [^XY] is interpreted as [^X]  Y
 
 What I meant treating...specially is that we do NOT do the same
 thing for other embedded character classes, so while [^[a-z]] does
 match c, [^\p{Lower}] actually does NOT match c, which I would
 expect.

 The j.u.regex.Pattern APIs do NOT help. All the samples given for 
 Character classes[1] section are simple negation, no nested
 sample is given. And neither ^ nor [^...] appear in the operator
 precedence table. The behaviors in other regex engines, such as Perl
 and POSIX, don't help, as nested character class is not supported
 there.
 
 I did check with the original author who wrote this part of the code.
 It appears the current implementation is indeed what he intended to
 do back then, so this behavior is NOT an implementation bug but by
 design.
 
 Personally I don't feel this design is not correct.

More mind tricks ;-) ?  You think the design *is* wrong?

 Ideally, I would assume the spec either specifies [^...] as a
 separate group operator to be the complement of [...], or ^ as
 the negation operator with the lowest precedence, such as (from
 lowest to highest)
 
 (1) Negation  ^(only at the beginning of the [...])
 (2) Intersection 
 (3) Range -
 (4) nested class []

or as I suggested above
  (5) negated class [^ ... ]

and the understanding that nested classes do not 'inherit' the negation
property of their parent.

 
 So
 [^X[Y]] would be the complement of [Xunion[Y]]
 
 [^X[Y]Z] would be the complement of  [Xunion[Y]unionZ]
 
 [^XY] would be the complement of [XY]
 
 for example, if I dump the regex internal logic node tree for the sample
 regex
 [^a-b[c-d]e-f], the jdk7 and jdk8 results would look like
 
 /home/sherman/TL/regex$
 /export/sherman/Workspace/jdk7/jdk/build/linux-i586/bin/java RegEx -flag
 1000 [^a-b[c-d]e-f] c
 Pattern=[^a-b[c-d]e-f]
 Input  =c
  1: Difference (7)
  2: Union (0)
  3: Complement (0)
  4: Range [a-b] (0)
  5: Range [c-d] (0)
  6: Range [e-f] (0)
  7: END (0)
 ---
 match:true
 groupCount=0
 
 /home/sherman/TL/regex$
 /export/sherman/Workspace/jdk8/jdk/build/linux-i586/bin/java RegEx -flag
 1000 [^a-b[c-d]e-f] c
 Pattern=[^a-b[c-d]e-f]
 Input  =c
  1: Complement (7)
  2: Union (0)
  3: Union (0)
  4: Range [a-b] (0)
  5: Range [c-d] (0)
  6: Range [e-f] (0)
  7: END (0)
 ---
 match:false
 
 I know, most of people might not be interested, but if you have read 
 this far, means you are interested:-) and might have some opinions.
 It is definitely an incompatible change, given this has been the
 behavior from the very beginning of Java regex and has been there for
 almost a decade, I would appreciate any comment/opinion, especially
 if you agree that the existing behavior is NOT correct and therefor
 we need to fix, OR you think the existing one is just 

Re: j.u.regex: Negated Character Classes

2011-06-08 Thread Xueming Shen

 Hi Tim,

Semantically I don't see too much difference between to consider [^ 
syntax as a negated character set or to
separate the ^ out and treat it as an unary operator with the lowest 
precedence (only at the very beginning,
of course). The issue here is whether or not to consider the nested 
class character (the character classes
grouped by [...]) as a basic element of its enclosing class, 
equivalent to other elements, such as the literals,

the predefined, the range and the union.

While POSIX's bracket expression does not support nested/sub bracket 
expression,  it does define a set of
so called POSIX character classes in form of [:...:] that can only be 
used inside the bracket expression, as
a basic element, equivalent to other elements when nested/embedded in 
[...] or [^...]. So [a-z[:digit:] matches
any of lowercase ascii or 0-9 digits and [^a-z[:digit] matches any 
character that is NOT lowercase ascii AND
0-9 digits, in which the [:digit'] is an equivalence of 0-9, or [0-9] if 
nested bracket expression is supported.


In Java, however the [0-9] (grouping), 0-9 (range) and \p{digit} are NOT 
treated as equivalent when inside [^..],

but the same when inside [...].

My apology for the confusion. Yes, I meant to say The design decision 
IS wrong:-)


I agree that it appears not to be a great problem for most people, it 
might be OK just leave it alone (and
document it somewhere), consider it has been the behavior for decade. 
But on the other side, it also makes
the compatibility issue less severe:-) especially it's hard to 
explain the  case.


-Sherman


On 6/8/2011 8:27 AM, Tim Ellison wrote:

Hi Sherman, ok so I'll admit to reading through to the end of your note
and finding it interesting ;-)

Some comments in-lined.

On 03/Jun/2011 22:55, Xueming Shen wrote:

I'm sure everybody understands what negated character classes [^...]
in j.u.regex means.
You would never have doubt about

[^c] does NOT match c
[^0-9] does NOT match 8
[^a-z] does NOT match b
[^a-bc-d] does NOT match 'c

But how about

does [^[c]] match c?
does [^[0-9]] match 8?
does [^[a-z]] match b?
does [^a-b[c-d]] match c?

I was wrong on all of them when was asked first time and it took me
a while to figure out what is going on behind it. Oh, btw, the answer
is yes for all 4, yes, the

[^[c]] matches c
[^[0-9]] matches 8
[^[a-z]] matches b.
[^a-b[c-d]] matches c  (while [^a-bc-d] does NOT match c)

I would not have known the right answer to this quiz either; it seems
that the use of nested character sets is sufficiently rare that we've
not had to learn how these behave.


Another interesting sample is

[^a-b[c-d]e-f] matches c but does NOT match e (so the e-f part after
the nested character class [c-d] does back to normal).

It appears the negation of the outer character class does not
affect its nested character class,

I think the easiest way to explain the situation is not to consider the
negation separately, but that [^ is the syntax of a negated character
set.  Having a normal character set inside a negated character set then
seems ok to me.


so [^X] is always opposite from
[^[X]], X to be any character class.

Same strange thing seems to be true for intersection operation
as well, so both [a-dc-f] and [^a-dc-f] do NOT match a.

This does not sound correct, at least for me.

This case is, I agree, a gotcha which is hard to justify through the
syntax rather than the implementation.


The source code suggests that we are treating the nested/embedded
[...] character class and the intersection specially, so

[^[X]  is interpreted as [^] union [X]
[^X[Y]] is interpreted as [^X] union [Y]
[^X[Y]Z] is interpreted as [^XZ] union [Y]
[^XY] is interpreted as [^X]  Y

What I meant treating...specially is that we do NOT do the same
thing for other embedded character classes, so while [^[a-z]] does
match c, [^\p{Lower}] actually does NOT match c, which I would
expect.

The j.u.regex.Pattern APIs do NOT help. All the samples given for
Character classes[1] section are simple negation, no nested
sample is given. And neither ^ nor [^...] appear in the operator
precedence table. The behaviors in other regex engines, such as Perl
and POSIX, don't help, as nested character class is not supported
there.

I did check with the original author who wrote this part of the code.
It appears the current implementation is indeed what he intended to
do back then, so this behavior is NOT an implementation bug but by
design.

Personally I don't feel this design is not correct.

More mind tricks ;-) ?  You think the design *is* wrong?


Ideally, I would assume the spec either specifies [^...] as a
separate group operator to be the complement of [...], or ^ as
the negation operator with the lowest precedence, such as (from
lowest to highest)

(1) Negation  ^(only at the beginning of the [...])
(2) Intersection
(3) Range -
(4) nested class []

or as I suggested above
   (5) negated class [^ ... ]

and the understanding that nested classes do not