Hi Sherman, ok so I'll admit to reading through to the end of your note and finding it interesting ;-)
Some comments in-lined. On 03/Jun/2011 22:55, Xueming Shen wrote: > I'm sure everybody understands what "negated character classes" [^...] > in j.u.regex means. > You would never have doubt about > > [^c] does NOT match "c" > [^0-9] does NOT match "8" > [^a-z] does NOT match "b" > [^a-bc-d] does NOT match 'c" > > But how about > > does [^[c]] match "c"? > does [^[0-9]] match "8"? > does [^[a-z]] match "b"? > does [^a-b[c-d]] match "c"? > > I was wrong on all of them when was asked first time and it took me > a while to figure out what is going on behind it. Oh, btw, the answer > is "yes" for all 4, yes, the > > [^[c]] matches "c" > [^[0-9]] matches "8" > [^[a-z]] matches "b". > [^a-b[c-d]] matches "c" (while [^a-bc-d] does NOT match "c") I would not have known the right answer to this quiz either; it seems that the use of nested character sets is sufficiently rare that we've not had to learn how these behave. > Another interesting sample is > > [^a-b[c-d]e-f] matches "c" but does NOT match "e" (so the "e-f" part after > the nested character class [c-d] does back to "normal"). > > It appears the "negation" of the "outer" character class does not > affect its nested character class, I think the easiest way to explain the situation is not to consider the negation separately, but that [^ is the syntax of a negated character set. Having a normal character set inside a negated character set then seems ok to me. > so [^X] is always opposite from > [^[X]], "X" to be any character class. > > Same "strange" thing seems to be true for "intersection operation &&" > as well, so both [a-d&&c-f] and [^a-d&&c-f] do NOT match "a". > > This does not sound correct, at least for me. This case is, I agree, a gotcha which is hard to justify through the syntax rather than the implementation. > The source code suggests that we are treating the nested/embedded > [...] character class and the "intersection &&" specially, so > > [^[X] is interpreted as [^] union [X] > [^X[Y]] is interpreted as [^X] union [Y] > [^X[Y]Z] is interpreted as [^XZ] union [Y] > [^X&&Y] is interpreted as [^X] && Y > > What I meant "treating...specially" is that we do NOT do the same > thing for other "embedded character classes", so while [^[a-z]] does > match "c", [^\p{Lower}] actually does NOT match "c", which I would > expect. > > The j.u.regex.Pattern APIs do NOT help. All the samples given for > "Character classes"[1] section are "simple" negation, no "nested" > sample is given. And neither "^" nor "[^...]" appear in the operator > precedence table. The behaviors in other regex engines, such as Perl > and POSIX, don't help, as "nested" character class is not supported > there. > > I did check with the original author who wrote this part of the code. > It appears the current implementation is indeed what he intended to > do back then, so this behavior is NOT an implementation bug but by > design. > > Personally I don't feel this design is not correct. More mind tricks ;-) ? You think the design *is* wrong? > Ideally, I would assume the spec either specifies [^...] as a > separate "group operator" to be the "complement" of [...], or "^" as > the "negation operator" with the lowest precedence, such as (from > lowest to highest) > > (1) Negation ^ (only at the beginning of the [...]) > (2) Intersection && > (3) Range - > (4) nested class [] or as I suggested above (5) negated class [^ ... ] and the understanding that nested classes do not 'inherit' the negation property of their parent. > > So > [^X[Y]] would be the "complement" of [X<union>[Y]] > > [^X[Y]Z] would be the "complement" of [X<union>[Y]<union>Z] > > [^X&&Y] would be the "complement" of [X&&Y] > > for example, if I dump the regex internal logic node tree for the sample > regex > [^a-b[c-d]e-f], the jdk7 and jdk8 results would look like > > /home/sherman/TL/regex$ > /export/sherman/Workspace/jdk7/jdk/build/linux-i586/bin/java RegEx -flag > "1000" "[^a-b[c-d]e-f]" "c" > Pattern=<[^a-b[c-d]e-f]> > Input =<c> > 1: <Difference> (7) > 2: <Union> (0) > 3: <Complement> (0) > 4: <Range [a-b]> (0) > 5: <Range [c-d]> (0) > 6: <Range [e-f]> (0) > 7: <END> (0) > ------------------------------- > match:true > groupCount=0 > > /home/sherman/TL/regex$ > /export/sherman/Workspace/jdk8/jdk/build/linux-i586/bin/java RegEx -flag > "1000" "[^a-b[c-d]e-f]" "c" > Pattern=<[^a-b[c-d]e-f]> > Input =<c> > 1: <Complement> (7) > 2: <Union> (0) > 3: <Union> (0) > 4: <Range [a-b]> (0) > 5: <Range [c-d]> (0) > 6: <Range [e-f]> (0) > 7: <END> (0) > ------------------------------- > match:false > > I know, most of people might not be interested, but if you have read > this far, means you are interested:-) and might have some opinions. > It is definitely an incompatible change, given this has been the > behavior from the very beginning of Java regex and has been there for > almost a decade, I would appreciate any comment/opinion, especially > if you agree that the existing behavior is NOT "correct" and therefor > we need to fix, OR you think the existing one is just fine (the fact > I only heard one complain the past 5 -6 years:-) so far), OR even the > existing behavior is not "ideal", but given the compatibility > concern, we might just leave it alone. Since it does not seem to be causing a great problem, I would be inclined to document it and leave it alone. Should the future include subtractive character classes then writing, e.g. [^a-z-[safe]] would seem more natural than [^a-z-[^safe]]. Regards, Tim > The fix/change itself is relatively easy, as showed at > > http://cr.openjdk.java.net/~sherman/pattern_cc/ > <http://cr.openjdk.java.net/%7Esherman/pattern_cc/> > > Thanks > -Sherman > > [1] http://download.java.net/jdk7/docs/api/java/util/regex/Pattern.html#cc