The root cause is an off-by-one bug introduced in an old change we made years 
ago for Pattern.CANON_EQ.
See https://cr.openjdk.org/~sherman/regexCE/Note.txt for background info.

As described in the writeup above the basic logic of the change is to:

**generate the permutations, create the alternation and then put it 
appropriately into the character class (logically), we now use a special 
"Node", the NFCCharProperty to do the matching work. The NFCCharProperty tries 
to match a grapheme cluster at a time (nfc greedly, then backtrack) against the 
character class.**

It appears we have a off-by-one bug in the backtrack boundary condition check, 
when it backtracking to the position 'after' the base(main) character (in case 
where the resulting 'nfc' string is not a **single character'** string /not 
match). In such cases, we still need to match/compare the base character 
against the _predicate_ to find the potential match. 

For example in the reported scenario, the target string contains the pair of 
**u+2764** (emoji) + **u+fe0f** (variation selector/emoji_component). The 
boundary edge j = Grapheme.nextBoundary() starts at **2** (after u+fe0f), then 
it backtracks to 1. The current boundary check implementation incorrectly exits 
here because 0 + 1 < 1 fails, which is incorrect. 

This emoji pair should match correctly, s showed below


jshell> var p = Pattern.compile("\\p{IsEmoji}\\p{IsEmoji_Component}", 
Pattern.CANON_EQ);
p ==> \p{IsEmoji}\p{IsEmoji_Component}

jshell> p.matcher("\u2764\ufe0f").matches();
$53 ==> true


or


jshell> var p = Pattern.compile("\\p{IsEmoji}", Pattern.CANON_EQ);
p ==> \p{IsEmoji}

jshell> p.matcher("\u2764\ufe0f").find();
$55 ==> true


This bug is not limited to the emoji + variation selector pairs (which don't 
'nfc' into a single character, even are treated as a single grapheme cluster). 
It also impacts cases involing dangling or unmatched combining character(s). 
For example, the following should work/match/find, even in Pattern.CANON_EQ 
mode.



jshell> p = Pattern.compile("\\p{IsGreek}\\p{IsAlphabetic}", Pattern.CANON_EQ);
p ==> \p{IsGreek}\p{IsAlphabetic}

jshell> p.matcher("\u1f80\u0345").matches();
$57 ==> true

jshell> p = Pattern.compile("[\\p{IsAlphabetic}]*", Pattern.CANON_EQ);
p ==> [\p{IsAlphabetic}]*

jshell> p.matcher("\u1f80\u0345").matches();
$59 ==> true


**note:** the grapheme boundary is not necessary the same as the resulting nfc 
boundary.

-------------

Commit messages:
 - 8354490: Pattern.CANON_EQ causes a pattern to not match a string with a 
UNICODE variation

Changes: https://git.openjdk.org/jdk/pull/25986/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25986&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8354490
  Stats: 19 lines in 2 files changed: 15 ins; 3 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/25986.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/25986/head:pull/25986

PR: https://git.openjdk.org/jdk/pull/25986

Reply via email to