On Mon, 14 Jul 2025 04:53:13 GMT, Xueming Shen <sher...@openjdk.org> wrote:
> Regex class should conform to **_Level 1_** of [Unicode Technical Standard > #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/), plus > RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters. > > This PR primarily addresses conformance with RL1.5: Simple Loose Matches, > which requires that simple case folding be applied to literals and > (optionally) to character classes. When applied to character classes, each > class is expected to be closed under simple case folding. See the standard > for a detailed explanation of what it means for a class to be “closed.” > > To conform with Level 1 of UTS #18, specifically RL1.5: Simple Loose Matches, > simple case folding must be applied to literals and (optionally) to character > classes. When applied to character classes, each character class is expected > to **be closed under simple case folding**. See the standard for the > detailed explanation and example of "closed". > > **RL1.5 states**: > > To meet this requirement, an implementation that supports case-sensitive > matching should > > 1. Provide at least the simple, default Unicode case-insensitive > matching, and > 2. Specify which character properties or constructs are closed under the > matching. > > **In the Pattern implementation**, 5 types of constructs may be affected by > case sensitivity: > > 1. back-refs > 2. string slices (sequences) > 3. single character, > 4. character families (Unicode Properties ...), and > 5. character class ranges > > **Note**: Single characters and families may appear independently or within a > character class. > > For case-insensitive (loose) matching, the implementation already applies > Character.toUpperCase() and Character.toLowerCase() to **both the pattern and > the input string** for back-refs, slices, and single characters. This > effectively makes these constructs closed under case folding. > > This has been verified in the newly added test case > **_test/jdk/java/util/regex/CaseFoldingTest.java_**. > > For example: > > Pattern.compile("(?ui)\u017f").matcher("S").matches(). => true > Pattern.compile("(?ui)[\u017f]").matcher("S").matches() => true > > The character properties (families) are not "closed" and should remain > unchanged. This is acceptable per RL1.5, if the behavior is clearly > specified (TBD: update javadoc to reflect this). > > **Current Non-Conformance: Character Class Ranges**, as reported in the > original bug report. > > Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches() => false > vs > Pattern.compile("(?ui)[S-S]").... make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 45: > 43: var caseFoldingTxt = Paths.get(args[1]); > 44: var genSrcFile = Paths.get(args[2]); > 45: var supportedTypes = "^.*; [CTS]; .*$"; Do we still need T here given you already have a hardcoded special case? make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 60: > 58: .map(cols -> String.format(" entry(0x%s, 0x%s),", > cols[0], cols[2])) > 59: .collect(Collectors.joining("\n")) > 60: .replaceFirst(",$", ""); // remove the last ',' Suggestion: .map(cols -> String.format(" entry(0x%s, 0x%s)", cols[0], cols[2])) .collect(Collectors.joining(",\n", "", "\n")); // remove the last ',' make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 74: > 72: StandardOpenOption.CREATE, > StandardOpenOption.TRUNCATE_EXISTING); > 73: } catch (IOException e) { > 74: e.printStackTrace(); I recommend removing this catch and add `throws Throwable` in the signature of `main` src/java.base/share/classes/jdk/internal/util/regex/CaseFolding.java.template line 36: > 34: public final class CaseFolding { > 35: > 36: private static Map<Integer, Integer> expanded_casefolding = > Map.ofEntries( Suggestion: private static final Map<Integer, Integer> expanded_casefolding = Map.ofEntries( src/java.base/share/classes/jdk/internal/util/regex/CaseFolding.java.template line 99: > 97: */ > 98: public static int[] getClassRangeClosingCharacters(int start, int > end) { > 99: int[] expanded = new int[expanded_casefolding.size()]; Can be `Math.min(expanded_casefolding.size(), end - start)` in case the table grows large, and update the `off < expanded.length` check below too. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203858280 PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203854636 PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203852720 PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203850027 PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203851719