On Mon, 14 Jul 2025 04:53:13 GMT, Xueming Shen <sher...@openjdk.org> wrote:

> Regex class should conform to **_Level 1_** of [Unicode Technical Standard 
> #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/), plus 
> RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters.
> 
> This PR primarily addresses conformance with RL1.5: Simple Loose Matches, 
> which requires that simple case folding be applied to literals and 
> (optionally) to character classes. When applied to character classes, each 
> class is expected to be closed under simple case folding. See the standard 
> for a detailed explanation of what it means for a class to be “closed.”
> 
> To conform with Level 1 of UTS #18, specifically RL1.5: Simple Loose Matches, 
> simple case folding must be applied to literals and (optionally) to character 
> classes. When applied to character classes, each character class is expected 
> to **be closed under simple case folding**.  See the standard for the 
> detailed explanation and example of "closed".
> 
> **RL1.5 states**: 
> 
> To meet this requirement, an implementation that supports case-sensitive 
> matching should
> 
>     1. Provide at least the simple, default Unicode case-insensitive 
> matching, and
>     2. Specify which character properties or constructs are closed under the 
> matching.
> 
> **In the Pattern implementation**, 5 types of constructs may be affected by 
> case sensitivity:
> 
>     1. back-refs
>     2. string slices (sequences)
>     3. single character,
>     4. character families (Unicode Properties ...), and
>     5. character class ranges
> 
> **Note**: Single characters and families may appear independently or within a 
> character class.
> 
> For case-insensitive (loose) matching, the implementation already applies 
> Character.toUpperCase() and Character.toLowerCase() to **both the pattern and 
> the input string** for back-refs, slices, and single characters. This 
> effectively makes these constructs closed under case folding.
> 
> This has been verified in the newly added test case 
> **_test/jdk/java/util/regex/CaseFoldingTest.java_**.
> 
> For example:
> 
> Pattern.compile("(?ui)\u017f").matcher("S").matches().      => true
> Pattern.compile("(?ui)[\u017f]").matcher("S").matches()    => true
> 
> The character properties (families)  are not "closed" and should remain 
> unchanged. This is acceptable per RL1.5, if the  behavior is clearly 
> specified (TBD: update javadoc to reflect this).
> 
> **Current Non-Conformance: Character Class Ranges**, as reported in the 
> original bug report.
> 
> Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches()  => false
> vs
> Pattern.compile("(?ui)[S-S]")....

make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 45:

> 43:         var caseFoldingTxt = Paths.get(args[1]);
> 44:         var genSrcFile = Paths.get(args[2]);
> 45:         var supportedTypes = "^.*; [CTS]; .*$";

Do we still need T here given you already have a hardcoded special case?

make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 60:

> 58:                 .map(cols -> String.format("        entry(0x%s, 0x%s),", 
> cols[0], cols[2]))
> 59:                 .collect(Collectors.joining("\n"))
> 60:                 .replaceFirst(",$", "");  // remove the last ','

Suggestion:

                .map(cols -> String.format("        entry(0x%s, 0x%s)", 
cols[0], cols[2]))
                .collect(Collectors.joining(",\n", "", "\n"));  // remove the 
last ','

make/jdk/src/classes/build/tools/generatecharacter/CaseFolding.java line 74:

> 72:                 StandardOpenOption.CREATE, 
> StandardOpenOption.TRUNCATE_EXISTING);
> 73:         } catch (IOException e) {
> 74:             e.printStackTrace();

I recommend removing this catch and add `throws Throwable` in the signature of 
`main`

src/java.base/share/classes/jdk/internal/util/regex/CaseFolding.java.template 
line 36:

> 34: public final class CaseFolding {
> 35: 
> 36:     private static Map<Integer, Integer> expanded_casefolding = 
> Map.ofEntries(

Suggestion:

    private static final Map<Integer, Integer> expanded_casefolding = 
Map.ofEntries(

src/java.base/share/classes/jdk/internal/util/regex/CaseFolding.java.template 
line 99:

> 97:      */
> 98:     public static int[] getClassRangeClosingCharacters(int start, int 
> end) {
> 99:         int[] expanded = new int[expanded_casefolding.size()];

Can be `Math.min(expanded_casefolding.size(), end - start)` in case the table 
grows large, and update the `off < expanded.length` check below too.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203858280
PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203854636
PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203852720
PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203850027
PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2203851719

Reply via email to