On Thu, 30 Oct 2025 02:59:45 GMT, Xueming Shen <[email protected]> wrote:

>> ### Summary
>> 
>> Case folding is a key operation for case-insensitive matching (e.g., string 
>> equality, regex matching), where the goal is to eliminate case distinctions 
>> without applying locale or language specific conversions.
>> 
>> Currently, the JDK does not expose a direct API for Unicode-compliant case 
>> folding. Developers now rely on methods such as:
>> 
>> **String.equalsIgnoreCase(String)**
>> 
>> - Unicode-aware, locale-independent.
>> - Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per 
>> code point.
>> - Limited: does not support 1:M mapping defined in Unicode case folding.
>> 
>> **Character.toLowerCase(int) / Character.toUpperCase(int)**
>> 
>> - Locale-independent, single code point only.
>> - No support for 1:M mappings.
>> 
>> **String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)**
>> 
>> - Based on Unicode SpecialCasing.txt, supports 1:M mappings.
>> - Intended primarily for presentation/display, not structural 
>> case-insensitive matching.
>> - Requires full string conversion before comparison, which is less efficient 
>> and not intended for structural matching.
>> 
>> **1:M mapping example, U+00DF (ß)**
>> 
>> - String.toUpperCase(Locale.ROOT, "ß") → "SS"
>> - Case folding produces "ss", matching Unicode caseless comparison rules.
>> 
>> 
>> jshell> "\u00df".equalsIgnoreCase("ss")
>> $22 ==> false
>> 
>> jshell> 
>> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
>> $24 ==> true
>> 
>> 
>> ### Motivation & Direction
>> 
>> Add Unicode standard-compliant case-less comparison methods to the String 
>> class, enabling & improving reliable and efficient Unicode-aware/compliant 
>> case-insensitive matching.
>> 
>> - Unicode-compliant **full** case folding.
>> - Simpler, stable and more efficient case-less matching without workarounds.
>> - Brings Java's string comparison handling in line with other programming 
>> languages/libraries.
>> 
>> This PR proposes to introduce the following comparison methods in `String` 
>> class
>> 
>> - boolean equalsFoldCase(String anotherString)
>> - int compareToFoldCase(String anotherString)
>> - Comparator<String> UNICODE_CASEFOLD_ORDER
>> 
>> These methods are intended to be the preferred choice when Unicode-compliant 
>> case-less matching is required.
>> 
>> *Note: An early draft also proposed a String.toCaseFold() method returning a 
>> new case-folded string.
>> However, during review this was considered error-prone, as the resulting 
>> string could easily be mistaken for a general transformation like 
>> toLowerCase() and then pass...
>
> Xueming Shen has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   update to use value long for folding

Looking good.
I'll look at the javadoc again when the CSR comments are addressed.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 53:

> 51:     public static boolean isDefined(int cp) {
> 52:          return getDefined(cp) != -1;
> 53:      }

Extra space.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 
107:

> 105:     * family may appears independently or within a class.
> 106:     * <p>
> 107:     * For loose/case-insensitive matching, the back-refs, slices and 
> singles apply {code toUpperCase} and

Missing at-sign in markup:
Suggestion:

    * For loose/case-insensitive matching, the back-refs, slices and singles 
apply {@code toUpperCase} and

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 
136:

> 134:     *
> 135:     * <p>
> 136:     * @spec https://www.unicode.org/reports/tr18/#Simple_Loose_Matches

I'd put @spec after @return.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 
152:

> 150:                 }
> 151:             }
> 152:         }

If expanded_case_cps was sorted, Array.binarySearch could be used to find the 
index of the first character in the range.
And the loop could break when cp reaches the end of the range.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 
163:

> 161:       .stream()
> 162:       .mapToInt(Integer::intValue)
> 163:       .toArray();

It might be worthwhile to sort these to enable skipping a quicker break when 
the last one in the range is seen.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 
169:

> 167:     private static final int HASH_NEXT = 2;
> 168: 
> 169:     private static int[][] hashKeys(int[] keys) {

It may be worthwhile to round up the hash modulo to a prime number to avoid 
unfortunate hash collisions.

test/jdk/java/lang/String/UnicodeCaseFoldingTest.java line 31:

> 29:  * @compile --add-exports java.base/jdk.internal.lang=ALL-UNNAMED
> 30:  * UnicodeCaseFoldingTest.java
> 31:  * @run junit/othervm --add-exports 
> java.base/jdk.internal.lang=ALL-UNNAMED

The @module directive can replace the explicit --add-exports and the explicit 
@compile may be unnecessary.

* @modules java.base/jdk.internal.lang:+open

-------------

PR Review: https://git.openjdk.org/jdk/pull/27628#pullrequestreview-3436511645
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505610221
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505623056
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505629880
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505705459
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505699712
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505714395
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2505728277

Reply via email to