Re: RFR: 8365675: Add String Unicode Case-Folding Support [v2]

Chen Liang Fri, 17 Oct 2025 23:12:06 -0700

On Wed, 8 Oct 2025 00:33:20 GMT, Xueming Shen <[email protected]> wrote:


>> ### Summary
>> 
>> Case folding is a key operation for case-insensitive matching (e.g., string 
>> equality, regex matching), where the goal is to eliminate case distinctions 
>> without applying locale or language specific conversions.
>> 
>> Currently, the JDK does not expose a direct API for Unicode-compliant case 
>> folding. Developers now rely on methods such as:
>> 
>> **String.equalsIgnoreCase(String)**
>> 
>> - Unicode-aware, locale-independent.
>> - Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per 
>> code point.
>> - Limited: does not support 1:M mapping defined in Unicode case folding.
>> 
>> **Character.toLowerCase(int) / Character.toUpperCase(int)**
>> 
>> - Locale-independent, single code point only.
>> - No support for 1:M mappings.
>> 
>> **String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)**
>> 
>> - Based on Unicode SpecialCasing.txt, supports 1:M mappings.
>> - Intended primarily for presentation/display, not structural 
>> case-insensitive matching.
>> - Requires full string conversion before comparison, which is less efficient 
>> and not intended for structural matching.
>> 
>> **1:M mapping example, U+00DF (ß)**
>> 
>> - String.toUpperCase(Locale.ROOT, "ß") → "SS"
>> - Case folding produces "ss", matching Unicode caseless comparison rules.
>> 
>> 
>> jshell> "\u00df".equalsIgnoreCase("ss")
>> $22 ==> false
>> 
>> jshell> 
>> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
>> $24 ==> true
>> 
>> 
>> ### Motivation & Direction
>> 
>> Add Unicode standard-compliant case-less comparison methods to the String 
>> class, enabling & improving reliable and efficient Unicode-aware/compliant 
>> case-insensitive matching.
>> 
>> - Unicode-compliant **full** case folding.
>> - Simpler, stable and more efficient case-less matching without workarounds.
>> - Brings Java's string comparison handling in line with other programming 
>> languages/libraries.
>> 
>> This PR proposes to introduce the following comparison methods in `String` 
>> class
>> 
>> - boolean equalsFoldCase(String anotherString)
>> - int compareToFoldCase(String anotherString)
>> - Comparator<String> UNICODE_CASEFOLD_ORDER
>> 
>> These methods are intended to be the preferred choice when Unicode-compliant 
>> case-less matching is required.
>> 
>> *Note: An early draft also proposed a String.toCaseFold() method returning a 
>> new case-folded string.
>> However, during review this was considered error-prone, as the resulting 
>> string could easily be mistaken for a general transformation like 
>> toLowerCase() and then pass...
>
> Xueming Shen has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   minor api doc updates

Given this patch obviously has so many performance optimization opportunities, 
I recommend handling those in subsequent RFEs so that we can review this purely 
from a specification point of view.

make/modules/java.base/gensrc/GensrcCharacterData.gmk line 76:

> 74: 
> 75: 
> 76: GENSRC_STRINGCASEFOLDING := 
> $(SUPPORT_OUTPUTDIR)/gensrc/java.base/jdk/internal/java/lang/CaseFolding.java

Can we target the package `jdk.internal.lang` instead of 
`jdk.internal.java.lang`? I think the previous one is the convention set forth 
by stable values.

-------------

PR Review: https://git.openjdk.org/jdk/pull/27628#pullrequestreview-3314954963
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2413957017

Re: RFR: 8365675: Add String Unicode Case-Folding Support [v2]

Reply via email to