Re: RFR: 8365675: Add String Unicode Case-Folding Support

Roger Riggs Tue, 07 Oct 2025 15:25:40 -0700

On Fri, 3 Oct 2025 19:56:22 GMT, Xueming Shen <[email protected]> wrote:


> ### Summary
> 
> Case folding is a key operation for case-insensitive matching (e.g., string 
> equality, regex matching), where the goal is to eliminate case distinctions 
> without applying locale or language specific conversions.
> 
> Currently, the JDK does not expose a direct API for Unicode-compliant case 
> folding. Developers now rely on methods such as:
> 
> **String.equalsIgnoreCase(String)**
> 
> - Unicode-aware, locale-independent.
> - Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per 
> code point.
> - Limited: does not support 1:M mapping defined in Unicode case folding.
> 
> **Character.toLowerCase(int) / Character.toUpperCase(int)**
> 
> - Locale-independent, single code point only.
> - No support for 1:M mappings.
> 
> **String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)**
> 
> - Based on Unicode SpecialCasing.txt, supports 1:M mappings.
> - Intended primarily for presentation/display, not structural 
> case-insensitive matching.
> - Requires full string conversion before comparison, which is less efficient 
> and not intended for structural matching.
> 
> **1:M mapping example, U+00DF (ß)**
> 
> - String.toUpperCase(Locale.ROOT, "ß") → "SS"
> - Case folding produces "ss", matching Unicode caseless comparison rules.
> 
> 
> jshell> "\u00df".equalsIgnoreCase("ss")
> $22 ==> false
> 
> jshell> 
> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
> $24 ==> true
> 
> 
> ### Motivation & Direction
> 
> Add Unicode standard-compliant case-less comparison methods to the String 
> class, enabling & improving reliable and efficient Unicode-aware/compliant 
> case-insensitive matching.
> 
> - Unicode-compliant **full** case folding.
> - Simpler, stable and more efficient case-less matching without workarounds.
> - Brings Java's string comparison handling in line with other programming 
> languages/libraries.
> 
> This PR proposes to introduce the following comparison methods in `String` 
> class
> 
> - boolean equalsFoldCase(String anotherString)
> - int compareToFoldCase(String anotherString)
> - Comparator<String> UNICODE_CASEFOLD_ORDER
> 
> These methods are intended to be the preferred choice when Unicode-compliant 
> case-less matching is required.
> 
> *Note: An early draft also proposed a String.toCaseFold() method returning a 
> new case-folded string.
> However, during review this was considered error-prone, as the resulting 
> string could easily be mistaken for a general transformation like 
> toLowerCase() and then passed into APIs where case-folding semantics are not 
> appropriate.
> 
> ### The New API
> 
> 
>    /**
>      * Compares thi...

The API looks good.

Is the performance comparable to equalsIgnoreCase?

src/java.base/share/classes/java/lang/StringLatin1.java line 194:

> 192:         char[] folded2 = null;
> 193:         int k1 = 0, k2 = 0, fk1 = 0, fk2 = 0;
> 194:         while ((k1 < len1 || folded1 != null && fk1 < folded1.length) &&

Many suggestions come to mind here on the algorithm, to optimize performance.
For example, many strings will have identical prefixes. Using Arrays.mismatch 
could quickly skip over the identical prefix.
Consider using code points (or a long, packing 4 chars) for the folded 
replacements, to avoid having to step through chars in char arrays.  
CaseFolding.foldIfDefined could return the full expansion as a long.
It may be profitable to use Arrays.mismatch again after expanded characters are 
determined to be equal.

Take another look at the data structure storing and doing the lookup of 
foldIfDefined both to increase the lookup performance.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 
230:

> 228:      private static class CaseFoldingEntry {
> 229:         final int cp;
> 230:         final char[] folding;

Consider storing the folding as a int or long directly to avoid the overhead of 
small char arrays.
Arrange to be able to compare the whole replacement with another codePoint, etc.

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template line 
280:

> 278:         }
> 279: 
> 280:         private void add(CaseFoldingEntry entry) {

CDS can map whole objects/data structures into the heap; consider how to make 
this data structure so it can be mapped and not re-computed each startup.

-------------

PR Review: https://git.openjdk.org/jdk/pull/27628#pullrequestreview-3312084027
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2412043131
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2412060747
PR Review Comment: https://git.openjdk.org/jdk/pull/27628#discussion_r2412062604

Re: RFR: 8365675: Add String Unicode Case-Folding Support

Reply via email to