Historically, the methods currently known as "compareToCI" and "regionMatchesCI", and located in "java.lang.StringUTF16", have lacked support for Supplementary Multilingual Plane code-points. (I've seen no associated bug.)
On July 23, 2020 the first fix for the bug was committed. However, it includes two simple bugs of its own. They're not much more than typos, but they break some things nonetheless, as demonstrated by the unit tests comprising part 2 of this contribution. (Those two bugs: In "StringUTF16.compareToCIImpl", change statements "cp1 -= cp1;" and "cp2 -= cp2;" to, respectively, "cp1 = -cp1;" and "cp2 = -cp2;", and those bugs are history.) The fixed "compareToCI" and "regionMatchesCI" methods in this patch are different, however.[1] Notably: 1. Case-insensitive UTF-16 string comparison and equality-testing is 2.2 to 2.9 times faster than the current methods, depending on data set composition. (So, meaningfully faster, but the degree varies with the strings compared.) 2. 100% TestNG unit test coverage. 3. A utility method, "compareCodePointsIgnoringCase", created for these methods increased their performance by 24%, and could, in the future, be applied easily to "StringLatin1" methods "compareToCI*", and "regionMatchesCI*". (My guess is the performance gain there will be smaller, but still useful.) This patch to "StringUTF16" may appear quite large. For better or worse (your call, gentle reader), the vast majority of it is comments. The code itself is minimal by comparison. Thanks in advance for your consideration, ----Chris Chris W. Johnson chriswjohnson....@gmail.com http://www.panojohnson.com/ Footnote: 1. Work on this code began around 4 years ago, when I failed to find a bug report link on the OpenJDK page, and supposed the message to devs might be something like "contribute or go away". Since then, life, death, work, learning to build & test OpenJDK, registering as a contributor, social anxiety, unit test development, benchmarking, experimentation, and being unable to look at my own code without seeing *something* to improve (especially considering it might end-up in the JDK), are among the reasons for the long delay in attempting this contribution. Index: src/java.base/share/classes/java/lang/StringUTF16.java IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP <+>US-ASCII =================================================================== diff --git a/src/java.base/share/classes/java/lang/StringUTF16.java b/src/java.base/share/classes/java/lang/StringUTF16.java --- a/src/java.base/share/classes/java/lang/StringUTF16.java (revision 60819:ee1d592a9f5389725a0338a4b5dfcf4fc3fcf20c) +++ b/src/java.base/share/classes/java/lang/StringUTF16.java (revision 60819+:ee1d592a9f53+) @@ -41,7 +41,32 @@ import static java.lang.String.LATIN1; final class StringUTF16 { - + + /** + * Number of low-order bits storing the payload of Unicode surrogate code-units. + * (The number of surrogate header bits is {@link Character#SIZE} minus this value.) + */ + private static final int SURROGATE_PAYLOAD_BIT_COUNT = 10; + + /** + * The high six bits common to all high-surrogate code-units (bits 15-10) right-shifted into bits + * 5-0 of an "{@code int}" primitive. ({@code 0b1101_10**_****_**** >>> 10 == 0b11_0110 == 0x36}) + */ + private static final int SURROGATE_HI_HEADER_RIGHT_SHIFTED_10 = Character.MIN_HIGH_SURROGATE >>> SURROGATE_PAYLOAD_BIT_COUNT; // == 0x36 + + /** + * Produces a Supplementary Multilingual Plane code-point when added to a valid surrogate-pair + * combined as follows: {@code (highSurrogate << SURROGATE_PAYLOAD_BIT_COUNT) + lowSurrogate}. + * The result is undefined if either would-be surrogate is invalid (not a surrogate code-unit, + * or the wrong surrogate type [low instead of high, or vice versa]). + * + * @see Character#toCodePoint(char, char) + */ + private static final int SURROGATE_COMBO_TO_CODE_POINT_DELTA = Character.MIN_SUPPLEMENTARY_CODE_POINT + - (Character.MIN_HIGH_SURROGATE << SURROGATE_PAYLOAD_BIT_COUNT) + - Character.MIN_LOW_SURROGATE; // -56_613_888 + + public static byte[] newBytesFor(int len) { if (len < 0) { throw new NegativeArraySizeException(); @@ -318,93 +343,429 @@ return -StringLatin1.compareToUTF16(other, value, len2, len1); } - public static int compareToCI(byte[] value, byte[] other) { - return compareToCIImpl(value, 0, length(value), other, 0, length(other)); + /** + * Compares case-insensitively UTF-16 sequences in {@code byte} arrays. Performs + * {@linkplain #compareCodePointsIgnoringCase case conversions equivalent to + * <code>Character.toLowerCase(Character.toUpperCase(codePoint))</code>} before + * comparing unequal code-points. + * + * @param charsA array of byte pairs representing 16-bit ({@code char}) quantities, + * with byte order mandated by {@code StringUTF16.isBigEndian()} + * @param charsB array of byte pairs representing 16-bit ({@code char}) quantities, + * with byte order mandated by {@code StringUTF16.isBigEndian()} + * + * @return negative if {@code charsA} is lexicographically less than {@code charsB}, + * positive if it is greater, or zero if they are equal + */ + static int compareToCI + ( + final byte[] charsA, + final byte[] charsB + ){ + // Consider: If the G1 garbage collector's string-deduplication mode is enabled, performance may + // be improved by first testing "charsA" and "charsB" for reference equality, and + // returning zero in that case. However, the same is equally true for all invocations + // of "String.compareToIgnoreCase" (not just those operating on two UTF-16 strings), + // so the test belongs in that method (ideally with magic causing the array reference + // equality test to occur only when G1 and its string-deduplication mode are active). + // Similarly modifying methods "String.equalsIgnoreCase" and "String.equals" may be + // even more beneficial, because they already include "String" object reference equality + // fast-paths, unlike "String.compareToIgnoreCase". + // Indeed, string-deduplication makes reference equality more likely between backing- + // store arrays than "String" objects, and therefore the more important fast-path test, + // provided real-world use of those methods frequently involves "String" objects whose + // tenure (number of garbage-collections survived) is >= the deduplication threshold (3 + // by default, as of JDK 15, or the "-XX:StringDeduplicationAgeThreshold" value). + + final int lengthSignum = charsA.length - charsB.length; + + // If the array lengths are identical, return the "compareToCI" result; no additional + // considerations apply. + + if (lengthSignum == 0) + return compareToCI(charsA, 0, charsB, 0, length(charsA)); + + // The array lengths differ. Compare the elements present in both "charsA" and "charsB". + // If the comparison result is inequality, return the comparison result. If the result + // is equality, the shorter array is considered less than the longer array. + + final int compareSignum = compareToCI(charsA, 0, charsB, 0, length(lengthSignum <= 0 ? charsA : charsB)); + + // The right-shift (division by two) applied to "lengthSignum" below is necessary only + // if compatibility is required with code expecting specific return values (instead of + // any negative or positive value) from known comparisons. + + return compareSignum != 0 ? compareSignum : lengthSignum >> 1; } - - private static int compareToCIImpl(byte[] value, int toffset, int tlen, - byte[] other, int ooffset, int olen) { - int tlast = toffset + tlen; - int olast = ooffset + olen; - assert toffset >= 0 && ooffset >= 0; - assert tlast <= length(value); - assert olast <= length(other); - - for (int k1 = toffset, k2 = ooffset; k1 < tlast && k2 < olast; k1++, k2++) { - int cp1 = (int)getChar(value, k1); - int cp2 = (int)getChar(other, k2); - - if (cp1 == cp2 || compareCodePointCI(cp1, cp2) == 0) { + + /** + * Compares case-insensitively UTF-16 sequences in {@code byte} array subsections. + * Performs {@linkplain #compareCodePointsIgnoringCase case conversions equivalent + * to <code>Character.toLowerCase(Character.toUpperCase(codePoint))</code>} before + * comparing unequal code-points. + * + * @param charsA array of byte pairs representing 16-bit ({@code char}) quantities, + * with byte order mandated by {@code StringUTF16.isBigEndian()} + * @param charsAIndex index, in 16-bit quantities, at which comparison of "{@code charsA}" + * begins. Caller must enforce constraints "{@code 0 <= charsAIndex}" + * and "{@code 0 <= charsAIndex + totalChars <= charsA.length / 2}". + * @param charsB array of byte pairs representing 16-bit ({@code char}) quantities, + * with byte order mandated by {@code StringUTF16.isBigEndian()} + * @param charsBIndex index, in 16-bit quantities, at which comparison of "{@code charsB}" + * begins. Caller must enforce constraints "{@code 0 <= charsBIndex}" + * and "{@code 0 <= charsBIndex + totalChars <= charsB.length / 2}". + * @param totalChars maximum number of {@code char}s to compare. Caller must enforce + * constraint "{@code 0 <= totalChars}". + * + * @return negative if the compared portion of {@code charsA} is lexicographically less + * than that of {@code charsB}, positive if it is greater, or zero if they are equal + * + * @implNote <p>This method relies on the following "<b>Unicode Empirical Properties</b>":</p> + * <ol> + * <li> + * <p>Only code-points encoded using the same UTF-16 high-surrogate may be case-insensitively equal.</p> + * <p>Several facts suggest sufficient proximity of code-point case-variants within the Supplementary + * Multilingual Plane (SMP):</p> + * <ul> + * <li>Related code-points are grouped into a Unicode "block".</li> + * <li>Blocks start at multiples of 256.</li> + * <li>High-surrogates encode 1,024 code-points starting with multiples of 1,024.</li> + * <li>Few SMP code-points are case-sensitive (450, as of Unicode 13.0.0).</li> + * </ul> + * <p>Thus, it is not surprising this property has held for 25 years, as of February, 2021.</p> + * <li><p>For all SMP code-points, both {@code Character.toLowerCase(Character.toUpperCase(int))} and + * {@code Character.toUpperCase(int)} return a SMP code-point, never a Basic Multilingual Plane (BMP) + * code-point.</p> + * <p>Supporting points:</p> + * <ul> + * <li>This is expected due to related code-point grouping.</li> + * <li>If this property were ever invalidated, every Unicode case-insensitive equality test + * which requires equal-length input sequences (measured in {@code byte}s or {@code char}s) + * would break. In other words, this property is already utilized more-or-less universally by + * Unicode case-insensitive equality tests, whether or not their authors realize it. The + * Unicode Consortium has likely been aware of, and sensitive to, this issue for years. + * </li> + * </ul> + * </li> + * <li><p>For all BMP code-points, both {@code Character.toLowerCase(Character.toUpperCase(int))} and + * {@code Character.toUpperCase(int)} return a BMP code-point, never a SMP code-point.</p> + * <p>Supporting points:</p> + * <ul> + * <li>This is expected due to related code-point grouping.</li> + * <li>Most scripts affected by letter case were added early in Unicode's history, and therefore + * placed in the BMP. (As of Unicode 13.0.0, there are 2,349 case-sensitive code-points in the + * BMP, but only 450 in the SMP.) + * </li> + * <li>If this property were ever invalidated, every Unicode case-insensitive equality test + * which requires equal-length input sequences (measured in {@code byte}s or {@code char}s) + * would break. In other words, this property is already utilized more-or-less universally by + * Unicode case-insensitive equality tests, whether or not their authors realize it. The + * Unicode Consortium has likely been aware of, and sensitive to, this issue for years. + * </li> + * </ul> + * </li> + * </ol> + * <p>Validation of each property is performed automatically by unit tests in {@code CompareIC} + * ("test/jdk/java/lang/String/CompareIC.java") during normal, post-build JDK testing. Those unit tests + * methods are, respectively: + * </p> + * <ol> + * <li>{@code validatePremise_AllSMPCodePointCaseVariantsUseTheSameHighSurrogate}</li> + * <li>{@code validatePremise_AllSMPCodePointCaseVariantsAreSMPCodePoints}</li> + * <li>{@code validatePremise_AllBMPCodePointCaseVariantsAreBMPCodePoints}</li> + * </ol> + * <p>Those tests' Javadocs include discussion of the consequences should they ever fail. + * </p> + */ + private static int compareToCI + ( + final byte[] charsA, int charsAIndex, + final byte[] charsB, int charsBIndex, int totalChars + ){ + // Assertions carried-over from the previous "regionMatchesCI" implementation, with additional + // negative value testing. (Are these assertions still necessary or desirable?) + + assert (charsAIndex | charsBIndex | totalChars) >= 0; // Ensure these parameters are non-negative. + assert charsAIndex + totalChars <= length(charsA); + assert charsBIndex + totalChars <= length(charsB); + + // Variable "priorExactlyEqualChar" (PEEC) is used by the following loop to store the "char" value read + // by its prior iteration, if the same "char" value was read from both inputs ("charsA" and "charsB"). + // Otherwise, it is set to zero. PEEC's value is used only during the UTF-16 decoding attempt triggered + // by reading a pair of unequal low-surrogates (see Unicode empirical property no. 1). It removes from + // this algorithm all read-behind/-ahead operations, along with their extra tests & branches, and + // redundant array accesses. + // Note: PEEC can be zero for three reasons: (1) the loop is in its first iteration, (2) the prior + // iteration read code-point zero from both inputs, or (3) the prior iteration read different, but case- + // insensitively equal, Basic Multilingual Plane code-points. In all three cases, zero signifies absence + // of preceding, identical high-surrogates (as does any non-high-surrogate value). Thus, a zero value's + // cause is ambiguous, but its meaning is clear. + + int priorExactlyEqualChar = 0; // AKA "PEEC". + + while (--totalChars >= 0) { + int cA = getChar(charsA, charsAIndex++); + int cB = getChar(charsB, charsBIndex++); + + if (cA == cB) { + + // If the next iteration's "cA" and "cB" are different low-surrogate code-units, it will need this + // iteration's value of "cA"/"cB" (their presumed high-surrogate code-unit), so it is stored in + // "priorExactlyEqualChar" (PEEC). In all other cases, this caching of "cA"/"cB" in PEEC is a small + // waste of time. However, the time saved by eliminating the tests, branches and reads necessary to + // reacquire pairs of high-surrogates should compensate for many unnecessary writes to PEEC. + + priorExactlyEqualChar = cA; continue; } - - // Check for supplementary characters case - cp1 = codePointIncluding(value, cp1, k1, toffset, tlast); - if (cp1 < 0) { - k1++; - cp1 -= cp1; - } - cp2 = codePointIncluding(other, cp2, k2, ooffset, olast); - if (cp2 < 0) { - k2++; - cp2 -= cp2; - } - - int diff = compareCodePointCI(cp1, cp2); - if (diff != 0) { - return diff; - } - } - return tlen - olen; - } - - // Case insensitive comparison of two code points - private static int compareCodePointCI(int cp1, int cp2) { - // try converting both characters to uppercase. - // If the results match, then the comparison scan should - // continue. - cp1 = Character.toUpperCase(cp1); - cp2 = Character.toUpperCase(cp2); - if (cp1 != cp2) { - // Unfortunately, conversion to uppercase does not work properly - // for the Georgian alphabet, which has strange rules about case - // conversion. So we need to make one last check before - // exiting. - cp1 = Character.toLowerCase(cp1); - cp2 = Character.toLowerCase(cp2); - if (cp1 != cp2) { - return cp1 - cp2; + + // KNOWN: "cA" and "cB" are not precisely equal. + + // When "priorExactlyEqualChar" is a high-surrogate, and both "cA" and "cB" are low-surrogates, the + // following "if" block performs UTF-16 decoding, replacing the code-units in "cA" and "cB" with + // code-points. The "if" block then falls-through to the code-point case-insensitive comparison. + // + // If any UTF-16 decoding precondition is unmet, "cA" and "cB" will contain Basic Multilingual Plane + // (BMP) code-points suitable for case-insensitive comparison, unless UTF-16 encoding is corrupt in + // "charsA" and/or "charsB". The following cases are possible: + // * "priorExactlyEqualChar" is not a high-surrogate, so the "if" block is not entered. + // Possible states of "cA" and "cB": + // * "cA" and "cB" are unequal BMP code-points, potentially case-insensitively equal. + // * "cA" and "cB" are unequal high-surrogates. Case-insensitive equality testing inevitably + // fails for those code-units, terminating this method. Low-surrogate retrieval and UTF-16 + // decoding are unnecessary due to Unicode empirical property no. 1. + // * "cA" and "cB" are unequal low-surrogates. UTF-16 encoding is corrupt in both "charsA" + // and "charsB" (low-surrogates not preceded by high-surrogates). Case-insensitive equality + // testing inevitably fails for those code-units, terminating this method. + // * Either "cA" or "cB" is a low-surrogate, while the other is a high-surrogate or BMP code- + // point. UTF-16 encoding is corrupt in the array supplying the low-surrogate (no preceding + // high-surrogate). Case-insensitive equality testing inevitably fails, terminating this + // method. + // * "priorExactlyEqualChar" is a high-surrogate present in both "charsA" and "charsB". + // Possible states of "cA" and "cB": + // * "cA" and "cB" are unequal low-surrogates. The "if" block is entered, converting "cA" + // and "cB" into Supplementary Multilingual Plane (SMP) code-points, potentially case- + // insensitively equal. + // * "cA" and "cB" are unequal high-surrogates. UTF-16 encoding is corrupt in both "charsA" + // and "charsB" (high-surrogates following high-surrogates). Even if the next loop + // iteration would retrieve low-surrogates, Unicode empirical property no. 1 rules-out + // case-insensitive equality based on the unequal high-surrogates alone. + // * "cA" and "cB" are unequal BMP code-points, potentially case-insensitively equal. UTF-16 + // encoding is corrupt in both "charsA" and "charsB" (high-surrogates not followed by low- + // surrogates), but the corruption is ignored because the high-surrogates were identical + // (equality exists thus far, corrupt encoding notwithstanding). + // * Either "cA" or "cB" is a low-surrogate, while the other is a high-surrogate or BMP code- + // point. UTF-16 encoding is corrupt in the array supplying the non-low-surrogate (high- + // surrogate not followed by low-surrogate). UTF-16 decoding using the lone low-surrogate + // is unnecessary, because either: + // 1. The non-low-surrogate is a high-surrogate, ruling-out case-insensitive equality. + // (Two independent rationales: [1] A high-surrogate is not a low-surrogate, by + // definition. Also by definition, [2] an SMP code-point--decoded from high-surrogate + // "priorExactlyEqualChar" and the lone low-surrogate--is not a high-surrogate code- + // unit.) + // 2. The non-low-surrogate is a BMP code-point. Unicode empirical properties 2 and 3 + // rule-out case-insensitive equality between SMP and BMP code-points. + // (Alternately, if those properties did not exist, deciding equality exists based + // on code-points at different positions within the compared string sections--whose + // comparison requires overlooking an inconvenient preceding code-unit in one string-- + // would be logically dubious on multiple grounds. For instance, if one preceding + // orphan surrogate can be ignored to prove equality, is the same true for two or more + // orphans consecutively and/or separately? If so, existing code breaks. Notably, most + // string equality tests would break due to invalidation of their equal-length strings + // precondition. [A precondition also invalidated if the Unicode Consortium ever + // asserts case-insensitive equality between a BMP and SMP code-point. Thus, Unicode + // empirical properties 2 and 3 will likely persist.]) + // + // (Aside: Of the following three tests, only the test of "priorExactlyEqualChar" would be necessary + // if Unicode [non-compact] "String" objects guaranteed valid UTF-16 encoding. It seems certain + // enforcing that guarantee during creation of every "String" [or string-like object] is infeasible + // due to runtime cost, and backwards-incompatibility. Nonetheless, this writer [CWJ] is tantalized + // by the many code simplifications guaranteed-correct "String" objects would permit, as well as the + // clues they could confer on some developers [including this writer, once upon a time].) + + // Branch Reduction + // + // The "if" statement used below has two tests/branches, instead of three, due to combining two + // low-surrogate tests into one. Multiple comparative benchmarks (each using 300 random data sets + // of 200,000 comparisons, repeated 800 times [160 million comparisons per data set; 48 billion + // total comparisons]), run on two different OS & hardware combinations, yielded respective best- + // case improvements of 1.24% and 0.98%, and mean improvements of 1.79% and 0.62%, relative to + // three-branch "if" statement: + // + // if (priorExactlyEqualChar >>> SURROGATE_PAYLOAD_BIT_COUNT == SURROGATE_HI_HEADER_RIGHT_SHIFTED_10 + // && cA >>> SURROGATE_PAYLOAD_BIT_COUNT == SURROGATE_LO_HEADER_RIGHT_SHIFTED_10 + // && cB >>> SURROGATE_PAYLOAD_BIT_COUNT == SURROGATE_LO_HEADER_RIGHT_SHIFTED_10) + // + // Conversely, combining all three tests into one yields worse performance than the three-test + // "if" shown above, because high-surrogate testing of "priorExactlyEqualChar" fails frequently + // (50% minimum failure rate for well-formed UTF-16), and, when it fails, low-surrogate testing + // of "cA" and "cB" wastes time. + + if (priorExactlyEqualChar >>> SURROGATE_PAYLOAD_BIT_COUNT == SURROGATE_HI_HEADER_RIGHT_SHIFTED_10 + && (cA ^ Character.MIN_LOW_SURROGATE | cB ^ Character.MIN_LOW_SURROGATE) >>> SURROGATE_PAYLOAD_BIT_COUNT == 0 + ){ + // KNOWN: "cA" and "cB" are not precisely equal. + // KNOWN: "priorExactlyEqualChar" is a high-surrogate code-unit. + // KNOWN: "cA" and "cB" are low-surrogate code-units. (Bits 15 to 10 of both are 110111.) + + // Perform UTF-16 decoding, accelerated by repurposing "priorExactlyEqualChar" (PEEC) to store + // the result of decoding operations common to both "cA" and "cB" (because they share the high- + // surrogate in PEEC). Near-total commonality of operations enables decoding two code-points + // using only one more addition operation than decoding a single code-point. (PEEC's repurposing + // is brief; it is always set to zero immediately after this "if" block.) + + priorExactlyEqualChar = (priorExactlyEqualChar << SURROGATE_PAYLOAD_BIT_COUNT) + SURROGATE_COMBO_TO_CODE_POINT_DELTA; + + cA += priorExactlyEqualChar; + cB += priorExactlyEqualChar; } + + // "cA" and "cB" are not precisely equal, therefore set "priorExactlyEqualChar" (PEEC) to zero for + // the next iteration's potential benefit. ("Potential benefit," because the next iteration, if any, + // uses PEEC's value only when its "cA" and "cB" are different low-surrogate code-units.) + + priorExactlyEqualChar = 0; + + // Having performed UTF-16 decoding (if necessary and possible), compare case-insensitively code- + // points "cA" and "cB". + // + // Note: Code-units are also possible. Normal terminating conditions include unequal high-surrogates + // (see Unicode empirical property no. 1), and a high-surrogate/code-point combination. An + // abnormal terminating condition is a low-surrogate and a code-point, meaning at least one + // input string's UTF-16 encoding is corrupt (a low-surrogate not preceded by a high-surrogate, + // or matching high-surrogates not followed--in one string--by a low-surrogate). None of those + // conditions undermines the code below, because "compareCodePointsIgnoringCase" evaluates + // "caseless" inputs unchanged, including code-units. + + cA = compareCodePointsIgnoringCase(cA, cB); + if (cA != 0) + return cA; } + return 0; } - - // Returns a code point from the code unit pointed by "index". If it is - // not a surrogate or an unpaired surrogate, then the code unit is - // returned as is. Otherwise, it is combined with the code unit before - // or after, depending on the type of the surrogate at index, to make a - // supplementary code point. The return value will be negated if the code - // unit pointed by index is a high surrogate, and index + 1 is a low surrogate. - private static int codePointIncluding(byte[] ba, int cp, int index, int start, int end) { - // fast check - if (!Character.isSurrogate((char)cp)) { - return cp; - } - if (Character.isLowSurrogate((char)cp)) { - if (index > start) { - char c = getChar(ba, index - 1); - if (Character.isHighSurrogate(c)) { - return Character.toCodePoint(c, (char)cp); - } - } - } else if (index + 1 < end) { // cp == high surrogate - char c = getChar(ba, index + 1); - if (Character.isLowSurrogate(c)) { - // negate the code point - return - Character.toCodePoint((char)cp, c); - } - } - return cp; + + /** + * Gets the case-insensitive, lexicographical relationship between code-points. + * Though not required, passing this method only unequal code-points ("{@code + * codePointA != codePointB}") is most efficient. + * + * @param codePointA first code-point + * @param codePointB second code-point + * + * @return "{@code codePointA - codePointB}" after both undergo case-conversion equivalent + * to, but faster than, {@code Character.toLowerCase(Character.toUpperCase(codePoint))} + * + * @implNote This method uses two, duplicate {@code switch} expressions, because + * benchmarking showed a noticeable performance improvement over a single {@code + * switch} wrapped in the most minimal of two-iteration loops. The duplication + * should create no maintenance problems, because, as detailed by a comment within + * the method, the {@code switch} source code is programmatically generated when a + * "CompareIC" unit test detects a mishandled "triple-case" code-point. Otherwise, + * modifying these expressions is unnecessary. (The unit test does not alter this, + * or any, file; it merely includes updated {@code switch} source in its error + * message, leaving a human responsible for pasting it here, twice.) + */ + static int compareCodePointsIgnoringCase + ( + final int codePointA, + final int codePointB + ){ + /* A code-point is here dubbed "triple-case" if "codePoint != Character.toUpperCase(codePoint) && + codePoint != Character.toLowerCase(Character.toUpperCase(codePoint))". Such code-points are rare + (see "switch" expressions below), but force most code-point comparisons to invoke both "toUpperCase" + and "toLowerCase". If "triple-case" code-points did not exist, invoking only "toLowerCase" would + suffice. + + This method comes close to realizing the "no triple-case code-points" scenario by directly converting + "triple-case" code-points to their final, lowercase forms using "switch" expressions, and invoking + only "toLowerCase" on all other code-points. The effect is a benchmark performance gain of 24.3%. + + Note: The Georgian alphabet includes no "triple-case" code-points, as shown below by the "progression" + columns of the "switch" expressions' comments. (As of Unicode 13.0.0, there are three Georgian + blocks: "Georgian" 10A0-10FF, "Georgian Extended" 1C90-1CBF, and "Georgian Supplement" 2D00-2D2F.) + + + TESTING AND CODE GENERATION + + A unit test in "CompareIC" ("test/jdk/java/lang/String/CompareIC.java") searches all 1,114,112 Unicode + code-points/-units for triple-case code-points, then validates their handling by "String.equalsIgnoreCase" + and "String.compareToIgnoreCase". If errors are detected, those tests fail with a message detailing each + error and supplying freshly generated source code for the "switch" expression used twice below. (If the + new and existing set of "case" clauses differ, pasting the former over the latter will fix the problem. + If "case" clauses are identical, the problem lies elsewhere.) + */ + // Triple-Case Code-Points, as of Java 15.0.2. (Written by "java.lang.CompareIC.generateTripleCaseCodePointsSwitchExpression".) + // + // Code-Point Progression Code-Point Progression by Name + return switch (codePointA) { // -------------------------- ------------------------------------------------------------------------------------------------------------------------------------- + case '\u00B5' -> '\u03BC'; // U+00B5 -> U+039C -> U+03BC MICRO SIGN -> GREEK CAPITAL LETTER MU -> GREEK SMALL LETTER MU + case '\u0131' -> '\u0069'; // U+0131 -> U+0049 -> U+0069 LATIN SMALL LETTER DOTLESS I -> LATIN CAPITAL LETTER I -> LATIN SMALL LETTER I + case '\u017F' -> '\u0073'; // U+017F -> U+0053 -> U+0073 LATIN SMALL LETTER LONG S -> LATIN CAPITAL LETTER S -> LATIN SMALL LETTER S + case '\u01C5' -> '\u01C6'; // U+01C5 -> U+01C4 -> U+01C6 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON -> LATIN CAPITAL LETTER DZ WITH CARON -> LATIN SMALL LETTER DZ WITH CARON + case '\u01C8' -> '\u01C9'; // U+01C8 -> U+01C7 -> U+01C9 LATIN CAPITAL LETTER L WITH SMALL LETTER J -> LATIN CAPITAL LETTER LJ -> LATIN SMALL LETTER LJ + case '\u01CB' -> '\u01CC'; // U+01CB -> U+01CA -> U+01CC LATIN CAPITAL LETTER N WITH SMALL LETTER J -> LATIN CAPITAL LETTER NJ -> LATIN SMALL LETTER NJ + case '\u01F2' -> '\u01F3'; // U+01F2 -> U+01F1 -> U+01F3 LATIN CAPITAL LETTER D WITH SMALL LETTER Z -> LATIN CAPITAL LETTER DZ -> LATIN SMALL LETTER DZ + case '\u0345' -> '\u03B9'; // U+0345 -> U+0399 -> U+03B9 COMBINING GREEK YPOGEGRAMMENI -> GREEK CAPITAL LETTER IOTA -> GREEK SMALL LETTER IOTA + case '\u03C2' -> '\u03C3'; // U+03C2 -> U+03A3 -> U+03C3 GREEK SMALL LETTER FINAL SIGMA -> GREEK CAPITAL LETTER SIGMA -> GREEK SMALL LETTER SIGMA + case '\u03D0' -> '\u03B2'; // U+03D0 -> U+0392 -> U+03B2 GREEK BETA SYMBOL -> GREEK CAPITAL LETTER BETA -> GREEK SMALL LETTER BETA + case '\u03D1' -> '\u03B8'; // U+03D1 -> U+0398 -> U+03B8 GREEK THETA SYMBOL -> GREEK CAPITAL LETTER THETA -> GREEK SMALL LETTER THETA + case '\u03D5' -> '\u03C6'; // U+03D5 -> U+03A6 -> U+03C6 GREEK PHI SYMBOL -> GREEK CAPITAL LETTER PHI -> GREEK SMALL LETTER PHI + case '\u03D6' -> '\u03C0'; // U+03D6 -> U+03A0 -> U+03C0 GREEK PI SYMBOL -> GREEK CAPITAL LETTER PI -> GREEK SMALL LETTER PI + case '\u03F0' -> '\u03BA'; // U+03F0 -> U+039A -> U+03BA GREEK KAPPA SYMBOL -> GREEK CAPITAL LETTER KAPPA -> GREEK SMALL LETTER KAPPA + case '\u03F1' -> '\u03C1'; // U+03F1 -> U+03A1 -> U+03C1 GREEK RHO SYMBOL -> GREEK CAPITAL LETTER RHO -> GREEK SMALL LETTER RHO + case '\u03F5' -> '\u03B5'; // U+03F5 -> U+0395 -> U+03B5 GREEK LUNATE EPSILON SYMBOL -> GREEK CAPITAL LETTER EPSILON -> GREEK SMALL LETTER EPSILON + case '\u1C80' -> '\u0432'; // U+1C80 -> U+0412 -> U+0432 CYRILLIC SMALL LETTER ROUNDED VE -> CYRILLIC CAPITAL LETTER VE -> CYRILLIC SMALL LETTER VE + case '\u1C81' -> '\u0434'; // U+1C81 -> U+0414 -> U+0434 CYRILLIC SMALL LETTER LONG-LEGGED DE -> CYRILLIC CAPITAL LETTER DE -> CYRILLIC SMALL LETTER DE + case '\u1C82' -> '\u043E'; // U+1C82 -> U+041E -> U+043E CYRILLIC SMALL LETTER NARROW O -> CYRILLIC CAPITAL LETTER O -> CYRILLIC SMALL LETTER O + case '\u1C83' -> '\u0441'; // U+1C83 -> U+0421 -> U+0441 CYRILLIC SMALL LETTER WIDE ES -> CYRILLIC CAPITAL LETTER ES -> CYRILLIC SMALL LETTER ES + case '\u1C84' -> '\u0442'; // U+1C84 -> U+0422 -> U+0442 CYRILLIC SMALL LETTER TALL TE -> CYRILLIC CAPITAL LETTER TE -> CYRILLIC SMALL LETTER TE + case '\u1C85' -> '\u0442'; // U+1C85 -> U+0422 -> U+0442 CYRILLIC SMALL LETTER THREE-LEGGED TE -> CYRILLIC CAPITAL LETTER TE -> CYRILLIC SMALL LETTER TE + case '\u1C86' -> '\u044A'; // U+1C86 -> U+042A -> U+044A CYRILLIC SMALL LETTER TALL HARD SIGN -> CYRILLIC CAPITAL LETTER HARD SIGN -> CYRILLIC SMALL LETTER HARD SIGN + case '\u1C87' -> '\u0463'; // U+1C87 -> U+0462 -> U+0463 CYRILLIC SMALL LETTER TALL YAT -> CYRILLIC CAPITAL LETTER YAT -> CYRILLIC SMALL LETTER YAT + case '\u1C88' -> '\uA64B'; // U+1C88 -> U+A64A -> U+A64B CYRILLIC SMALL LETTER UNBLENDED UK -> CYRILLIC CAPITAL LETTER MONOGRAPH UK -> CYRILLIC SMALL LETTER MONOGRAPH UK + case '\u1E9B' -> '\u1E61'; // U+1E9B -> U+1E60 -> U+1E61 LATIN SMALL LETTER LONG S WITH DOT ABOVE -> LATIN CAPITAL LETTER S WITH DOT ABOVE -> LATIN SMALL LETTER S WITH DOT ABOVE + case '\u1FBE' -> '\u03B9'; // U+1FBE -> U+0399 -> U+03B9 GREEK PROSGEGRAMMENI -> GREEK CAPITAL LETTER IOTA -> GREEK SMALL LETTER IOTA + // -------------------------- ------------------------------------------------------------------------------------------------------------------------------------- + // All other case-sensitive code-points are either uppercase (in which case they are changed + // below), or lowercase already (in which case they are not). Therefore, only "toLowerCase" + // is necessary. Code-units and case-insensitive code-points are unchanged by "toLowerCase". + default -> Character.toLowerCase(codePointA); + + } - switch (codePointB) { // -------------------------- ------------------------------------------------------------------------------------------------------------------------------------- + case '\u00B5' -> '\u03BC'; // U+00B5 -> U+039C -> U+03BC MICRO SIGN -> GREEK CAPITAL LETTER MU -> GREEK SMALL LETTER MU + case '\u0131' -> '\u0069'; // U+0131 -> U+0049 -> U+0069 LATIN SMALL LETTER DOTLESS I -> LATIN CAPITAL LETTER I -> LATIN SMALL LETTER I + case '\u017F' -> '\u0073'; // U+017F -> U+0053 -> U+0073 LATIN SMALL LETTER LONG S -> LATIN CAPITAL LETTER S -> LATIN SMALL LETTER S + case '\u01C5' -> '\u01C6'; // U+01C5 -> U+01C4 -> U+01C6 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON -> LATIN CAPITAL LETTER DZ WITH CARON -> LATIN SMALL LETTER DZ WITH CARON + case '\u01C8' -> '\u01C9'; // U+01C8 -> U+01C7 -> U+01C9 LATIN CAPITAL LETTER L WITH SMALL LETTER J -> LATIN CAPITAL LETTER LJ -> LATIN SMALL LETTER LJ + case '\u01CB' -> '\u01CC'; // U+01CB -> U+01CA -> U+01CC LATIN CAPITAL LETTER N WITH SMALL LETTER J -> LATIN CAPITAL LETTER NJ -> LATIN SMALL LETTER NJ + case '\u01F2' -> '\u01F3'; // U+01F2 -> U+01F1 -> U+01F3 LATIN CAPITAL LETTER D WITH SMALL LETTER Z -> LATIN CAPITAL LETTER DZ -> LATIN SMALL LETTER DZ + case '\u0345' -> '\u03B9'; // U+0345 -> U+0399 -> U+03B9 COMBINING GREEK YPOGEGRAMMENI -> GREEK CAPITAL LETTER IOTA -> GREEK SMALL LETTER IOTA + case '\u03C2' -> '\u03C3'; // U+03C2 -> U+03A3 -> U+03C3 GREEK SMALL LETTER FINAL SIGMA -> GREEK CAPITAL LETTER SIGMA -> GREEK SMALL LETTER SIGMA + case '\u03D0' -> '\u03B2'; // U+03D0 -> U+0392 -> U+03B2 GREEK BETA SYMBOL -> GREEK CAPITAL LETTER BETA -> GREEK SMALL LETTER BETA + case '\u03D1' -> '\u03B8'; // U+03D1 -> U+0398 -> U+03B8 GREEK THETA SYMBOL -> GREEK CAPITAL LETTER THETA -> GREEK SMALL LETTER THETA + case '\u03D5' -> '\u03C6'; // U+03D5 -> U+03A6 -> U+03C6 GREEK PHI SYMBOL -> GREEK CAPITAL LETTER PHI -> GREEK SMALL LETTER PHI + case '\u03D6' -> '\u03C0'; // U+03D6 -> U+03A0 -> U+03C0 GREEK PI SYMBOL -> GREEK CAPITAL LETTER PI -> GREEK SMALL LETTER PI + case '\u03F0' -> '\u03BA'; // U+03F0 -> U+039A -> U+03BA GREEK KAPPA SYMBOL -> GREEK CAPITAL LETTER KAPPA -> GREEK SMALL LETTER KAPPA + case '\u03F1' -> '\u03C1'; // U+03F1 -> U+03A1 -> U+03C1 GREEK RHO SYMBOL -> GREEK CAPITAL LETTER RHO -> GREEK SMALL LETTER RHO + case '\u03F5' -> '\u03B5'; // U+03F5 -> U+0395 -> U+03B5 GREEK LUNATE EPSILON SYMBOL -> GREEK CAPITAL LETTER EPSILON -> GREEK SMALL LETTER EPSILON + case '\u1C80' -> '\u0432'; // U+1C80 -> U+0412 -> U+0432 CYRILLIC SMALL LETTER ROUNDED VE -> CYRILLIC CAPITAL LETTER VE -> CYRILLIC SMALL LETTER VE + case '\u1C81' -> '\u0434'; // U+1C81 -> U+0414 -> U+0434 CYRILLIC SMALL LETTER LONG-LEGGED DE -> CYRILLIC CAPITAL LETTER DE -> CYRILLIC SMALL LETTER DE + case '\u1C82' -> '\u043E'; // U+1C82 -> U+041E -> U+043E CYRILLIC SMALL LETTER NARROW O -> CYRILLIC CAPITAL LETTER O -> CYRILLIC SMALL LETTER O + case '\u1C83' -> '\u0441'; // U+1C83 -> U+0421 -> U+0441 CYRILLIC SMALL LETTER WIDE ES -> CYRILLIC CAPITAL LETTER ES -> CYRILLIC SMALL LETTER ES + case '\u1C84' -> '\u0442'; // U+1C84 -> U+0422 -> U+0442 CYRILLIC SMALL LETTER TALL TE -> CYRILLIC CAPITAL LETTER TE -> CYRILLIC SMALL LETTER TE + case '\u1C85' -> '\u0442'; // U+1C85 -> U+0422 -> U+0442 CYRILLIC SMALL LETTER THREE-LEGGED TE -> CYRILLIC CAPITAL LETTER TE -> CYRILLIC SMALL LETTER TE + case '\u1C86' -> '\u044A'; // U+1C86 -> U+042A -> U+044A CYRILLIC SMALL LETTER TALL HARD SIGN -> CYRILLIC CAPITAL LETTER HARD SIGN -> CYRILLIC SMALL LETTER HARD SIGN + case '\u1C87' -> '\u0463'; // U+1C87 -> U+0462 -> U+0463 CYRILLIC SMALL LETTER TALL YAT -> CYRILLIC CAPITAL LETTER YAT -> CYRILLIC SMALL LETTER YAT + case '\u1C88' -> '\uA64B'; // U+1C88 -> U+A64A -> U+A64B CYRILLIC SMALL LETTER UNBLENDED UK -> CYRILLIC CAPITAL LETTER MONOGRAPH UK -> CYRILLIC SMALL LETTER MONOGRAPH UK + case '\u1E9B' -> '\u1E61'; // U+1E9B -> U+1E60 -> U+1E61 LATIN SMALL LETTER LONG S WITH DOT ABOVE -> LATIN CAPITAL LETTER S WITH DOT ABOVE -> LATIN SMALL LETTER S WITH DOT ABOVE + case '\u1FBE' -> '\u03B9'; // U+1FBE -> U+0399 -> U+03B9 GREEK PROSGEGRAMMENI -> GREEK CAPITAL LETTER IOTA -> GREEK SMALL LETTER IOTA + // -------------------------- ------------------------------------------------------------------------------------------------------------------------------------- + // All other case-sensitive code-points are either uppercase (in which case they are changed + // below), or lowercase already (in which case they are not). Therefore, only "toLowerCase" + // is necessary. Code-units and case-insensitive code-points are unchanged by "toLowerCase". + default -> Character.toLowerCase(codePointB); + }; } public static int compareToCI_Latin1(byte[] value, byte[] other) { @@ -780,10 +1141,35 @@ } return new String(result, UTF16); } - - public static boolean regionMatchesCI(byte[] value, int toffset, - byte[] other, int ooffset, int len) { - return compareToCIImpl(value, toffset, len, other, ooffset, len) == 0; + + /** + * Tests case-insensitive equality of UTF-16 sequences in {@code byte} array subsections. + * Performs {@linkplain #compareCodePointsIgnoringCase case conversions equivalent to + * <code>Character.toLowerCase(Character.toUpperCase(codePoint))</code>} before comparing + * unequal code-points. + * + * @param charsA array of byte pairs representing 16-bit ({@code char}) quantities, + * with byte order mandated by {@code StringUTF16.isBigEndian()} + * @param charsAIndex index, in 16-bit quantities, at which comparison of "{@code charsA}" + * begins. Caller must enforce constraints "{@code 0 <= charsAIndex}" + * and "{@code 0 <= charsAIndex + totalChars <= charsA.length / 2}". + * @param charsB array of byte pairs representing 16-bit ({@code char}) quantities, + * with byte order mandated by {@code StringUTF16.isBigEndian()} + * @param charsBIndex index, in 16-bit quantities, at which comparison of "{@code charsB}" + * begins. Caller must enforce constraints "{@code 0 <= charsBIndex}" + * and "{@code 0 <= charsBIndex + totalChars <= charsB.length / 2}". + * @param totalChars maximum number of {@code char}s to compare. Caller must enforce + * constraint "{@code 0 <= totalChars}". + * + * @return {@code true} if the tested portions of {@code charsA} and {@code charsB} are + * case-insensitively equal, otherwise {@code false} + */ + static boolean regionMatchesCI + ( + final byte[] charsA, final int charsAIndex, + final byte[] charsB, final int charsBIndex, final int totalChars + ){ + return compareToCI(charsA, charsAIndex, charsB, charsBIndex, totalChars) == 0; } public static boolean regionMatchesCI_Latin1(byte[] value, int toffset,