Proposed: StringUTF16 bug fix with optimization - Part 1 of 2, StringUTF16 Patch

Chris Johnson Tue, 30 Mar 2021 16:44:58 -0700

Historically, the methods currently known as "compareToCI" and
"regionMatchesCI", and located in "java.lang.StringUTF16", have lacked
support for Supplementary Multilingual Plane code-points. (I've seen no
associated bug.)


On July 23, 2020 the first fix for the bug was committed. However, it
includes two simple bugs of its own. They're not much more than typos,
but they break some things nonetheless, as demonstrated by the unit
tests comprising part 2 of this contribution.

(Those two bugs: In "StringUTF16.compareToCIImpl", change statements
"cp1 -= cp1;" and "cp2 -= cp2;" to, respectively, "cp1 = -cp1;" and
"cp2 = -cp2;", and those bugs are history.)

The fixed "compareToCI" and "regionMatchesCI" methods in this patch are
different, however.[1] Notably:

  1. Case-insensitive UTF-16 string comparison and equality-testing is
     2.2 to 2.9 times faster than the current methods, depending on data
     set composition. (So, meaningfully faster, but the degree varies
     with the strings compared.)

  2. 100% TestNG unit test coverage.

  3. A utility method, "compareCodePointsIgnoringCase", created for these
     methods increased their performance by 24%, and could, in the future,
     be applied easily to "StringLatin1" methods "compareToCI*", and
     "regionMatchesCI*". (My guess is the performance gain there will be
     smaller, but still useful.)

This patch to "StringUTF16" may appear quite large. For better or worse
(your call, gentle reader), the vast majority of it is comments. The code
itself is minimal by comparison.

Thanks in advance for your consideration,

----Chris

Chris W. Johnson
[email protected]
http://www.panojohnson.com/


Footnote:

1. Work on this code began around 4 years ago, when I failed to find a
   bug report link on the OpenJDK page, and supposed the message to devs
   might be something like "contribute or go away". Since then, life,
   death, work, learning to build & test OpenJDK, registering as a
   contributor, social anxiety, unit test development, benchmarking,
   experimentation, and being unable to look at my own code without
   seeing *something* to improve (especially considering it might end-up
   in the JDK), are among the reasons for the long delay in attempting
   this contribution.






Index: src/java.base/share/classes/java/lang/StringUTF16.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>US-ASCII
===================================================================
diff --git a/src/java.base/share/classes/java/lang/StringUTF16.java
b/src/java.base/share/classes/java/lang/StringUTF16.java
--- a/src/java.base/share/classes/java/lang/StringUTF16.java    (revision
60819:ee1d592a9f5389725a0338a4b5dfcf4fc3fcf20c)
+++ b/src/java.base/share/classes/java/lang/StringUTF16.java    (revision
60819+:ee1d592a9f53+)
@@ -41,7 +41,32 @@
 import static java.lang.String.LATIN1;

 final class StringUTF16 {
-
+
+    /**
+     * Number of low-order bits storing the payload of Unicode surrogate
code-units.
+     * (The number of surrogate header bits is {@link Character#SIZE}
minus this value.)
+     */
+    private static final int    SURROGATE_PAYLOAD_BIT_COUNT = 10;
+
+    /**
+     * The high six bits common to all high-surrogate code-units (bits
15-10) right-shifted into bits
+     * 5-0 of an "{@code int}" primitive. ({@code 0b1101_10**_****_****
>>> 10 == 0b11_0110 == 0x36})
+     */
+    private static final int    SURROGATE_HI_HEADER_RIGHT_SHIFTED_10 =
Character.MIN_HIGH_SURROGATE >>> SURROGATE_PAYLOAD_BIT_COUNT;  // == 0x36
+
+    /**
+     * Produces a Supplementary Multilingual Plane code-point when added
to a valid surrogate-pair
+     * combined as follows: {@code (highSurrogate <<
SURROGATE_PAYLOAD_BIT_COUNT) + lowSurrogate}.
+     * The result is undefined if either would-be surrogate is invalid
(not a surrogate code-unit,
+     * or the wrong surrogate type [low instead of high, or vice versa]).
+     *
+     * @see Character#toCodePoint(char, char)
+     */
+    private static final int    SURROGATE_COMBO_TO_CODE_POINT_DELTA =
 Character.MIN_SUPPLEMENTARY_CODE_POINT
+                                                                    -
(Character.MIN_HIGH_SURROGATE << SURROGATE_PAYLOAD_BIT_COUNT)
+                                                                    -
 Character.MIN_LOW_SURROGATE;  // -56_613_888
+
+
     public static byte[] newBytesFor(int len) {
         if (len < 0) {
             throw new NegativeArraySizeException();
@@ -318,93 +343,429 @@
         return -StringLatin1.compareToUTF16(other, value, len2, len1);
     }

-    public static int compareToCI(byte[] value, byte[] other) {
-        return compareToCIImpl(value, 0, length(value), other, 0,
length(other));
+    /**
+     * Compares case-insensitively UTF-16 sequences in {@code byte}
arrays. Performs
+     * {@linkplain #compareCodePointsIgnoringCase case conversions
equivalent to
+     *
<code>Character.toLowerCase(Character.toUpperCase(codePoint))</code>} before
+     * comparing unequal code-points.
+     *
+     * @param charsA array of byte pairs representing 16-bit ({@code
char}) quantities,
+     *               with byte order mandated by {@code
StringUTF16.isBigEndian()}
+     * @param charsB array of byte pairs representing 16-bit ({@code
char}) quantities,
+     *               with byte order mandated by {@code
StringUTF16.isBigEndian()}
+     *
+     * @return negative if {@code charsA} is lexicographically less than
{@code charsB},
+     * positive if it is greater, or zero if they are equal
+     */
+    static int compareToCI
+    (
+        final byte[]            charsA,
+        final byte[]            charsB
+    ){
+    //  Consider: If the G1 garbage collector's string-deduplication mode
is enabled, performance may
+    //            be improved by first testing "charsA" and "charsB" for
reference equality, and
+    //            returning zero in that case. However, the same is
equally true for all invocations
+    //            of "String.compareToIgnoreCase" (not just those
operating on two UTF-16 strings),
+    //            so the test belongs in that method (ideally with magic
causing the array reference
+    //            equality test to occur only when G1 and its
string-deduplication mode are active).
+    //                Similarly modifying methods
"String.equalsIgnoreCase" and "String.equals" may be
+    //            even more beneficial, because they already include
"String" object reference equality
+    //            fast-paths, unlike "String.compareToIgnoreCase".
+    //                Indeed, string-deduplication makes reference
equality more likely between backing-
+    //            store arrays than "String" objects, and therefore the
more important fast-path test,
+    //            provided real-world use of those methods frequently
involves "String" objects whose
+    //            tenure (number of garbage-collections survived) is >=
the deduplication threshold (3
+    //            by default, as of JDK 15, or the
"-XX:StringDeduplicationAgeThreshold" value).
+
+        final int               lengthSignum = charsA.length -
charsB.length;
+
+    //  If the array lengths are identical, return the "compareToCI"
result; no additional
+    //  considerations apply.
+
+        if (lengthSignum == 0)
+            return compareToCI(charsA, 0, charsB, 0, length(charsA));
+
+    //  The array lengths differ. Compare the elements present in both
"charsA" and "charsB".
+    //  If the comparison result is inequality, return the comparison
result. If the result
+    //  is equality, the shorter array is considered less than the longer
array.
+
+        final int               compareSignum = compareToCI(charsA, 0,
charsB, 0, length(lengthSignum <= 0 ? charsA : charsB));
+
+    //  The right-shift (division by two) applied to "lengthSignum" below
is necessary only
+    //  if compatibility is required with code expecting specific return
values (instead of
+    //  any negative or positive value) from known comparisons.
+
+        return compareSignum != 0 ? compareSignum : lengthSignum >> 1;
     }
-
-    private static int compareToCIImpl(byte[] value, int toffset, int tlen,
-                                      byte[] other, int ooffset, int olen)
{
-        int tlast = toffset + tlen;
-        int olast = ooffset + olen;
-        assert toffset >= 0 && ooffset >= 0;
-        assert tlast <= length(value);
-        assert olast <= length(other);
-
-        for (int k1 = toffset, k2 = ooffset; k1 < tlast && k2 < olast;
k1++, k2++) {
-            int cp1 = (int)getChar(value, k1);
-            int cp2 = (int)getChar(other, k2);
-
-            if (cp1 == cp2 || compareCodePointCI(cp1, cp2) == 0) {
+
+    /**
+     * Compares case-insensitively UTF-16 sequences in {@code byte} array
subsections.
+     * Performs {@linkplain #compareCodePointsIgnoringCase case
conversions equivalent
+     * to
<code>Character.toLowerCase(Character.toUpperCase(codePoint))</code>} before
+     * comparing unequal code-points.
+     *
+     * @param charsA      array of byte pairs representing 16-bit ({@code
char}) quantities,
+     *                    with byte order mandated by {@code
StringUTF16.isBigEndian()}
+     * @param charsAIndex index, in 16-bit quantities, at which comparison
of "{@code charsA}"
+     *                    begins. Caller must enforce constraints "{@code
0 <= charsAIndex}"
+     *                    and "{@code 0 <= charsAIndex + totalChars <=
charsA.length / 2}".
+     * @param charsB      array of byte pairs representing 16-bit ({@code
char}) quantities,
+     *                    with byte order mandated by {@code
StringUTF16.isBigEndian()}
+     * @param charsBIndex index, in 16-bit quantities, at which comparison
of "{@code charsB}"
+     *                    begins. Caller must enforce constraints "{@code
0 <= charsBIndex}"
+     *                    and "{@code 0 <= charsBIndex + totalChars <=
charsB.length / 2}".
+     * @param totalChars  maximum number of {@code char}s to compare.
Caller must enforce
+     *                    constraint "{@code 0 <= totalChars}".
+     *
+     * @return negative if the compared portion of {@code charsA} is
lexicographically less
+     * than that of {@code charsB}, positive if it is greater, or zero if
they are equal
+     *
+     * @implNote <p>This method relies on the following "<b>Unicode
Empirical Properties</b>":</p>
+     * <ol>
+     *     <li>
+     *         <p>Only code-points encoded using the same UTF-16
high-surrogate may be case-insensitively equal.</p>
+     *         <p>Several facts suggest sufficient proximity of code-point
case-variants within the Supplementary
+     *         Multilingual Plane (SMP):</p>
+     *         <ul>
+     *             <li>Related code-points are grouped into a Unicode
"block".</li>
+     *             <li>Blocks start at multiples of 256.</li>
+     *             <li>High-surrogates encode 1,024 code-points starting
with multiples of 1,024.</li>
+     *             <li>Few SMP code-points are case-sensitive (450, as of
Unicode 13.0.0).</li>
+     *         </ul>
+     *         <p>Thus, it is not surprising this property has held for 25
years, as of February, 2021.</p>
+     *     <li><p>For all SMP code-points, both {@code
Character.toLowerCase(Character.toUpperCase(int))} and
+     *         {@code Character.toUpperCase(int)} return a SMP code-point,
never a Basic Multilingual Plane (BMP)
+     *         code-point.</p>
+     *         <p>Supporting points:</p>
+     *         <ul>
+     *             <li>This is expected due to related code-point
grouping.</li>
+     *             <li>If this property were ever invalidated, every
Unicode case-insensitive equality test
+     *             which requires equal-length input sequences (measured
in {@code byte}s or {@code char}s)
+     *             would break. In other words, this property is already
utilized more-or-less universally by
+     *             Unicode case-insensitive equality tests, whether or not
their authors realize it. The
+     *             Unicode Consortium has likely been aware of, and
sensitive to, this issue for years.
+     *             </li>
+     *         </ul>
+     *     </li>
+     *     <li><p>For all BMP code-points, both {@code
Character.toLowerCase(Character.toUpperCase(int))} and
+     *         {@code Character.toUpperCase(int)} return a BMP code-point,
never a SMP code-point.</p>
+     *         <p>Supporting points:</p>
+     *         <ul>
+     *             <li>This is expected due to related code-point
grouping.</li>
+     *             <li>Most scripts affected by letter case were added
early in Unicode's history, and therefore
+     *             placed in the BMP. (As of Unicode 13.0.0, there are
2,349 case-sensitive code-points in the
+     *             BMP, but only 450 in the SMP.)
+     *             </li>
+     *             <li>If this property were ever invalidated, every
Unicode case-insensitive equality test
+     *             which requires equal-length input sequences (measured
in {@code byte}s or {@code char}s)
+     *             would break. In other words, this property is already
utilized more-or-less universally by
+     *             Unicode case-insensitive equality tests, whether or not
their authors realize it. The
+     *             Unicode Consortium has likely been aware of, and
sensitive to, this issue for years.
+     *             </li>
+     *         </ul>
+     *     </li>
+     * </ol>
+     * <p>Validation of each property is performed automatically by unit
tests in {@code CompareIC}
+     * ("test/jdk/java/lang/String/CompareIC.java") during normal,
post-build JDK testing. Those unit tests
+     * methods are, respectively:
+     * </p>
+     * <ol>
+     *     <li>{@code
validatePremise_AllSMPCodePointCaseVariantsUseTheSameHighSurrogate}</li>
+     *     <li>{@code
validatePremise_AllSMPCodePointCaseVariantsAreSMPCodePoints}</li>
+     *     <li>{@code
validatePremise_AllBMPCodePointCaseVariantsAreBMPCodePoints}</li>
+     * </ol>
+     * <p>Those tests' Javadocs include discussion of the consequences
should they ever fail.
+     * </p>
+     */
+    private static int compareToCI
+    (
+        final byte[] charsA, int charsAIndex,
+        final byte[] charsB, int charsBIndex, int totalChars
+    ){
+    //  Assertions carried-over from the previous "regionMatchesCI"
implementation, with additional
+    //  negative value testing. (Are these assertions still necessary or
desirable?)
+
+        assert (charsAIndex | charsBIndex | totalChars) >= 0;  // Ensure
these parameters are non-negative.
+        assert  charsAIndex + totalChars <= length(charsA);
+        assert  charsBIndex + totalChars <= length(charsB);
+
+    //  Variable "priorExactlyEqualChar" (PEEC) is used by the following
loop to store the "char" value read
+    //  by its prior iteration, if the same "char" value was read from
both inputs ("charsA" and "charsB").
+    //  Otherwise, it is set to zero. PEEC's value is used only during the
UTF-16 decoding attempt triggered
+    //  by reading a pair of unequal low-surrogates (see Unicode empirical
property no. 1). It removes from
+    //  this algorithm all read-behind/-ahead operations, along with their
extra tests & branches, and
+    //  redundant array accesses.
+    //      Note: PEEC can be zero for three reasons: (1) the loop is in
its first iteration, (2) the prior
+    //  iteration read code-point zero from both inputs, or (3) the prior
iteration read different, but case-
+    //  insensitively equal, Basic Multilingual Plane code-points. In all
three cases, zero signifies absence
+    //  of preceding, identical high-surrogates (as does any
non-high-surrogate value). Thus, a zero value's
+    //  cause is ambiguous, but its meaning is clear.
+
+        int                     priorExactlyEqualChar = 0;  // AKA "PEEC".
+
+        while (--totalChars >= 0) {
+            int                     cA = getChar(charsA, charsAIndex++);
+            int                     cB = getChar(charsB, charsBIndex++);
+
+            if (cA == cB) {
+
+            //  If the next iteration's "cA" and "cB" are different
low-surrogate code-units, it will need this
+            //  iteration's value of "cA"/"cB" (their presumed
high-surrogate code-unit), so it is stored in
+            //  "priorExactlyEqualChar" (PEEC). In all other cases, this
caching of "cA"/"cB" in PEEC is a small
+            //  waste of time. However, the time saved by eliminating the
tests, branches and reads necessary to
+            //  reacquire pairs of high-surrogates should compensate for
many unnecessary writes to PEEC.
+
+                priorExactlyEqualChar = cA;
                 continue;
             }
-
-            // Check for supplementary characters case
-            cp1 = codePointIncluding(value, cp1, k1, toffset, tlast);
-            if (cp1 < 0) {
-                k1++;
-                cp1 -= cp1;
-            }
-            cp2 = codePointIncluding(other, cp2, k2, ooffset, olast);
-            if (cp2 < 0) {
-                k2++;
-                cp2 -= cp2;
-            }
-
-            int diff = compareCodePointCI(cp1, cp2);
-            if (diff != 0) {
-                return diff;
-            }
-        }
-        return tlen - olen;
-    }
-
-    // Case insensitive comparison of two code points
-    private static int compareCodePointCI(int cp1, int cp2) {
-        // try converting both characters to uppercase.
-        // If the results match, then the comparison scan should
-        // continue.
-        cp1 = Character.toUpperCase(cp1);
-        cp2 = Character.toUpperCase(cp2);
-        if (cp1 != cp2) {
-            // Unfortunately, conversion to uppercase does not work
properly
-            // for the Georgian alphabet, which has strange rules about
case
-            // conversion.  So we need to make one last check before
-            // exiting.
-            cp1 = Character.toLowerCase(cp1);
-            cp2 = Character.toLowerCase(cp2);
-            if (cp1 != cp2) {
-                return cp1 - cp2;
+
+        //  KNOWN: "cA" and "cB" are not precisely equal.
+
+        //  When "priorExactlyEqualChar" is a high-surrogate, and both
"cA" and "cB" are low-surrogates, the
+        //  following "if" block performs UTF-16 decoding, replacing the
code-units in "cA" and "cB" with
+        //  code-points. The "if" block then falls-through to the
code-point case-insensitive comparison.
+        //
+        //  If any UTF-16 decoding precondition is unmet, "cA" and "cB"
will contain Basic Multilingual Plane
+        //  (BMP) code-points suitable for case-insensitive comparison,
unless UTF-16 encoding is corrupt in
+        //  "charsA" and/or "charsB". The following cases are possible:
+        //      * "priorExactlyEqualChar" is not a high-surrogate, so the
"if" block is not entered.
+        //         Possible states of "cA" and "cB":
+        //          * "cA" and "cB" are unequal BMP code-points,
potentially case-insensitively equal.
+        //          * "cA" and "cB" are unequal high-surrogates.
Case-insensitive equality testing inevitably
+        //            fails for those code-units, terminating this method.
Low-surrogate retrieval and UTF-16
+        //            decoding are unnecessary due to Unicode empirical
property no. 1.
+        //          * "cA" and "cB" are unequal low-surrogates. UTF-16
encoding is corrupt in both "charsA"
+        //            and "charsB" (low-surrogates not preceded by
high-surrogates). Case-insensitive equality
+        //            testing inevitably fails for those code-units,
terminating this method.
+        //          * Either "cA" or "cB" is a low-surrogate, while the
other is a high-surrogate or BMP code-
+        //            point. UTF-16 encoding is corrupt in the array
supplying the low-surrogate (no preceding
+        //            high-surrogate). Case-insensitive equality testing
inevitably fails, terminating this
+        //            method.
+        //      * "priorExactlyEqualChar" is a high-surrogate present in
both "charsA" and "charsB".
+        //        Possible states of "cA" and "cB":
+        //          * "cA" and "cB" are unequal low-surrogates. The "if"
block is entered, converting "cA"
+        //            and "cB" into Supplementary Multilingual Plane (SMP)
code-points, potentially case-
+        //            insensitively equal.
+        //          * "cA" and "cB" are unequal high-surrogates. UTF-16
encoding is corrupt in both "charsA"
+        //            and "charsB" (high-surrogates following
high-surrogates). Even if the next loop
+        //            iteration would retrieve low-surrogates, Unicode
empirical property no. 1 rules-out
+        //            case-insensitive equality based on the unequal
high-surrogates alone.
+        //          * "cA" and "cB" are unequal BMP code-points,
potentially case-insensitively equal. UTF-16
+        //            encoding is corrupt in both "charsA" and "charsB"
(high-surrogates not followed by low-
+        //            surrogates), but the corruption is ignored because
the high-surrogates were identical
+        //            (equality exists thus far, corrupt encoding
notwithstanding).
+        //          * Either "cA" or "cB" is a low-surrogate, while the
other is a high-surrogate or BMP code-
+        //            point. UTF-16 encoding is corrupt in the array
supplying the non-low-surrogate (high-
+        //            surrogate not followed by low-surrogate). UTF-16
decoding using the lone low-surrogate
+        //            is unnecessary, because either:
+        //              1. The non-low-surrogate is a high-surrogate,
ruling-out case-insensitive equality.
+        //                 (Two independent rationales: [1] A
high-surrogate is not a low-surrogate, by
+        //                 definition. Also by definition, [2] an SMP
code-point--decoded from high-surrogate
+        //                 "priorExactlyEqualChar" and the lone
low-surrogate--is not a high-surrogate code-
+        //                 unit.)
+        //              2. The non-low-surrogate is a BMP code-point.
Unicode empirical properties 2 and 3
+        //                 rule-out case-insensitive equality between SMP
and BMP code-points.
+        //                     (Alternately, if those properties did not
exist, deciding equality exists based
+        //                 on code-points at different positions within
the compared string sections--whose
+        //                 comparison requires overlooking an inconvenient
preceding code-unit in one string--
+        //                 would be logically dubious on multiple grounds.
For instance, if one preceding
+        //                 orphan surrogate can be ignored to prove
equality, is the same true for two or more
+        //                 orphans consecutively and/or separately? If so,
existing code breaks. Notably, most
+        //                 string equality tests would break due to
invalidation of their equal-length strings
+        //                 precondition. [A precondition also invalidated
if the Unicode Consortium ever
+        //                 asserts case-insensitive equality between a BMP
and SMP code-point. Thus, Unicode
+        //                 empirical properties 2 and 3 will likely
persist.])
+        //
+        //  (Aside: Of the following three tests, only the test of
"priorExactlyEqualChar" would be necessary
+        //  if Unicode [non-compact] "String" objects guaranteed valid
UTF-16 encoding. It seems certain
+        //  enforcing that guarantee during creation of every "String" [or
string-like object] is infeasible
+        //  due to runtime cost, and backwards-incompatibility.
Nonetheless, this writer [CWJ] is tantalized
+        //  by the many code simplifications guaranteed-correct "String"
objects would permit, as well as the
+        //  clues they could confer on some developers [including this
writer, once upon a time].)
+
+        //  Branch Reduction
+        //
+        //  The "if" statement used below has two tests/branches, instead
of three, due to combining two
+        //  low-surrogate tests into one. Multiple comparative benchmarks
(each using 300 random data sets
+        //  of 200,000 comparisons, repeated 800 times [160 million
comparisons per data set; 48 billion
+        //  total comparisons]), run on two different OS & hardware
combinations, yielded respective best-
+        //  case improvements of 1.24% and 0.98%, and mean improvements of
1.79% and 0.62%, relative to
+        //  three-branch "if" statement:
+        //
+        //      if (priorExactlyEqualChar >>> SURROGATE_PAYLOAD_BIT_COUNT
== SURROGATE_HI_HEADER_RIGHT_SHIFTED_10
+        //      &&  cA                    >>> SURROGATE_PAYLOAD_BIT_COUNT
== SURROGATE_LO_HEADER_RIGHT_SHIFTED_10
+        //      &&  cB                    >>> SURROGATE_PAYLOAD_BIT_COUNT
== SURROGATE_LO_HEADER_RIGHT_SHIFTED_10)
+        //
+        //  Conversely, combining all three tests into one yields worse
performance than the three-test
+        //  "if" shown above, because high-surrogate testing of
"priorExactlyEqualChar" fails frequently
+        //  (50% minimum failure rate for well-formed UTF-16), and, when
it fails, low-surrogate testing
+        //  of "cA" and "cB" wastes time.
+
+            if (priorExactlyEqualChar >>> SURROGATE_PAYLOAD_BIT_COUNT ==
SURROGATE_HI_HEADER_RIGHT_SHIFTED_10
+            && (cA ^ Character.MIN_LOW_SURROGATE | cB ^
Character.MIN_LOW_SURROGATE) >>> SURROGATE_PAYLOAD_BIT_COUNT == 0
+            ){
+            //  KNOWN: "cA" and "cB" are not precisely equal.
+            //  KNOWN: "priorExactlyEqualChar" is a high-surrogate
code-unit.
+            //  KNOWN: "cA" and "cB" are low-surrogate code-units. (Bits
15 to 10 of both are 110111.)
+
+            //  Perform UTF-16 decoding, accelerated by repurposing
"priorExactlyEqualChar" (PEEC) to store
+            //  the result of decoding operations common to both "cA" and
"cB" (because they share the high-
+            //  surrogate in PEEC). Near-total commonality of operations
enables decoding two code-points
+            //  using only one more addition operation than decoding a
single code-point. (PEEC's repurposing
+            //  is brief; it is always set to zero immediately after this
"if" block.)
+
+                priorExactlyEqualChar = (priorExactlyEqualChar <<
SURROGATE_PAYLOAD_BIT_COUNT) + SURROGATE_COMBO_TO_CODE_POINT_DELTA;
+
+                cA += priorExactlyEqualChar;
+                cB += priorExactlyEqualChar;
             }
+
+        //  "cA" and "cB" are not precisely equal, therefore set
"priorExactlyEqualChar" (PEEC) to zero for
+        //  the next iteration's potential benefit. ("Potential benefit,"
because the next iteration, if any,
+        //  uses PEEC's value only when its "cA" and "cB" are different
low-surrogate code-units.)
+
+            priorExactlyEqualChar = 0;
+
+        //  Having performed UTF-16 decoding (if necessary and possible),
compare case-insensitively code-
+        //  points "cA" and "cB".
+        //
+        //  Note: Code-units are also possible. Normal terminating
conditions include unequal high-surrogates
+        //        (see Unicode empirical property no. 1), and a
high-surrogate/code-point combination. An
+        //        abnormal terminating condition is a low-surrogate and a
code-point, meaning at least one
+        //        input string's UTF-16 encoding is corrupt (a
low-surrogate not preceded by a high-surrogate,
+        //        or matching high-surrogates not followed--in one
string--by a low-surrogate). None of those
+        //        conditions undermines the code below, because
"compareCodePointsIgnoringCase" evaluates
+        //        "caseless" inputs unchanged, including code-units.
+
+            cA = compareCodePointsIgnoringCase(cA, cB);
+            if (cA != 0)
+                return cA;
         }
+
         return 0;
     }
-
-    // Returns a code point from the code unit pointed by "index". If it is
-    // not a surrogate or an unpaired surrogate, then the code unit is
-    // returned as is. Otherwise, it is combined with the code unit before
-    // or after, depending on the type of the surrogate at index, to make a
-    // supplementary code point. The return value will be negated if the
code
-    // unit pointed by index is a high surrogate, and index + 1 is a low
surrogate.
-    private static int codePointIncluding(byte[] ba, int cp, int index,
int start, int end) {
-        // fast check
-        if (!Character.isSurrogate((char)cp)) {
-            return cp;
-        }
-        if (Character.isLowSurrogate((char)cp)) {
-            if (index > start) {
-                char c = getChar(ba, index - 1);
-                if (Character.isHighSurrogate(c)) {
-                    return Character.toCodePoint(c, (char)cp);
-                }
-            }
-        } else if (index + 1 < end) { // cp == high surrogate
-            char c = getChar(ba, index + 1);
-            if (Character.isLowSurrogate(c)) {
-                // negate the code point
-                return - Character.toCodePoint((char)cp, c);
-            }
-        }
-        return cp;
+
+    /**
+     * Gets the case-insensitive, lexicographical relationship between
code-points.
+     * Though not required, passing this method only unequal code-points
("{@code
+     * codePointA != codePointB}") is most efficient.
+     *
+     * @param codePointA first code-point
+     * @param codePointB second code-point
+     *
+     * @return "{@code codePointA - codePointB}" after both undergo
case-conversion equivalent
+     * to, but faster than, {@code
Character.toLowerCase(Character.toUpperCase(codePoint))}
+     *
+     * @implNote This method uses two, duplicate {@code switch}
expressions, because
+     * benchmarking showed a noticeable performance improvement over a
single {@code
+     * switch} wrapped in the most minimal of two-iteration loops. The
duplication
+     * should create no maintenance problems, because, as detailed by a
comment within
+     * the method, the {@code switch} source code is programmatically
generated when a
+     * "CompareIC" unit test detects a mishandled "triple-case"
code-point. Otherwise,
+     * modifying these expressions is unnecessary. (The unit test does not
alter this,
+     * or any, file; it merely includes updated {@code switch} source in
its error
+     * message, leaving a human responsible for pasting it here, twice.)
+     */
+    static int compareCodePointsIgnoringCase
+    (
+        final int               codePointA,
+        final int               codePointB
+    ){
+    /*  A code-point is here dubbed "triple-case" if "codePoint !=
Character.toUpperCase(codePoint) &&
+        codePoint !=
Character.toLowerCase(Character.toUpperCase(codePoint))". Such code-points
are rare
+        (see "switch" expressions below), but force most code-point
comparisons to invoke both "toUpperCase"
+        and "toLowerCase". If "triple-case" code-points did not exist,
invoking only "toLowerCase" would
+        suffice.
+
+        This method comes close to realizing the "no triple-case
code-points" scenario by directly converting
+        "triple-case" code-points to their final, lowercase forms using
"switch" expressions, and invoking
+        only "toLowerCase" on all other code-points. The effect is a
benchmark performance gain of 24.3%.
+
+        Note: The Georgian alphabet includes no "triple-case" code-points,
as shown below by the "progression"
+              columns of the "switch" expressions' comments. (As of
Unicode 13.0.0, there are three Georgian
+              blocks: "Georgian" 10A0-10FF, "Georgian Extended" 1C90-1CBF,
and "Georgian Supplement" 2D00-2D2F.)
+
+
+        TESTING AND CODE GENERATION
+
+        A unit test in "CompareIC"
("test/jdk/java/lang/String/CompareIC.java") searches all 1,114,112 Unicode
+        code-points/-units for triple-case code-points, then validates
their handling by "String.equalsIgnoreCase"
+        and "String.compareToIgnoreCase". If errors are detected, those
tests fail with a message detailing each
+        error and supplying freshly generated source code for the "switch"
expression used twice below. (If the
+        new and existing set of "case" clauses differ, pasting the former
over the latter will fix the problem.
+        If "case" clauses are identical, the problem lies elsewhere.)
+    */
+                                          //  Triple-Case Code-Points, as
of Java 15.0.2. (Written by
"java.lang.CompareIC.generateTripleCaseCodePointsSwitchExpression".)
+                                          //
+                                          //  Code-Point Progression
   Code-Point Progression by Name
+        return switch (codePointA) {      //  --------------------------

 
-------------------------------------------------------------------------------------------------------------------------------------
+            case '\u00B5' -> '\u03BC';    //  U+00B5 -> U+039C -> U+03BC
   MICRO SIGN                                            -> GREEK CAPITAL
LETTER MU               -> GREEK SMALL LETTER MU
+            case '\u0131' -> '\u0069';    //  U+0131 -> U+0049 -> U+0069
   LATIN SMALL LETTER DOTLESS I                          -> LATIN CAPITAL
LETTER I                -> LATIN SMALL LETTER I
+            case '\u017F' -> '\u0073';    //  U+017F -> U+0053 -> U+0073
   LATIN SMALL LETTER LONG S                             -> LATIN CAPITAL
LETTER S                -> LATIN SMALL LETTER S
+            case '\u01C5' -> '\u01C6';    //  U+01C5 -> U+01C4 -> U+01C6
   LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON -> LATIN CAPITAL
LETTER DZ WITH CARON    -> LATIN SMALL LETTER DZ WITH CARON
+            case '\u01C8' -> '\u01C9';    //  U+01C8 -> U+01C7 -> U+01C9
   LATIN CAPITAL LETTER L WITH SMALL LETTER J            -> LATIN CAPITAL
LETTER LJ               -> LATIN SMALL LETTER LJ
+            case '\u01CB' -> '\u01CC';    //  U+01CB -> U+01CA -> U+01CC
   LATIN CAPITAL LETTER N WITH SMALL LETTER J            -> LATIN CAPITAL
LETTER NJ               -> LATIN SMALL LETTER NJ
+            case '\u01F2' -> '\u01F3';    //  U+01F2 -> U+01F1 -> U+01F3
   LATIN CAPITAL LETTER D WITH SMALL LETTER Z            -> LATIN CAPITAL
LETTER DZ               -> LATIN SMALL LETTER DZ
+            case '\u0345' -> '\u03B9';    //  U+0345 -> U+0399 -> U+03B9
   COMBINING GREEK YPOGEGRAMMENI                         -> GREEK CAPITAL
LETTER IOTA             -> GREEK SMALL LETTER IOTA
+            case '\u03C2' -> '\u03C3';    //  U+03C2 -> U+03A3 -> U+03C3
   GREEK SMALL LETTER FINAL SIGMA                        -> GREEK CAPITAL
LETTER SIGMA            -> GREEK SMALL LETTER SIGMA
+            case '\u03D0' -> '\u03B2';    //  U+03D0 -> U+0392 -> U+03B2
   GREEK BETA SYMBOL                                     -> GREEK CAPITAL
LETTER BETA             -> GREEK SMALL LETTER BETA
+            case '\u03D1' -> '\u03B8';    //  U+03D1 -> U+0398 -> U+03B8
   GREEK THETA SYMBOL                                    -> GREEK CAPITAL
LETTER THETA            -> GREEK SMALL LETTER THETA
+            case '\u03D5' -> '\u03C6';    //  U+03D5 -> U+03A6 -> U+03C6
   GREEK PHI SYMBOL                                      -> GREEK CAPITAL
LETTER PHI              -> GREEK SMALL LETTER PHI
+            case '\u03D6' -> '\u03C0';    //  U+03D6 -> U+03A0 -> U+03C0
   GREEK PI SYMBOL                                       -> GREEK CAPITAL
LETTER PI               -> GREEK SMALL LETTER PI
+            case '\u03F0' -> '\u03BA';    //  U+03F0 -> U+039A -> U+03BA
   GREEK KAPPA SYMBOL                                    -> GREEK CAPITAL
LETTER KAPPA            -> GREEK SMALL LETTER KAPPA
+            case '\u03F1' -> '\u03C1';    //  U+03F1 -> U+03A1 -> U+03C1
   GREEK RHO SYMBOL                                      -> GREEK CAPITAL
LETTER RHO              -> GREEK SMALL LETTER RHO
+            case '\u03F5' -> '\u03B5';    //  U+03F5 -> U+0395 -> U+03B5
   GREEK LUNATE EPSILON SYMBOL                           -> GREEK CAPITAL
LETTER EPSILON          -> GREEK SMALL LETTER EPSILON
+            case '\u1C80' -> '\u0432';    //  U+1C80 -> U+0412 -> U+0432
   CYRILLIC SMALL LETTER ROUNDED VE                      -> CYRILLIC
CAPITAL LETTER VE            -> CYRILLIC SMALL LETTER VE
+            case '\u1C81' -> '\u0434';    //  U+1C81 -> U+0414 -> U+0434
   CYRILLIC SMALL LETTER LONG-LEGGED DE                  -> CYRILLIC
CAPITAL LETTER DE            -> CYRILLIC SMALL LETTER DE
+            case '\u1C82' -> '\u043E';    //  U+1C82 -> U+041E -> U+043E
   CYRILLIC SMALL LETTER NARROW O                        -> CYRILLIC
CAPITAL LETTER O             -> CYRILLIC SMALL LETTER O
+            case '\u1C83' -> '\u0441';    //  U+1C83 -> U+0421 -> U+0441
   CYRILLIC SMALL LETTER WIDE ES                         -> CYRILLIC
CAPITAL LETTER ES            -> CYRILLIC SMALL LETTER ES
+            case '\u1C84' -> '\u0442';    //  U+1C84 -> U+0422 -> U+0442
   CYRILLIC SMALL LETTER TALL TE                         -> CYRILLIC
CAPITAL LETTER TE            -> CYRILLIC SMALL LETTER TE
+            case '\u1C85' -> '\u0442';    //  U+1C85 -> U+0422 -> U+0442
   CYRILLIC SMALL LETTER THREE-LEGGED TE                 -> CYRILLIC
CAPITAL LETTER TE            -> CYRILLIC SMALL LETTER TE
+            case '\u1C86' -> '\u044A';    //  U+1C86 -> U+042A -> U+044A
   CYRILLIC SMALL LETTER TALL HARD SIGN                  -> CYRILLIC
CAPITAL LETTER HARD SIGN     -> CYRILLIC SMALL LETTER HARD SIGN
+            case '\u1C87' -> '\u0463';    //  U+1C87 -> U+0462 -> U+0463
   CYRILLIC SMALL LETTER TALL YAT                        -> CYRILLIC
CAPITAL LETTER YAT           -> CYRILLIC SMALL LETTER YAT
+            case '\u1C88' -> '\uA64B';    //  U+1C88 -> U+A64A -> U+A64B
   CYRILLIC SMALL LETTER UNBLENDED UK                    -> CYRILLIC
CAPITAL LETTER MONOGRAPH UK  -> CYRILLIC SMALL LETTER MONOGRAPH UK
+            case '\u1E9B' -> '\u1E61';    //  U+1E9B -> U+1E60 -> U+1E61
   LATIN SMALL LETTER LONG S WITH DOT ABOVE              -> LATIN CAPITAL
LETTER S WITH DOT ABOVE -> LATIN SMALL LETTER S WITH DOT ABOVE
+            case '\u1FBE' -> '\u03B9';    //  U+1FBE -> U+0399 -> U+03B9
   GREEK PROSGEGRAMMENI                                  -> GREEK CAPITAL
LETTER IOTA             -> GREEK SMALL LETTER IOTA
+                                          //  --------------------------

 
-------------------------------------------------------------------------------------------------------------------------------------
+        //  All other case-sensitive code-points are either uppercase (in
which case they are changed
+        //  below), or lowercase already (in which case they are not).
Therefore, only "toLowerCase"
+        //  is necessary. Code-units and case-insensitive code-points are
unchanged by "toLowerCase".
+            default -> Character.toLowerCase(codePointA);
+
+        } - switch (codePointB) {         //  --------------------------

 
-------------------------------------------------------------------------------------------------------------------------------------
+            case '\u00B5' -> '\u03BC';    //  U+00B5 -> U+039C -> U+03BC
   MICRO SIGN                                            -> GREEK CAPITAL
LETTER MU               -> GREEK SMALL LETTER MU
+            case '\u0131' -> '\u0069';    //  U+0131 -> U+0049 -> U+0069
   LATIN SMALL LETTER DOTLESS I                          -> LATIN CAPITAL
LETTER I                -> LATIN SMALL LETTER I
+            case '\u017F' -> '\u0073';    //  U+017F -> U+0053 -> U+0073
   LATIN SMALL LETTER LONG S                             -> LATIN CAPITAL
LETTER S                -> LATIN SMALL LETTER S
+            case '\u01C5' -> '\u01C6';    //  U+01C5 -> U+01C4 -> U+01C6
   LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON -> LATIN CAPITAL
LETTER DZ WITH CARON    -> LATIN SMALL LETTER DZ WITH CARON
+            case '\u01C8' -> '\u01C9';    //  U+01C8 -> U+01C7 -> U+01C9
   LATIN CAPITAL LETTER L WITH SMALL LETTER J            -> LATIN CAPITAL
LETTER LJ               -> LATIN SMALL LETTER LJ
+            case '\u01CB' -> '\u01CC';    //  U+01CB -> U+01CA -> U+01CC
   LATIN CAPITAL LETTER N WITH SMALL LETTER J            -> LATIN CAPITAL
LETTER NJ               -> LATIN SMALL LETTER NJ
+            case '\u01F2' -> '\u01F3';    //  U+01F2 -> U+01F1 -> U+01F3
   LATIN CAPITAL LETTER D WITH SMALL LETTER Z            -> LATIN CAPITAL
LETTER DZ               -> LATIN SMALL LETTER DZ
+            case '\u0345' -> '\u03B9';    //  U+0345 -> U+0399 -> U+03B9
   COMBINING GREEK YPOGEGRAMMENI                         -> GREEK CAPITAL
LETTER IOTA             -> GREEK SMALL LETTER IOTA
+            case '\u03C2' -> '\u03C3';    //  U+03C2 -> U+03A3 -> U+03C3
   GREEK SMALL LETTER FINAL SIGMA                        -> GREEK CAPITAL
LETTER SIGMA            -> GREEK SMALL LETTER SIGMA
+            case '\u03D0' -> '\u03B2';    //  U+03D0 -> U+0392 -> U+03B2
   GREEK BETA SYMBOL                                     -> GREEK CAPITAL
LETTER BETA             -> GREEK SMALL LETTER BETA
+            case '\u03D1' -> '\u03B8';    //  U+03D1 -> U+0398 -> U+03B8
   GREEK THETA SYMBOL                                    -> GREEK CAPITAL
LETTER THETA            -> GREEK SMALL LETTER THETA
+            case '\u03D5' -> '\u03C6';    //  U+03D5 -> U+03A6 -> U+03C6
   GREEK PHI SYMBOL                                      -> GREEK CAPITAL
LETTER PHI              -> GREEK SMALL LETTER PHI
+            case '\u03D6' -> '\u03C0';    //  U+03D6 -> U+03A0 -> U+03C0
   GREEK PI SYMBOL                                       -> GREEK CAPITAL
LETTER PI               -> GREEK SMALL LETTER PI
+            case '\u03F0' -> '\u03BA';    //  U+03F0 -> U+039A -> U+03BA
   GREEK KAPPA SYMBOL                                    -> GREEK CAPITAL
LETTER KAPPA            -> GREEK SMALL LETTER KAPPA
+            case '\u03F1' -> '\u03C1';    //  U+03F1 -> U+03A1 -> U+03C1
   GREEK RHO SYMBOL                                      -> GREEK CAPITAL
LETTER RHO              -> GREEK SMALL LETTER RHO
+            case '\u03F5' -> '\u03B5';    //  U+03F5 -> U+0395 -> U+03B5
   GREEK LUNATE EPSILON SYMBOL                           -> GREEK CAPITAL
LETTER EPSILON          -> GREEK SMALL LETTER EPSILON
+            case '\u1C80' -> '\u0432';    //  U+1C80 -> U+0412 -> U+0432
   CYRILLIC SMALL LETTER ROUNDED VE                      -> CYRILLIC
CAPITAL LETTER VE            -> CYRILLIC SMALL LETTER VE
+            case '\u1C81' -> '\u0434';    //  U+1C81 -> U+0414 -> U+0434
   CYRILLIC SMALL LETTER LONG-LEGGED DE                  -> CYRILLIC
CAPITAL LETTER DE            -> CYRILLIC SMALL LETTER DE
+            case '\u1C82' -> '\u043E';    //  U+1C82 -> U+041E -> U+043E
   CYRILLIC SMALL LETTER NARROW O                        -> CYRILLIC
CAPITAL LETTER O             -> CYRILLIC SMALL LETTER O
+            case '\u1C83' -> '\u0441';    //  U+1C83 -> U+0421 -> U+0441
   CYRILLIC SMALL LETTER WIDE ES                         -> CYRILLIC
CAPITAL LETTER ES            -> CYRILLIC SMALL LETTER ES
+            case '\u1C84' -> '\u0442';    //  U+1C84 -> U+0422 -> U+0442
   CYRILLIC SMALL LETTER TALL TE                         -> CYRILLIC
CAPITAL LETTER TE            -> CYRILLIC SMALL LETTER TE
+            case '\u1C85' -> '\u0442';    //  U+1C85 -> U+0422 -> U+0442
   CYRILLIC SMALL LETTER THREE-LEGGED TE                 -> CYRILLIC
CAPITAL LETTER TE            -> CYRILLIC SMALL LETTER TE
+            case '\u1C86' -> '\u044A';    //  U+1C86 -> U+042A -> U+044A
   CYRILLIC SMALL LETTER TALL HARD SIGN                  -> CYRILLIC
CAPITAL LETTER HARD SIGN     -> CYRILLIC SMALL LETTER HARD SIGN
+            case '\u1C87' -> '\u0463';    //  U+1C87 -> U+0462 -> U+0463
   CYRILLIC SMALL LETTER TALL YAT                        -> CYRILLIC
CAPITAL LETTER YAT           -> CYRILLIC SMALL LETTER YAT
+            case '\u1C88' -> '\uA64B';    //  U+1C88 -> U+A64A -> U+A64B
   CYRILLIC SMALL LETTER UNBLENDED UK                    -> CYRILLIC
CAPITAL LETTER MONOGRAPH UK  -> CYRILLIC SMALL LETTER MONOGRAPH UK
+            case '\u1E9B' -> '\u1E61';    //  U+1E9B -> U+1E60 -> U+1E61
   LATIN SMALL LETTER LONG S WITH DOT ABOVE              -> LATIN CAPITAL
LETTER S WITH DOT ABOVE -> LATIN SMALL LETTER S WITH DOT ABOVE
+            case '\u1FBE' -> '\u03B9';    //  U+1FBE -> U+0399 -> U+03B9
   GREEK PROSGEGRAMMENI                                  -> GREEK CAPITAL
LETTER IOTA             -> GREEK SMALL LETTER IOTA
+                                          //  --------------------------

 
-------------------------------------------------------------------------------------------------------------------------------------
+        //  All other case-sensitive code-points are either uppercase (in
which case they are changed
+        //  below), or lowercase already (in which case they are not).
Therefore, only "toLowerCase"
+        //  is necessary. Code-units and case-insensitive code-points are
unchanged by "toLowerCase".
+            default -> Character.toLowerCase(codePointB);
+        };
     }

     public static int compareToCI_Latin1(byte[] value, byte[] other) {
@@ -780,10 +1141,35 @@
         }
         return new String(result, UTF16);
     }
-
-    public static boolean regionMatchesCI(byte[] value, int toffset,
-                                          byte[] other, int ooffset, int
len) {
-        return compareToCIImpl(value, toffset, len, other, ooffset, len)
== 0;
+
+    /**
+     * Tests case-insensitive equality of UTF-16 sequences in {@code byte}
array subsections.
+     * Performs {@linkplain #compareCodePointsIgnoringCase case
conversions equivalent to
+     *
<code>Character.toLowerCase(Character.toUpperCase(codePoint))</code>}
before comparing
+     * unequal code-points.
+     *
+     * @param charsA      array of byte pairs representing 16-bit ({@code
char}) quantities,
+     *                    with byte order mandated by {@code
StringUTF16.isBigEndian()}
+     * @param charsAIndex index, in 16-bit quantities, at which comparison
of "{@code charsA}"
+     *                    begins. Caller must enforce constraints "{@code
0 <= charsAIndex}"
+     *                    and "{@code 0 <= charsAIndex + totalChars <=
charsA.length / 2}".
+     * @param charsB      array of byte pairs representing 16-bit ({@code
char}) quantities,
+     *                    with byte order mandated by {@code
StringUTF16.isBigEndian()}
+     * @param charsBIndex index, in 16-bit quantities, at which comparison
of "{@code charsB}"
+     *                    begins. Caller must enforce constraints "{@code
0 <= charsBIndex}"
+     *                    and "{@code 0 <= charsBIndex + totalChars <=
charsB.length / 2}".
+     * @param totalChars  maximum number of {@code char}s to compare.
Caller must enforce
+     *                    constraint "{@code 0 <= totalChars}".
+     *
+     * @return {@code true} if the tested portions of {@code charsA} and
{@code charsB} are
+     * case-insensitively equal, otherwise {@code false}
+     */
+    static boolean regionMatchesCI
+    (
+        final byte[] charsA, final int charsAIndex,
+        final byte[] charsB, final int charsBIndex, final int totalChars
+    ){
+        return compareToCI(charsA, charsAIndex, charsB, charsBIndex,
totalChars) == 0;
     }

     public static boolean regionMatchesCI_Latin1(byte[] value, int toffset,

Proposed: StringUTF16 bug fix with optimization - Part 1 of 2, StringUTF16 Patch

Reply via email to