Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

Joe Wang Wed, 22 Jul 2020 13:23:35 -0700

Hi Naoto,

The change looks good to me. "supLower" is indeed super slow :-)

The only minor comment I have is that the compareCodePointCI methodperforms toUpperCase unconditionally. That's not a problem for theregular case, where a check on cp1 == cp2 (line 337) is done prior tothe method call. But for the sup case (starting at line 341), the methodis called unconditionally while in webrev.04 there was a check "cp1 !=cp2". One option to fix it is to include the "cp1 != cp2" check in themethod compareCodePointCI, then cp1 == cp2 at line 337 can be omitted.


Regards,
Joe

On 7/22/20 10:23 AM, [email protected] wrote:

Hi,

I revised the fix again, based on further suggestions:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/

Changes from v.04 are (all in StringUTF16.java):
- The short cut now does case insensitive comparison that makes thefix closer to the previous implementation (for BMP characters).- Changed the bit operation to negating for detecting needed indexincrement.- Method name is changed to better reflect what it is doing, with moredescriptive comments.
Here is the benchmark results:

before:
Benchmark                                Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower          avgt   25  49.960 ? 1.923  ns/op
StringCompareToIgnoreCase.supLower       avgt   25  21.003 ? 0.354  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.863 ? 4.529  ns/op
StringCompareToIgnoreCase.upperLower     avgt   25  15.417 ? 1.046  ns/op

after:
Benchmark                                Mode  Cnt    Score Error  Units
StringCompareToIgnoreCase.lower avgt 25 46.857 ? 0.524 ns/opStringCompareToIgnoreCase.supLower avgt 25 148.688 ? 6.546 ns/opStringCompareToIgnoreCase.supUpperLower avgt 25 37.160 ? 0.259 ns/opStringCompareToIgnoreCase.upperLower avgt 25 15.126 ? 0.338 ns/op
Now non-supplementary operations ("lower" and "upperLower") are on parwith the "before" result (I am not quite sure why the "after" resultsare somewhat faster though). For supplementary test cases, "supLower"is very slow. The reason is two fold; one is because "before" oneexits at the very first character (which I am addressing here) while"after" continues to compare to the last characters, the other reasonis the test suffers from the change where supplementary cases doublethe case insensitivity checks (compared to the "after" result justbelow). Also "supUpperLower" gets slower for the same reason. Theseare expected results for supplementary comparisons (as we discussed).
Naoto

On 7/17/20 4:36 PM, [email protected] wrote:
Hi,

Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

Changes from the initial revision are:

- Shared the implementation between compareToCI() and regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and afterthe change:
before:
Benchmark                                Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower avgt 25 53.764 ? 2.811 ns/opStringCompareToIgnoreCase.supLower avgt 25 24.211 ? 1.135 ns/opStringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ? 1.344 ns/opStringCompareToIgnoreCase.upperLower avgt 25 18.859 ? 1.499 ns/op
after:
Benchmark                                Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower avgt 25 58.354 ? 4.603 ns/opStringCompareToIgnoreCase.supLower avgt 25 57.975 ? 5.672 ns/opStringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ? 0.965 ns/opStringCompareToIgnoreCase.upperLower avgt 25 17.744 ? 0.272 ns/op
Here, "sup" means all supplementary characters, BMP otherwise."lower" means each character requires upper->lower casemap."upperLower" means all characters are the same, except the lastcharacter which requires casemap.
I think the result is reasonable, considering surrogates check arenow mandatory. I have tried Roger's suggestion to useArrays.mismatch() but it did not seem to benefit here. In fact, theperformance degraded partly because I implemented the short cut, andpossibly for the overhead of extra checks.
Naoto

On 7/15/20 9:00 AM, [email protected] wrote:
Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664
A bug was filed against SimpleDateFormat (8248434) wherecase-insensitive date format/parse failed in some of the new localesin JDK15. The root cause was that case-insensitiveString.regionMatches() method did not work with supplementarycharacters. The problem is that the method's spec does not expectcase mappings of supplementary characters, possibly because it wasoverlooked in the first place, JSR 204 - "Unicode SupplementaryCharacter support". Similar behavior is observed in other twocase-insensitive methods, i.e., compareToIgnoreCase() andequalsIgnoreCase().
The fix is straightforward to compare strings by code point basis,instead of code unit (16bit "char") basis. Technically this changewill introduce a backward incompatibility, but I believe it is anincompatibility to wrong behavior, not true to the meaning of thosemethods' expectations.
Naoto

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

Reply via email to