Hi Naoto,
The change looks good to me. "supLower" is indeed super slow :-)
The only minor comment I have is that the compareCodePointCI method
performs toUpperCase unconditionally. That's not a problem for the
regular case, where a check on cp1 == cp2 (line 337) is done prior to
the method call. But for the sup case (starting at line 341), the method
is called unconditionally while in webrev.04 there was a check "cp1 !=
cp2". One option to fix it is to include the "cp1 != cp2" check in the
method compareCodePointCI, then cp1 == cp2 at line 337 can be omitted.
Regards,
Joe
On 7/22/20 10:23 AM, naoto.s...@oracle.com wrote:
Hi,
I revised the fix again, based on further suggestions:
https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/
Changes from v.04 are (all in StringUTF16.java):
- The short cut now does case insensitive comparison that makes the
fix closer to the previous implementation (for BMP characters).
- Changed the bit operation to negating for detecting needed index
increment.
- Method name is changed to better reflect what it is doing, with more
descriptive comments.
Here is the benchmark results:
before:
Benchmark Mode Cnt Score Error Units
StringCompareToIgnoreCase.lower avgt 25 49.960 ? 1.923 ns/op
StringCompareToIgnoreCase.supLower avgt 25 21.003 ? 0.354 ns/op
StringCompareToIgnoreCase.supUpperLower avgt 25 30.863 ? 4.529 ns/op
StringCompareToIgnoreCase.upperLower avgt 25 15.417 ? 1.046 ns/op
after:
Benchmark Mode Cnt Score Error Units
StringCompareToIgnoreCase.lower avgt 25 46.857 ? 0.524
ns/op
StringCompareToIgnoreCase.supLower avgt 25 148.688 ? 6.546
ns/op
StringCompareToIgnoreCase.supUpperLower avgt 25 37.160 ? 0.259
ns/op
StringCompareToIgnoreCase.upperLower avgt 25 15.126 ? 0.338
ns/op
Now non-supplementary operations ("lower" and "upperLower") are on par
with the "before" result (I am not quite sure why the "after" results
are somewhat faster though). For supplementary test cases, "supLower"
is very slow. The reason is two fold; one is because "before" one
exits at the very first character (which I am addressing here) while
"after" continues to compare to the last characters, the other reason
is the test suffers from the change where supplementary cases double
the case insensitivity checks (compared to the "after" result just
below). Also "supUpperLower" gets slower for the same reason. These
are expected results for supplementary comparisons (as we discussed).
Naoto
On 7/17/20 4:36 PM, naoto.s...@oracle.com wrote:
Hi,
Based on the suggestions, I modified the fix as follows:
https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
Changes from the initial revision are:
- Shared the implementation between compareToCI() and regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and after
the change:
before:
Benchmark Mode Cnt Score Error Units
StringCompareToIgnoreCase.lower avgt 25 53.764 ? 2.811
ns/op
StringCompareToIgnoreCase.supLower avgt 25 24.211 ? 1.135
ns/op
StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ? 1.344
ns/op
StringCompareToIgnoreCase.upperLower avgt 25 18.859 ? 1.499
ns/op
after:
Benchmark Mode Cnt Score Error Units
StringCompareToIgnoreCase.lower avgt 25 58.354 ? 4.603
ns/op
StringCompareToIgnoreCase.supLower avgt 25 57.975 ? 5.672
ns/op
StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ? 0.965
ns/op
StringCompareToIgnoreCase.upperLower avgt 25 17.744 ? 0.272
ns/op
Here, "sup" means all supplementary characters, BMP otherwise.
"lower" means each character requires upper->lower casemap.
"upperLower" means all characters are the same, except the last
character which requires casemap.
I think the result is reasonable, considering surrogates check are
now mandatory. I have tried Roger's suggestion to use
Arrays.mismatch() but it did not seem to benefit here. In fact, the
performance degraded partly because I implemented the short cut, and
possibly for the overhead of extra checks.
Naoto
On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:
Hello,
Please review the fix to the following issues:
https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434
The proposed changeset and its CSR are located at:
https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664
A bug was filed against SimpleDateFormat (8248434) where
case-insensitive date format/parse failed in some of the new locales
in JDK15. The root cause was that case-insensitive
String.regionMatches() method did not work with supplementary
characters. The problem is that the method's spec does not expect
case mappings of supplementary characters, possibly because it was
overlooked in the first place, JSR 204 - "Unicode Supplementary
Character support". Similar behavior is observed in other two
case-insensitive methods, i.e., compareToIgnoreCase() and
equalsIgnoreCase().
The fix is straightforward to compare strings by code point basis,
instead of code unit (16bit "char") basis. Technically this change
will introduce a backward incompatibility, but I believe it is an
incompatibility to wrong behavior, not true to the meaning of those
methods' expectations.
Naoto