Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

naoto . sato Tue, 21 Jul 2020 15:27:25 -0700

Hi Brent,

On 7/21/20 2:56 PM, Brent Christian wrote:

Hi, Naoto
I have a few comments:

src/java.base/share/classes/java/lang/StringUTF16.java
379 private static int getSupplementaryCodePoint(byte[] ba, int cp,int index, int start, int end)
I think it would be worth a small addition to the comment to reflectthat non-surrogate chars are returned as-is.


Sure, I will add some more comments to the method.

--
I thought about the scenario of an unpaired low or high surrogate at thebeginning or end of the string, respectively:
384         if (Character.isLowSurrogate((char)cp)) {
385             if (index > start) {
                 ...
391         } else if (index + 1 < end) { // cp == high surrogate
392             char c = getChar(ba, index + 1);
                 ...
397         return cp;
It looks like the cp itself would be returned fromgetSupplementaryCodePoint(). And then back in compareToCIImpl(), it'sconverted using Character.to[Upper|Lower]Case(int), which will alsoreturn the cp itself. I imagine that's the best we could do, so seemsfine.


Yes, that is exactly what is intended.

Is there a test case for unmatched surrogates at the beginning and endof the string ? Should there be ?

Interestingly, there has been a test case for supplementary charactersbefore this change, where it intentionally begins from a low surrogate,and ends with a high surrogate, so that it would succeed in the previous*exact match* logic. Line 82 in RegionMatches.java tests:"\uD801\uDC28\uD801\uDC29\uFF41a".regionMatches(true, 1, "\uDC28\uD801",0, 2) == true

And the proposed change is compatible with this test case.

--
I see there are no changes to StringLatin1.regionMatchesCI_UTF16(). Ipresume there are no cases in which toUpperCase(toLowerCase()) of asupplementary character could yield a Latin-1 character, yes?


Yes, that is correct.

Naoto

Also, thanks for adding the benchmark!

-Brent

On 7/20/20 3:20 PM, [email protected] wrote:
Small correction in the updated part:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/

Naoto

On 7/20/20 2:39 PM, [email protected] wrote:
Hi Joe,

Thank you for your comments.

On 7/20/20 11:20 AM, Joe Wang wrote:
Hi Naoto,
StringUTF16: line 384 - 388 seem unnecessary since you'd only getthere if 389:isHighSurrogate is not true.
Good point.

But more importantly, StringUTF16
has existing method "codePointAt" you may want to consider insteadof adding a new method.
If we call codePointAt/Before, it would call an extra getChar().Since we know one codepoint as an input, I would avoid the extra calls.
Comparing to the base benchmark:
StringCompareToIgnoreCase.lower          8.5%
StringCompareToIgnoreCase.supLower      139%
StringCompareToIgnoreCase.supUpperLower  -21.8%
StringCompareToIgnoreCase.upperLower     avgt   -5.9%
"lower" was 8.5% slower, if such test exists in the specJVM, itwould be considered a regression. I would suggest you run thespecJVM. I agree with you on surrogate check being a requirement,thus supLower being 139% slower is ok since it won't otherwise becorrect anyways.
Yes, it would be a regression if SPECjvm produces 8+% degradation,but the test suite is aimed at the entire application performance.But for this one, it is a micro benchmark for relatively rarelyissued methods (I would think normal cases fall into Latin1equivalents), I would consider it is acceptable.
But after introducing additional operations supUpperLower andupperLower ran faster? That may indicate irregularity in the tests.Maybe we should consider running tests with short, long and verylong sample strings to see if we can reduce the noise level and alsosee how it fares for a longer string. I assume the machine you'rerunning the test on was isolated.
This result pretty much depends on the data it is testing for. As Iwrote in the previous email, (sup)UpperLower tests use the data thatare almost identical, but one last character is case insensitivelyequal. So in these cases, the new short cut works really well and notcall toLower/UpperCase() at all for most of the characters. Thus thenew results are faster. Again the test result is very dependent onthe input data, Unless the result showed 100% slower or something(except supLower case), I think it is OK.
Anyways, here is the updated webrev addressing your first suggestion:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/

Naoto
Regards,
Joe

On 7/19/2020 11:05 AM, [email protected] wrote:
Hi Mark,

Thank you for your comments.

On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up tothe point where you hit a high surrogate, then drop into theslower codepoint path. That saves time for the high-runner cases.
On the other hand, if the times are good enough, you might notneed the complication.
The implementation is dealing with bare byte arrays. Only methodsthat it uses from Character class are toLowerCase(int) andtoUpperCase(int) (sans surrogate check, which is needed at eachiteration anyways), and their "char" equivalents are merely casting(char) to the int result. So it might not be so beneficial todifferentiate char and int paths.
Having said that, I found that there was an unnecessary surrogatecheck (always checks high *and* low surrogate on each iteration),so I revised the fix (added line 380-383 in StringUTF16.java):
http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/

Naoto
Mark
//////
On Fri, Jul 17, 2020 at 4:39 PM <[email protected]<mailto:[email protected]>> wrote:
    Hi,

    Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

    Changes from the initial revision are:
- Shared the implementation between compareToCI() andregionMatchesCI()
    - Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores beforeand after
    the change:

    before:
Benchmark Mode Cnt Score Error Units StringCompareToIgnoreCase.lower avgt 25 53.764 ?2.811 ns/op StringCompareToIgnoreCase.supLower avgt 25 24.211 ?1.135 ns/op StringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ?1.344 ns/op StringCompareToIgnoreCase.upperLower avgt 25 18.859 ?1.499 ns/op
    after:
Benchmark Mode Cnt Score Error Units StringCompareToIgnoreCase.lower avgt 25 58.354 ?4.603 ns/op StringCompareToIgnoreCase.supLower avgt 25 57.975 ?5.672 ns/op StringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ?0.965 ns/op StringCompareToIgnoreCase.upperLower avgt 25 17.744 ?0.272 ns/op
Here, "sup" means all supplementary characters, BMP otherwise."lower" means each character requires upper->lower casemap."upperLower" means all characters are the same, except the last character whichrequires
    casemap.
I think the result is reasonable, considering surrogates checkare now mandatory. I have tried Roger's suggestion to useArrays.mismatch() but it did not seem to benefit here. In fact, the performancedegraded
    partly because I implemented the short cut, and possibly for the
    overhead of extra checks.

    Naoto

    On 7/15/20 9:00 AM, [email protected]
    <mailto:[email protected]> wrote:
     > Hello,
     >
     > Please review the fix to the following issues:
     >
     > https://bugs.openjdk.java.net/browse/JDK-8248655
     > https://bugs.openjdk.java.net/browse/JDK-8248434
     >
     > The proposed changeset and its CSR are located at:
     >
     > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
     > https://bugs.openjdk.java.net/browse/JDK-8248664
     >
     > A bug was filed against SimpleDateFormat (8248434) where
     > case-insensitive date format/parse failed in some of the new
    locales in
     > JDK15. The root cause was that case-insensitive
    String.regionMatches()
> method did not work with supplementary characters. Theproblem is
    that
> the method's spec does not expect case mappings ofsupplementary
     > characters, possibly because it was overlooked in the first
    place, JSR
> 204 - "Unicode Supplementary Character support". Similarbehavior is
     > observed in other two case-insensitive methods, i.e.,
     > compareToIgnoreCase() and equalsIgnoreCase().
     >
> The fix is straightforward to compare strings by code pointbasis,
     > instead of code unit (16bit "char") basis. Technically this
    change will
     > introduce a backward incompatibility, but I believe it is an
> incompatibility to wrong behavior, not true to the meaningof those
     > methods' expectations.
     >
     > Naoto

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

Reply via email to