Re: Possible optimization in StringLatin1.regionMatchesCI

Christoph Dreis Tue, 26 May 2020 04:09:01 -0700

Hi Martin,

> Not a review, but:
> Compare with the variant of this code in StringUTF16.
> StringLatin1 only ever needs to support the first 256 chars in Unicode


Does it really? That makes me wonder even more about the additional lowercase 
check.

> which can never change, unlike StringUTF16,

What do you mean by "can never change"?

> Do all the String tests still pass if you simplify the code?

You mean if I remove the additional lowercasing completely!?

--- a/src/java.base/share/classes/java/lang/StringLatin1.java   Tue May 26 
09:25:23 2020 +0200
+++ b/src/java.base/share/classes/java/lang/StringLatin1.java   Tue May 26 
13:01:14 2020 +0200
@@ -396,9 +396,6 @@
             if (u1 == u2) {
                 continue;
             }
-            if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
-                continue;
-            }
             return false;
         }
         return true;

Tier1 seems to pass still (apart from some tests that don't seem to like my 
German setup - also without the change)

> Should CharacterDataLatin1 have a method to compare two characters
> case-insensitively?

What do you mean by that? It needs to keep at least one check for the 
regionMatchesCI method.

> Be careful with Latin Small Letter sharp S
That seems to stay 223 regardless of uppercasing or lowercasing it.

I'm afraid your answer raised more question marks than it actually solved. ;-)

Cheers,
Christoph

>On Mon, May 25, 2020 at 2:16 PM Christoph Dreis
><[email protected]> wrote:
>>
>> Hi,
>>
>> I've recently looked through the StringLatin1 code - specifically 
>> regionMatchesCI.
>>
>> I think I have an optimization, but would need someone with more domain 
>> knowledge to verify if I'm missing nothing.
>>
>> Currently, the code does a conversion to uppercase and if that doesn't match 
>> it does an additional comparison of the lowercase characters.
>> What's confusing to me is that there are actually both upper- and lowercase 
>> checks needed, but that might be explained by the comment in the UTF-16 
>> version about the Georgian alphabet.
>>
>> Assuming that the additional lowercase check is needed, I was wondering if 
>> this must be on the uppercase variant. Wouldn't it be faster on the 
>> character itself to avoid potentially converting a lowercase character to an 
>> uppercase character and back?
>>
>> I think code is actually better explaining what I'm suggesting:
>>
>> --- a/src/java.base/share/classes/java/lang/StringLatin1.java   Wed May 13 
>> 16:18:16 2020 +0200
>> +++ b/src/java.base/share/classes/java/lang/StringLatin1.java   Mon May 25 
>> 22:59:13 2020 +0200
>> @@ -396,7 +396,7 @@
>>              if (u1 == u2) {
>>                  continue;
>>              }
>> -            if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
>> +            if (Character.toLowerCase(c1) == Character.toLowerCase(c2)) {
>>                  continue;
>>              }
>>              return false;
>>
>>
>> And indeed the newer version seems to be faster if I use the following 
>> benchmark:
>>
>> @BenchmarkMode(Mode.AverageTime)
>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>> public class MyBenchmark {
>>
>>     @State(Scope.Benchmark)
>>     public static class ThreadState {
>>         private String test1 = "test";
>>         private String test2 = "best";
>>     }
>>
>>     @Benchmark
>>     public boolean test(ThreadState threadState) {
>>         return threadState.test1.equalsIgnoreCase(threadState.test2);
>>     }
>>
>> }
>>
>> Benchmark                                      Mode  Cnt   Score    Error   
>> Units
>> MyBenchmark.testOld                  avgt   10   8,843 ±  0,274   ns/op
>> MyBenchmark.testPatched          avgt   10   7,067 ±  0,177   ns/op
>>
>> Does this make sense? Do I miss something here? I would appreciate if 
>> someone can either explain the shortcomings of the solution above or - in 
>> case there aren't any - can maybe sponsor it.
>>
>> Cheers,
>> Christoph
>>
>>

Re: Possible optimization in StringLatin1.regionMatchesCI

Reply via email to