On Sat, 18 Feb 2023 18:22:49 GMT, Eirik Bjorsnos <d...@openjdk.org> wrote:

> This PR continues the efforts from #12632 to speed up case-insensitive string 
> matching.
> 
> We now tackle case-insensitive comparison of mixed-coder strings, implemented 
> in `StringLatin1.regionMatchesCI_UTF16`
> 
> Key insights:
> 
> - If the UTF16 code point is also in latin1 range, we can leverage 
> improvements from 12632 directly by calling 
> `CharacterDataLatin1.equalsIgnoreCase`
> - There are exactly 7 non-latin1 Unicode code points which case fold into the 
> latin1 range. We can special-case our comparison of these code points by 
> adding the method `CharacterDataLatin1.latin1CaseFold`.
> - To avoid checking of `a == b` twice, this check is lifted out of 
> `CharacterDataLatin1.equalsIgnoreCase` and the two callers are updated to 
> check that `a != b` before calling the method. 
>  
> For completeness, the RegionMatches test is updated to also compare Turkic 
> dotted/dotless 'i's against the uppercase ASCII 'I', not just the lowercase 
> one.  Not stricktly related to the purpose of this PR, but it did help catch 
> a regression introduced in an earlier iteration of the PR.   
> 
> To guard against regressions caused by future changes to the set of Unicode 
> code points folding into latin1, a new test is added to `EqualsIgnoreCase` 
> which identifies all such code points and verifies they are compared correcty.
> 
> Performance is tested for matching and mismatching cases of selected code 
> point pairs picked from the ASCII letter, ASCII number, latin1 letter and 
> non-latin Unicode letter ranges. Results in the first comment below.

Benchmark results:

Baseline:


Benchmark                                 (codePoints)  (size)  Mode  Cnt      
Score     Error  Units
RegionMatchesIC.Mixed.regionMatchesIC      ascii-match    1024  avgt   15   
1497.391 ±  22.350  ns/op
RegionMatchesIC.Mixed.regionMatchesIC   ascii-mismatch    1024  avgt   15      
5.346 ±   0.165  ns/op
RegionMatchesIC.Mixed.regionMatchesIC     number-match    1024  avgt   15    
364.034 ±   5.561  ns/op
RegionMatchesIC.Mixed.regionMatchesIC  number-mismatch    1024  avgt   15      
4.036 ±   0.171  ns/op
RegionMatchesIC.Mixed.regionMatchesIC       lat1-match    1024  avgt   15   
2674.043 ± 174.669  ns/op
RegionMatchesIC.Mixed.regionMatchesIC    lat1-mismatch    1024  avgt   15      
6.493 ±   0.230  ns/op
RegionMatchesIC.Mixed.regionMatchesIC      utf16-match    1024  avgt   15  
12630.314 ± 212.472  ns/op
RegionMatchesIC.Mixed.regionMatchesIC   utf16-mismatch    1024  avgt   15     
14.796 ±   0.359  ns/op



PR:


Benchmark                                 (codePoints)  (size)  Mode  Cnt     
Score    Error  Units
RegionMatchesIC.Mixed.regionMatchesIC      ascii-match    1024  avgt   15  
1449.499 ± 14.350  ns/op
RegionMatchesIC.Mixed.regionMatchesIC   ascii-mismatch    1024  avgt   15     
3.450 ±  0.082  ns/op
RegionMatchesIC.Mixed.regionMatchesIC     number-match    1024  avgt   15   
362.582 ±  2.963  ns/op
RegionMatchesIC.Mixed.regionMatchesIC  number-mismatch    1024  avgt   15     
3.259 ±  0.021  ns/op
RegionMatchesIC.Mixed.regionMatchesIC       lat1-match    1024  avgt   15  
1625.513 ± 14.305  ns/op
RegionMatchesIC.Mixed.regionMatchesIC    lat1-mismatch    1024  avgt   15     
3.858 ±  0.027  ns/op
RegionMatchesIC.Mixed.regionMatchesIC      utf16-match    1024  avgt   15  
1422.722 ± 85.581  ns/op
RegionMatchesIC.Mixed.regionMatchesIC   utf16-mismatch    1024  avgt   15     
3.756 ±  0.089  ns/op

-------------

PR: https://git.openjdk.org/jdk/pull/12637

Reply via email to