On Sat, 18 Feb 2023 18:22:49 GMT, Eirik Bjorsnos <d...@openjdk.org> wrote:
> This PR continues the efforts from #12632 to speed up case-insensitive string > matching. > > We now tackle case-insensitive comparison of mixed-coder strings, implemented > in `StringLatin1.regionMatchesCI_UTF16` > > Key insights: > > - If the UTF16 code point is also in latin1 range, we can leverage > improvements from 12632 directly by calling > `CharacterDataLatin1.equalsIgnoreCase` > - There are exactly 7 non-latin1 Unicode code points which case fold into the > latin1 range. We can special-case our comparison of these code points by > adding the method `CharacterDataLatin1.latin1CaseFold`. > - To avoid checking of `a == b` twice, this check is lifted out of > `CharacterDataLatin1.equalsIgnoreCase` and the two callers are updated to > check that `a != b` before calling the method. > > For completeness, the RegionMatches test is updated to also compare Turkic > dotted/dotless 'i's against the uppercase ASCII 'I', not just the lowercase > one. Not stricktly related to the purpose of this PR, but it did help catch > a regression introduced in an earlier iteration of the PR. > > To guard against regressions caused by future changes to the set of Unicode > code points folding into latin1, a new test is added to `EqualsIgnoreCase` > which identifies all such code points and verifies they are compared correcty. > > Performance is tested for matching and mismatching cases of selected code > point pairs picked from the ASCII letter, ASCII number, latin1 letter and > non-latin Unicode letter ranges. Results in the first comment below. Benchmark results: Baseline: Benchmark (codePoints) (size) Mode Cnt Score Error Units RegionMatchesIC.Mixed.regionMatchesIC ascii-match 1024 avgt 15 1497.391 ± 22.350 ns/op RegionMatchesIC.Mixed.regionMatchesIC ascii-mismatch 1024 avgt 15 5.346 ± 0.165 ns/op RegionMatchesIC.Mixed.regionMatchesIC number-match 1024 avgt 15 364.034 ± 5.561 ns/op RegionMatchesIC.Mixed.regionMatchesIC number-mismatch 1024 avgt 15 4.036 ± 0.171 ns/op RegionMatchesIC.Mixed.regionMatchesIC lat1-match 1024 avgt 15 2674.043 ± 174.669 ns/op RegionMatchesIC.Mixed.regionMatchesIC lat1-mismatch 1024 avgt 15 6.493 ± 0.230 ns/op RegionMatchesIC.Mixed.regionMatchesIC utf16-match 1024 avgt 15 12630.314 ± 212.472 ns/op RegionMatchesIC.Mixed.regionMatchesIC utf16-mismatch 1024 avgt 15 14.796 ± 0.359 ns/op PR: Benchmark (codePoints) (size) Mode Cnt Score Error Units RegionMatchesIC.Mixed.regionMatchesIC ascii-match 1024 avgt 15 1449.499 ± 14.350 ns/op RegionMatchesIC.Mixed.regionMatchesIC ascii-mismatch 1024 avgt 15 3.450 ± 0.082 ns/op RegionMatchesIC.Mixed.regionMatchesIC number-match 1024 avgt 15 362.582 ± 2.963 ns/op RegionMatchesIC.Mixed.regionMatchesIC number-mismatch 1024 avgt 15 3.259 ± 0.021 ns/op RegionMatchesIC.Mixed.regionMatchesIC lat1-match 1024 avgt 15 1625.513 ± 14.305 ns/op RegionMatchesIC.Mixed.regionMatchesIC lat1-mismatch 1024 avgt 15 3.858 ± 0.027 ns/op RegionMatchesIC.Mixed.regionMatchesIC utf16-match 1024 avgt 15 1422.722 ± 85.581 ns/op RegionMatchesIC.Mixed.regionMatchesIC utf16-mismatch 1024 avgt 15 3.756 ± 0.089 ns/op ------------- PR: https://git.openjdk.org/jdk/pull/12637