Withdrawn: 8302872: Speed up StringLatin1.regionMatchesCI_UTF16
On Sat, 18 Feb 2023 18:22:49 GMT, Eirik Bjorsnos wrote: > This PR continues the efforts from #12632 to speed up case-insensitive string > matching. > > We now tackle case-insensitive comparison of mixed-coder strings, implemented > in `StringLatin1.regionMatchesCI_UTF16` > > Key insights: > > - If the UTF16 code point is also in latin1 range, we can leverage > improvements from 12632 directly by calling > `CharacterDataLatin1.equalsIgnoreCase` > - There are exactly 7 non-latin1 Unicode code points which case fold into the > latin1 range. We can special-case our comparison of these code points by > adding the method `CharacterDataLatin1.latin1CaseFold`. > - To avoid checking of `a == b` twice, this check is lifted out of > `CharacterDataLatin1.equalsIgnoreCase` and the two callers are updated to > check that `a != b` before calling the method. > > For completeness, the RegionMatches test is updated to also compare Turkic > dotted/dotless 'i's against the uppercase ASCII 'I', not just the lowercase > one. Not stricktly related to the purpose of this PR, but it did help catch > a regression introduced in an earlier iteration of the PR. > > To guard against regressions caused by future changes to the set of Unicode > code points folding into latin1, a new test is added to `EqualsIgnoreCase` > which identifies all such code points and verifies they are compared correcty. > > Performance is tested for matching and mismatching cases of selected code > point pairs picked from the ASCII letter, ASCII number, latin1 letter and > non-latin Unicode letter ranges. Results in the first comment below. This pull request has been closed without being integrated. - PR: https://git.openjdk.org/jdk/pull/12637
Re: RFR: 8302872: Speed up StringLatin1.regionMatchesCI_UTF16 [v2]
> This PR continues the efforts from #12632 to speed up case-insensitive string > matching. > > We now tackle case-insensitive comparison of mixed-coder strings, implemented > in `StringLatin1.regionMatchesCI_UTF16` > > Key insights: > > - If the UTF16 code point is also in latin1 range, we can leverage > improvements from 12632 directly by calling > `CharacterDataLatin1.equalsIgnoreCase` > - There are exactly 7 non-latin1 Unicode code points which case fold into the > latin1 range. We can special-case our comparison of these code points by > adding the method `CharacterDataLatin1.latin1CaseFold`. > - To avoid checking of `a == b` twice, this check is lifted out of > `CharacterDataLatin1.equalsIgnoreCase` and the two callers are updated to > check that `a != b` before calling the method. > > For completeness, the RegionMatches test is updated to also compare Turkic > dotted/dotless 'i's against the uppercase ASCII 'I', not just the lowercase > one. Not stricktly related to the purpose of this PR, but it did help catch > a regression introduced in an earlier iteration of the PR. > > To guard against regressions caused by future changes to the set of Unicode > code points folding into latin1, a new test is added to `EqualsIgnoreCase` > which identifies all such code points and verifies they are compared correcty. > > Performance is tested for matching and mismatching cases of selected code > point pairs picked from the ASCII letter, ASCII number, latin1 letter and > non-latin Unicode letter ranges. Results in the first comment below. Eirik Bjorsnos has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 24 commits: - Merge branch 'master' into regionmatches-mixed-speedup - Inline local variable - latin1CaseFold was moved to CharacterDataLatin1 - Move latin1CaseFold to CharacterDataLatin1 - Improve latin1CaseFold javadocs - Simplify comments - Prefer fast matching by comparing for equality before checking latin1 range - Improve Javadocs of latin1CaseFold - Be consistent in comments - CharacterData.latin1LowerCase was renamed to latin1CaseFold - ... and 14 more: https://git.openjdk.org/jdk/compare/6d30bbe6...2340f8b5 - Changes: https://git.openjdk.org/jdk/pull/12637/files Webrev: https://webrevs.openjdk.org/?repo=jdk=12637=01 Stats: 169 lines in 5 files changed: 155 ins; 2 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/12637.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12637/head:pull/12637 PR: https://git.openjdk.org/jdk/pull/12637
Re: RFR: 8302872: Speed up StringLatin1.regionMatchesCI_UTF16
On Sat, 18 Feb 2023 18:22:49 GMT, Eirik Bjorsnos wrote: > This PR continues the efforts from #12632 to speed up case-insensitive string > matching. > > We now tackle case-insensitive comparison of mixed-coder strings, implemented > in `StringLatin1.regionMatchesCI_UTF16` > > Key insights: > > - If the UTF16 code point is also in latin1 range, we can leverage > improvements from 12632 directly by calling > `CharacterDataLatin1.equalsIgnoreCase` > - There are exactly 7 non-latin1 Unicode code points which case fold into the > latin1 range. We can special-case our comparison of these code points by > adding the method `CharacterDataLatin1.latin1CaseFold`. > - To avoid checking of `a == b` twice, this check is lifted out of > `CharacterDataLatin1.equalsIgnoreCase` and the two callers are updated to > check that `a != b` before calling the method. > > For completeness, the RegionMatches test is updated to also compare Turkic > dotted/dotless 'i's against the uppercase ASCII 'I', not just the lowercase > one. Not stricktly related to the purpose of this PR, but it did help catch > a regression introduced in an earlier iteration of the PR. > > To guard against regressions caused by future changes to the set of Unicode > code points folding into latin1, a new test is added to `EqualsIgnoreCase` > which identifies all such code points and verifies they are compared correcty. > > Performance is tested for matching and mismatching cases of selected code > point pairs picked from the ASCII letter, ASCII number, latin1 letter and > non-latin Unicode letter ranges. Results in the first comment below. Benchmark results: Baseline: Benchmark (codePoints) (size) Mode Cnt Score Error Units RegionMatchesIC.Mixed.regionMatchesIC ascii-match1024 avgt 15 1497.391 ± 22.350 ns/op RegionMatchesIC.Mixed.regionMatchesIC ascii-mismatch1024 avgt 15 5.346 ± 0.165 ns/op RegionMatchesIC.Mixed.regionMatchesIC number-match1024 avgt 15 364.034 ± 5.561 ns/op RegionMatchesIC.Mixed.regionMatchesIC number-mismatch1024 avgt 15 4.036 ± 0.171 ns/op RegionMatchesIC.Mixed.regionMatchesIC lat1-match1024 avgt 15 2674.043 ± 174.669 ns/op RegionMatchesIC.Mixed.regionMatchesIClat1-mismatch1024 avgt 15 6.493 ± 0.230 ns/op RegionMatchesIC.Mixed.regionMatchesIC utf16-match1024 avgt 15 12630.314 ± 212.472 ns/op RegionMatchesIC.Mixed.regionMatchesIC utf16-mismatch1024 avgt 15 14.796 ± 0.359 ns/op PR: Benchmark (codePoints) (size) Mode Cnt ScoreError Units RegionMatchesIC.Mixed.regionMatchesIC ascii-match1024 avgt 15 1449.499 ± 14.350 ns/op RegionMatchesIC.Mixed.regionMatchesIC ascii-mismatch1024 avgt 15 3.450 ± 0.082 ns/op RegionMatchesIC.Mixed.regionMatchesIC number-match1024 avgt 15 362.582 ± 2.963 ns/op RegionMatchesIC.Mixed.regionMatchesIC number-mismatch1024 avgt 15 3.259 ± 0.021 ns/op RegionMatchesIC.Mixed.regionMatchesIC lat1-match1024 avgt 15 1625.513 ± 14.305 ns/op RegionMatchesIC.Mixed.regionMatchesIClat1-mismatch1024 avgt 15 3.858 ± 0.027 ns/op RegionMatchesIC.Mixed.regionMatchesIC utf16-match1024 avgt 15 1422.722 ± 85.581 ns/op RegionMatchesIC.Mixed.regionMatchesIC utf16-mismatch1024 avgt 15 3.756 ± 0.089 ns/op - PR: https://git.openjdk.org/jdk/pull/12637
RFR: 8302872: Speed up StringLatin1.regionMatchesCI_UTF16
This PR continues the efforts from #12632 to speed up case-insensitive string matching. We now tackle case-insensitive comparison of mixed-coder strings, implemented in `StringLatin1.regionMatchesCI_UTF16` Key insights: - If the UTF16 code point is also in latin1 range, we can leverage improvements from 12632 directly by calling `CharacterDataLatin1.equalsIgnoreCase` - There are exactly 7 non-latin1 Unicode code points which case fold into the latin1 range. We can special-case our comparison of these code points by adding the method `CharacterDataLatin1.latin1CaseFold`. - To avoid checking of `a == b` twice, this check is lifted out of `CharacterDataLatin1.equalsIgnoreCase` and the two callers are updated to check that `a != b` before calling the method. For completeness, the RegionMatches test is updated to also compare Turkic dotted/dotless 'i's against the uppercase ASCII 'I', not just the lowercase one. Not stricktly related to the purpose of this PR, but it did help catch a regression introduced in an earlier iteration of the PR. To guard against regressions caused by future changes to the set of Unicode code points folding into latin1, a new test is added to `EqualsIgnoreCase` which identifies all such code points and verifies they are compared correcty. Performance is tested for matching and mismatching cases of selected code point pairs picked from the ASCII letter, ASCII number, latin1 letter and non-latin Unicode letter ranges. Results in the first comment below. - Commit messages: - Inline local variable - latin1CaseFold was moved to CharacterDataLatin1 - Move latin1CaseFold to CharacterDataLatin1 - Improve latin1CaseFold javadocs - Simplify comments - Prefer fast matching by comparing for equality before checking latin1 range - Improve Javadocs of latin1CaseFold - Be consistent in comments - CharacterData.latin1LowerCase was renamed to latin1CaseFold - Hoist equality check out of CharacterDataLatin1.equalsIgnoreCase - ... and 13 more: https://git.openjdk.org/jdk/compare/f2b03f9a...92755920 Changes: https://git.openjdk.org/jdk/pull/12637/files Webrev: https://webrevs.openjdk.org/?repo=jdk=12637=00 Issue: https://bugs.openjdk.org/browse/JDK-8302872 Stats: 169 lines in 5 files changed: 155 ins; 2 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/12637.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12637/head:pull/12637 PR: https://git.openjdk.org/jdk/pull/12637
Re: Speed up StringLatin1.regionMatchesCI_UTF16
RFE filed: https://bugs.openjdk.org/browse/JDK-8302872 /Claes 18 feb. 2023 kl. 19:58 skrev Eirik Bjørsnøs mailto:eir...@gmail.com>>: Hi, This PR continues the effort to speed up case-insensitive string comparisons, this time tackling comparison of latin1-coded strings with utf16-coded strings: https://github.com/openjdk/jdk/pull/12637 This builds on top of #12632, it makes sense to review that one first. Thanks, Eirik.
Speed up StringLatin1.regionMatchesCI_UTF16
Hi, This PR continues the effort to speed up case-insensitive string comparisons, this time tackling comparison of latin1-coded strings with utf16-coded strings: https://github.com/openjdk/jdk/pull/12637 This builds on top of #12632, it makes sense to review that one first. Thanks, Eirik.