Re: RFR: 8302872: Speed up StringLatin1.regionMatchesCI_UTF16 [v2]

Eirik Bjorsnos Tue, 14 Mar 2023 05:04:39 -0700

> This PR continues the efforts from #12632 to speed up case-insensitive string 
> matching.
> 
> We now tackle case-insensitive comparison of mixed-coder strings, implemented 
> in `StringLatin1.regionMatchesCI_UTF16`
> 
> Key insights:
> 
> - If the UTF16 code point is also in latin1 range, we can leverage 
> improvements from 12632 directly by calling 
> `CharacterDataLatin1.equalsIgnoreCase`
> - There are exactly 7 non-latin1 Unicode code points which case fold into the 
> latin1 range. We can special-case our comparison of these code points by 
> adding the method `CharacterDataLatin1.latin1CaseFold`.
> - To avoid checking of `a == b` twice, this check is lifted out of 
> `CharacterDataLatin1.equalsIgnoreCase` and the two callers are updated to 
> check that `a != b` before calling the method. 
>  
> For completeness, the RegionMatches test is updated to also compare Turkic 
> dotted/dotless 'i's against the uppercase ASCII 'I', not just the lowercase 
> one.  Not stricktly related to the purpose of this PR, but it did help catch 
> a regression introduced in an earlier iteration of the PR.   
> 
> To guard against regressions caused by future changes to the set of Unicode 
> code points folding into latin1, a new test is added to `EqualsIgnoreCase` 
> which identifies all such code points and verifies they are compared correcty.
> 
> Performance is tested for matching and mismatching cases of selected code 
> point pairs picked from the ASCII letter, ASCII number, latin1 letter and 
> non-latin Unicode letter ranges. Results in the first comment below.


Eirik Bjorsnos has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 24 commits:

 - Merge branch 'master' into regionmatches-mixed-speedup
 - Inline local variable
 - latin1CaseFold was moved to CharacterDataLatin1
 - Move latin1CaseFold to CharacterDataLatin1
 - Improve latin1CaseFold javadocs
 - Simplify comments
 - Prefer fast matching by comparing for equality before checking latin1 range
 - Improve Javadocs of latin1CaseFold
 - Be consistent in comments
 - CharacterData.latin1LowerCase was renamed to latin1CaseFold
 - ... and 14 more: https://git.openjdk.org/jdk/compare/6d30bbe6...2340f8b5

-------------

Changes: https://git.openjdk.org/jdk/pull/12637/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12637&range=01
  Stats: 169 lines in 5 files changed: 155 ins; 2 del; 12 mod
  Patch: https://git.openjdk.org/jdk/pull/12637.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12637/head:pull/12637

PR: https://git.openjdk.org/jdk/pull/12637

Re: RFR: 8302872: Speed up StringLatin1.regionMatchesCI_UTF16 [v2]

Reply via email to