subject:"Speed up StringLatin1.regionMatchesCI"

Integrated: 8302871: Speed up StringLatin1.regionMatchesCI

2023-02-24 Thread Eirik Bjorsnos

On Sat, 18 Feb 2023 09:21:25 GMT, Eirik Bjorsnos  wrote:

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

This pull request has now been integrated.

Changeset: 17e3769e
Author:Eirik Bjorsnos 
Committer: Alan Bateman 
URL:   
https://git.openjdk.org/jdk/commit/17e3769ed7190c3ba885e6434e1811bca2d66f13
Stats: 166 lines in 4 files changed: 148 ins; 5 del; 13 mod

8302871: Speed up StringLatin1.regionMatchesCI

Reviewed-by: redestad, martin, alanb

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v14]

2023-02-24 Thread Alan Bateman

On Wed, 22 Feb 2023 20:01:52 GMT, Eirik Bjorsnos  wrote:

>> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
>> 'the oldest ASCII trick in the book'.
>> 
>> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
>> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
>> updated to use `equalsIgnoreCase`
>> 
>> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
>> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
>> code point pairs have an `equalsIgnoreCase` consistent with 
>> Character.toUpperCase, Character.toLowerCase.
>> 
>> Performance is tested for matching and mismatching cases of code point pairs 
>> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
>> in the first comment below.
>
> Eirik Bjorsnos has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 21 additional 
> commits since the last revision:
> 
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - Make the loop variables chars to avoid casting
>  - Use improved case-twiddling comment as suggested by Martin
>  - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase 
> with "by removing (setting) a single bit"
>  - Align local variable naming in toLowerCase, toUpperCase with 
> equalsIgnoreCase by using 'lower' and 'upper'
>  - Rename unconventionally named local variable 'U' to 'upper'
>  - Merge remote-tracking branch 'origin/master' into 
> regionmatches-latin1-speedup
>  - Add whitespace between methods
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - ... and 11 more: https://git.openjdk.org/jdk/compare/6e269b24...597b346a

Marked as reviewed by alanb (Reviewer).

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v14]

2023-02-23 Thread Eirik Bjorsnos

On Wed, 22 Feb 2023 20:01:52 GMT, Eirik Bjorsnos  wrote:

>> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
>> 'the oldest ASCII trick in the book'.
>> 
>> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
>> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
>> updated to use `equalsIgnoreCase`
>> 
>> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
>> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
>> code point pairs have an `equalsIgnoreCase` consistent with 
>> Character.toUpperCase, Character.toLowerCase.
>> 
>> Performance is tested for matching and mismatching cases of code point pairs 
>> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
>> in the first comment below.
>
> Eirik Bjorsnos has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 21 additional 
> commits since the last revision:
> 
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - Make the loop variables chars to avoid casting
>  - Use improved case-twiddling comment as suggested by Martin
>  - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase 
> with "by removing (setting) a single bit"
>  - Align local variable naming in toLowerCase, toUpperCase with 
> equalsIgnoreCase by using 'lower' and 'upper'
>  - Rename unconventionally named local variable 'U' to 'upper'
>  - Merge remote-tracking branch 'origin/master' into 
> regionmatches-latin1-speedup
>  - Add whitespace between methods
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - ... and 11 more: https://git.openjdk.org/jdk/compare/bd4b71f6...597b346a

Thanks, Alan 👍 Sponsors, hold your horses! (Not sure how to 'undo' the 
integrate command)

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v12]

2023-02-23 Thread Alan Bateman

On Wed, 22 Feb 2023 18:44:57 GMT, Eirik Bjorsnos  wrote:

> I'll let this linger a bit before integrating in case Alan has comments after 
> the latest updates.

I plan to look at it, been busy with other things.

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v14]

2023-02-22 Thread Eirik Bjorsnos

On Thu, 23 Feb 2023 07:21:08 GMT, Eirik Bjorsnos  wrote:

> Seems compatibility with existing 6-bit devices might have been the primary 
> concern:

This also explains the placement of brackets, braces, bars, tilde etc. They 
would look visually similar on 6-bit devices:

https://user-images.githubusercontent.com/300291/220845393-de12301a-73e2-46fe-a6ce-ce101d0a2b3b.png";>

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v14]

2023-02-22 Thread Eirik Bjorsnos

On Wed, 22 Feb 2023 20:01:52 GMT, Eirik Bjorsnos  wrote:

>> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
>> 'the oldest ASCII trick in the book'.
>> 
>> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
>> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
>> updated to use `equalsIgnoreCase`
>> 
>> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
>> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
>> code point pairs have an `equalsIgnoreCase` consistent with 
>> Character.toUpperCase, Character.toLowerCase.
>> 
>> Performance is tested for matching and mismatching cases of code point pairs 
>> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
>> in the first comment below.
>
> Eirik Bjorsnos has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 21 additional 
> commits since the last revision:
> 
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - Make the loop variables chars to avoid casting
>  - Use improved case-twiddling comment as suggested by Martin
>  - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase 
> with "by removing (setting) a single bit"
>  - Align local variable naming in toLowerCase, toUpperCase with 
> equalsIgnoreCase by using 'lower' and 'upper'
>  - Rename unconventionally named local variable 'U' to 'upper'
>  - Merge remote-tracking branch 'origin/master' into 
> regionmatches-latin1-speedup
>  - Add whitespace between methods
>  - Merge branch 'master' into regionmatches-latin1-speedup
>  - ... and 11 more: https://git.openjdk.org/jdk/compare/31689be3...597b346a

I found this in Appendix A of the 1973 `Draft Proposed Revision of ASCII`. 
Seems compatibility with existing 6-bit devices might have been the primary 
concern:


A 6.4 It is expected that devices having the capability of
printing only 64 graphic symbols will continue to be important.
It may be desirable to arrange these devices to print one symbol
for the bit pattern of both upper and lower case of a given
alphabetic letter. To facilitate this, there should be a single-
bit difference between the upper and lowercase representations
of any given letter. Combined with the requirement that a given
case of the alphabet be contiguous, this dictated the assignment
of the alphabet, as shown in columns 4 through 7.

https://user-images.githubusercontent.com/300291/220842000-3efa64fe-9154-4069-9e81-e202d5731f6f.png";>

https://ia800606.us.archive.org/17/items/enf-ascii-1972-1975/Image070917152640_text.pdf

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v14]

2023-02-22 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains 21 additional commits 
since the last revision:

 - Merge branch 'master' into regionmatches-latin1-speedup
 - Merge branch 'master' into regionmatches-latin1-speedup
 - Make the loop variables chars to avoid casting
 - Use improved case-twiddling comment as suggested by Martin
 - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase 
with "by removing (setting) a single bit"
 - Align local variable naming in toLowerCase, toUpperCase with 
equalsIgnoreCase by using 'lower' and 'upper'
 - Rename unconventionally named local variable 'U' to 'upper'
 - Merge remote-tracking branch 'origin/master' into 
regionmatches-latin1-speedup
 - Add whitespace between methods
 - Merge branch 'master' into regionmatches-latin1-speedup
 - ... and 11 more: https://git.openjdk.org/jdk/compare/e7379286...597b346a

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/fe4efe9f..597b346a

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=13
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=12-13

  Stats: 9 lines in 5 files changed: 1 ins; 2 del; 6 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v13]

2023-02-22 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains 20 additional commits 
since the last revision:

 - Merge branch 'master' into regionmatches-latin1-speedup
 - Make the loop variables chars to avoid casting
 - Use improved case-twiddling comment as suggested by Martin
 - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase 
with "by removing (setting) a single bit"
 - Align local variable naming in toLowerCase, toUpperCase with 
equalsIgnoreCase by using 'lower' and 'upper'
 - Rename unconventionally named local variable 'U' to 'upper'
 - Merge remote-tracking branch 'origin/master' into 
regionmatches-latin1-speedup
 - Add whitespace between methods
 - Merge branch 'master' into regionmatches-latin1-speedup
 - Remove whitespace following '('
 - ... and 10 more: https://git.openjdk.org/jdk/compare/ea132c56...fe4efe9f

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/cc185293..fe4efe9f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=12
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=11-12

  Stats: 1397 lines in 120 files changed: 873 ins; 291 del; 233 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v12]

2023-02-22 Thread Eirik Bjorsnos

On Wed, 22 Feb 2023 16:37:45 GMT, Eirik Bjorsnos  wrote:

>> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
>> 'the oldest ASCII trick in the book'.
>> 
>> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
>> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
>> updated to use `equalsIgnoreCase`
>> 
>> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
>> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
>> code point pairs have an `equalsIgnoreCase` consistent with 
>> Character.toUpperCase, Character.toLowerCase.
>> 
>> Performance is tested for matching and mismatching cases of code point pairs 
>> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
>> in the first comment below.
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Make the loop variables chars to avoid casting

Thanks for reviews Claes and Martin! I'll let this linger a bit before 
integrating in case Alan has comments after the latest updates.

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v12]

2023-02-22 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request incrementally with one additional 
commit since the last revision:

  Make the loop variables chars to avoid casting

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/ea2f9fa3..cc185293

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=11
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=10-11

  Stats: 3 lines in 1 file changed: 0 ins; 1 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v10]

2023-02-22 Thread Eirik Bjorsnos

On Wed, 22 Feb 2023 16:25:41 GMT, Martin Buchholz  wrote:

>> Eirik Bjorsnos has updated the pull request incrementally with two 
>> additional commits since the last revision:
>> 
>>  - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase 
>> with "by removing (setting) a single bit"
>>  - Align local variable naming in toLowerCase, toUpperCase with 
>> equalsIgnoreCase by using 'lower' and 'upper'
>
> test/jdk/java/lang/String/CompactString/EqualsIgnoreCase.java line 89:
> 
>> 87: for (int ab = 0; ab < 256; ab++) {
>> 88: for (int bb = 0; bb < 256; bb++) {
>> 89: char a = (char) ab, b = (char) bb;
> 
> char is an unsigned numeric type, so cleaner is
> 
> for (char a = 0; a < 256; a++) 
> for (char b = 0; b < 256; b++)

Thanks, fixed. Might have been copied over from processing of code points in 
the higher planes. Not needed here.

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-22 Thread Eirik Bjorsnos

On Wed, 22 Feb 2023 16:10:42 GMT, Martin Buchholz  wrote:

>> Thanks Martin, David, Alan.  This was instructive (and fun!)
>> 
>> I suggest we condense the comment to something like this:
>> 
>> 
>> // Uppercase b1 by removing a single bit
>> int upper = b1 & 0xDF;
>> if (upper < 'A') {
>> return false;  // Low ASCII
>> }
>> ...
>> 
>> 
>> The similar methods `toLowerCase` `toUpperCase` just above have been updated 
>> to follow the same style. (I also updated local variable names there to 
>> align better with equalsIgnoreCase)
>
> // ASCII and Latin-1 were designed to optimize case-twiddling operations

Thanks! This expresses the higher-level benefit succinctly, without getting 
into the details. I like it!

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v10]

2023-02-22 Thread Martin Buchholz

On Wed, 22 Feb 2023 07:11:16 GMT, Eirik Bjorsnos  wrote:

>> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
>> 'the oldest ASCII trick in the book'.
>> 
>> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
>> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
>> updated to use `equalsIgnoreCase`
>> 
>> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
>> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
>> code point pairs have an `equalsIgnoreCase` consistent with 
>> Character.toUpperCase, Character.toLowerCase.
>> 
>> Performance is tested for matching and mismatching cases of code point pairs 
>> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
>> in the first comment below.
>
> Eirik Bjorsnos has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase 
> with "by removing (setting) a single bit"
>  - Align local variable naming in toLowerCase, toUpperCase with 
> equalsIgnoreCase by using 'lower' and 'upper'

Marked as reviewed by martin (Reviewer).

test/jdk/java/lang/String/CompactString/EqualsIgnoreCase.java line 89:

> 87: for (int ab = 0; ab < 256; ab++) {
> 88: for (int bb = 0; bb < 256; bb++) {
> 89: char a = (char) ab, b = (char) bb;

char is an unsigned numeric type, so cleaner is

for (char a = 0; a < 256; a++) 
for (char b = 0; b < 256; b++)

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v11]

2023-02-22 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request incrementally with one additional 
commit since the last revision:

  Use improved case-twiddling comment as suggested by Martin

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/44d91544..ea2f9fa3

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=10
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=09-10

  Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-22 Thread Martin Buchholz

On Wed, 22 Feb 2023 07:12:35 GMT, Eirik Bjorsnos  wrote:

>>> Do you have an opinion on the appropriate level of documentation / comments 
>>> for this kind of 'tricky' code? 
>> 
>> This code is not that tricky!  And the proposed level of documentation is 
>> excessive!  A couple of lines of explanation and perhaps a link to an 
>> external document would be good.
>> 
>> It often happens to me that I will write such exhaustive notes for myself 
>> when learning a new technology.  A year later I pare it all back because 
>> much of it is "obvious in retrospect".
>
> Thanks Martin, David, Alan.  This was instructive (and fun!)
> 
> I suggest we condense the comment to something like this:
> 
> 
> // Uppercase b1 by removing a single bit
> int upper = b1 & 0xDF;
> if (upper < 'A') {
> return false;  // Low ASCII
> }
> ...
> 
> 
> The similar methods `toLowerCase` `toUpperCase` just above have been updated 
> to follow the same style. (I also updated local variable names there to align 
> better with equalsIgnoreCase)

// ASCII and Latin-1 were designed to optimize case-twiddling operations

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v10]

2023-02-22 Thread Claes Redestad

On Wed, 22 Feb 2023 07:11:16 GMT, Eirik Bjorsnos  wrote:

>> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
>> 'the oldest ASCII trick in the book'.
>> 
>> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
>> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
>> updated to use `equalsIgnoreCase`
>> 
>> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
>> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
>> code point pairs have an `equalsIgnoreCase` consistent with 
>> Character.toUpperCase, Character.toLowerCase.
>> 
>> Performance is tested for matching and mismatching cases of code point pairs 
>> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
>> in the first comment below.
>
> Eirik Bjorsnos has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase 
> with "by removing (setting) a single bit"
>  - Align local variable naming in toLowerCase, toUpperCase with 
> equalsIgnoreCase by using 'lower' and 'upper'

Marked as reviewed by redestad (Reviewer).

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Eirik Bjorsnos

On Wed, 22 Feb 2023 00:50:39 GMT, Martin Buchholz  wrote:

>> More history: IIRC I originally used 'ASCII trick' when I was truly only 
>> cared about ASCII, not Latin1 (e.g. ZipFile.isMetaName) and it's a slight 
>> misnomer to use "ASCII" here.  But Latin1 followed the precedent of ASCII.
>
>> Do you have an opinion on the appropriate level of documentation / comments 
>> for this kind of 'tricky' code? 
> 
> This code is not that tricky!  And the proposed level of documentation is 
> excessive!  A couple of lines of explanation and perhaps a link to an 
> external document would be good.
> 
> It often happens to me that I will write such exhaustive notes for myself 
> when learning a new technology.  A year later I pare it all back because much 
> of it is "obvious in retrospect".

Thanks Martin, David, Alan.  This was instructive (and fun!)

I suggest we condense the comment to something like this:


// Uppercase b1 by removing a single bit
int upper = b1 & 0xDF;
if (upper < 'A') {
return false;  // Low ASCII
}
...


The similar methods `toLowerCase` `toUpperCase` just above have been updated to 
follow the same style. (I also updated local variable names there to align 
better with equalsIgnoreCase)

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v10]

2023-02-21 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request incrementally with two additional 
commits since the last revision:

 - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase 
with "by removing (setting) a single bit"
 - Align local variable naming in toLowerCase, toUpperCase with 
equalsIgnoreCase by using 'lower' and 'upper'

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/6588ab0f..44d91544

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=09
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=08-09

  Stats: 11 lines in 1 file changed: 2 ins; 0 del; 9 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Martin Buchholz

On Wed, 22 Feb 2023 00:45:52 GMT, Martin Buchholz  wrote:

>> There are still some books on this :) but from wikipedia:
>>> during May 1963 the CCITT Working Party on the New Telegraph Alphabet 
>>> proposed to assign lowercase characters to 
>>> sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14)
>>>  6 and 
>>> 7,[[15]](https://en.wikipedia.org/wiki/ASCII#cite_note-CCITT_1963-16) and 
>>> [International Organization for 
>>> Standardization](https://en.wikipedia.org/wiki/International_Organization_for_Standardization)
>>>  TC 97 SC 2 voted during October to incorporate the change into its draft 
>>> standard.[[16]](https://en.wikipedia.org/wiki/ASCII#cite_note-ISO_1963-17) 
>>> The X3.2.4 task group voted its approval for the change to ASCII at its May 
>>> 1963 meeting.[[17]](https://en.wikipedia.org/wiki/ASCII#cite_note-18) 
>>> **Locating the lowercase letters in 
>>> sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14)
>>>  6 and 7 caused the characters to differ in bit pattern
  from the upper case by a single bit, which simplified 
[case-insensitive](https://en.wikipedia.org/wiki/Case-insensitive) character 
matching and the construction of keyboards and printers.**
>> 
>> Hence the simplicity of the shift key :)
>
> More history: IIRC I originally used 'ASCII trick' when I was truly only 
> cared about ASCII, not Latin1 (e.g. ZipFile.isMetaName) and it's a slight 
> misnomer to use "ASCII" here.  But Latin1 followed the precedent of ASCII.

> Do you have an opinion on the appropriate level of documentation / comments 
> for this kind of 'tricky' code? 

This code is not that tricky!  And the proposed level of documentation is 
excessive!  A couple of lines of explanation and perhaps a link to an external 
document would be good.

It often happens to me that I will write such exhaustive notes for myself when 
learning a new technology.  A year later I pare it all back because much of it 
is "obvious in retrospect".

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Martin Buchholz

On Wed, 22 Feb 2023 00:06:51 GMT, David Holmes  wrote:

>> Thanks Martin, I will from now on envision a stack of dusty punch cards with 
>> carefully scribbled notes on the back, barely held together with a dry and 
>> cracked rubber band.
>> 
>> More to the point: Do you have an opinion on the appropriate level of 
>> documentation / comments for this kind of 'tricky' code? My goal is that 
>> future maintainers should not get scared by the code and help them 
>> understand why it became what it is.
>
> There are still some books on this :) but from wikipedia:
>> during May 1963 the CCITT Working Party on the New Telegraph Alphabet 
>> proposed to assign lowercase characters to 
>> sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14)
>>  6 and 7,[[15]](https://en.wikipedia.org/wiki/ASCII#cite_note-CCITT_1963-16) 
>> and [International Organization for 
>> Standardization](https://en.wikipedia.org/wiki/International_Organization_for_Standardization)
>>  TC 97 SC 2 voted during October to incorporate the change into its draft 
>> standard.[[16]](https://en.wikipedia.org/wiki/ASCII#cite_note-ISO_1963-17) 
>> The X3.2.4 task group voted its approval for the change to ASCII at its May 
>> 1963 meeting.[[17]](https://en.wikipedia.org/wiki/ASCII#cite_note-18) 
>> **Locating the lowercase letters in 
>> sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14)
>>  6 and 7 caused the characters to differ in bit pattern 
 from the upper case by a single bit, which simplified 
[case-insensitive](https://en.wikipedia.org/wiki/Case-insensitive) character 
matching and the construction of keyboards and printers.**
> 
> Hence the simplicity of the shift key :)

More history: IIRC I originally used 'ASCII trick' when I was truly only cared 
about ASCII, not Latin1 (e.g. ZipFile.isMetaName) and it's a slight misnomer to 
use "ASCII" here.  But Latin1 followed the precedent of ASCII.

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread David Holmes

On Tue, 21 Feb 2023 23:28:52 GMT, Eirik Bjorsnos  wrote:

>> More to the point, ASCII was obviously **designed** to allow you to 
>> uppercase a lower case letter with a single instruction, so "the book" might 
>> have been a draft standard before they scrubbed out the interesting history!
>
> Thanks Martin, I will from now on envision a stack of dusty punch cards with 
> carefully scribbled notes on the back, barely held together with a dry and 
> cracked rubber band.
> 
> More to the point: Do you have an opinion on the appropriate level of 
> documentation / comments for this kind of 'tricky' code? My goal is that 
> future maintainers should not get scared by the code and help them understand 
> why it became what it is.

There are still some books on this :) but from wikipedia:
> during May 1963 the CCITT Working Party on the New Telegraph Alphabet 
> proposed to assign lowercase characters to 
> sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14)
>  6 and 7,[[15]](https://en.wikipedia.org/wiki/ASCII#cite_note-CCITT_1963-16) 
> and [International Organization for 
> Standardization](https://en.wikipedia.org/wiki/International_Organization_for_Standardization)
>  TC 97 SC 2 voted during October to incorporate the change into its draft 
> standard.[[16]](https://en.wikipedia.org/wiki/ASCII#cite_note-ISO_1963-17) 
> The X3.2.4 task group voted its approval for the change to ASCII at its May 
> 1963 meeting.[[17]](https://en.wikipedia.org/wiki/ASCII#cite_note-18) 
> **Locating the lowercase letters in 
> sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14)
>  6 and 7 caused the characters to differ in bit pattern f
 rom the upper case by a single bit, which simplified 
[case-insensitive](https://en.wikipedia.org/wiki/Case-insensitive) character 
matching and the construction of keyboards and printers.**

Hence the simplicity of the shift key :)

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Eirik Bjorsnos

On Tue, 21 Feb 2023 23:20:03 GMT, Martin Buchholz  wrote:

>> "oldest trick in the book" is a phrase that does not necessarily imply 
>> existence of an actual book!
>> 
>> Let this evoke an image of a **personal** book of tricks that programmers in 
>> the 1960s might have recorded such techniques in.  And the tricks were 
>> passed down across generations of programmers!
>
> More to the point, ASCII was obviously **designed** to allow you to uppercase 
> a lower case letter with a single instruction, so "the book" might have been 
> a draft standard before they scrubbed out the interesting history!

Thanks Martin, I will from now on envision a stack of dusty punch cards with 
carefully scribbled notes on the back, barely held together with a dry and 
cracked rubber band.

More to the point: Do you have an opinion on the appropriate level of 
documentation / comments for this kind of 'tricky' code? My goal is that future 
maintainers should not get scared by the code and help them understand why it 
became what it is.

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Martin Buchholz

On Tue, 21 Feb 2023 20:48:04 GMT, Martin Buchholz  wrote:

>> Perhaps @Martin-Buchholz could chime in and also tell us which book he found 
>> his ASCII trick in :)
>
> "oldest trick in the book" is a phrase that does not necessarily imply 
> existence of an actual book!
> 
> Let this evoke an image of a **personal** book of tricks that programmers in 
> the 1960s might have recorded such techniques in.  And the tricks were passed 
> down across generations of programmers!

More to the point, ASCII was obviously **designed** to allow you to uppercase a 
lower case letter with a single instruction, so "the book" might have been a 
draft standard before they scrubbed out the interesting history!

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v9]

2023-02-21 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains 15 additional commits 
since the last revision:

 - Rename unconventionally named local variable 'U' to 'upper'
 - Merge remote-tracking branch 'origin/master' into 
regionmatches-latin1-speedup
 - Add whitespace between methods
 - Merge branch 'master' into regionmatches-latin1-speedup
 - Remove whitespace following '('
 - Revert "Spell fix for 'exhaustive' in comments in sun/text/resources"
   
   This reverts commit 5e9927a4b35e157fd3fa72fd2663c8bfbecf32bb.
 - Spell fix for 'exhaustive' in comments in sun/text/resources
 - Add @bug tag to EqualsIgnoreCase test for correct issue JDK-8302871
 - Add @bug tag to EqualsIgnoreCase test for JDK-8302877
 - Add clarifying comments and use more descriptive variable names in the 
latin1 verification EqualsIgnoreCase test
 - ... and 5 more: https://git.openjdk.org/jdk/compare/7025e4c5...6588ab0f

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/718fcead..6588ab0f

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=08
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=07-08

  Stats: 20 lines in 6 files changed: 0 ins; 7 del; 13 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Eirik Bjorsnos

On Tue, 21 Feb 2023 14:22:30 GMT, Alan Bateman  wrote:

>> Eirik Bjorsnos has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Remove whitespace following '('
>
> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
> 163:
> 
>> 161: return mapChar;
>> 162: }
>> 163: /**
> 
> I assume you should insert a blank line between the two methods.

This has been fixed now.

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v8]

2023-02-21 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains 13 additional commits 
since the last revision:

 - Add whitespace between methods
 - Merge branch 'master' into regionmatches-latin1-speedup
 - Remove whitespace following '('
 - Revert "Spell fix for 'exhaustive' in comments in sun/text/resources"
   
   This reverts commit 5e9927a4b35e157fd3fa72fd2663c8bfbecf32bb.
 - Spell fix for 'exhaustive' in comments in sun/text/resources
 - Add @bug tag to EqualsIgnoreCase test for correct issue JDK-8302871
 - Add @bug tag to EqualsIgnoreCase test for JDK-8302877
 - Add clarifying comments and use more descriptive variable names in the 
latin1 verification EqualsIgnoreCase test
 - Align whitespace to make example strings easier to read
 - Merge branch 'master' into regionmatches-latin1-speedup
 - ... and 3 more: https://git.openjdk.org/jdk/compare/ea62160f...718fcead

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/d7b1c164..718fcead

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=07
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=06-07

  Stats: 4282 lines in 198 files changed: 2630 ins; 1129 del; 523 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Martin Buchholz

On Tue, 21 Feb 2023 20:33:41 GMT, Eirik Bjorsnos  wrote:

>> Hi Alan,
>> 
>> I thought I was clever by encoding the 'uppercaseness' in the variable name, 
>> but yeah I'll find a better name :)
>> 
>> There is some precedent for using the 'ASCII trick' comment in the JDK.  I 
>> found it in ZipFile.isMetaName, which is also where I first learned about 
>> this interesting relationship between ASCII (and also latin1) letters.
>> 
>> The comment was first added by Martin Buchholz back in 2016 as part of 
>> JDK-8157069, 'Assorted ZipFile improvements'. In 2020, Claes was updating 
>> this code and Lance has some input about clarifying the comment. Martin then 
>> [chimed 
>> in](https://mail.openjdk.org/pipermail/core-libs-dev/2020-May/066363.html) 
>> to defend his comment:
>> 
>>> I still like my ancient "ASCII trick" comment.
>> 
>> I think this 'trick', whatever we call it, is sufficiently intricate that it 
>> deserves to be called out somehow and that we should not just casually 
>> bitmask with these magic constants without any discussion at all. 
>> 
>> An earlier iteration of this PR included a small essay in the javadoc of 
>> this method describing the layout and relationship of letters in latin1 and 
>> how we can apply that knowledge of the layout to implement the method.
>> 
>> How would you feel about adding that description back to the Javadocs? This 
>> would then live close to the similarly implemented toUpperCase and 
>> toLowerCase methods currently under review in #12623. 
>> 
>> Here's the updated discussion included in the Javadoc:
>> 
>> 
>> /**
>>  * Compares two latin1 code points, ignoring case considerations.
>>  *
>>  * Implementation note: In ISO/IEC 8859-1, the uppercase and lowercase
>>  * letters are found in the following code point ranges:
>>  *
>>  * 0x41-0x5A: Uppercase ASCII letters: A-Z
>>  * 0x61-0x7A: Lowercase ASCII letters: a-z
>>  * 0xC0-0xD6: Uppercase latin1 letters: A-GRAVE - O with Diaeresis
>>  * 0xD8-0xDE: Uppercase latin1 letters: O with slash - Thorn
>>  * 0xE0-0xF6: Lowercase latin1 letters: a-grave - o with Diaeresis
>>  * 0xF8-0xFE: Lowercase latin1 letters: o with slash - thorn
>>  *
>>  * While both ASCII letter ranges are contiguous, the latin1 ranges are 
>> not:
>>  *
>>  * The 'multiplication sign' 0xD7 splits the uppercase range in two.
>>  * The 'division sign' 0xF7 splits the lowercase range in two.
>>  *
>>  * Lowercase letters are found 32 positions (0x20) after their 
>> corresponding uppercase letter.
>>  * The 'division sign' and 'multiplication sign' have the same relative 
>> distance.
>>  *
>>  * Since 0x20 is a single bit, we can apply the 'oldest ASCII trick in 
>> the book' to
>>  * lowercase any letter by setting the bit:
>>  *
>>  * ('C' | 0x20) == 'c'
>>  *
>>  * By removing the bit, we can perform the uppercase operation:
>>  *
>>  * ('c' & 0xDF) == 'C'
>>  *
>>  * Applying this knowledge of the latin1 layout, we can test for 
>> equality ignoring case by
>>  * checking that the code points are either equal, or that one of the 
>> code points is a letter
>>  * which uppercases is the same as the uppercase of the other code point.
>>  *
>>  * @param b1 byte representing a latin1 code point
>>  * @param b2 another byte representing a latin1 code point
>>  * @return true if the two bytes are considered equals ignoring case in 
>> latin1
>>  */
>>  static boolean equalsIgnoreCase(byte b1, byte b2) {
>>  if (b1 == b2) {
>>  return true;
>>  }
>>  int upper = b1 & 0xDF;
>>  if (upper < 'A') {
>>  return false;  // Low ASCII
>>  }
>>  return (upper <= 'Z' // In range A-Z
>>  || (upper >= 0xC0 && upper <= 0XDE && upper != 0xD7)) // 
>> ..or A-grave-Thorn, excl. multiplication
>>  && upper == (b2 & 0xDF); // b2 has same uppercase
>> }
>
> Perhaps @Martin-Buchholz could chime in and also tell us which book he found 
> his ASCII trick in :)

"oldest trick in the book" is a phrase that does not necessarily imply 
existence of an actual book!

Let this evoke an image of a **personal** book of tricks that programmers in 
the 1960s might have recorded such techniques in.  And the tricks were passed 
down across generations of programmers!

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Eirik Bjorsnos

On Tue, 21 Feb 2023 20:23:11 GMT, Eirik Bjorsnos  wrote:

>> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
>> 175:
>> 
>>> 173:  }
>>> 174:  // uppercase b1 using 'the oldest ASCII trick in the book'
>>> 175:  int U = b1 & 0xDF;
>> 
>> I'm sure some people reading this comment will wonder which book :-) It 
>> might be better to drop that bit and if possible, find a better name for "U" 
>> as normally variables start with a lower case.
>
> Hi Alan,
> 
> I thought I was clever by encoding the 'uppercaseness' in the variable name, 
> but yeah I'll find a better name :)
> 
> There is some precedent for using the 'ASCII trick' comment in the JDK.  I 
> found it in ZipFile.isMetaName, which is also where I first learned about 
> this interesting relationship between ASCII (and also latin1) letters.
> 
> The comment was first added by Martin Buchholz back in 2016 as part of 
> JDK-8157069, 'Assorted ZipFile improvements'. In 2020, Claes was updating 
> this code and Lance has some input about clarifying the comment. Martin then 
> [chimed 
> in](https://mail.openjdk.org/pipermail/core-libs-dev/2020-May/066363.html) to 
> defend his comment:
> 
>> I still like my ancient "ASCII trick" comment.
> 
> I think this 'trick', whatever we call it, is sufficiently intricate that it 
> deserves to be called out somehow and that we should not just casually 
> bitmask with these magic constants without any discussion at all. 
> 
> An earlier iteration of this PR included a small essay in the javadoc of this 
> method describing the layout and relationship of letters in latin1 and how we 
> can apply that knowledge of the layout to implement the method.
> 
> How would you feel about adding that description back to the Javadocs? This 
> would then live close to the similarly implemented toUpperCase and 
> toLowerCase methods currently under review in #12623. 
> 
> Here's the updated discussion included in the Javadoc:
> 
> 
> /**
>  * Compares two latin1 code points, ignoring case considerations.
>  *
>  * Implementation note: In ISO/IEC 8859-1, the uppercase and lowercase
>  * letters are found in the following code point ranges:
>  *
>  * 0x41-0x5A: Uppercase ASCII letters: A-Z
>  * 0x61-0x7A: Lowercase ASCII letters: a-z
>  * 0xC0-0xD6: Uppercase latin1 letters: A-GRAVE - O with Diaeresis
>  * 0xD8-0xDE: Uppercase latin1 letters: O with slash - Thorn
>  * 0xE0-0xF6: Lowercase latin1 letters: a-grave - o with Diaeresis
>  * 0xF8-0xFE: Lowercase latin1 letters: o with slash - thorn
>  *
>  * While both ASCII letter ranges are contiguous, the latin1 ranges are 
> not:
>  *
>  * The 'multiplication sign' 0xD7 splits the uppercase range in two.
>  * The 'division sign' 0xF7 splits the lowercase range in two.
>  *
>  * Lowercase letters are found 32 positions (0x20) after their 
> corresponding uppercase letter.
>  * The 'division sign' and 'multiplication sign' have the same relative 
> distance.
>  *
>  * Since 0x20 is a single bit, we can apply the 'oldest ASCII trick in 
> the book' to
>  * lowercase any letter by setting the bit:
>  *
>  * ('C' | 0x20) == 'c'
>  *
>  * By removing the bit, we can perform the uppercase operation:
>  *
>  * ('c' & 0xDF) == 'C'
>  *
>  * Applying this knowledge of the latin1 layout, we can test for equality 
> ignoring case by
>  * checking that the code points are either equal, or that one of the 
> code points is a letter
>  * which uppercases is the same as the uppercase of the other code point.
>  *
>  * @param b1 byte representing a latin1 code point
>  * @param b2 another byte representing a latin1 code point
>  * @return true if the two bytes are considered equals ignoring case in 
> latin1
>  */
>  static boolean equalsIgnoreCase(byte b1, byte b2) {
>  if (b1 == b2) {
>  return true;
>  }
>  int upper = b1 & 0xDF;
>  if (upper < 'A') {
>  return false;  // Low ASCII
>  }
>  return (upper <= 'Z' // In range A-Z
>  || (upper >= 0xC0 && upper <= 0XDE && upper != 0xD7)) // 
> ..or A-grave-Thorn, excl. multiplication
>  && upper == (b2 & 0xDF); // b2 has same uppercase
> }

Perhaps @Martin-Buchholz could chime in and also tell us which book he found 
his ASCII trick in :)

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Eirik Bjorsnos

On Tue, 21 Feb 2023 14:27:03 GMT, Alan Bateman  wrote:

>> Eirik Bjorsnos has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Remove whitespace following '('
>
> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
> 175:
> 
>> 173:  }
>> 174:  // uppercase b1 using 'the oldest ASCII trick in the book'
>> 175:  int U = b1 & 0xDF;
> 
> I'm sure some people reading this comment will wonder which book :-) It might 
> be better to drop that bit and if possible, find a better name for "U" as 
> normally variables start with a lower case.

Hi Alan,

I thought I was clever by encoding the 'uppercaseness' in the variable name, 
but yeah I'll find a better name :)

There is some precedent for using the 'ASCII trick' comment in the JDK.  I 
found it in ZipFile.isMetaName, which is also where I first learned about this 
interesting relationship between ASCII (and also latin1) letters.

The comment was first added by Martin Buchholz back in 2016 as part of 
JDK-8157069, 'Assorted ZipFile improvements'. In 2020, Claes was updating this 
code and Lance has some input about clarifying the comment. Martin then [chimed 
in](https://mail.openjdk.org/pipermail/core-libs-dev/2020-May/066363.html) to 
defend his comment:

> I still like my ancient "ASCII trick" comment.

I think this 'trick', whatever we call it, is sufficiently intricate that it 
deserves to be called out somehow and that we should not just casually bitmask 
with these magic constants without any discussion at all. 

An earlier iteration of this PR included a small essay in the javadoc of this 
method describing the layout and relationship of letters in latin1 and how we 
can apply that knowledge of the layout to implement the method.

How would you feel about adding that description back to the Javadocs? This 
would then live close to the similarly implemented toUpperCase and toLowerCase 
methods currently under review in #12623. 

Here's the updated discussion included in the Javadoc:

/**
 * Compares two latin1 code points, ignoring case considerations.
 *
 * Implementation note: In ISO/IEC 8859-1, the uppercase and lowercase
 * letters are found in the following code point ranges:
 *
 * 0x41-0x5A: Uppercase ASCII letters: A-Z
 * 0x61-0x7A: Lowercase ASCII letters: a-z
 * 0xC0-0xD6: Uppercase latin1 letters: A-GRAVE - O with Diaeresis
 * 0xD8-0xDE: Uppercase latin1 letters: O with slash - Thorn
 * 0xE0-0xF6: Lowercase latin1 letters: a-grave - o with Diaeresis
 * 0xF8-0xFE: Lowercase latin1 letters: o with slash - thorn
 *
 * While both ASCII letter ranges are contiguous, the latin1 ranges are not:
 *
 * The 'multiplication sign' 0xD7 splits the uppercase range in two.
 * The 'division sign' 0xF7 splits the lowercase range in two.
 *
 * Lowercase letters are found 32 positions (0x20) after their 
corresponding uppercase letter.
 * The 'division sign' and 'multiplication sign' have the same relative 
distance.
 *
 * Since 0x20 is a single bit, we can apply the 'oldest ASCII trick in the 
book' to
 * lowercase any letter by setting the bit:
 *
 * ('C' | 0x20) == 'c'
 *
 * By removing the bit, we can perform the uppercase operation:
 *
 * ('c' & 0xDF) == 'C'
 *
 * Applying this knowledge of the latin1 layout, we can test for equality 
ignoring case by
 * checking that the code points are either equal, or that one of the code 
points is a letter
 * which uppercases is the same as the uppercase of the other code point.
 *
 * @param b1 byte representing a latin1 code point
 * @param b2 another byte representing a latin1 code point
 * @return true if the two bytes are considered equals ignoring case in 
latin1
 */
 static boolean equalsIgnoreCase(byte b1, byte b2) {
 if (b1 == b2) {
 return true;
 }
 int upper = b1 & 0xDF;
 if (upper < 'A') {
 return false;  // Low ASCII
 }
 return (upper <= 'Z' // In range A-Z
 || (upper >= 0xC0 && upper <= 0XDE && upper != 0xD7)) // ..or 
A-grave-Thorn, excl. multiplication
 && upper == (b2 & 0xDF); // b2 has same uppercase
}

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Alan Bateman

On Tue, 21 Feb 2023 11:14:13 GMT, Eirik Bjorsnos  wrote:

>> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
>> 'the oldest ASCII trick in the book'.
>> 
>> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
>> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
>> updated to use `equalsIgnoreCase`
>> 
>> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
>> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
>> code point pairs have an `equalsIgnoreCase` consistent with 
>> Character.toUpperCase, Character.toLowerCase.
>> 
>> Performance is tested for matching and mismatching cases of code point pairs 
>> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
>> in the first comment below.
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Remove whitespace following '('

src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
163:

> 161: return mapChar;
> 162: }
> 163: /**

I assume you should insert a blank line between the two methods.

src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
175:

> 173:  }
> 174:  // uppercase b1 using 'the oldest ASCII trick in the book'
> 175:  int U = b1 & 0xDF;

I'm sure some people reading this comment will wonder which book :-) It might 
be better to drop that bit and if possible, find a better name for "U" as 
normally variables start with a lower case.

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Claes Redestad

On Tue, 21 Feb 2023 11:14:13 GMT, Eirik Bjorsnos  wrote:

>> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
>> 'the oldest ASCII trick in the book'.
>> 
>> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
>> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
>> updated to use `equalsIgnoreCase`
>> 
>> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
>> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
>> code point pairs have an `equalsIgnoreCase` consistent with 
>> Character.toUpperCase, Character.toLowerCase.
>> 
>> Performance is tested for matching and mismatching cases of code point pairs 
>> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
>> in the first comment below.
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Remove whitespace following '('

Marked as reviewed by redestad (Reviewer).

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

2023-02-21 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request incrementally with one additional 
commit since the last revision:

  Remove whitespace following '('

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/b8139961..d7b1c164

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=06
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=05-06

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v6]

2023-02-20 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request incrementally with one additional 
commit since the last revision:

  Revert "Spell fix for 'exhaustive' in comments in sun/text/resources"
  
  This reverts commit 5e9927a4b35e157fd3fa72fd2663c8bfbecf32bb.

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/5e9927a4..b8139961

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=05
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=04-05

  Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v5]

2023-02-20 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request incrementally with one additional 
commit since the last revision:

  Spell fix for 'exhaustive' in comments in sun/text/resources

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/03d3e2cb..5e9927a4

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=04
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=03-04

  Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v4]

2023-02-20 Thread Eirik Bjorsnos

On Mon, 20 Feb 2023 16:23:32 GMT, Claes Redestad  wrote:

>> Is it not already in CharacterDataLatin1?
>> 
>> Here is a comparison of relying on improvements in 
>> `CharacterDataLatin1.toUpperCase/toLowerCase` only vs. using 
>> `CharacterDataLatin1.equalsIgnoreCase`:
>> 
>> Character.toUpperCase/toLowerCase only:
>> 
>> 
>> Benchmark  (codePoints)  (size)  Mode  Cnt   
>>   ScoreError  Units
>> RegionMatchesIC.Latin1.regionMatchesIC  ascii-match1024  avgt   15  
>> 1310.582 ± 84.777  ns/op
>> RegionMatchesIC.Latin1.regionMatchesIC   ascii-mismatch1024  avgt   15   
>>   4.547 ±  0.545  ns/op
>> RegionMatchesIC.Latin1.regionMatchesIC number-match1024  avgt   15   
>> 686.947 ± 11.850  ns/op
>> RegionMatchesIC.Latin1.regionMatchesIC  number-mismatch1024  avgt   15   
>>   3.836 ±  0.634  ns/op
>> RegionMatchesIC.Latin1.regionMatchesIC   lat1-match1024  avgt   15  
>> 2107.219 ± 17.662  ns/op
>> RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024  avgt   15   
>>   4.924 ±  0.829  ns/op
>> 
>> 
>> CharacterDataLatin1.equalsIgnoreCase:
>> 
>> 
>> Benchmark  (codePoints)  (size)  Mode  Cnt   
>>   ScoreError  Units
>> RegionMatchesIC.Latin1.regionMatchesIC  ascii-match1024  avgt   15   
>> 742.467 ± 34.490  ns/op
>> RegionMatchesIC.Latin1.regionMatchesIC   ascii-mismatch1024  avgt   15   
>>   3.960 ±  0.046  ns/op
>> RegionMatchesIC.Latin1.regionMatchesIC number-match1024  avgt   15   
>> 361.158 ± 37.096  ns/op
>> RegionMatchesIC.Latin1.regionMatchesIC  number-mismatch1024  avgt   15   
>>   4.039 ±  0.521  ns/op
>> RegionMatchesIC.Latin1.regionMatchesIC   lat1-match1024  avgt   15  
>> 1158.091 ± 41.617  ns/op
>> RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024  avgt   15   
>>   4.358 ±  0.123  ns/op
>
> Oops, I lost context and thought this was in `StringLatin1`.
> 
> Thanks for running the numbers with #12623. Looks like you're getting big 
> enough of an improvement on top.

Yes, seems `equalsIgnoreCase` carries its weight.

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v4]

2023-02-20 Thread Claes Redestad

On Mon, 20 Feb 2023 16:16:45 GMT, Eirik Bjorsnos  wrote:

>> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
>> 170:
>> 
>>> 168:  * @return true if the two bytes are considered equals ignoring 
>>> case in latin1
>>> 169:  */
>>> 170:  static boolean equalsIgnoreCase(byte b1, byte b2) {
>> 
>> Perhaps put this in `CharacterDataLatin1`, keeping it close to 
>> toLowerCase/toUpperCase that you're changing to use similar logic with 
>> #12623 
>> 
>> If you apply #12623 first - how much difference does this make on the micro 
>> you're adding with this PR?
>
> Is it not already in CharacterDataLatin1?
> 
> Here is a comparison of relying on improvements in 
> `CharacterDataLatin1.toUpperCase/toLowerCase` only vs. using 
> `CharacterDataLatin1.equalsIgnoreCase`:
> 
> Character.toUpperCase/toLowerCase only:
> 
> 
> Benchmark  (codePoints)  (size)  Mode  Cnt
>  ScoreError  Units
> RegionMatchesIC.Latin1.regionMatchesIC  ascii-match1024  avgt   15  
> 1310.582 ± 84.777  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC   ascii-mismatch1024  avgt   15
>  4.547 ±  0.545  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC number-match1024  avgt   15   
> 686.947 ± 11.850  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC  number-mismatch1024  avgt   15
>  3.836 ±  0.634  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC   lat1-match1024  avgt   15  
> 2107.219 ± 17.662  ns/op
> RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024  avgt   15
>  4.924 ±  0.829  ns/op
> 
> 
> CharacterDataLatin1.equalsIgnoreCase:
> 
> 
> Benchmark  (codePoints)  (size)  Mode  Cnt
>  ScoreError  Units
> RegionMatchesIC.Latin1.regionMatchesIC  ascii-match1024  avgt   15   
> 742.467 ± 34.490  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC   ascii-mismatch1024  avgt   15
>  3.960 ±  0.046  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC number-match1024  avgt   15   
> 361.158 ± 37.096  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC  number-mismatch1024  avgt   15
>  4.039 ±  0.521  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC   lat1-match1024  avgt   15  
> 1158.091 ± 41.617  ns/op
> RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024  avgt   15
>  4.358 ±  0.123  ns/op

Oops, I lost context and thought this was in `StringLatin1`.

Thanks for running the numbers with #12623. Looks like you're getting big 
enough of an improvement on top.

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v4]

2023-02-20 Thread Eirik Bjorsnos

On Mon, 20 Feb 2023 15:40:09 GMT, Claes Redestad  wrote:

>> Eirik Bjorsnos has updated the pull request incrementally with two 
>> additional commits since the last revision:
>> 
>>  - Add @bug tag to EqualsIgnoreCase test for correct issue JDK-8302871
>>  - Add @bug tag to EqualsIgnoreCase test for JDK-8302877
>
> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
> 170:
> 
>> 168:  * @return true if the two bytes are considered equals ignoring 
>> case in latin1
>> 169:  */
>> 170:  static boolean equalsIgnoreCase(byte b1, byte b2) {
> 
> Perhaps put this in `CharacterDataLatin1`, keeping it close to 
> toLowerCase/toUpperCase that you're changing to use similar logic with #12623 
> 
> If you apply #12623 first - how much difference does this make on the micro 
> you're adding with this PR?

Is it not already in CharacterDataLatin1?

Here is a comparison of relying on improvements in 
`CharacterDataLatin1.toUpperCase/toLowerCase` only vs. using 
`CharacterDataLatin1.equalsIgnoreCase`:

Character.toUpperCase/toLowerCase only:


Benchmark  (codePoints)  (size)  Mode  Cnt 
ScoreError  Units
RegionMatchesIC.Latin1.regionMatchesIC  ascii-match1024  avgt   15  
1310.582 ± 84.777  ns/op
RegionMatchesIC.Latin1.regionMatchesIC   ascii-mismatch1024  avgt   15 
4.547 ±  0.545  ns/op
RegionMatchesIC.Latin1.regionMatchesIC number-match1024  avgt   15   
686.947 ± 11.850  ns/op
RegionMatchesIC.Latin1.regionMatchesIC  number-mismatch1024  avgt   15 
3.836 ±  0.634  ns/op
RegionMatchesIC.Latin1.regionMatchesIC   lat1-match1024  avgt   15  
2107.219 ± 17.662  ns/op
RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024  avgt   15 
4.924 ±  0.829  ns/op


CharacterDataLatin1.equalsIgnoreCase:


Benchmark  (codePoints)  (size)  Mode  Cnt 
ScoreError  Units
RegionMatchesIC.Latin1.regionMatchesIC  ascii-match1024  avgt   15   
742.467 ± 34.490  ns/op
RegionMatchesIC.Latin1.regionMatchesIC   ascii-mismatch1024  avgt   15 
3.960 ±  0.046  ns/op
RegionMatchesIC.Latin1.regionMatchesIC number-match1024  avgt   15   
361.158 ± 37.096  ns/op
RegionMatchesIC.Latin1.regionMatchesIC  number-mismatch1024  avgt   15 
4.039 ±  0.521  ns/op
RegionMatchesIC.Latin1.regionMatchesIC   lat1-match1024  avgt   15  
1158.091 ± 41.617  ns/op
RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024  avgt   15 
4.358 ±  0.123  ns/op

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v4]

2023-02-20 Thread Claes Redestad

On Mon, 20 Feb 2023 14:45:09 GMT, Eirik Bjorsnos  wrote:

>> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
>> 'the oldest ASCII trick in the book'.
>> 
>> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
>> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
>> updated to use `equalsIgnoreCase`
>> 
>> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
>> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
>> code point pairs have an `equalsIgnoreCase` consistent with 
>> Character.toUpperCase, Character.toLowerCase.
>> 
>> Performance is tested for matching and mismatching cases of code point pairs 
>> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
>> in the first comment below.
>
> Eirik Bjorsnos has updated the pull request incrementally with two additional 
> commits since the last revision:
> 
>  - Add @bug tag to EqualsIgnoreCase test for correct issue JDK-8302871
>  - Add @bug tag to EqualsIgnoreCase test for JDK-8302877

src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
170:

> 168:  * @return true if the two bytes are considered equals ignoring case 
> in latin1
> 169:  */
> 170:  static boolean equalsIgnoreCase(byte b1, byte b2) {

Perhaps put this in `CharacterDataLatin1`, keeping it close to 
toLowerCase/toUpperCase that you're changing to use similar logic with #12623 

If you apply #12623 first - how much difference does this make on the micro 
you're adding with this PR?

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v4]

2023-02-20 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request incrementally with two additional 
commits since the last revision:

 - Add @bug tag to EqualsIgnoreCase test for correct issue JDK-8302871
 - Add @bug tag to EqualsIgnoreCase test for JDK-8302877

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/84517102..03d3e2cb

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=03
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=02-03

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v3]

2023-02-20 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request incrementally with one additional 
commit since the last revision:

  Add clarifying comments and use more descriptive variable names in the latin1 
verification EqualsIgnoreCase test

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/59c42298..84517102

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=01-02

  Stats: 16 lines in 1 file changed: 5 ins; 2 del; 9 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v2]

2023-02-20 Thread Eirik Bjorsnos

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Eirik Bjorsnos has updated the pull request with a new target base due to a 
merge or a rebase. The incremental webrev excludes the unrelated changes 
brought in by the merge/rebase. The pull request contains five additional 
commits since the last revision:

 - Align whitespace to make example strings easier to read
 - Merge branch 'master' into regionmatches-latin1-speedup
 - Exhaustive verification needs to cover the case b1 == b2
 - Move multiplication exclusion to the lat1 range branch
 - Speed up StringLatin1.regionMatchesCI by applying the 'oldest ASCII trick in 
the book'

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12632/files
  - new: https://git.openjdk.org/jdk/pull/12632/files/a583bcd6..59c42298

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=00-01

  Stats: 57495 lines in 1316 files changed: 24512 ins; 15682 del; 17301 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI

2023-02-20 Thread David Schlosnagle

On Sat, 18 Feb 2023 19:45:34 GMT, Eirik Bjorsnos  wrote:

>> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
>> 181:
>> 
>>> 179:  return ( U <= 'Z' // In range A-Z
>>> 180:  || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or 
>>> A-grave-Thorn, excl. multiplication
>>> 181:  && U == (b2 & 0xDF); // b2 has same uppercase
>> 
>> I'm curious if the order of comparisons could alter performance to a small 
>> degree. For example, it might be interesting to compare various permutations 
>> like below to short circuit reject unequal uppercased b2
>> 
>> Suggestion:
>> 
>>  // uppercase b1 using 'the oldest ASCII trick in the book'
>>  int U = b1 & 0xDF;
>>  return (U == (b2 & 0xDF))
>>  && ((U >= 'A' && U <= 'Z') // In range A-Z
>>  || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or 
>> A-grave-Thorn, excl. multiplication
>
> Yeah, as you noticed this code is tricky and sensitive to the order of 
> operations. I did some quite extensive exploration before ending on the 
> current structure. This particular one seems to improve rejection somewhat at 
> the cost of matches.
> 
> Since rejection is relatively speaking already very fast, I think we should 
> favour fast matching here.
> 
> Results:
> 
> 
> enchmark  (codePoints)  (size)  Mode  Cnt 
> ScoreError  Units
> RegionMatchesIC.Latin1.regionMatchesIC  ascii-match1024  avgt   15   
> 917.796 ± 20.285  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC   ascii-mismatch1024  avgt   15
>  4.367 ±  0.348  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC number-match1024  avgt   15   
> 399.656 ± 10.703  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC  number-mismatch1024  avgt   15
>  4.361 ±  0.664  ns/op
> RegionMatchesIC.Latin1.regionMatchesIC   lat1-match1024  avgt   15  
> 1384.443 ± 22.199  ns/op
> RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024  avgt   15
>  4.119 ±  0.451  ns/op

Thanks for confirming

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI

2023-02-20 Thread Eirik Bjorsnos

On Sat, 18 Feb 2023 19:04:24 GMT, David Schlosnagle  wrote:

>> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
>> 'the oldest ASCII trick in the book'.
>> 
>> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
>> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
>> updated to use `equalsIgnoreCase`
>> 
>> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
>> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
>> code point pairs have an `equalsIgnoreCase` consistent with 
>> Character.toUpperCase, Character.toLowerCase.
>> 
>> Performance is tested for matching and mismatching cases of code point pairs 
>> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
>> in the first comment below.
>
> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
> 181:
> 
>> 179:  return ( U <= 'Z' // In range A-Z
>> 180:  || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or 
>> A-grave-Thorn, excl. multiplication
>> 181:  && U == (b2 & 0xDF); // b2 has same uppercase
> 
> I'm curious if the order of comparisons could alter performance to a small 
> degree. For example, it might be interesting to compare various permutations 
> like below to short circuit reject unequal uppercased b2
> 
> Suggestion:
> 
>  // uppercase b1 using 'the oldest ASCII trick in the book'
>  int U = b1 & 0xDF;
>  return (U == (b2 & 0xDF))
>  && ((U >= 'A' && U <= 'Z') // In range A-Z
>  || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or 
> A-grave-Thorn, excl. multiplication

Yeah, as you noticed this code is tricky and sensitive to the order of 
operations. I did some quite extensive exploration before ending on the current 
structure. This particular one seems to improve rejection somewhat at the cost 
of matches.

Since rejection is relatively speaking already very fast, I think we should 
favour fast matching here.

Results:


enchmark  (codePoints)  (size)  Mode  Cnt 
ScoreError  Units
RegionMatchesIC.Latin1.regionMatchesIC  ascii-match1024  avgt   15   
917.796 ± 20.285  ns/op
RegionMatchesIC.Latin1.regionMatchesIC   ascii-mismatch1024  avgt   15 
4.367 ±  0.348  ns/op
RegionMatchesIC.Latin1.regionMatchesIC number-match1024  avgt   15   
399.656 ± 10.703  ns/op
RegionMatchesIC.Latin1.regionMatchesIC  number-mismatch1024  avgt   15 
4.361 ±  0.664  ns/op
RegionMatchesIC.Latin1.regionMatchesIC   lat1-match1024  avgt   15  
1384.443 ± 22.199  ns/op
RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024  avgt   15 
4.119 ±  0.451  ns/op

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI

2023-02-20 Thread David Schlosnagle

On Sat, 18 Feb 2023 09:21:25 GMT, Eirik Bjorsnos  wrote:

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
181:

> 179:  return ( U <= 'Z' // In range A-Z
> 180:  || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or 
> A-grave-Thorn, excl. multiplication
> 181:  && U == (b2 & 0xDF); // b2 has same uppercase

I'm curious if the order of comparisons could alter performance to a small 
degree. For example, it might be interesting to compare various permutations 
like below to short circuit reject unequal uppercased b2

Suggestion:

 // uppercase b1 using 'the oldest ASCII trick in the book'
 int U = b1 & 0xDF;
 return (U == (b2 & 0xDF))
 && ((U >= 'A' && U <= 'Z') // In range A-Z
 || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or 
A-grave-Thorn, excl. multiplication

-

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI

2023-02-20 Thread Eirik Bjorsnos

On Sat, 18 Feb 2023 09:21:25 GMT, Eirik Bjorsnos  wrote:

> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
> 'the oldest ASCII trick in the book'.
> 
> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
> updated to use `equalsIgnoreCase`
> 
> To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 
> code point pairs have an `equalsIgnoreCase` consistent with 
> Character.toUpperCase, Character.toLowerCase.
> 
> Performance is tested for matching and mismatching cases of code point pairs 
> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results 
> in the first comment below.

Benchmark results:

Baseline:


Benchmark  (codePoints)  (size)  Mode  Cnt 
ScoreError  Units
RegionMatchesIC.Latin1.regionMatchesIC  ascii-match1024  avgt   15  
2216.525 ± 79.626  ns/op
RegionMatchesIC.Latin1.regionMatchesIC   ascii-mismatch1024  avgt   15 
5.049 ±  0.044  ns/op
RegionMatchesIC.Latin1.regionMatchesIC number-match1024  avgt   15   
708.977 ± 19.381  ns/op
RegionMatchesIC.Latin1.regionMatchesIC  number-mismatch1024  avgt   15 
3.726 ±  0.036  ns/op
RegionMatchesIC.Latin1.regionMatchesIC   lat1-match1024  avgt   15  
2134.499 ± 23.064  ns/op
RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024  avgt   15 
4.227 ±  0.070  ns/op


Patch:


Benchmark  (codePoints)  (size)  Mode  Cnt 
ScoreError  Units
RegionMatchesIC.Latin1.regionMatchesIC  ascii-match1024  avgt   15   
809.729 ± 40.257  ns/op
RegionMatchesIC.Latin1.regionMatchesIC   ascii-mismatch1024  avgt   15 
4.334 ±  0.031  ns/op
RegionMatchesIC.Latin1.regionMatchesIC number-match1024  avgt   15   
370.814 ± 39.790  ns/op
RegionMatchesIC.Latin1.regionMatchesIC  number-mismatch1024  avgt   15 
3.766 ±  0.072  ns/op
RegionMatchesIC.Latin1.regionMatchesIC   lat1-match1024  avgt   15  
1247.979 ±  7.826  ns/op
RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024  avgt   15 
4.819 ±  0.026  ns/op

-

PR: https://git.openjdk.org/jdk/pull/12632

RFR: 8302871: Speed up StringLatin1.regionMatchesCI

2023-02-20 Thread Eirik Bjorsnos

This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 
'the oldest ASCII trick in the book'.

The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two 
latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is 
updated to use `equalsIgnoreCase`

To verify the correctness of `equalsIgnoreCase`, a new test is added  to 
`EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 code 
point pairs have an `equalsIgnoreCase` consistent with Character.toUpperCase, 
Character.toLowerCase.

Performance is tested for matching and mismatching cases of code point pairs 
picked from the ASCII letter, ASCII number and latin1 letter ranges. Results in 
the first comment below.

-

Commit messages:
 - Exhaustive verification needs to cover the case b1 == b2
 - Move multiplication exclusion to the lat1 range branch
 - Speed up StringLatin1.regionMatchesCI by applying the 'oldest ASCII trick in 
the book'

Changes: https://git.openjdk.org/jdk/pull/12632/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8302871
  Stats: 158 lines in 4 files changed: 148 ins; 5 del; 5 mod
  Patch: https://git.openjdk.org/jdk/pull/12632.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632

PR: https://git.openjdk.org/jdk/pull/12632

Re: Speed up StringLatin1.regionMatchesCI

2023-02-20 Thread Claes Redestad

RFE filed: https://bugs.openjdk.org/browse/JDK-8302871

/Claes

18 feb. 2023 kl. 10:28 skrev Eirik Bjørsnøs 
mailto:eir...@gmail.com>>:

Hi,

The following PR suggests we can speed up StringLatin1.regionMatchesCI by 
applying 'the oldest ASCII trick in the book':

https://github.com/openjdk/jdk/pull/12632

Thanks,
Eirik.

Speed up StringLatin1.regionMatchesCI

2023-02-18 Thread Eirik Bjørsnøs

Hi,

The following PR suggests we can speed up StringLatin1.regionMatchesCI by
applying 'the oldest ASCII trick in the book':

https://github.com/openjdk/jdk/pull/12632

Thanks,
Eirik.

48 matches

Mail list logo