Integrated: 8302871: Speed up StringLatin1.regionMatchesCI
On Sat, 18 Feb 2023 09:21:25 GMT, Eirik Bjorsnos wrote: > This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. This pull request has now been integrated. Changeset: 17e3769e Author:Eirik Bjorsnos Committer: Alan Bateman URL: https://git.openjdk.org/jdk/commit/17e3769ed7190c3ba885e6434e1811bca2d66f13 Stats: 166 lines in 4 files changed: 148 ins; 5 del; 13 mod 8302871: Speed up StringLatin1.regionMatchesCI Reviewed-by: redestad, martin, alanb - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v14]
On Wed, 22 Feb 2023 20:01:52 GMT, Eirik Bjorsnos wrote: >> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying >> 'the oldest ASCII trick in the book'. >> >> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two >> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is >> updated to use `equalsIgnoreCase` >> >> To verify the correctness of `equalsIgnoreCase`, a new test is added to >> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 >> code point pairs have an `equalsIgnoreCase` consistent with >> Character.toUpperCase, Character.toLowerCase. >> >> Performance is tested for matching and mismatching cases of code point pairs >> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results >> in the first comment below. > > Eirik Bjorsnos has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 21 additional > commits since the last revision: > > - Merge branch 'master' into regionmatches-latin1-speedup > - Merge branch 'master' into regionmatches-latin1-speedup > - Make the loop variables chars to avoid casting > - Use improved case-twiddling comment as suggested by Martin > - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase > with "by removing (setting) a single bit" > - Align local variable naming in toLowerCase, toUpperCase with > equalsIgnoreCase by using 'lower' and 'upper' > - Rename unconventionally named local variable 'U' to 'upper' > - Merge remote-tracking branch 'origin/master' into > regionmatches-latin1-speedup > - Add whitespace between methods > - Merge branch 'master' into regionmatches-latin1-speedup > - ... and 11 more: https://git.openjdk.org/jdk/compare/6e269b24...597b346a Marked as reviewed by alanb (Reviewer). - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v14]
On Wed, 22 Feb 2023 20:01:52 GMT, Eirik Bjorsnos wrote: >> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying >> 'the oldest ASCII trick in the book'. >> >> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two >> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is >> updated to use `equalsIgnoreCase` >> >> To verify the correctness of `equalsIgnoreCase`, a new test is added to >> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 >> code point pairs have an `equalsIgnoreCase` consistent with >> Character.toUpperCase, Character.toLowerCase. >> >> Performance is tested for matching and mismatching cases of code point pairs >> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results >> in the first comment below. > > Eirik Bjorsnos has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 21 additional > commits since the last revision: > > - Merge branch 'master' into regionmatches-latin1-speedup > - Merge branch 'master' into regionmatches-latin1-speedup > - Make the loop variables chars to avoid casting > - Use improved case-twiddling comment as suggested by Martin > - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase > with "by removing (setting) a single bit" > - Align local variable naming in toLowerCase, toUpperCase with > equalsIgnoreCase by using 'lower' and 'upper' > - Rename unconventionally named local variable 'U' to 'upper' > - Merge remote-tracking branch 'origin/master' into > regionmatches-latin1-speedup > - Add whitespace between methods > - Merge branch 'master' into regionmatches-latin1-speedup > - ... and 11 more: https://git.openjdk.org/jdk/compare/bd4b71f6...597b346a Thanks, Alan 👍 Sponsors, hold your horses! (Not sure how to 'undo' the integrate command) - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v12]
On Wed, 22 Feb 2023 18:44:57 GMT, Eirik Bjorsnos wrote: > I'll let this linger a bit before integrating in case Alan has comments after > the latest updates. I plan to look at it, been busy with other things. - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v14]
On Thu, 23 Feb 2023 07:21:08 GMT, Eirik Bjorsnos wrote: > Seems compatibility with existing 6-bit devices might have been the primary > concern: This also explains the placement of brackets, braces, bars, tilde etc. They would look visually similar on 6-bit devices: https://user-images.githubusercontent.com/300291/220845393-de12301a-73e2-46fe-a6ce-ce101d0a2b3b.png";> - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v14]
On Wed, 22 Feb 2023 20:01:52 GMT, Eirik Bjorsnos wrote: >> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying >> 'the oldest ASCII trick in the book'. >> >> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two >> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is >> updated to use `equalsIgnoreCase` >> >> To verify the correctness of `equalsIgnoreCase`, a new test is added to >> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 >> code point pairs have an `equalsIgnoreCase` consistent with >> Character.toUpperCase, Character.toLowerCase. >> >> Performance is tested for matching and mismatching cases of code point pairs >> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results >> in the first comment below. > > Eirik Bjorsnos has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 21 additional > commits since the last revision: > > - Merge branch 'master' into regionmatches-latin1-speedup > - Merge branch 'master' into regionmatches-latin1-speedup > - Make the loop variables chars to avoid casting > - Use improved case-twiddling comment as suggested by Martin > - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase > with "by removing (setting) a single bit" > - Align local variable naming in toLowerCase, toUpperCase with > equalsIgnoreCase by using 'lower' and 'upper' > - Rename unconventionally named local variable 'U' to 'upper' > - Merge remote-tracking branch 'origin/master' into > regionmatches-latin1-speedup > - Add whitespace between methods > - Merge branch 'master' into regionmatches-latin1-speedup > - ... and 11 more: https://git.openjdk.org/jdk/compare/31689be3...597b346a I found this in Appendix A of the 1973 `Draft Proposed Revision of ASCII`. Seems compatibility with existing 6-bit devices might have been the primary concern: A 6.4 It is expected that devices having the capability of printing only 64 graphic symbols will continue to be important. It may be desirable to arrange these devices to print one symbol for the bit pattern of both upper and lower case of a given alphabetic letter. To facilitate this, there should be a single- bit difference between the upper and lowercase representations of any given letter. Combined with the requirement that a given case of the alphabet be contiguous, this dictated the assignment of the alphabet, as shown in columns 4 through 7. https://user-images.githubusercontent.com/300291/220842000-3efa64fe-9154-4069-9e81-e202d5731f6f.png";> https://ia800606.us.archive.org/17/items/enf-ascii-1972-1975/Image070917152640_text.pdf - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v14]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 21 additional commits since the last revision: - Merge branch 'master' into regionmatches-latin1-speedup - Merge branch 'master' into regionmatches-latin1-speedup - Make the loop variables chars to avoid casting - Use improved case-twiddling comment as suggested by Martin - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase with "by removing (setting) a single bit" - Align local variable naming in toLowerCase, toUpperCase with equalsIgnoreCase by using 'lower' and 'upper' - Rename unconventionally named local variable 'U' to 'upper' - Merge remote-tracking branch 'origin/master' into regionmatches-latin1-speedup - Add whitespace between methods - Merge branch 'master' into regionmatches-latin1-speedup - ... and 11 more: https://git.openjdk.org/jdk/compare/e7379286...597b346a - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/fe4efe9f..597b346a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=13 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=12-13 Stats: 9 lines in 5 files changed: 1 ins; 2 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v13]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision: - Merge branch 'master' into regionmatches-latin1-speedup - Make the loop variables chars to avoid casting - Use improved case-twiddling comment as suggested by Martin - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase with "by removing (setting) a single bit" - Align local variable naming in toLowerCase, toUpperCase with equalsIgnoreCase by using 'lower' and 'upper' - Rename unconventionally named local variable 'U' to 'upper' - Merge remote-tracking branch 'origin/master' into regionmatches-latin1-speedup - Add whitespace between methods - Merge branch 'master' into regionmatches-latin1-speedup - Remove whitespace following '(' - ... and 10 more: https://git.openjdk.org/jdk/compare/ea132c56...fe4efe9f - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/cc185293..fe4efe9f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=11-12 Stats: 1397 lines in 120 files changed: 873 ins; 291 del; 233 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v12]
On Wed, 22 Feb 2023 16:37:45 GMT, Eirik Bjorsnos wrote: >> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying >> 'the oldest ASCII trick in the book'. >> >> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two >> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is >> updated to use `equalsIgnoreCase` >> >> To verify the correctness of `equalsIgnoreCase`, a new test is added to >> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 >> code point pairs have an `equalsIgnoreCase` consistent with >> Character.toUpperCase, Character.toLowerCase. >> >> Performance is tested for matching and mismatching cases of code point pairs >> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results >> in the first comment below. > > Eirik Bjorsnos has updated the pull request incrementally with one additional > commit since the last revision: > > Make the loop variables chars to avoid casting Thanks for reviews Claes and Martin! I'll let this linger a bit before integrating in case Alan has comments after the latest updates. - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v12]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request incrementally with one additional commit since the last revision: Make the loop variables chars to avoid casting - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/ea2f9fa3..cc185293 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=10-11 Stats: 3 lines in 1 file changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v10]
On Wed, 22 Feb 2023 16:25:41 GMT, Martin Buchholz wrote: >> Eirik Bjorsnos has updated the pull request incrementally with two >> additional commits since the last revision: >> >> - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase >> with "by removing (setting) a single bit" >> - Align local variable naming in toLowerCase, toUpperCase with >> equalsIgnoreCase by using 'lower' and 'upper' > > test/jdk/java/lang/String/CompactString/EqualsIgnoreCase.java line 89: > >> 87: for (int ab = 0; ab < 256; ab++) { >> 88: for (int bb = 0; bb < 256; bb++) { >> 89: char a = (char) ab, b = (char) bb; > > char is an unsigned numeric type, so cleaner is > > for (char a = 0; a < 256; a++) > for (char b = 0; b < 256; b++) Thanks, fixed. Might have been copied over from processing of code points in the higher planes. Not needed here. - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Wed, 22 Feb 2023 16:10:42 GMT, Martin Buchholz wrote: >> Thanks Martin, David, Alan. This was instructive (and fun!) >> >> I suggest we condense the comment to something like this: >> >> >> // Uppercase b1 by removing a single bit >> int upper = b1 & 0xDF; >> if (upper < 'A') { >> return false; // Low ASCII >> } >> ... >> >> >> The similar methods `toLowerCase` `toUpperCase` just above have been updated >> to follow the same style. (I also updated local variable names there to >> align better with equalsIgnoreCase) > > // ASCII and Latin-1 were designed to optimize case-twiddling operations Thanks! This expresses the higher-level benefit succinctly, without getting into the details. I like it! - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v10]
On Wed, 22 Feb 2023 07:11:16 GMT, Eirik Bjorsnos wrote: >> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying >> 'the oldest ASCII trick in the book'. >> >> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two >> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is >> updated to use `equalsIgnoreCase` >> >> To verify the correctness of `equalsIgnoreCase`, a new test is added to >> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 >> code point pairs have an `equalsIgnoreCase` consistent with >> Character.toUpperCase, Character.toLowerCase. >> >> Performance is tested for matching and mismatching cases of code point pairs >> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results >> in the first comment below. > > Eirik Bjorsnos has updated the pull request incrementally with two additional > commits since the last revision: > > - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase > with "by removing (setting) a single bit" > - Align local variable naming in toLowerCase, toUpperCase with > equalsIgnoreCase by using 'lower' and 'upper' Marked as reviewed by martin (Reviewer). test/jdk/java/lang/String/CompactString/EqualsIgnoreCase.java line 89: > 87: for (int ab = 0; ab < 256; ab++) { > 88: for (int bb = 0; bb < 256; bb++) { > 89: char a = (char) ab, b = (char) bb; char is an unsigned numeric type, so cleaner is for (char a = 0; a < 256; a++) for (char b = 0; b < 256; b++) - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v11]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request incrementally with one additional commit since the last revision: Use improved case-twiddling comment as suggested by Martin - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/44d91544..ea2f9fa3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=09-10 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Wed, 22 Feb 2023 07:12:35 GMT, Eirik Bjorsnos wrote: >>> Do you have an opinion on the appropriate level of documentation / comments >>> for this kind of 'tricky' code? >> >> This code is not that tricky! And the proposed level of documentation is >> excessive! A couple of lines of explanation and perhaps a link to an >> external document would be good. >> >> It often happens to me that I will write such exhaustive notes for myself >> when learning a new technology. A year later I pare it all back because >> much of it is "obvious in retrospect". > > Thanks Martin, David, Alan. This was instructive (and fun!) > > I suggest we condense the comment to something like this: > > > // Uppercase b1 by removing a single bit > int upper = b1 & 0xDF; > if (upper < 'A') { > return false; // Low ASCII > } > ... > > > The similar methods `toLowerCase` `toUpperCase` just above have been updated > to follow the same style. (I also updated local variable names there to align > better with equalsIgnoreCase) // ASCII and Latin-1 were designed to optimize case-twiddling operations - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v10]
On Wed, 22 Feb 2023 07:11:16 GMT, Eirik Bjorsnos wrote: >> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying >> 'the oldest ASCII trick in the book'. >> >> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two >> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is >> updated to use `equalsIgnoreCase` >> >> To verify the correctness of `equalsIgnoreCase`, a new test is added to >> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 >> code point pairs have an `equalsIgnoreCase` consistent with >> Character.toUpperCase, Character.toLowerCase. >> >> Performance is tested for matching and mismatching cases of code point pairs >> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results >> in the first comment below. > > Eirik Bjorsnos has updated the pull request incrementally with two additional > commits since the last revision: > > - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase > with "by removing (setting) a single bit" > - Align local variable naming in toLowerCase, toUpperCase with > equalsIgnoreCase by using 'lower' and 'upper' Marked as reviewed by redestad (Reviewer). - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Wed, 22 Feb 2023 00:50:39 GMT, Martin Buchholz wrote: >> More history: IIRC I originally used 'ASCII trick' when I was truly only >> cared about ASCII, not Latin1 (e.g. ZipFile.isMetaName) and it's a slight >> misnomer to use "ASCII" here. But Latin1 followed the precedent of ASCII. > >> Do you have an opinion on the appropriate level of documentation / comments >> for this kind of 'tricky' code? > > This code is not that tricky! And the proposed level of documentation is > excessive! A couple of lines of explanation and perhaps a link to an > external document would be good. > > It often happens to me that I will write such exhaustive notes for myself > when learning a new technology. A year later I pare it all back because much > of it is "obvious in retrospect". Thanks Martin, David, Alan. This was instructive (and fun!) I suggest we condense the comment to something like this: // Uppercase b1 by removing a single bit int upper = b1 & 0xDF; if (upper < 'A') { return false; // Low ASCII } ... The similar methods `toLowerCase` `toUpperCase` just above have been updated to follow the same style. (I also updated local variable names there to align better with equalsIgnoreCase) - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v10]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request incrementally with two additional commits since the last revision: - Replace 'oldest ASCII trick in the book' use in toUpperCase, toLowerCase with "by removing (setting) a single bit" - Align local variable naming in toLowerCase, toUpperCase with equalsIgnoreCase by using 'lower' and 'upper' - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/6588ab0f..44d91544 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=08-09 Stats: 11 lines in 1 file changed: 2 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Wed, 22 Feb 2023 00:45:52 GMT, Martin Buchholz wrote: >> There are still some books on this :) but from wikipedia: >>> during May 1963 the CCITT Working Party on the New Telegraph Alphabet >>> proposed to assign lowercase characters to >>> sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14) >>> 6 and >>> 7,[[15]](https://en.wikipedia.org/wiki/ASCII#cite_note-CCITT_1963-16) and >>> [International Organization for >>> Standardization](https://en.wikipedia.org/wiki/International_Organization_for_Standardization) >>> TC 97 SC 2 voted during October to incorporate the change into its draft >>> standard.[[16]](https://en.wikipedia.org/wiki/ASCII#cite_note-ISO_1963-17) >>> The X3.2.4 task group voted its approval for the change to ASCII at its May >>> 1963 meeting.[[17]](https://en.wikipedia.org/wiki/ASCII#cite_note-18) >>> **Locating the lowercase letters in >>> sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14) >>> 6 and 7 caused the characters to differ in bit pattern from the upper case by a single bit, which simplified [case-insensitive](https://en.wikipedia.org/wiki/Case-insensitive) character matching and the construction of keyboards and printers.** >> >> Hence the simplicity of the shift key :) > > More history: IIRC I originally used 'ASCII trick' when I was truly only > cared about ASCII, not Latin1 (e.g. ZipFile.isMetaName) and it's a slight > misnomer to use "ASCII" here. But Latin1 followed the precedent of ASCII. > Do you have an opinion on the appropriate level of documentation / comments > for this kind of 'tricky' code? This code is not that tricky! And the proposed level of documentation is excessive! A couple of lines of explanation and perhaps a link to an external document would be good. It often happens to me that I will write such exhaustive notes for myself when learning a new technology. A year later I pare it all back because much of it is "obvious in retrospect". - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Wed, 22 Feb 2023 00:06:51 GMT, David Holmes wrote: >> Thanks Martin, I will from now on envision a stack of dusty punch cards with >> carefully scribbled notes on the back, barely held together with a dry and >> cracked rubber band. >> >> More to the point: Do you have an opinion on the appropriate level of >> documentation / comments for this kind of 'tricky' code? My goal is that >> future maintainers should not get scared by the code and help them >> understand why it became what it is. > > There are still some books on this :) but from wikipedia: >> during May 1963 the CCITT Working Party on the New Telegraph Alphabet >> proposed to assign lowercase characters to >> sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14) >> 6 and 7,[[15]](https://en.wikipedia.org/wiki/ASCII#cite_note-CCITT_1963-16) >> and [International Organization for >> Standardization](https://en.wikipedia.org/wiki/International_Organization_for_Standardization) >> TC 97 SC 2 voted during October to incorporate the change into its draft >> standard.[[16]](https://en.wikipedia.org/wiki/ASCII#cite_note-ISO_1963-17) >> The X3.2.4 task group voted its approval for the change to ASCII at its May >> 1963 meeting.[[17]](https://en.wikipedia.org/wiki/ASCII#cite_note-18) >> **Locating the lowercase letters in >> sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14) >> 6 and 7 caused the characters to differ in bit pattern from the upper case by a single bit, which simplified [case-insensitive](https://en.wikipedia.org/wiki/Case-insensitive) character matching and the construction of keyboards and printers.** > > Hence the simplicity of the shift key :) More history: IIRC I originally used 'ASCII trick' when I was truly only cared about ASCII, not Latin1 (e.g. ZipFile.isMetaName) and it's a slight misnomer to use "ASCII" here. But Latin1 followed the precedent of ASCII. - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Tue, 21 Feb 2023 23:28:52 GMT, Eirik Bjorsnos wrote: >> More to the point, ASCII was obviously **designed** to allow you to >> uppercase a lower case letter with a single instruction, so "the book" might >> have been a draft standard before they scrubbed out the interesting history! > > Thanks Martin, I will from now on envision a stack of dusty punch cards with > carefully scribbled notes on the back, barely held together with a dry and > cracked rubber band. > > More to the point: Do you have an opinion on the appropriate level of > documentation / comments for this kind of 'tricky' code? My goal is that > future maintainers should not get scared by the code and help them understand > why it became what it is. There are still some books on this :) but from wikipedia: > during May 1963 the CCITT Working Party on the New Telegraph Alphabet > proposed to assign lowercase characters to > sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14) > 6 and 7,[[15]](https://en.wikipedia.org/wiki/ASCII#cite_note-CCITT_1963-16) > and [International Organization for > Standardization](https://en.wikipedia.org/wiki/International_Organization_for_Standardization) > TC 97 SC 2 voted during October to incorporate the change into its draft > standard.[[16]](https://en.wikipedia.org/wiki/ASCII#cite_note-ISO_1963-17) > The X3.2.4 task group voted its approval for the change to ASCII at its May > 1963 meeting.[[17]](https://en.wikipedia.org/wiki/ASCII#cite_note-18) > **Locating the lowercase letters in > sticks[[a]](https://en.wikipedia.org/wiki/ASCII#cite_note-NB_Stick-15)[[14]](https://en.wikipedia.org/wiki/ASCII#cite_note-Bemer_1980_Inside-14) > 6 and 7 caused the characters to differ in bit pattern f rom the upper case by a single bit, which simplified [case-insensitive](https://en.wikipedia.org/wiki/Case-insensitive) character matching and the construction of keyboards and printers.** Hence the simplicity of the shift key :) - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Tue, 21 Feb 2023 23:20:03 GMT, Martin Buchholz wrote: >> "oldest trick in the book" is a phrase that does not necessarily imply >> existence of an actual book! >> >> Let this evoke an image of a **personal** book of tricks that programmers in >> the 1960s might have recorded such techniques in. And the tricks were >> passed down across generations of programmers! > > More to the point, ASCII was obviously **designed** to allow you to uppercase > a lower case letter with a single instruction, so "the book" might have been > a draft standard before they scrubbed out the interesting history! Thanks Martin, I will from now on envision a stack of dusty punch cards with carefully scribbled notes on the back, barely held together with a dry and cracked rubber band. More to the point: Do you have an opinion on the appropriate level of documentation / comments for this kind of 'tricky' code? My goal is that future maintainers should not get scared by the code and help them understand why it became what it is. - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Tue, 21 Feb 2023 20:48:04 GMT, Martin Buchholz wrote: >> Perhaps @Martin-Buchholz could chime in and also tell us which book he found >> his ASCII trick in :) > > "oldest trick in the book" is a phrase that does not necessarily imply > existence of an actual book! > > Let this evoke an image of a **personal** book of tricks that programmers in > the 1960s might have recorded such techniques in. And the tricks were passed > down across generations of programmers! More to the point, ASCII was obviously **designed** to allow you to uppercase a lower case letter with a single instruction, so "the book" might have been a draft standard before they scrubbed out the interesting history! - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v9]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 15 additional commits since the last revision: - Rename unconventionally named local variable 'U' to 'upper' - Merge remote-tracking branch 'origin/master' into regionmatches-latin1-speedup - Add whitespace between methods - Merge branch 'master' into regionmatches-latin1-speedup - Remove whitespace following '(' - Revert "Spell fix for 'exhaustive' in comments in sun/text/resources" This reverts commit 5e9927a4b35e157fd3fa72fd2663c8bfbecf32bb. - Spell fix for 'exhaustive' in comments in sun/text/resources - Add @bug tag to EqualsIgnoreCase test for correct issue JDK-8302871 - Add @bug tag to EqualsIgnoreCase test for JDK-8302877 - Add clarifying comments and use more descriptive variable names in the latin1 verification EqualsIgnoreCase test - ... and 5 more: https://git.openjdk.org/jdk/compare/7025e4c5...6588ab0f - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/718fcead..6588ab0f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=07-08 Stats: 20 lines in 6 files changed: 0 ins; 7 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Tue, 21 Feb 2023 14:22:30 GMT, Alan Bateman wrote: >> Eirik Bjorsnos has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Remove whitespace following '(' > > src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line > 163: > >> 161: return mapChar; >> 162: } >> 163: /** > > I assume you should insert a blank line between the two methods. This has been fixed now. - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v8]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision: - Add whitespace between methods - Merge branch 'master' into regionmatches-latin1-speedup - Remove whitespace following '(' - Revert "Spell fix for 'exhaustive' in comments in sun/text/resources" This reverts commit 5e9927a4b35e157fd3fa72fd2663c8bfbecf32bb. - Spell fix for 'exhaustive' in comments in sun/text/resources - Add @bug tag to EqualsIgnoreCase test for correct issue JDK-8302871 - Add @bug tag to EqualsIgnoreCase test for JDK-8302877 - Add clarifying comments and use more descriptive variable names in the latin1 verification EqualsIgnoreCase test - Align whitespace to make example strings easier to read - Merge branch 'master' into regionmatches-latin1-speedup - ... and 3 more: https://git.openjdk.org/jdk/compare/ea62160f...718fcead - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/d7b1c164..718fcead Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=06-07 Stats: 4282 lines in 198 files changed: 2630 ins; 1129 del; 523 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Tue, 21 Feb 2023 20:33:41 GMT, Eirik Bjorsnos wrote: >> Hi Alan, >> >> I thought I was clever by encoding the 'uppercaseness' in the variable name, >> but yeah I'll find a better name :) >> >> There is some precedent for using the 'ASCII trick' comment in the JDK. I >> found it in ZipFile.isMetaName, which is also where I first learned about >> this interesting relationship between ASCII (and also latin1) letters. >> >> The comment was first added by Martin Buchholz back in 2016 as part of >> JDK-8157069, 'Assorted ZipFile improvements'. In 2020, Claes was updating >> this code and Lance has some input about clarifying the comment. Martin then >> [chimed >> in](https://mail.openjdk.org/pipermail/core-libs-dev/2020-May/066363.html) >> to defend his comment: >> >>> I still like my ancient "ASCII trick" comment. >> >> I think this 'trick', whatever we call it, is sufficiently intricate that it >> deserves to be called out somehow and that we should not just casually >> bitmask with these magic constants without any discussion at all. >> >> An earlier iteration of this PR included a small essay in the javadoc of >> this method describing the layout and relationship of letters in latin1 and >> how we can apply that knowledge of the layout to implement the method. >> >> How would you feel about adding that description back to the Javadocs? This >> would then live close to the similarly implemented toUpperCase and >> toLowerCase methods currently under review in #12623. >> >> Here's the updated discussion included in the Javadoc: >> >> >> /** >> * Compares two latin1 code points, ignoring case considerations. >> * >> * Implementation note: In ISO/IEC 8859-1, the uppercase and lowercase >> * letters are found in the following code point ranges: >> * >> * 0x41-0x5A: Uppercase ASCII letters: A-Z >> * 0x61-0x7A: Lowercase ASCII letters: a-z >> * 0xC0-0xD6: Uppercase latin1 letters: A-GRAVE - O with Diaeresis >> * 0xD8-0xDE: Uppercase latin1 letters: O with slash - Thorn >> * 0xE0-0xF6: Lowercase latin1 letters: a-grave - o with Diaeresis >> * 0xF8-0xFE: Lowercase latin1 letters: o with slash - thorn >> * >> * While both ASCII letter ranges are contiguous, the latin1 ranges are >> not: >> * >> * The 'multiplication sign' 0xD7 splits the uppercase range in two. >> * The 'division sign' 0xF7 splits the lowercase range in two. >> * >> * Lowercase letters are found 32 positions (0x20) after their >> corresponding uppercase letter. >> * The 'division sign' and 'multiplication sign' have the same relative >> distance. >> * >> * Since 0x20 is a single bit, we can apply the 'oldest ASCII trick in >> the book' to >> * lowercase any letter by setting the bit: >> * >> * ('C' | 0x20) == 'c' >> * >> * By removing the bit, we can perform the uppercase operation: >> * >> * ('c' & 0xDF) == 'C' >> * >> * Applying this knowledge of the latin1 layout, we can test for >> equality ignoring case by >> * checking that the code points are either equal, or that one of the >> code points is a letter >> * which uppercases is the same as the uppercase of the other code point. >> * >> * @param b1 byte representing a latin1 code point >> * @param b2 another byte representing a latin1 code point >> * @return true if the two bytes are considered equals ignoring case in >> latin1 >> */ >> static boolean equalsIgnoreCase(byte b1, byte b2) { >> if (b1 == b2) { >> return true; >> } >> int upper = b1 & 0xDF; >> if (upper < 'A') { >> return false; // Low ASCII >> } >> return (upper <= 'Z' // In range A-Z >> || (upper >= 0xC0 && upper <= 0XDE && upper != 0xD7)) // >> ..or A-grave-Thorn, excl. multiplication >> && upper == (b2 & 0xDF); // b2 has same uppercase >> } > > Perhaps @Martin-Buchholz could chime in and also tell us which book he found > his ASCII trick in :) "oldest trick in the book" is a phrase that does not necessarily imply existence of an actual book! Let this evoke an image of a **personal** book of tricks that programmers in the 1960s might have recorded such techniques in. And the tricks were passed down across generations of programmers! - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Tue, 21 Feb 2023 20:23:11 GMT, Eirik Bjorsnos wrote: >> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line >> 175: >> >>> 173: } >>> 174: // uppercase b1 using 'the oldest ASCII trick in the book' >>> 175: int U = b1 & 0xDF; >> >> I'm sure some people reading this comment will wonder which book :-) It >> might be better to drop that bit and if possible, find a better name for "U" >> as normally variables start with a lower case. > > Hi Alan, > > I thought I was clever by encoding the 'uppercaseness' in the variable name, > but yeah I'll find a better name :) > > There is some precedent for using the 'ASCII trick' comment in the JDK. I > found it in ZipFile.isMetaName, which is also where I first learned about > this interesting relationship between ASCII (and also latin1) letters. > > The comment was first added by Martin Buchholz back in 2016 as part of > JDK-8157069, 'Assorted ZipFile improvements'. In 2020, Claes was updating > this code and Lance has some input about clarifying the comment. Martin then > [chimed > in](https://mail.openjdk.org/pipermail/core-libs-dev/2020-May/066363.html) to > defend his comment: > >> I still like my ancient "ASCII trick" comment. > > I think this 'trick', whatever we call it, is sufficiently intricate that it > deserves to be called out somehow and that we should not just casually > bitmask with these magic constants without any discussion at all. > > An earlier iteration of this PR included a small essay in the javadoc of this > method describing the layout and relationship of letters in latin1 and how we > can apply that knowledge of the layout to implement the method. > > How would you feel about adding that description back to the Javadocs? This > would then live close to the similarly implemented toUpperCase and > toLowerCase methods currently under review in #12623. > > Here's the updated discussion included in the Javadoc: > > > /** > * Compares two latin1 code points, ignoring case considerations. > * > * Implementation note: In ISO/IEC 8859-1, the uppercase and lowercase > * letters are found in the following code point ranges: > * > * 0x41-0x5A: Uppercase ASCII letters: A-Z > * 0x61-0x7A: Lowercase ASCII letters: a-z > * 0xC0-0xD6: Uppercase latin1 letters: A-GRAVE - O with Diaeresis > * 0xD8-0xDE: Uppercase latin1 letters: O with slash - Thorn > * 0xE0-0xF6: Lowercase latin1 letters: a-grave - o with Diaeresis > * 0xF8-0xFE: Lowercase latin1 letters: o with slash - thorn > * > * While both ASCII letter ranges are contiguous, the latin1 ranges are > not: > * > * The 'multiplication sign' 0xD7 splits the uppercase range in two. > * The 'division sign' 0xF7 splits the lowercase range in two. > * > * Lowercase letters are found 32 positions (0x20) after their > corresponding uppercase letter. > * The 'division sign' and 'multiplication sign' have the same relative > distance. > * > * Since 0x20 is a single bit, we can apply the 'oldest ASCII trick in > the book' to > * lowercase any letter by setting the bit: > * > * ('C' | 0x20) == 'c' > * > * By removing the bit, we can perform the uppercase operation: > * > * ('c' & 0xDF) == 'C' > * > * Applying this knowledge of the latin1 layout, we can test for equality > ignoring case by > * checking that the code points are either equal, or that one of the > code points is a letter > * which uppercases is the same as the uppercase of the other code point. > * > * @param b1 byte representing a latin1 code point > * @param b2 another byte representing a latin1 code point > * @return true if the two bytes are considered equals ignoring case in > latin1 > */ > static boolean equalsIgnoreCase(byte b1, byte b2) { > if (b1 == b2) { > return true; > } > int upper = b1 & 0xDF; > if (upper < 'A') { > return false; // Low ASCII > } > return (upper <= 'Z' // In range A-Z > || (upper >= 0xC0 && upper <= 0XDE && upper != 0xD7)) // > ..or A-grave-Thorn, excl. multiplication > && upper == (b2 & 0xDF); // b2 has same uppercase > } Perhaps @Martin-Buchholz could chime in and also tell us which book he found his ASCII trick in :) - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Tue, 21 Feb 2023 14:27:03 GMT, Alan Bateman wrote: >> Eirik Bjorsnos has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Remove whitespace following '(' > > src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line > 175: > >> 173: } >> 174: // uppercase b1 using 'the oldest ASCII trick in the book' >> 175: int U = b1 & 0xDF; > > I'm sure some people reading this comment will wonder which book :-) It might > be better to drop that bit and if possible, find a better name for "U" as > normally variables start with a lower case. Hi Alan, I thought I was clever by encoding the 'uppercaseness' in the variable name, but yeah I'll find a better name :) There is some precedent for using the 'ASCII trick' comment in the JDK. I found it in ZipFile.isMetaName, which is also where I first learned about this interesting relationship between ASCII (and also latin1) letters. The comment was first added by Martin Buchholz back in 2016 as part of JDK-8157069, 'Assorted ZipFile improvements'. In 2020, Claes was updating this code and Lance has some input about clarifying the comment. Martin then [chimed in](https://mail.openjdk.org/pipermail/core-libs-dev/2020-May/066363.html) to defend his comment: > I still like my ancient "ASCII trick" comment. I think this 'trick', whatever we call it, is sufficiently intricate that it deserves to be called out somehow and that we should not just casually bitmask with these magic constants without any discussion at all. An earlier iteration of this PR included a small essay in the javadoc of this method describing the layout and relationship of letters in latin1 and how we can apply that knowledge of the layout to implement the method. How would you feel about adding that description back to the Javadocs? This would then live close to the similarly implemented toUpperCase and toLowerCase methods currently under review in #12623. Here's the updated discussion included in the Javadoc: /** * Compares two latin1 code points, ignoring case considerations. * * Implementation note: In ISO/IEC 8859-1, the uppercase and lowercase * letters are found in the following code point ranges: * * 0x41-0x5A: Uppercase ASCII letters: A-Z * 0x61-0x7A: Lowercase ASCII letters: a-z * 0xC0-0xD6: Uppercase latin1 letters: A-GRAVE - O with Diaeresis * 0xD8-0xDE: Uppercase latin1 letters: O with slash - Thorn * 0xE0-0xF6: Lowercase latin1 letters: a-grave - o with Diaeresis * 0xF8-0xFE: Lowercase latin1 letters: o with slash - thorn * * While both ASCII letter ranges are contiguous, the latin1 ranges are not: * * The 'multiplication sign' 0xD7 splits the uppercase range in two. * The 'division sign' 0xF7 splits the lowercase range in two. * * Lowercase letters are found 32 positions (0x20) after their corresponding uppercase letter. * The 'division sign' and 'multiplication sign' have the same relative distance. * * Since 0x20 is a single bit, we can apply the 'oldest ASCII trick in the book' to * lowercase any letter by setting the bit: * * ('C' | 0x20) == 'c' * * By removing the bit, we can perform the uppercase operation: * * ('c' & 0xDF) == 'C' * * Applying this knowledge of the latin1 layout, we can test for equality ignoring case by * checking that the code points are either equal, or that one of the code points is a letter * which uppercases is the same as the uppercase of the other code point. * * @param b1 byte representing a latin1 code point * @param b2 another byte representing a latin1 code point * @return true if the two bytes are considered equals ignoring case in latin1 */ static boolean equalsIgnoreCase(byte b1, byte b2) { if (b1 == b2) { return true; } int upper = b1 & 0xDF; if (upper < 'A') { return false; // Low ASCII } return (upper <= 'Z' // In range A-Z || (upper >= 0xC0 && upper <= 0XDE && upper != 0xD7)) // ..or A-grave-Thorn, excl. multiplication && upper == (b2 & 0xDF); // b2 has same uppercase } - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Tue, 21 Feb 2023 11:14:13 GMT, Eirik Bjorsnos wrote: >> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying >> 'the oldest ASCII trick in the book'. >> >> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two >> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is >> updated to use `equalsIgnoreCase` >> >> To verify the correctness of `equalsIgnoreCase`, a new test is added to >> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 >> code point pairs have an `equalsIgnoreCase` consistent with >> Character.toUpperCase, Character.toLowerCase. >> >> Performance is tested for matching and mismatching cases of code point pairs >> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results >> in the first comment below. > > Eirik Bjorsnos has updated the pull request incrementally with one additional > commit since the last revision: > > Remove whitespace following '(' src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 163: > 161: return mapChar; > 162: } > 163: /** I assume you should insert a blank line between the two methods. src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 175: > 173: } > 174: // uppercase b1 using 'the oldest ASCII trick in the book' > 175: int U = b1 & 0xDF; I'm sure some people reading this comment will wonder which book :-) It might be better to drop that bit and if possible, find a better name for "U" as normally variables start with a lower case. - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
On Tue, 21 Feb 2023 11:14:13 GMT, Eirik Bjorsnos wrote: >> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying >> 'the oldest ASCII trick in the book'. >> >> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two >> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is >> updated to use `equalsIgnoreCase` >> >> To verify the correctness of `equalsIgnoreCase`, a new test is added to >> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 >> code point pairs have an `equalsIgnoreCase` consistent with >> Character.toUpperCase, Character.toLowerCase. >> >> Performance is tested for matching and mismatching cases of code point pairs >> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results >> in the first comment below. > > Eirik Bjorsnos has updated the pull request incrementally with one additional > commit since the last revision: > > Remove whitespace following '(' Marked as reviewed by redestad (Reviewer). - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request incrementally with one additional commit since the last revision: Remove whitespace following '(' - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/b8139961..d7b1c164 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=05-06 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v6]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request incrementally with one additional commit since the last revision: Revert "Spell fix for 'exhaustive' in comments in sun/text/resources" This reverts commit 5e9927a4b35e157fd3fa72fd2663c8bfbecf32bb. - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/5e9927a4..b8139961 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=04-05 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v5]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request incrementally with one additional commit since the last revision: Spell fix for 'exhaustive' in comments in sun/text/resources - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/03d3e2cb..5e9927a4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=03-04 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v4]
On Mon, 20 Feb 2023 16:23:32 GMT, Claes Redestad wrote: >> Is it not already in CharacterDataLatin1? >> >> Here is a comparison of relying on improvements in >> `CharacterDataLatin1.toUpperCase/toLowerCase` only vs. using >> `CharacterDataLatin1.equalsIgnoreCase`: >> >> Character.toUpperCase/toLowerCase only: >> >> >> Benchmark (codePoints) (size) Mode Cnt >> ScoreError Units >> RegionMatchesIC.Latin1.regionMatchesIC ascii-match1024 avgt 15 >> 1310.582 ± 84.777 ns/op >> RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch1024 avgt 15 >> 4.547 ± 0.545 ns/op >> RegionMatchesIC.Latin1.regionMatchesIC number-match1024 avgt 15 >> 686.947 ± 11.850 ns/op >> RegionMatchesIC.Latin1.regionMatchesIC number-mismatch1024 avgt 15 >> 3.836 ± 0.634 ns/op >> RegionMatchesIC.Latin1.regionMatchesIC lat1-match1024 avgt 15 >> 2107.219 ± 17.662 ns/op >> RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024 avgt 15 >> 4.924 ± 0.829 ns/op >> >> >> CharacterDataLatin1.equalsIgnoreCase: >> >> >> Benchmark (codePoints) (size) Mode Cnt >> ScoreError Units >> RegionMatchesIC.Latin1.regionMatchesIC ascii-match1024 avgt 15 >> 742.467 ± 34.490 ns/op >> RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch1024 avgt 15 >> 3.960 ± 0.046 ns/op >> RegionMatchesIC.Latin1.regionMatchesIC number-match1024 avgt 15 >> 361.158 ± 37.096 ns/op >> RegionMatchesIC.Latin1.regionMatchesIC number-mismatch1024 avgt 15 >> 4.039 ± 0.521 ns/op >> RegionMatchesIC.Latin1.regionMatchesIC lat1-match1024 avgt 15 >> 1158.091 ± 41.617 ns/op >> RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024 avgt 15 >> 4.358 ± 0.123 ns/op > > Oops, I lost context and thought this was in `StringLatin1`. > > Thanks for running the numbers with #12623. Looks like you're getting big > enough of an improvement on top. Yes, seems `equalsIgnoreCase` carries its weight. - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v4]
On Mon, 20 Feb 2023 16:16:45 GMT, Eirik Bjorsnos wrote: >> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line >> 170: >> >>> 168: * @return true if the two bytes are considered equals ignoring >>> case in latin1 >>> 169: */ >>> 170: static boolean equalsIgnoreCase(byte b1, byte b2) { >> >> Perhaps put this in `CharacterDataLatin1`, keeping it close to >> toLowerCase/toUpperCase that you're changing to use similar logic with >> #12623 >> >> If you apply #12623 first - how much difference does this make on the micro >> you're adding with this PR? > > Is it not already in CharacterDataLatin1? > > Here is a comparison of relying on improvements in > `CharacterDataLatin1.toUpperCase/toLowerCase` only vs. using > `CharacterDataLatin1.equalsIgnoreCase`: > > Character.toUpperCase/toLowerCase only: > > > Benchmark (codePoints) (size) Mode Cnt > ScoreError Units > RegionMatchesIC.Latin1.regionMatchesIC ascii-match1024 avgt 15 > 1310.582 ± 84.777 ns/op > RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch1024 avgt 15 > 4.547 ± 0.545 ns/op > RegionMatchesIC.Latin1.regionMatchesIC number-match1024 avgt 15 > 686.947 ± 11.850 ns/op > RegionMatchesIC.Latin1.regionMatchesIC number-mismatch1024 avgt 15 > 3.836 ± 0.634 ns/op > RegionMatchesIC.Latin1.regionMatchesIC lat1-match1024 avgt 15 > 2107.219 ± 17.662 ns/op > RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024 avgt 15 > 4.924 ± 0.829 ns/op > > > CharacterDataLatin1.equalsIgnoreCase: > > > Benchmark (codePoints) (size) Mode Cnt > ScoreError Units > RegionMatchesIC.Latin1.regionMatchesIC ascii-match1024 avgt 15 > 742.467 ± 34.490 ns/op > RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch1024 avgt 15 > 3.960 ± 0.046 ns/op > RegionMatchesIC.Latin1.regionMatchesIC number-match1024 avgt 15 > 361.158 ± 37.096 ns/op > RegionMatchesIC.Latin1.regionMatchesIC number-mismatch1024 avgt 15 > 4.039 ± 0.521 ns/op > RegionMatchesIC.Latin1.regionMatchesIC lat1-match1024 avgt 15 > 1158.091 ± 41.617 ns/op > RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024 avgt 15 > 4.358 ± 0.123 ns/op Oops, I lost context and thought this was in `StringLatin1`. Thanks for running the numbers with #12623. Looks like you're getting big enough of an improvement on top. - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v4]
On Mon, 20 Feb 2023 15:40:09 GMT, Claes Redestad wrote: >> Eirik Bjorsnos has updated the pull request incrementally with two >> additional commits since the last revision: >> >> - Add @bug tag to EqualsIgnoreCase test for correct issue JDK-8302871 >> - Add @bug tag to EqualsIgnoreCase test for JDK-8302877 > > src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line > 170: > >> 168: * @return true if the two bytes are considered equals ignoring >> case in latin1 >> 169: */ >> 170: static boolean equalsIgnoreCase(byte b1, byte b2) { > > Perhaps put this in `CharacterDataLatin1`, keeping it close to > toLowerCase/toUpperCase that you're changing to use similar logic with #12623 > > If you apply #12623 first - how much difference does this make on the micro > you're adding with this PR? Is it not already in CharacterDataLatin1? Here is a comparison of relying on improvements in `CharacterDataLatin1.toUpperCase/toLowerCase` only vs. using `CharacterDataLatin1.equalsIgnoreCase`: Character.toUpperCase/toLowerCase only: Benchmark (codePoints) (size) Mode Cnt ScoreError Units RegionMatchesIC.Latin1.regionMatchesIC ascii-match1024 avgt 15 1310.582 ± 84.777 ns/op RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch1024 avgt 15 4.547 ± 0.545 ns/op RegionMatchesIC.Latin1.regionMatchesIC number-match1024 avgt 15 686.947 ± 11.850 ns/op RegionMatchesIC.Latin1.regionMatchesIC number-mismatch1024 avgt 15 3.836 ± 0.634 ns/op RegionMatchesIC.Latin1.regionMatchesIC lat1-match1024 avgt 15 2107.219 ± 17.662 ns/op RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024 avgt 15 4.924 ± 0.829 ns/op CharacterDataLatin1.equalsIgnoreCase: Benchmark (codePoints) (size) Mode Cnt ScoreError Units RegionMatchesIC.Latin1.regionMatchesIC ascii-match1024 avgt 15 742.467 ± 34.490 ns/op RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch1024 avgt 15 3.960 ± 0.046 ns/op RegionMatchesIC.Latin1.regionMatchesIC number-match1024 avgt 15 361.158 ± 37.096 ns/op RegionMatchesIC.Latin1.regionMatchesIC number-mismatch1024 avgt 15 4.039 ± 0.521 ns/op RegionMatchesIC.Latin1.regionMatchesIC lat1-match1024 avgt 15 1158.091 ± 41.617 ns/op RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024 avgt 15 4.358 ± 0.123 ns/op - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v4]
On Mon, 20 Feb 2023 14:45:09 GMT, Eirik Bjorsnos wrote: >> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying >> 'the oldest ASCII trick in the book'. >> >> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two >> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is >> updated to use `equalsIgnoreCase` >> >> To verify the correctness of `equalsIgnoreCase`, a new test is added to >> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 >> code point pairs have an `equalsIgnoreCase` consistent with >> Character.toUpperCase, Character.toLowerCase. >> >> Performance is tested for matching and mismatching cases of code point pairs >> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results >> in the first comment below. > > Eirik Bjorsnos has updated the pull request incrementally with two additional > commits since the last revision: > > - Add @bug tag to EqualsIgnoreCase test for correct issue JDK-8302871 > - Add @bug tag to EqualsIgnoreCase test for JDK-8302877 src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 170: > 168: * @return true if the two bytes are considered equals ignoring case > in latin1 > 169: */ > 170: static boolean equalsIgnoreCase(byte b1, byte b2) { Perhaps put this in `CharacterDataLatin1`, keeping it close to toLowerCase/toUpperCase that you're changing to use similar logic with #12623 If you apply #12623 first - how much difference does this make on the micro you're adding with this PR? - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v4]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request incrementally with two additional commits since the last revision: - Add @bug tag to EqualsIgnoreCase test for correct issue JDK-8302871 - Add @bug tag to EqualsIgnoreCase test for JDK-8302877 - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/84517102..03d3e2cb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v3]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request incrementally with one additional commit since the last revision: Add clarifying comments and use more descriptive variable names in the latin1 verification EqualsIgnoreCase test - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/59c42298..84517102 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=01-02 Stats: 16 lines in 1 file changed: 5 ins; 2 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v2]
> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Eirik Bjorsnos has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: - Align whitespace to make example strings easier to read - Merge branch 'master' into regionmatches-latin1-speedup - Exhaustive verification needs to cover the case b1 == b2 - Move multiplication exclusion to the lat1 range branch - Speed up StringLatin1.regionMatchesCI by applying the 'oldest ASCII trick in the book' - Changes: - all: https://git.openjdk.org/jdk/pull/12632/files - new: https://git.openjdk.org/jdk/pull/12632/files/a583bcd6..59c42298 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=00-01 Stats: 57495 lines in 1316 files changed: 24512 ins; 15682 del; 17301 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI
On Sat, 18 Feb 2023 19:45:34 GMT, Eirik Bjorsnos wrote: >> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line >> 181: >> >>> 179: return ( U <= 'Z' // In range A-Z >>> 180: || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or >>> A-grave-Thorn, excl. multiplication >>> 181: && U == (b2 & 0xDF); // b2 has same uppercase >> >> I'm curious if the order of comparisons could alter performance to a small >> degree. For example, it might be interesting to compare various permutations >> like below to short circuit reject unequal uppercased b2 >> >> Suggestion: >> >> // uppercase b1 using 'the oldest ASCII trick in the book' >> int U = b1 & 0xDF; >> return (U == (b2 & 0xDF)) >> && ((U >= 'A' && U <= 'Z') // In range A-Z >> || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or >> A-grave-Thorn, excl. multiplication > > Yeah, as you noticed this code is tricky and sensitive to the order of > operations. I did some quite extensive exploration before ending on the > current structure. This particular one seems to improve rejection somewhat at > the cost of matches. > > Since rejection is relatively speaking already very fast, I think we should > favour fast matching here. > > Results: > > > enchmark (codePoints) (size) Mode Cnt > ScoreError Units > RegionMatchesIC.Latin1.regionMatchesIC ascii-match1024 avgt 15 > 917.796 ± 20.285 ns/op > RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch1024 avgt 15 > 4.367 ± 0.348 ns/op > RegionMatchesIC.Latin1.regionMatchesIC number-match1024 avgt 15 > 399.656 ± 10.703 ns/op > RegionMatchesIC.Latin1.regionMatchesIC number-mismatch1024 avgt 15 > 4.361 ± 0.664 ns/op > RegionMatchesIC.Latin1.regionMatchesIC lat1-match1024 avgt 15 > 1384.443 ± 22.199 ns/op > RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024 avgt 15 > 4.119 ± 0.451 ns/op Thanks for confirming - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI
On Sat, 18 Feb 2023 19:04:24 GMT, David Schlosnagle wrote: >> This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying >> 'the oldest ASCII trick in the book'. >> >> The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two >> latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is >> updated to use `equalsIgnoreCase` >> >> To verify the correctness of `equalsIgnoreCase`, a new test is added to >> `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 >> code point pairs have an `equalsIgnoreCase` consistent with >> Character.toUpperCase, Character.toLowerCase. >> >> Performance is tested for matching and mismatching cases of code point pairs >> picked from the ASCII letter, ASCII number and latin1 letter ranges. Results >> in the first comment below. > > src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line > 181: > >> 179: return ( U <= 'Z' // In range A-Z >> 180: || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or >> A-grave-Thorn, excl. multiplication >> 181: && U == (b2 & 0xDF); // b2 has same uppercase > > I'm curious if the order of comparisons could alter performance to a small > degree. For example, it might be interesting to compare various permutations > like below to short circuit reject unequal uppercased b2 > > Suggestion: > > // uppercase b1 using 'the oldest ASCII trick in the book' > int U = b1 & 0xDF; > return (U == (b2 & 0xDF)) > && ((U >= 'A' && U <= 'Z') // In range A-Z > || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or > A-grave-Thorn, excl. multiplication Yeah, as you noticed this code is tricky and sensitive to the order of operations. I did some quite extensive exploration before ending on the current structure. This particular one seems to improve rejection somewhat at the cost of matches. Since rejection is relatively speaking already very fast, I think we should favour fast matching here. Results: enchmark (codePoints) (size) Mode Cnt ScoreError Units RegionMatchesIC.Latin1.regionMatchesIC ascii-match1024 avgt 15 917.796 ± 20.285 ns/op RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch1024 avgt 15 4.367 ± 0.348 ns/op RegionMatchesIC.Latin1.regionMatchesIC number-match1024 avgt 15 399.656 ± 10.703 ns/op RegionMatchesIC.Latin1.regionMatchesIC number-mismatch1024 avgt 15 4.361 ± 0.664 ns/op RegionMatchesIC.Latin1.regionMatchesIC lat1-match1024 avgt 15 1384.443 ± 22.199 ns/op RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024 avgt 15 4.119 ± 0.451 ns/op - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI
On Sat, 18 Feb 2023 09:21:25 GMT, Eirik Bjorsnos wrote: > This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 181: > 179: return ( U <= 'Z' // In range A-Z > 180: || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or > A-grave-Thorn, excl. multiplication > 181: && U == (b2 & 0xDF); // b2 has same uppercase I'm curious if the order of comparisons could alter performance to a small degree. For example, it might be interesting to compare various permutations like below to short circuit reject unequal uppercased b2 Suggestion: // uppercase b1 using 'the oldest ASCII trick in the book' int U = b1 & 0xDF; return (U == (b2 & 0xDF)) && ((U >= 'A' && U <= 'Z') // In range A-Z || (U >= 0xC0 && U <= 0XDE && U != 0xD7)) // ..or A-grave-Thorn, excl. multiplication - PR: https://git.openjdk.org/jdk/pull/12632
Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI
On Sat, 18 Feb 2023 09:21:25 GMT, Eirik Bjorsnos wrote: > This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying > 'the oldest ASCII trick in the book'. > > The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two > latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is > updated to use `equalsIgnoreCase` > > To verify the correctness of `equalsIgnoreCase`, a new test is added to > `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 > code point pairs have an `equalsIgnoreCase` consistent with > Character.toUpperCase, Character.toLowerCase. > > Performance is tested for matching and mismatching cases of code point pairs > picked from the ASCII letter, ASCII number and latin1 letter ranges. Results > in the first comment below. Benchmark results: Baseline: Benchmark (codePoints) (size) Mode Cnt ScoreError Units RegionMatchesIC.Latin1.regionMatchesIC ascii-match1024 avgt 15 2216.525 ± 79.626 ns/op RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch1024 avgt 15 5.049 ± 0.044 ns/op RegionMatchesIC.Latin1.regionMatchesIC number-match1024 avgt 15 708.977 ± 19.381 ns/op RegionMatchesIC.Latin1.regionMatchesIC number-mismatch1024 avgt 15 3.726 ± 0.036 ns/op RegionMatchesIC.Latin1.regionMatchesIC lat1-match1024 avgt 15 2134.499 ± 23.064 ns/op RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024 avgt 15 4.227 ± 0.070 ns/op Patch: Benchmark (codePoints) (size) Mode Cnt ScoreError Units RegionMatchesIC.Latin1.regionMatchesIC ascii-match1024 avgt 15 809.729 ± 40.257 ns/op RegionMatchesIC.Latin1.regionMatchesIC ascii-mismatch1024 avgt 15 4.334 ± 0.031 ns/op RegionMatchesIC.Latin1.regionMatchesIC number-match1024 avgt 15 370.814 ± 39.790 ns/op RegionMatchesIC.Latin1.regionMatchesIC number-mismatch1024 avgt 15 3.766 ± 0.072 ns/op RegionMatchesIC.Latin1.regionMatchesIC lat1-match1024 avgt 15 1247.979 ± 7.826 ns/op RegionMatchesIC.Latin1.regionMatchesIClat1-mismatch1024 avgt 15 4.819 ± 0.026 ns/op - PR: https://git.openjdk.org/jdk/pull/12632
RFR: 8302871: Speed up StringLatin1.regionMatchesCI
This PR suggests we can speed up `StringLatin1.regionMatchesCI` by applying 'the oldest ASCII trick in the book'. The new static method `CharacterDataLatin1.equalsIgnoreCase` compares two latin1 bytes for equality ignoring case. `StringLatin1.regionMatchesCI` is updated to use `equalsIgnoreCase` To verify the correctness of `equalsIgnoreCase`, a new test is added to `EqualsIgnoreCase` with an exhaustive verification that all 256x256 latin1 code point pairs have an `equalsIgnoreCase` consistent with Character.toUpperCase, Character.toLowerCase. Performance is tested for matching and mismatching cases of code point pairs picked from the ASCII letter, ASCII number and latin1 letter ranges. Results in the first comment below. - Commit messages: - Exhaustive verification needs to cover the case b1 == b2 - Move multiplication exclusion to the lat1 range branch - Speed up StringLatin1.regionMatchesCI by applying the 'oldest ASCII trick in the book' Changes: https://git.openjdk.org/jdk/pull/12632/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=12632&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8302871 Stats: 158 lines in 4 files changed: 148 ins; 5 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/12632.diff Fetch: git fetch https://git.openjdk.org/jdk pull/12632/head:pull/12632 PR: https://git.openjdk.org/jdk/pull/12632
Re: Speed up StringLatin1.regionMatchesCI
RFE filed: https://bugs.openjdk.org/browse/JDK-8302871 /Claes 18 feb. 2023 kl. 10:28 skrev Eirik Bjørsnøs mailto:eir...@gmail.com>>: Hi, The following PR suggests we can speed up StringLatin1.regionMatchesCI by applying 'the oldest ASCII trick in the book': https://github.com/openjdk/jdk/pull/12632 Thanks, Eirik.
Speed up StringLatin1.regionMatchesCI
Hi, The following PR suggests we can speed up StringLatin1.regionMatchesCI by applying 'the oldest ASCII trick in the book': https://github.com/openjdk/jdk/pull/12632 Thanks, Eirik.