Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v7]

2023-02-09 Thread Claes Redestad
On Thu, 9 Feb 2023 11:46:42 GMT, Eirik Bjorsnos  wrote:

> > In addition to - or instead of - an `equals` shortcut then if coders are 
> > the same we could use `ArraysSupport.mismatch` which should get similar 
> > speed and help more generally.
> 
> ..and if String had (an optimized) mismatch method, then I bet all or most of 
> the comparison methods (equals, compareTo, endsWith, startsWith, 
> regionMatches) could delegate to that :-)

A private mismatch method might make sense. A public method would require 
making a stronger case, I think, e.g. showing use cases a mismatcher would 
solve (elegantly, performantly) that can't be expressed with existing methods.

-

PR: https://git.openjdk.org/jdk/pull/12290


Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v7]

2023-02-09 Thread Eirik Bjorsnos
On Thu, 9 Feb 2023 11:39:18 GMT, Claes Redestad  wrote:

> Yes, I'll file a PR to see if we can make `startsWith` a bit sharper

Thanks! I pushed the change to ZipCoder.compare.

> In addition to - or instead of - an `equals` shortcut then if coders are the 
> same we could use `ArraysSupport.mismatch` which should get similar speed and 
> help more generally.

..and if String had (an optimized) mismatch method, then I bet all or most of 
the comparison methods (equals, compareTo, endsWith, startsWith, regionMatches) 
could delegate to that  :-)

-

PR: https://git.openjdk.org/jdk/pull/12290


Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v7]

2023-02-09 Thread Eirik Bjorsnos
On Thu, 9 Feb 2023 11:17:54 GMT, Eirik Bjorsnos  wrote:

> It helps... a lot!

I guess an update to String.startsWith should be handled separately from this 
PR.

If startsWith will be optimized, then I'm happy to update this PR to use 
startsWith as you suggested. Then ZipCoder will speed up for free with any 
later startsWith optimization.

-

PR: https://git.openjdk.org/jdk/pull/12290


Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v7]

2023-02-09 Thread Eirik Bjorsnos
On Thu, 9 Feb 2023 11:07:17 GMT, Claes Redestad  wrote:

> Could be interesting to see if special-casing `startsWith` to call `equals` 
> when possible would help:

It helps... a lot!


Benchmark (size)  Mode  CntScore   Error  Units
ZipFileGetEntry.getEntryHit  512  avgt   15   72.800 ± 3.426  ns/op
ZipFileGetEntry.getEntryHit 1024  avgt   15   73.434 ± 0.671  ns/op
ZipFileGetEntry.getEntryHitUncached  512  avgt   15  107.538 ± 1.309  ns/op
ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  110.775 ± 2.252  ns/op
ZipFileGetEntry.getEntryMiss 512  avgt   15   22.022 ± 0.170  ns/op
ZipFileGetEntry.getEntryMiss1024  avgt   15   23.242 ± 1.569  ns/op
ZipFileGetEntry.getEntryMissUncached 512  avgt   15   50.946 ± 0.458  ns/op
ZipFileGetEntry.getEntryMissUncached1024  avgt   15   57.856 ± 3.021  ns/op

-

PR: https://git.openjdk.org/jdk/pull/12290


Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v7]

2023-02-09 Thread Claes Redestad
On Wed, 8 Feb 2023 16:36:16 GMT, Eirik Bjorsnos  wrote:

>> After finding a hash match, getEntryPos needs to compare the lookup name up 
>> to the encoded entry name in the CEN. This comparison is done by decoding 
>> the entry name into a String. The names can then be compared using the 
>> String API. This decoding step adds a significat cost to this method.
>> 
>> This PR suggest to update the string comparison such that in the common case 
>> where  both the lookup name and the entry name are encoded in 
>> ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can 
>> instead be compared direcly. 
>> 
>> ZipCoder is updated with a new method to compare a string with an encoded 
>> byte array range.  The default implementation decodes to string (like the 
>> current code), while the UTF-8 implementation uses 
>> JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses 
>> Arrays.mismatch for comparison with or without matching trailing slashes. 
>> 
>> Additionally, this PR suggest to make the following updates to getEntryPos:
>> 
>> - The try/catch for IAE is redundant and can be safely removed. (initCEN 
>> already checks this and will throws IAE for invalid UTF-8). This seems to 
>> give a 3-4% speedup on micros)
>> - A new test InvalidBytesInEntryNameOrComment is a added to verify that 
>> initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I 
>> found no existing test coverage for this)
>> - The recursion when looking for "name/" matches is replaced with iteration. 
>> We keep track of any "name/" match and return it at the end of the search. 
>> (I feel this is easier to follow and it also gives a ~30% speedup for 
>> addSlash lookups with no regression on regular lookups)
>> 
>> (My though is that including these additional updates in this PR might 
>> reduce reviewer overhead given that it touches the exact same code. I might 
>> be wrong on this, please advise :)
>> 
>> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified 
>> to use xalan.jar):
>> 
>> Baseline:
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   74.941 ± 1.004  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   84.943 ± 1.320  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  120.371 ± 2.386  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  126.128 ± 1.075  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.818 ± 0.838  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   29.762 ± 5.998  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   59.405 ± 0.545  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   71.840 ± 2.455  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15  135.621 ± 4.341  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15  134.190 ± 2.141  
>> ns/op
>> 
>> 
>> 
>> PR:
>> 
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   62.267 ± 1.329  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   72.916 ± 2.428  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  101.630 ± 1.154  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  113.161 ± 0.502  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.003 ± 1.191  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   23.236 ± 1.114  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   56.781 ± 1.505  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   67.767 ± 1.963  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15   73.745 ± 2.717  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15   75.784 ± 1.051  
>> ns/op
>> 
>> 
>> To assess the impact on startup/warmup, I made a main method class which 
>> measures the total time of calling ZipFile.getEntry for all entries in the 
>> 109 JAR file dependenies of spring-petclinic. The shows a nice improvement 
>> (time in micros):
>> 
>> 
>> Percentile Baseline Patch
>> 50 %  23155 21149
>> 75 %  23598 21454
>> 90 %  23989 21691
>> 95 %  24238 21973
>> 99 %  25270 22446
>> STDEV   792   549
>> Count   500   500
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Update the sameHashAndLengthDirLookup to use ASCII example colliding paths, 
> allowing the test to run over all charsets

Oh -- fun! Perhaps `startsWith` doesn't take advantage of the optimized 
intrinsic for `equals`. 

Could be interesting to see if special-casing `startsWith` to call `equals` 
when possible would help:

diff --git 

Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v7]

2023-02-09 Thread Eirik Bjorsnos
On Thu, 9 Feb 2023 10:05:06 GMT, Claes Redestad  wrote:

> Should be a win and avoids the need for a new mismatch method here.

Interestingly, this is not a win:

PR:


Benchmark (size)  Mode  CntScore   Error  Units
ZipFileGetEntry.getEntryHit  512  avgt   15   73.633 ± 4.349  ns/op
ZipFileGetEntry.getEntryHit 1024  avgt   15   74.477 ± 1.667  ns/op
ZipFileGetEntry.getEntryHitUncached  512  avgt   15  108.352 ± 1.598  ns/op
ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  110.425 ± 1.867  ns/op
ZipFileGetEntry.getEntryMiss 512  avgt   15   21.921 ± 0.251  ns/op
ZipFileGetEntry.getEntryMiss1024  avgt   15   22.836 ± 0.279  ns/op
ZipFileGetEntry.getEntryMissUncached 512  avgt   15   51.890 ± 2.289  ns/op
ZipFileGetEntry.getEntryMissUncached1024  avgt   15   56.722 ± 0.701  ns/op


Claes:


BBenchmark (size)  Mode  CntScore   Error  Units
ZipFileGetEntry.getEntryHit  512  avgt   15   84.518 ± 4.440  ns/op
ZipFileGetEntry.getEntryHit 1024  avgt   15   85.359 ± 1.484  ns/op
ZipFileGetEntry.getEntryHitUncached  512  avgt   15  117.003 ± 1.481  ns/op
ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  119.026 ± 1.073  ns/op
ZipFileGetEntry.getEntryMiss 512  avgt   15   22.027 ± 0.188  ns/op
ZipFileGetEntry.getEntryMiss1024  avgt   15   22.929 ± 0.282  ns/op
ZipFileGetEntry.getEntryMissUncached 512  avgt   15   52.513 ± 1.859  ns/op
ZipFileGetEntry.getEntryMissUncached1024  avgt   15   56.425 ± 0.699  ns/op


(I forced ZipCoder.compare here by simply removing UTFZipCoder.compare)

-

PR: https://git.openjdk.org/jdk/pull/12290


Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v7]

2023-02-09 Thread Claes Redestad
On Wed, 8 Feb 2023 16:36:16 GMT, Eirik Bjorsnos  wrote:

>> After finding a hash match, getEntryPos needs to compare the lookup name up 
>> to the encoded entry name in the CEN. This comparison is done by decoding 
>> the entry name into a String. The names can then be compared using the 
>> String API. This decoding step adds a significat cost to this method.
>> 
>> This PR suggest to update the string comparison such that in the common case 
>> where  both the lookup name and the entry name are encoded in 
>> ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can 
>> instead be compared direcly. 
>> 
>> ZipCoder is updated with a new method to compare a string with an encoded 
>> byte array range.  The default implementation decodes to string (like the 
>> current code), while the UTF-8 implementation uses 
>> JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses 
>> Arrays.mismatch for comparison with or without matching trailing slashes. 
>> 
>> Additionally, this PR suggest to make the following updates to getEntryPos:
>> 
>> - The try/catch for IAE is redundant and can be safely removed. (initCEN 
>> already checks this and will throws IAE for invalid UTF-8). This seems to 
>> give a 3-4% speedup on micros)
>> - A new test InvalidBytesInEntryNameOrComment is a added to verify that 
>> initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I 
>> found no existing test coverage for this)
>> - The recursion when looking for "name/" matches is replaced with iteration. 
>> We keep track of any "name/" match and return it at the end of the search. 
>> (I feel this is easier to follow and it also gives a ~30% speedup for 
>> addSlash lookups with no regression on regular lookups)
>> 
>> (My though is that including these additional updates in this PR might 
>> reduce reviewer overhead given that it touches the exact same code. I might 
>> be wrong on this, please advise :)
>> 
>> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified 
>> to use xalan.jar):
>> 
>> Baseline:
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   74.941 ± 1.004  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   84.943 ± 1.320  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  120.371 ± 2.386  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  126.128 ± 1.075  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.818 ± 0.838  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   29.762 ± 5.998  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   59.405 ± 0.545  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   71.840 ± 2.455  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15  135.621 ± 4.341  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15  134.190 ± 2.141  
>> ns/op
>> 
>> 
>> 
>> PR:
>> 
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   62.267 ± 1.329  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   72.916 ± 2.428  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  101.630 ± 1.154  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  113.161 ± 0.502  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.003 ± 1.191  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   23.236 ± 1.114  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   56.781 ± 1.505  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   67.767 ± 1.963  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15   73.745 ± 2.717  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15   75.784 ± 1.051  
>> ns/op
>> 
>> 
>> To assess the impact on startup/warmup, I made a main method class which 
>> measures the total time of calling ZipFile.getEntry for all entries in the 
>> 109 JAR file dependenies of spring-petclinic. The shows a nice improvement 
>> (time in micros):
>> 
>> 
>> Percentile Baseline Patch
>> 50 %  23155 21149
>> 75 %  23598 21454
>> 90 %  23989 21691
>> 95 %  24238 21973
>> 99 %  25270 22446
>> STDEV   792   549
>> Count   500   500
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Update the sameHashAndLengthDirLookup to use ASCII example colliding paths, 
> allowing the test to run over all charsets

You could skip the `equals`:

if (decoded.startsWith(str)) {
if (decoded.length() == str.length()) {
return Comparison.EXACT_MATCH;
} else if (addSlash
&& 

Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v7]

2023-02-09 Thread Eirik Bjorsnos
On Wed, 8 Feb 2023 16:36:16 GMT, Eirik Bjorsnos  wrote:

>> After finding a hash match, getEntryPos needs to compare the lookup name up 
>> to the encoded entry name in the CEN. This comparison is done by decoding 
>> the entry name into a String. The names can then be compared using the 
>> String API. This decoding step adds a significat cost to this method.
>> 
>> This PR suggest to update the string comparison such that in the common case 
>> where  both the lookup name and the entry name are encoded in 
>> ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can 
>> instead be compared direcly. 
>> 
>> ZipCoder is updated with a new method to compare a string with an encoded 
>> byte array range.  The default implementation decodes to string (like the 
>> current code), while the UTF-8 implementation uses 
>> JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses 
>> Arrays.mismatch for comparison with or without matching trailing slashes. 
>> 
>> Additionally, this PR suggest to make the following updates to getEntryPos:
>> 
>> - The try/catch for IAE is redundant and can be safely removed. (initCEN 
>> already checks this and will throws IAE for invalid UTF-8). This seems to 
>> give a 3-4% speedup on micros)
>> - A new test InvalidBytesInEntryNameOrComment is a added to verify that 
>> initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I 
>> found no existing test coverage for this)
>> - The recursion when looking for "name/" matches is replaced with iteration. 
>> We keep track of any "name/" match and return it at the end of the search. 
>> (I feel this is easier to follow and it also gives a ~30% speedup for 
>> addSlash lookups with no regression on regular lookups)
>> 
>> (My though is that including these additional updates in this PR might 
>> reduce reviewer overhead given that it touches the exact same code. I might 
>> be wrong on this, please advise :)
>> 
>> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified 
>> to use xalan.jar):
>> 
>> Baseline:
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   74.941 ± 1.004  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   84.943 ± 1.320  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  120.371 ± 2.386  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  126.128 ± 1.075  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.818 ± 0.838  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   29.762 ± 5.998  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   59.405 ± 0.545  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   71.840 ± 2.455  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15  135.621 ± 4.341  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15  134.190 ± 2.141  
>> ns/op
>> 
>> 
>> 
>> PR:
>> 
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   62.267 ± 1.329  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   72.916 ± 2.428  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  101.630 ± 1.154  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  113.161 ± 0.502  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.003 ± 1.191  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   23.236 ± 1.114  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   56.781 ± 1.505  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   67.767 ± 1.963  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15   73.745 ± 2.717  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15   75.784 ± 1.051  
>> ns/op
>> 
>> 
>> To assess the impact on startup/warmup, I made a main method class which 
>> measures the total time of calling ZipFile.getEntry for all entries in the 
>> 109 JAR file dependenies of spring-petclinic. The shows a nice improvement 
>> (time in micros):
>> 
>> 
>> Percentile Baseline Patch
>> 50 %  23155 21149
>> 75 %  23598 21454
>> 90 %  23989 21691
>> 95 %  24238 21973
>> 99 %  25270 22446
>> STDEV   792   549
>> Count   500   500
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Update the sameHashAndLengthDirLookup to use ASCII example colliding paths, 
> allowing the test to run over all charsets

A somewhat unrelated observation:

If String had a mismatch method, this last regression would probably not have 
been introduced, since the logic would align better with UTF8ZipCoder which 
used Arrays.mismatch.


Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v7]

2023-02-08 Thread Eirik Bjorsnos
On Wed, 8 Feb 2023 17:11:19 GMT, Claes Redestad  wrote:

> Great! Extending coverage to provoke the issue on most charsets is good, and 
> it should guard UTF-8 from regressing too - no?

It also guards UTFZipCoder from this particular regression, yes.

-

PR: https://git.openjdk.org/jdk/pull/12290


Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v7]

2023-02-08 Thread Claes Redestad
On Wed, 8 Feb 2023 16:36:16 GMT, Eirik Bjorsnos  wrote:

>> After finding a hash match, getEntryPos needs to compare the lookup name up 
>> to the encoded entry name in the CEN. This comparison is done by decoding 
>> the entry name into a String. The names can then be compared using the 
>> String API. This decoding step adds a significat cost to this method.
>> 
>> This PR suggest to update the string comparison such that in the common case 
>> where  both the lookup name and the entry name are encoded in 
>> ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can 
>> instead be compared direcly. 
>> 
>> ZipCoder is updated with a new method to compare a string with an encoded 
>> byte array range.  The default implementation decodes to string (like the 
>> current code), while the UTF-8 implementation uses 
>> JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses 
>> Arrays.mismatch for comparison with or without matching trailing slashes. 
>> 
>> Additionally, this PR suggest to make the following updates to getEntryPos:
>> 
>> - The try/catch for IAE is redundant and can be safely removed. (initCEN 
>> already checks this and will throws IAE for invalid UTF-8). This seems to 
>> give a 3-4% speedup on micros)
>> - A new test InvalidBytesInEntryNameOrComment is a added to verify that 
>> initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I 
>> found no existing test coverage for this)
>> - The recursion when looking for "name/" matches is replaced with iteration. 
>> We keep track of any "name/" match and return it at the end of the search. 
>> (I feel this is easier to follow and it also gives a ~30% speedup for 
>> addSlash lookups with no regression on regular lookups)
>> 
>> (My though is that including these additional updates in this PR might 
>> reduce reviewer overhead given that it touches the exact same code. I might 
>> be wrong on this, please advise :)
>> 
>> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified 
>> to use xalan.jar):
>> 
>> Baseline:
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   74.941 ± 1.004  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   84.943 ± 1.320  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  120.371 ± 2.386  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  126.128 ± 1.075  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.818 ± 0.838  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   29.762 ± 5.998  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   59.405 ± 0.545  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   71.840 ± 2.455  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15  135.621 ± 4.341  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15  134.190 ± 2.141  
>> ns/op
>> 
>> 
>> 
>> PR:
>> 
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   62.267 ± 1.329  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   72.916 ± 2.428  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  101.630 ± 1.154  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  113.161 ± 0.502  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.003 ± 1.191  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   23.236 ± 1.114  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   56.781 ± 1.505  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   67.767 ± 1.963  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15   73.745 ± 2.717  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15   75.784 ± 1.051  
>> ns/op
>> 
>> 
>> To assess the impact on startup/warmup, I made a main method class which 
>> measures the total time of calling ZipFile.getEntry for all entries in the 
>> 109 JAR file dependenies of spring-petclinic. The shows a nice improvement 
>> (time in micros):
>> 
>> 
>> Percentile Baseline Patch
>> 50 %  23155 21149
>> 75 %  23598 21454
>> 90 %  23989 21691
>> 95 %  24238 21973
>> 99 %  25270 22446
>> STDEV   792   549
>> Count   500   500
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Update the sameHashAndLengthDirLookup to use ASCII example colliding paths, 
> allowing the test to run over all charsets

Marked as reviewed by redestad (Reviewer).

Great! Extending coverage to provoke the issue on most charsets is good, and it 
should guard UTF-8 from regressing too - no?

-

PR: 

Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v7]

2023-02-08 Thread Eirik Bjorsnos
> After finding a hash match, getEntryPos needs to compare the lookup name up 
> to the encoded entry name in the CEN. This comparison is done by decoding the 
> entry name into a String. The names can then be compared using the String 
> API. This decoding step adds a significat cost to this method.
> 
> This PR suggest to update the string comparison such that in the common case 
> where  both the lookup name and the entry name are encoded in 
> ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can 
> instead be compared direcly. 
> 
> ZipCoder is updated with a new method to compare a string with an encoded 
> byte array range.  The default implementation decodes to string (like the 
> current code), while the UTF-8 implementation uses 
> JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses 
> Arrays.mismatch for comparison with or without matching trailing slashes. 
> 
> Additionally, this PR suggest to make the following updates to getEntryPos:
> 
> - The try/catch for IAE is redundant and can be safely removed. (initCEN 
> already checks this and will throws IAE for invalid UTF-8). This seems to 
> give a 3-4% speedup on micros)
> - A new test InvalidBytesInEntryNameOrComment is a added to verify that 
> initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I 
> found no existing test coverage for this)
> - The recursion when looking for "name/" matches is replaced with iteration. 
> We keep track of any "name/" match and return it at the end of the search. (I 
> feel this is easier to follow and it also gives a ~30% speedup for addSlash 
> lookups with no regression on regular lookups)
> 
> (My though is that including these additional updates in this PR might reduce 
> reviewer overhead given that it touches the exact same code. I might be wrong 
> on this, please advise :)
> 
> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified 
> to use xalan.jar):
> 
> Baseline:
> 
> Benchmark (size)  Mode  CntScore   Error  
> Units
> ZipFileGetEntry.getEntryHit  512  avgt   15   74.941 ± 1.004  
> ns/op
> ZipFileGetEntry.getEntryHit 1024  avgt   15   84.943 ± 1.320  
> ns/op
> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  120.371 ± 2.386  
> ns/op
> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  126.128 ± 1.075  
> ns/op
> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.818 ± 0.838  
> ns/op
> ZipFileGetEntry.getEntryMiss1024  avgt   15   29.762 ± 5.998  
> ns/op
> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   59.405 ± 0.545  
> ns/op
> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   71.840 ± 2.455  
> ns/op
> ZipFileGetEntry.getEntrySlash512  avgt   15  135.621 ± 4.341  
> ns/op
> ZipFileGetEntry.getEntrySlash   1024  avgt   15  134.190 ± 2.141  
> ns/op
> 
> 
> 
> PR:
> 
> 
> Benchmark (size)  Mode  CntScore   Error  
> Units
> ZipFileGetEntry.getEntryHit  512  avgt   15   62.267 ± 1.329  
> ns/op
> ZipFileGetEntry.getEntryHit 1024  avgt   15   72.916 ± 2.428  
> ns/op
> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  101.630 ± 1.154  
> ns/op
> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  113.161 ± 0.502  
> ns/op
> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.003 ± 1.191  
> ns/op
> ZipFileGetEntry.getEntryMiss1024  avgt   15   23.236 ± 1.114  
> ns/op
> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   56.781 ± 1.505  
> ns/op
> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   67.767 ± 1.963  
> ns/op
> ZipFileGetEntry.getEntrySlash512  avgt   15   73.745 ± 2.717  
> ns/op
> ZipFileGetEntry.getEntrySlash   1024  avgt   15   75.784 ± 1.051  
> ns/op
> 
> 
> To assess the impact on startup/warmup, I made a main method class which 
> measures the total time of calling ZipFile.getEntry for all entries in the 
> 109 JAR file dependenies of spring-petclinic. The shows a nice improvement 
> (time in micros):
> 
> 
> Percentile Baseline Patch
> 50 %  23155 21149
> 75 %  23598 21454
> 90 %  23989 21691
> 95 %  24238 21973
> 99 %  25270 22446
> STDEV   792   549
> Count   500   500

Eirik Bjorsnos has updated the pull request incrementally with one additional 
commit since the last revision:

  Update the sameHashAndLengthDirLookup to use ASCII example colliding paths, 
allowing the test to run over all charsets

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12290/files
  - new: https://git.openjdk.org/jdk/pull/12290/files/ac7c4c7a..3cb77dba

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=12290=06
 - incr: https://webrevs.openjdk.org/?repo=jdk=12290=05-06

  Stats: 13 lines in 1 file changed: 8 ins; 1 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/12290.diff