Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v6]

2023-02-08 Thread Eirik Bjorsnos
On Wed, 8 Feb 2023 15:31:15 GMT, Eirik Bjorsnos  wrote:

>> After finding a hash match, getEntryPos needs to compare the lookup name up 
>> to the encoded entry name in the CEN. This comparison is done by decoding 
>> the entry name into a String. The names can then be compared using the 
>> String API. This decoding step adds a significat cost to this method.
>> 
>> This PR suggest to update the string comparison such that in the common case 
>> where  both the lookup name and the entry name are encoded in 
>> ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can 
>> instead be compared direcly. 
>> 
>> ZipCoder is updated with a new method to compare a string with an encoded 
>> byte array range.  The default implementation decodes to string (like the 
>> current code), while the UTF-8 implementation uses 
>> JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses 
>> Arrays.mismatch for comparison with or without matching trailing slashes. 
>> 
>> Additionally, this PR suggest to make the following updates to getEntryPos:
>> 
>> - The try/catch for IAE is redundant and can be safely removed. (initCEN 
>> already checks this and will throws IAE for invalid UTF-8). This seems to 
>> give a 3-4% speedup on micros)
>> - A new test InvalidBytesInEntryNameOrComment is a added to verify that 
>> initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I 
>> found no existing test coverage for this)
>> - The recursion when looking for "name/" matches is replaced with iteration. 
>> We keep track of any "name/" match and return it at the end of the search. 
>> (I feel this is easier to follow and it also gives a ~30% speedup for 
>> addSlash lookups with no regression on regular lookups)
>> 
>> (My though is that including these additional updates in this PR might 
>> reduce reviewer overhead given that it touches the exact same code. I might 
>> be wrong on this, please advise :)
>> 
>> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified 
>> to use xalan.jar):
>> 
>> Baseline:
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   74.941 ± 1.004  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   84.943 ± 1.320  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  120.371 ± 2.386  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  126.128 ± 1.075  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.818 ± 0.838  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   29.762 ± 5.998  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   59.405 ± 0.545  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   71.840 ± 2.455  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15  135.621 ± 4.341  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15  134.190 ± 2.141  
>> ns/op
>> 
>> 
>> 
>> PR:
>> 
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   62.267 ± 1.329  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   72.916 ± 2.428  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  101.630 ± 1.154  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  113.161 ± 0.502  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.003 ± 1.191  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   23.236 ± 1.114  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   56.781 ± 1.505  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   67.767 ± 1.963  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15   73.745 ± 2.717  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15   75.784 ± 1.051  
>> ns/op
>> 
>> 
>> To assess the impact on startup/warmup, I made a main method class which 
>> measures the total time of calling ZipFile.getEntry for all entries in the 
>> 109 JAR file dependenies of spring-petclinic. The shows a nice improvement 
>> (time in micros):
>> 
>> 
>> Percentile Baseline Patch
>> 50 %  23155 21149
>> 75 %  23598 21454
>> 90 %  23989 21691
>> 95 %  24238 21973
>> 99 %  25270 22446
>> STDEV   792   549
>> Count   500   500
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Align entry name whitespace

I found the following ASCII, same length, hash code colliding dir names which 
we can use instead:


_1637461950/
_-408231241/


This allows the test to run with all charsets. Fails for all charsets expect 
UTF-8.

-

PR: https://git.openjdk.org/jdk/pull/12290


Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v6]

2023-02-08 Thread Eirik Bjorsnos
On Wed, 8 Feb 2023 15:31:15 GMT, Eirik Bjorsnos  wrote:

>> After finding a hash match, getEntryPos needs to compare the lookup name up 
>> to the encoded entry name in the CEN. This comparison is done by decoding 
>> the entry name into a String. The names can then be compared using the 
>> String API. This decoding step adds a significat cost to this method.
>> 
>> This PR suggest to update the string comparison such that in the common case 
>> where  both the lookup name and the entry name are encoded in 
>> ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can 
>> instead be compared direcly. 
>> 
>> ZipCoder is updated with a new method to compare a string with an encoded 
>> byte array range.  The default implementation decodes to string (like the 
>> current code), while the UTF-8 implementation uses 
>> JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses 
>> Arrays.mismatch for comparison with or without matching trailing slashes. 
>> 
>> Additionally, this PR suggest to make the following updates to getEntryPos:
>> 
>> - The try/catch for IAE is redundant and can be safely removed. (initCEN 
>> already checks this and will throws IAE for invalid UTF-8). This seems to 
>> give a 3-4% speedup on micros)
>> - A new test InvalidBytesInEntryNameOrComment is a added to verify that 
>> initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I 
>> found no existing test coverage for this)
>> - The recursion when looking for "name/" matches is replaced with iteration. 
>> We keep track of any "name/" match and return it at the end of the search. 
>> (I feel this is easier to follow and it also gives a ~30% speedup for 
>> addSlash lookups with no regression on regular lookups)
>> 
>> (My though is that including these additional updates in this PR might 
>> reduce reviewer overhead given that it touches the exact same code. I might 
>> be wrong on this, please advise :)
>> 
>> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified 
>> to use xalan.jar):
>> 
>> Baseline:
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   74.941 ± 1.004  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   84.943 ± 1.320  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  120.371 ± 2.386  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  126.128 ± 1.075  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.818 ± 0.838  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   29.762 ± 5.998  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   59.405 ± 0.545  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   71.840 ± 2.455  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15  135.621 ± 4.341  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15  134.190 ± 2.141  
>> ns/op
>> 
>> 
>> 
>> PR:
>> 
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   62.267 ± 1.329  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   72.916 ± 2.428  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  101.630 ± 1.154  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  113.161 ± 0.502  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.003 ± 1.191  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   23.236 ± 1.114  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   56.781 ± 1.505  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   67.767 ± 1.963  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15   73.745 ± 2.717  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15   75.784 ± 1.051  
>> ns/op
>> 
>> 
>> To assess the impact on startup/warmup, I made a main method class which 
>> measures the total time of calling ZipFile.getEntry for all entries in the 
>> 109 JAR file dependenies of spring-petclinic. The shows a nice improvement 
>> (time in micros):
>> 
>> 
>> Percentile Baseline Patch
>> 50 %  23155 21149
>> 75 %  23598 21454
>> 90 %  23989 21691
>> 95 %  24238 21973
>> 99 %  25270 22446
>> STDEV   792   549
>> Count   500   500
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Align entry name whitespace

It takes unicode-charsets as a DataProvider input parameter, so that should 
already be covered? Or did you mean something else?

-

PR: https://git.openjdk.org/jdk/pull/12290


Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v6]

2023-02-08 Thread Claes Redestad
On Wed, 8 Feb 2023 15:31:15 GMT, Eirik Bjorsnos  wrote:

>> After finding a hash match, getEntryPos needs to compare the lookup name up 
>> to the encoded entry name in the CEN. This comparison is done by decoding 
>> the entry name into a String. The names can then be compared using the 
>> String API. This decoding step adds a significat cost to this method.
>> 
>> This PR suggest to update the string comparison such that in the common case 
>> where  both the lookup name and the entry name are encoded in 
>> ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can 
>> instead be compared direcly. 
>> 
>> ZipCoder is updated with a new method to compare a string with an encoded 
>> byte array range.  The default implementation decodes to string (like the 
>> current code), while the UTF-8 implementation uses 
>> JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses 
>> Arrays.mismatch for comparison with or without matching trailing slashes. 
>> 
>> Additionally, this PR suggest to make the following updates to getEntryPos:
>> 
>> - The try/catch for IAE is redundant and can be safely removed. (initCEN 
>> already checks this and will throws IAE for invalid UTF-8). This seems to 
>> give a 3-4% speedup on micros)
>> - A new test InvalidBytesInEntryNameOrComment is a added to verify that 
>> initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I 
>> found no existing test coverage for this)
>> - The recursion when looking for "name/" matches is replaced with iteration. 
>> We keep track of any "name/" match and return it at the end of the search. 
>> (I feel this is easier to follow and it also gives a ~30% speedup for 
>> addSlash lookups with no regression on regular lookups)
>> 
>> (My though is that including these additional updates in this PR might 
>> reduce reviewer overhead given that it touches the exact same code. I might 
>> be wrong on this, please advise :)
>> 
>> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified 
>> to use xalan.jar):
>> 
>> Baseline:
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   74.941 ± 1.004  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   84.943 ± 1.320  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  120.371 ± 2.386  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  126.128 ± 1.075  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.818 ± 0.838  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   29.762 ± 5.998  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   59.405 ± 0.545  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   71.840 ± 2.455  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15  135.621 ± 4.341  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15  134.190 ± 2.141  
>> ns/op
>> 
>> 
>> 
>> PR:
>> 
>> 
>> Benchmark (size)  Mode  CntScore   Error  
>> Units
>> ZipFileGetEntry.getEntryHit  512  avgt   15   62.267 ± 1.329  
>> ns/op
>> ZipFileGetEntry.getEntryHit 1024  avgt   15   72.916 ± 2.428  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  101.630 ± 1.154  
>> ns/op
>> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  113.161 ± 0.502  
>> ns/op
>> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.003 ± 1.191  
>> ns/op
>> ZipFileGetEntry.getEntryMiss1024  avgt   15   23.236 ± 1.114  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   56.781 ± 1.505  
>> ns/op
>> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   67.767 ± 1.963  
>> ns/op
>> ZipFileGetEntry.getEntrySlash512  avgt   15   73.745 ± 2.717  
>> ns/op
>> ZipFileGetEntry.getEntrySlash   1024  avgt   15   75.784 ± 1.051  
>> ns/op
>> 
>> 
>> To assess the impact on startup/warmup, I made a main method class which 
>> measures the total time of calling ZipFile.getEntry for all entries in the 
>> 109 JAR file dependenies of spring-petclinic. The shows a nice improvement 
>> (time in micros):
>> 
>> 
>> Percentile Baseline Patch
>> 50 %  23155 21149
>> 75 %  23598 21454
>> 90 %  23989 21691
>> 95 %  24238 21973
>> 99 %  25270 22446
>> STDEV   792   549
>> Count   500   500
>
> Eirik Bjorsnos has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Align entry name whitespace

Good! Would it be too much to ask for you to try and construct such a test for 
UTF-8 zip files, too?

-

PR: https://git.openjdk.org/jdk/pull/12290


Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v6]

2023-02-08 Thread Eirik Bjorsnos
> After finding a hash match, getEntryPos needs to compare the lookup name up 
> to the encoded entry name in the CEN. This comparison is done by decoding the 
> entry name into a String. The names can then be compared using the String 
> API. This decoding step adds a significat cost to this method.
> 
> This PR suggest to update the string comparison such that in the common case 
> where  both the lookup name and the entry name are encoded in 
> ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can 
> instead be compared direcly. 
> 
> ZipCoder is updated with a new method to compare a string with an encoded 
> byte array range.  The default implementation decodes to string (like the 
> current code), while the UTF-8 implementation uses 
> JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses 
> Arrays.mismatch for comparison with or without matching trailing slashes. 
> 
> Additionally, this PR suggest to make the following updates to getEntryPos:
> 
> - The try/catch for IAE is redundant and can be safely removed. (initCEN 
> already checks this and will throws IAE for invalid UTF-8). This seems to 
> give a 3-4% speedup on micros)
> - A new test InvalidBytesInEntryNameOrComment is a added to verify that 
> initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I 
> found no existing test coverage for this)
> - The recursion when looking for "name/" matches is replaced with iteration. 
> We keep track of any "name/" match and return it at the end of the search. (I 
> feel this is easier to follow and it also gives a ~30% speedup for addSlash 
> lookups with no regression on regular lookups)
> 
> (My though is that including these additional updates in this PR might reduce 
> reviewer overhead given that it touches the exact same code. I might be wrong 
> on this, please advise :)
> 
> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified 
> to use xalan.jar):
> 
> Baseline:
> 
> Benchmark (size)  Mode  CntScore   Error  
> Units
> ZipFileGetEntry.getEntryHit  512  avgt   15   74.941 ± 1.004  
> ns/op
> ZipFileGetEntry.getEntryHit 1024  avgt   15   84.943 ± 1.320  
> ns/op
> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  120.371 ± 2.386  
> ns/op
> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  126.128 ± 1.075  
> ns/op
> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.818 ± 0.838  
> ns/op
> ZipFileGetEntry.getEntryMiss1024  avgt   15   29.762 ± 5.998  
> ns/op
> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   59.405 ± 0.545  
> ns/op
> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   71.840 ± 2.455  
> ns/op
> ZipFileGetEntry.getEntrySlash512  avgt   15  135.621 ± 4.341  
> ns/op
> ZipFileGetEntry.getEntrySlash   1024  avgt   15  134.190 ± 2.141  
> ns/op
> 
> 
> 
> PR:
> 
> 
> Benchmark (size)  Mode  CntScore   Error  
> Units
> ZipFileGetEntry.getEntryHit  512  avgt   15   62.267 ± 1.329  
> ns/op
> ZipFileGetEntry.getEntryHit 1024  avgt   15   72.916 ± 2.428  
> ns/op
> ZipFileGetEntry.getEntryHitUncached  512  avgt   15  101.630 ± 1.154  
> ns/op
> ZipFileGetEntry.getEntryHitUncached 1024  avgt   15  113.161 ± 0.502  
> ns/op
> ZipFileGetEntry.getEntryMiss 512  avgt   15   23.003 ± 1.191  
> ns/op
> ZipFileGetEntry.getEntryMiss1024  avgt   15   23.236 ± 1.114  
> ns/op
> ZipFileGetEntry.getEntryMissUncached 512  avgt   15   56.781 ± 1.505  
> ns/op
> ZipFileGetEntry.getEntryMissUncached1024  avgt   15   67.767 ± 1.963  
> ns/op
> ZipFileGetEntry.getEntrySlash512  avgt   15   73.745 ± 2.717  
> ns/op
> ZipFileGetEntry.getEntrySlash   1024  avgt   15   75.784 ± 1.051  
> ns/op
> 
> 
> To assess the impact on startup/warmup, I made a main method class which 
> measures the total time of calling ZipFile.getEntry for all entries in the 
> 109 JAR file dependenies of spring-petclinic. The shows a nice improvement 
> (time in micros):
> 
> 
> Percentile Baseline Patch
> 50 %  23155 21149
> 75 %  23598 21454
> 90 %  23989 21691
> 95 %  24238 21973
> 99 %  25270 22446
> STDEV   792   549
> Count   500   500

Eirik Bjorsnos has updated the pull request incrementally with one additional 
commit since the last revision:

  Align entry name whitespace

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/12290/files
  - new: https://git.openjdk.org/jdk/pull/12290/files/f9285261..ac7c4c7a

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk=12290=05
 - incr: https://webrevs.openjdk.org/?repo=jdk=12290=04-05

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/12290.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12290/head:pull/12290

PR: