On Fri, 27 Jan 2023 18:31:50 GMT, Scott Gibbons <d...@openjdk.org> wrote:

>> Added code for Base64 acceleration (encode and decode) which will accelerate 
>> ~4x for AVX2 platforms.
>> 
>> Encode performance:
>> **Old:**
>> 
>> Benchmark                      (maxNumBytes)   Mode  Cnt     Score   Error   
>> Units
>> Base64Encode.testBase64Encode           1024  thrpt    3  4309.439 ± 2.632  
>> ops/ms
>> 
>> 
>> **New:**
>> 
>> Benchmark                      (maxNumBytes)   Mode  Cnt      Score     
>> Error   Units
>> Base64Encode.testBase64Encode           1024  thrpt    3  24211.397 ± 
>> 102.026  ops/ms
>> 
>> 
>> Decode performance:
>> **Old:**
>> 
>> Benchmark                      (errorIndex)  (lineSize)  (maxNumBytes)   
>> Mode  Cnt     Score    Error   Units
>> Base64Decode.testBase64Decode           144           4           1024  
>> thrpt    3  3961.768 ± 93.409  ops/ms
>> 
>> **New:**
>> Benchmark                      (errorIndex)  (lineSize)  (maxNumBytes)   
>> Mode  Cnt      Score    Error   Units
>> Base64Decode.testBase64Decode           144           4           1024  
>> thrpt    3  14738.051 ± 24.383  ops/ms
>
> Scott Gibbons has updated the pull request with a new target base due to a 
> merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains 13 additional 
> commits since the last revision:
> 
>  - Merge branch 'openjdk:master' into Base64-AVX2
>  - Merge branch 'openjdk:master' into Base64-AVX2
>  - Merge branch 'openjdk:master' into Base64-AVX2
>  - Merge branch 'Base64-AVX2' of https://github.com/asgibbons/jdk into 
> Base64-AVX2
>  - Merge branch 'openjdk:master' into Base64-AVX2
>  - Address review comment
>  - Remove whitespace
>  - Fix wrong register usage
>  - Working version of Base64 decode with AVX2 (4x perf improvement). No URL 
> support
>  - Merge branch 'Base64-AVX2' of https://github.com/asgibbons/jdk into 
> Base64-AVX2
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/50dc1196...3e66f7be

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2159:

> 2157: }
> 2158: 
> 2159: address StubGenerator::base64_AVX2_tables_addr() {

A casual reader might wonder why there's 3 other avx2-tables and then this, so 
for readability it'd be nice to add "decode" to the name of this new table.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2643:

> 2641:     // Handle isURL / MIME?!?!  r12 is used for length calculation 
> (from out)
> 2642:     //
> 2643:     // rbx is out, r12 is saved out, rdx is size, rsi is src

It seems that on windows `r12` is in use, see line 2323. GHA seem to be having 
some trouble finishing Windows testing on time - could there be some issue here?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2647:

> 2645:     ////////////////////////////////////////////
> 2646: 
> 2647:     // *************** NEEDS TO BE FIXED ************

While it's fine to leave implementation of `getMimeDecoder` and `getUrlDecoder` 
for a follow-up, I think these comments needs to be cleaned up. E.g. a 
follow-up RFE filed and referenced.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2651:

> 2649:     __ jcc(Assembler::notZero, L_tailProc);
> 2650: 
> 2651:     __ cmpl(length, 44);

Perform `length` checks first to avoid unnecessary branches on small inputs?

Ideal might be to move this length check up just before the `_cmpl(length, 
128)` in the AVX-512 block, so that if `AVX=3` short inputs branch directly to 
the scalar tail procedure without jumping around. This might also apply to the 
encode stub, though that's pre-existing.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2673:

> 2671:     __ vmovdqu(xmm13, Address(r13, 0xc8));
> 2672:     __ vmovdqu(xmm12, Address(r13, 0x08));
> 2673:     __ jmp(L_enterLoop);

This got me curious (sorry!) and maybe there's something going on here that I'm 
not getting... But why have you split the loop apart and jump into the middle 
of it? It'd seem not doing this would be more straightforward, with no 
difference semantically and one less jump? (align32 should just translate to a 
narrow nop instruction, if anything)

-------------

PR: https://git.openjdk.org/jdk/pull/12126

Reply via email to