On Fri, 27 Jan 2023 18:31:50 GMT, Scott Gibbons <d...@openjdk.org> wrote:
>> Added code for Base64 acceleration (encode and decode) which will accelerate >> ~4x for AVX2 platforms. >> >> Encode performance: >> **Old:** >> >> Benchmark (maxNumBytes) Mode Cnt Score Error >> Units >> Base64Encode.testBase64Encode 1024 thrpt 3 4309.439 ± 2.632 >> ops/ms >> >> >> **New:** >> >> Benchmark (maxNumBytes) Mode Cnt Score >> Error Units >> Base64Encode.testBase64Encode 1024 thrpt 3 24211.397 ± >> 102.026 ops/ms >> >> >> Decode performance: >> **Old:** >> >> Benchmark (errorIndex) (lineSize) (maxNumBytes) >> Mode Cnt Score Error Units >> Base64Decode.testBase64Decode 144 4 1024 >> thrpt 3 3961.768 ± 93.409 ops/ms >> >> **New:** >> Benchmark (errorIndex) (lineSize) (maxNumBytes) >> Mode Cnt Score Error Units >> Base64Decode.testBase64Decode 144 4 1024 >> thrpt 3 14738.051 ± 24.383 ops/ms > > Scott Gibbons has updated the pull request with a new target base due to a > merge or a rebase. The incremental webrev excludes the unrelated changes > brought in by the merge/rebase. The pull request contains 13 additional > commits since the last revision: > > - Merge branch 'openjdk:master' into Base64-AVX2 > - Merge branch 'openjdk:master' into Base64-AVX2 > - Merge branch 'openjdk:master' into Base64-AVX2 > - Merge branch 'Base64-AVX2' of https://github.com/asgibbons/jdk into > Base64-AVX2 > - Merge branch 'openjdk:master' into Base64-AVX2 > - Address review comment > - Remove whitespace > - Fix wrong register usage > - Working version of Base64 decode with AVX2 (4x perf improvement). No URL > support > - Merge branch 'Base64-AVX2' of https://github.com/asgibbons/jdk into > Base64-AVX2 > - ... and 3 more: https://git.openjdk.org/jdk/compare/50dc1196...3e66f7be src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2159: > 2157: } > 2158: > 2159: address StubGenerator::base64_AVX2_tables_addr() { A casual reader might wonder why there's 3 other avx2-tables and then this, so for readability it'd be nice to add "decode" to the name of this new table. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2643: > 2641: // Handle isURL / MIME?!?! r12 is used for length calculation > (from out) > 2642: // > 2643: // rbx is out, r12 is saved out, rdx is size, rsi is src It seems that on windows `r12` is in use, see line 2323. GHA seem to be having some trouble finishing Windows testing on time - could there be some issue here? src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2647: > 2645: //////////////////////////////////////////// > 2646: > 2647: // *************** NEEDS TO BE FIXED ************ While it's fine to leave implementation of `getMimeDecoder` and `getUrlDecoder` for a follow-up, I think these comments needs to be cleaned up. E.g. a follow-up RFE filed and referenced. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2651: > 2649: __ jcc(Assembler::notZero, L_tailProc); > 2650: > 2651: __ cmpl(length, 44); Perform `length` checks first to avoid unnecessary branches on small inputs? Ideal might be to move this length check up just before the `_cmpl(length, 128)` in the AVX-512 block, so that if `AVX=3` short inputs branch directly to the scalar tail procedure without jumping around. This might also apply to the encode stub, though that's pre-existing. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2673: > 2671: __ vmovdqu(xmm13, Address(r13, 0xc8)); > 2672: __ vmovdqu(xmm12, Address(r13, 0x08)); > 2673: __ jmp(L_enterLoop); This got me curious (sorry!) and maybe there's something going on here that I'm not getting... But why have you split the loop apart and jump into the middle of it? It'd seem not doing this would be more straightforward, with no difference semantically and one less jump? (align32 should just translate to a narrow nop instruction, if anything) ------------- PR: https://git.openjdk.org/jdk/pull/12126