Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]
On Thu, 8 Apr 2021 06:33:43 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]
On Thu, 8 Apr 2021 09:05:43 GMT, Dong Bo wrote: > Hi @nick-arm, are you also ok with the newest commit? It looks ok to me but I'm not a Reviewer. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]
On Thu, 8 Apr 2021 08:28:53 GMT, Andrew Haley wrote: >> Dong Bo has updated the pull request incrementally with one additional >> commit since the last revision: >> >> reduce unnecessary memory write traffic in non-SIMD code > > Marked as reviewed by aph (Reviewer). @theRealAph Thanks for the review. Hi @nick-arm, are you also ok with the newest commit? - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]
On Thu, 8 Apr 2021 06:33:43 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]
On Wed, 7 Apr 2021 09:53:36 GMT, Andrew Haley wrote: >> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5829: >> >>> 5827: __ strb(r14, __ post(dst, 1)); >>> 5828: __ strb(r15, __ post(dst, 1)); >>> 5829: __ strb(r13, __ post(dst, 1)); >> >> I think this sequence should be 4 BFMs, STRW, BFM, STRW. That's the best we >> can do, I think. > > Sorry, that's not quite right, but you get the idea: let's not generate > unnecessary memory traffic. Okay, implemented as: __ lslw(r14, r10, 10); __ bfiw(r14, r11, 4, 6); __ bfmw(r14, r12, 2, 5); __ rev16w(r14, r14); __ bfiw(r13, r12, 6, 2); __ strh(r14, __ post(dst, 2)); __ strb(r13, __ post(dst, 1)); - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. > Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea > can be found at > http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. > > Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. > Tests in `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java` runned specially for the > correctness of the implementation. > > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. > > A JMH micro, Base64Decode.java, is added for performance test. > With different input length (upper-bounded by parameter `maxNumBytes` in the > JMH micro), > we witness ~2.5x improvements with long inputs and no regression with short > inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on > Kunpeng916. > > The Base64Decode.java JMH micro-benchmark results: > > Benchmark (lineSize) (maxNumBytes) Mode Cnt > Score Error Units > > # Kunpeng916, intrinsic > Base64Decode.testBase64Decode 4 1 avgt5 > 48.614 ± 0.609 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 58.199 ± 1.650 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 69.400 ± 0.931 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 96.818 ± 1.687 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 122.856 ± 9.217 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 130.935 ± 1.667 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 143.627 ± 1.751 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 152.311 ± 1.178 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 342.631 ± 0.584 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 573.635 ± 1.050 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 9534.136 ±45.172 ns/op > Base64Decode.testBase64Decode 4 5 avgt5 > 22718.726 ± 192.070 ns/op > Base64Decode.testBase64MIMEDecode 4 1 avgt 10 > 63.558 ±0.336 ns/op > Base64Decode.testBase64MIMEDecode 4 3 avgt 10 > 82.504 ±0.848 ns/op > Base64Decode.testBase64MIMEDecode 4 7 avgt 10 > 120.591 ±0.608 ns/op > Base64Decode.testBase64MIMEDecode 4 32 avgt 10 > 324.314 ±6.236 ns/op > Base64Decode.testBase64MIMEDecode 4 64 avgt 10 > 532.678 ±4.670 ns/op > Base64Decode.testBase64MIMEDecode 4 80 avgt 10 > 678.126 ±4.324 ns/op > Base64Decode.testBase64MIMEDecode 4 96 avgt 10 > 771.603 ±6.393 ns/op > Base64Decode.testBase64MIMEDecode 4112 avgt 10 > 889.608 ± 0.759 ns/op > Base64Decode.testBase64MIMEDecode 4512 avgt 10 > 3663.557 ±3.422 ns/op > Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 > 7017.784 ±9.128 ns/op > Base64Decode.testBase64MIMEDecode 4 2 avgt 10 > 128670.660 ± 7951.521 ns/op > Base64Decode.testBase64MIMEDecode 4 5 avgt 10 > 317113.667 ± 161.758 ns/op > > # Kunpeng916, default > Base64Decode.testBase64Decode 4 1 avgt5 > 48.455 ± 0.571 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 57.937 ± 0.505 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 73.823 ± 1.452 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 106.484 ± 1.243 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 141.004 ± 1.188 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 156.284 ± 0.572 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 174.137 ± 0.177 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 188.445 ± 0.572 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 610.847 ± 1.559 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 1155.368 ± 0.813 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 19751.477 ± 24.669 ns/op >
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]
On Wed, 7 Apr 2021 05:51:02 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]
On Wed, 7 Apr 2021 09:50:45 GMT, Andrew Haley wrote: >> Dong Bo has updated the pull request incrementally with one additional >> commit since the last revision: >> >> fix misleading annotations > > src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5829: > >> 5827: __ strb(r14, __ post(dst, 1)); >> 5828: __ strb(r15, __ post(dst, 1)); >> 5829: __ strb(r13, __ post(dst, 1)); > > I think this sequence should be 4 BFMs, STRW, BFM, STRW. That's the best we > can do, I think. Sorry, that's not quite right, but you get the idea: let's not generate unnecessary memory traffic. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v5]
On Tue, 6 Apr 2021 14:04:07 GMT, Andrew Haley wrote: >> Dong Bo has updated the pull request with a new target base due to a merge >> or a rebase. The pull request now contains 10 commits: >> >> - conflicts resolved >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into >> aarch64.base64.decode >> - resovling conflicts >> - load data with one ldrw, add JMH tests for error inputs >> - Merge branch 'master' into aarch64.base64.decode >> - copyright >> - trivial fixes >> - Handling error in SIMD case with loops, combining two non-SIMD cases into >> one code blob, addressing other comments >> - Merge branch 'master' into aarch64.base64.decode >> - 8256245: AArch64: Implement Base64 decoding intrinsic > > src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5800: > >> 5798: __ br(Assembler::LT, Process4B); >> 5799: >> 5800: // The 1st character of the input can be illegal if the data is >> MIME encoded. > > Why is this sentence here? It is very misleading. This sentence was used to describe the worst case observed frequently so that readers can understand more easily why the pre-processing non-SIMD code is necessary. I apologize for being unclear and misleading. The annotations have been modified as suggested. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. > Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea > can be found at > http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. > > Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. > Tests in `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java` runned specially for the > correctness of the implementation. > > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. > > A JMH micro, Base64Decode.java, is added for performance test. > With different input length (upper-bounded by parameter `maxNumBytes` in the > JMH micro), > we witness ~2.5x improvements with long inputs and no regression with short > inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on > Kunpeng916. > > The Base64Decode.java JMH micro-benchmark results: > > Benchmark (lineSize) (maxNumBytes) Mode Cnt > Score Error Units > > # Kunpeng916, intrinsic > Base64Decode.testBase64Decode 4 1 avgt5 > 48.614 ± 0.609 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 58.199 ± 1.650 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 69.400 ± 0.931 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 96.818 ± 1.687 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 122.856 ± 9.217 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 130.935 ± 1.667 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 143.627 ± 1.751 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 152.311 ± 1.178 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 342.631 ± 0.584 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 573.635 ± 1.050 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 9534.136 ±45.172 ns/op > Base64Decode.testBase64Decode 4 5 avgt5 > 22718.726 ± 192.070 ns/op > Base64Decode.testBase64MIMEDecode 4 1 avgt 10 > 63.558 ±0.336 ns/op > Base64Decode.testBase64MIMEDecode 4 3 avgt 10 > 82.504 ±0.848 ns/op > Base64Decode.testBase64MIMEDecode 4 7 avgt 10 > 120.591 ±0.608 ns/op > Base64Decode.testBase64MIMEDecode 4 32 avgt 10 > 324.314 ±6.236 ns/op > Base64Decode.testBase64MIMEDecode 4 64 avgt 10 > 532.678 ±4.670 ns/op > Base64Decode.testBase64MIMEDecode 4 80 avgt 10 > 678.126 ±4.324 ns/op > Base64Decode.testBase64MIMEDecode 4 96 avgt 10 > 771.603 ±6.393 ns/op > Base64Decode.testBase64MIMEDecode 4112 avgt 10 > 889.608 ± 0.759 ns/op > Base64Decode.testBase64MIMEDecode 4512 avgt 10 > 3663.557 ±3.422 ns/op > Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 > 7017.784 ±9.128 ns/op > Base64Decode.testBase64MIMEDecode 4 2 avgt 10 > 128670.660 ± 7951.521 ns/op > Base64Decode.testBase64MIMEDecode 4 5 avgt 10 > 317113.667 ± 161.758 ns/op > > # Kunpeng916, default > Base64Decode.testBase64Decode 4 1 avgt5 > 48.455 ± 0.571 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 57.937 ± 0.505 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 73.823 ± 1.452 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 106.484 ± 1.243 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 141.004 ± 1.188 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 156.284 ± 0.572 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 174.137 ± 0.177 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 188.445 ± 0.572 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 610.847 ± 1.559 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 1155.368 ± 0.813 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 19751.477 ± 24.669 ns/op >
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v5]
On Tue, 6 Apr 2021 07:25:57 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]
On Fri, 2 Apr 2021 10:01:27 GMT, Andrew Haley wrote: >> Dong Bo has updated the pull request with a new target base due to a merge >> or a rebase. The pull request now contains six commits: >> >> - Merge branch 'master' into aarch64.base64.decode >> - copyright >> - trivial fixes >> - Handling error in SIMD case with loops, combining two non-SIMD cases into >> one code blob, addressing other comments >> - Merge branch 'master' into aarch64.base64.decode >> - 8256245: AArch64: Implement Base64 decoding intrinsic > > src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5802: > >> 5800: // The 1st character of the input can be illegal if the data is >> MIME encoded. >> 5801: // We cannot benefits from SIMD for this case. The max line size >> of MIME >> 5802: // encoding is 76, with the PreProcess80B blob, we actually use >> no-simd > > "cannot benefit" OK, so I now understand what is actually going on here, and it has nothing to do with illegal first characters. The problem is that the maximum block length the decode will be supplied with is 76 bytes, and there isn't enough time for the SIMD to be worthwhile. So the comment should be "In the MIME case, the line length cannot be more than 76 bytes (see RFC 2045.) This is too short a block for SIMD to be worthwhile, so we use non-SIMD here." - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Tue, 6 Apr 2021 09:44:28 GMT, Andrew Haley wrote: > > It would be no benefits to use SIMD for this case, so the stub use no-simd > > instructions for MIME encoded data now. > > What is the reasoning here? Sure, there can be illegal characters at the > start, but what if there are not? The generic logic uses decodeBlock() even > in the MIME case, because we don't know that there certainly will be illegal > characters. This code block only process 80B of the inputs. If no illegal characters were found, the stub will use the SIMD instructions to process the rest of the inputs if the data length is large enough, i.e. >= 64B, to form up at least one SIMD round. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo wrote: > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. What is the reasoning here? Sure, there can be illegal characters at the start, but what if there are not? The generic logic uses decodeBlock() even in the MIME case, because we don't know that there certainly will be illegal characters. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Fri, 2 Apr 2021 10:17:57 GMT, Andrew Haley wrote: >> PING... Any suggestions on the updated commit? > >> PING... Any suggestions on the updated commit? > > Once you reply to the comments, sure. > > Are there any existing test cases for failing inputs? > I added one, the error character is injected at the paramized index of the encoded data. There are no big differences for small error injected index, seems too much time is took by exception handing. Witnessed ~2x performance improvements as expected. The JMH tests: ### Kunpeng 916, intrinsic,tested with `-jar benchmarks.jar testBase64WithErrorInputsDecode -p errorIndex=3,64,144,208,272,1000,2 -p maxNumBytes=1` Base64Decode.testBase64WithErrorInputsDecode 3 4 1 avgt 10 3696.151 ± 202.783 ns/op Base64Decode.testBase64WithErrorInputsDecode64 4 1 avgt 10 3899.269 ± 178.289 ns/op Base64Decode.testBase64WithErrorInputsDecode 144 4 1 avgt 10 3902.022 ± 163.611 ns/op Base64Decode.testBase64WithErrorInputsDecode 208 4 1 avgt 10 3982.423 ± 256.638 ns/op Base64Decode.testBase64WithErrorInputsDecode 272 4 1 avgt 10 3984.545 ± 144.282 ns/op Base64Decode.testBase64WithErrorInputsDecode 1000 4 1 avgt 10 4532.959 ± 310.068 ns/op Base64Decode.testBase64WithErrorInputsDecode 2 4 1 avgt 10 17578.148 ± 631.600 ns/op ### Kunpeng 916, default,tested with `-XX:-UseBASE64Intrinsics -jar benchmarks.jar testBase64WithErrorInputsDecode -p errorIndex=3,64,144,208,272,1000,2 -p maxNumBytes=1` Base64Decode.testBase64WithErrorInputsDecode 3 4 1 avgt 10 3760.330 ± 261.672 ns/op Base64Decode.testBase64WithErrorInputsDecode64 4 1 avgt 10 3900.326 ± 121.632 ns/op Base64Decode.testBase64WithErrorInputsDecode 144 4 1 avgt 10 4041.428 ± 174.435 ns/op Base64Decode.testBase64WithErrorInputsDecode 208 4 1 avgt 10 4177.670 ± 214.433 ns/op Base64Decode.testBase64WithErrorInputsDecode 272 4 1 avgt 10 4324.020 ± 106.826 ns/op Base64Decode.testBase64WithErrorInputsDecode 1000 4 1 avgt 10 5476.469 ± 171.647 ns/op Base64Decode.testBase64WithErrorInputsDecode 2 4 1 avgt 10 34163.743 ± 162.263 ns/op > > Your test results suggest that it isn't useful for that, surely? > The results suggest non-SIMD code provides ~11.9% improvements for MIME decoding. Furthermore, according to local tests, we may have about ~30% performance regression for MIME decoding without non-SIMD code. In worst case, a MIME line has only 4 base64 encoded characters and a newline string consisted of error inputs, e.g. `\r\n`. When the instrinsic encounter an illegal character (`\r`), it has to exit. Then the Java code will pass the next illegal source byte (`\n`) to the intrinsic. With only SIMD code, it will execute too much wasty instructions before it can detect the error. Whie with non-SIMD code, the instrinsic will execute only one non-SIMD round for this error input. > > For loads and four post increments rather than one load and a few BFMs? Why? > Nice suggestion. Done, thanks. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v5]
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. > Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea > can be found at > http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. > > Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. > Tests in `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java` runned specially for the > correctness of the implementation. > > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. > > A JMH micro, Base64Decode.java, is added for performance test. > With different input length (upper-bounded by parameter `maxNumBytes` in the > JMH micro), > we witness ~2.5x improvements with long inputs and no regression with short > inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on > Kunpeng916. > > The Base64Decode.java JMH micro-benchmark results: > > Benchmark (lineSize) (maxNumBytes) Mode Cnt > Score Error Units > > # Kunpeng916, intrinsic > Base64Decode.testBase64Decode 4 1 avgt5 > 48.614 ± 0.609 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 58.199 ± 1.650 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 69.400 ± 0.931 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 96.818 ± 1.687 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 122.856 ± 9.217 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 130.935 ± 1.667 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 143.627 ± 1.751 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 152.311 ± 1.178 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 342.631 ± 0.584 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 573.635 ± 1.050 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 9534.136 ±45.172 ns/op > Base64Decode.testBase64Decode 4 5 avgt5 > 22718.726 ± 192.070 ns/op > Base64Decode.testBase64MIMEDecode 4 1 avgt 10 > 63.558 ±0.336 ns/op > Base64Decode.testBase64MIMEDecode 4 3 avgt 10 > 82.504 ±0.848 ns/op > Base64Decode.testBase64MIMEDecode 4 7 avgt 10 > 120.591 ±0.608 ns/op > Base64Decode.testBase64MIMEDecode 4 32 avgt 10 > 324.314 ±6.236 ns/op > Base64Decode.testBase64MIMEDecode 4 64 avgt 10 > 532.678 ±4.670 ns/op > Base64Decode.testBase64MIMEDecode 4 80 avgt 10 > 678.126 ±4.324 ns/op > Base64Decode.testBase64MIMEDecode 4 96 avgt 10 > 771.603 ±6.393 ns/op > Base64Decode.testBase64MIMEDecode 4112 avgt 10 > 889.608 ± 0.759 ns/op > Base64Decode.testBase64MIMEDecode 4512 avgt 10 > 3663.557 ±3.422 ns/op > Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 > 7017.784 ±9.128 ns/op > Base64Decode.testBase64MIMEDecode 4 2 avgt 10 > 128670.660 ± 7951.521 ns/op > Base64Decode.testBase64MIMEDecode 4 5 avgt 10 > 317113.667 ± 161.758 ns/op > > # Kunpeng916, default > Base64Decode.testBase64Decode 4 1 avgt5 > 48.455 ± 0.571 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 57.937 ± 0.505 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 73.823 ± 1.452 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 106.484 ± 1.243 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 141.004 ± 1.188 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 156.284 ± 0.572 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 174.137 ± 0.177 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 188.445 ± 0.572 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 610.847 ± 1.559 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 1155.368 ± 0.813 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 19751.477 ± 24.669 ns/op >
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v4]
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. > Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea > can be found at > http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. > > Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. > Tests in `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java` runned specially for the > correctness of the implementation. > > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. > > A JMH micro, Base64Decode.java, is added for performance test. > With different input length (upper-bounded by parameter `maxNumBytes` in the > JMH micro), > we witness ~2.5x improvements with long inputs and no regression with short > inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on > Kunpeng916. > > The Base64Decode.java JMH micro-benchmark results: > > Benchmark (lineSize) (maxNumBytes) Mode Cnt > Score Error Units > > # Kunpeng916, intrinsic > Base64Decode.testBase64Decode 4 1 avgt5 > 48.614 ± 0.609 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 58.199 ± 1.650 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 69.400 ± 0.931 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 96.818 ± 1.687 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 122.856 ± 9.217 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 130.935 ± 1.667 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 143.627 ± 1.751 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 152.311 ± 1.178 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 342.631 ± 0.584 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 573.635 ± 1.050 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 9534.136 ±45.172 ns/op > Base64Decode.testBase64Decode 4 5 avgt5 > 22718.726 ± 192.070 ns/op > Base64Decode.testBase64MIMEDecode 4 1 avgt 10 > 63.558 ±0.336 ns/op > Base64Decode.testBase64MIMEDecode 4 3 avgt 10 > 82.504 ±0.848 ns/op > Base64Decode.testBase64MIMEDecode 4 7 avgt 10 > 120.591 ±0.608 ns/op > Base64Decode.testBase64MIMEDecode 4 32 avgt 10 > 324.314 ±6.236 ns/op > Base64Decode.testBase64MIMEDecode 4 64 avgt 10 > 532.678 ±4.670 ns/op > Base64Decode.testBase64MIMEDecode 4 80 avgt 10 > 678.126 ±4.324 ns/op > Base64Decode.testBase64MIMEDecode 4 96 avgt 10 > 771.603 ±6.393 ns/op > Base64Decode.testBase64MIMEDecode 4112 avgt 10 > 889.608 ± 0.759 ns/op > Base64Decode.testBase64MIMEDecode 4512 avgt 10 > 3663.557 ±3.422 ns/op > Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 > 7017.784 ±9.128 ns/op > Base64Decode.testBase64MIMEDecode 4 2 avgt 10 > 128670.660 ± 7951.521 ns/op > Base64Decode.testBase64MIMEDecode 4 5 avgt 10 > 317113.667 ± 161.758 ns/op > > # Kunpeng916, default > Base64Decode.testBase64Decode 4 1 avgt5 > 48.455 ± 0.571 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 57.937 ± 0.505 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 73.823 ± 1.452 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 106.484 ± 1.243 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 141.004 ± 1.188 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 156.284 ± 0.572 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 174.137 ± 0.177 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 188.445 ± 0.572 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 610.847 ± 1.559 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 1155.368 ± 0.813 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 19751.477 ± 24.669 ns/op >
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Fri, 2 Apr 2021 07:05:26 GMT, Dong Bo wrote: > PING... Any suggestions on the updated commit? Once you reply to the comments, sure. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Mon, 29 Mar 2021 03:28:54 GMT, Dong Bo wrote: > > Please consider losing the non-SIMD case. It doesn't result in any > > significant gain. > > The non-SIMD case is useful for MIME decoding performance. Your test results suggest that it isn't useful for that, surely? - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Tue, 30 Mar 2021 03:24:16 GMT, Dong Bo wrote: >>> I think I can rewrite this part as loops. >>> With an intial implemention, we can have almost half of the code size >>> reduced (1312B -> 748B). Sounds OK to you? >> >> Sounds great, but I'm still somewhat concerned that the non-SIMD case only >> offers 3-12% performance gain. Make it just 748 bytes, and therefore not >> icache-hostile, then perhaps the balance of risk and reward is justified. > >> > With an intial implemention, we can have almost half of the code size >> > reduced (1312B -> 748B). Sounds OK to you? >> >> Sounds great, but I'm still somewhat concerned that the non-SIMD case only >> offers 3-12% performance gain. Make it just 748 bytes, and therefore not >> icache-hostile, then perhaps the balance of risk and reward is justified. > > Hi, @theRealAph @nick-arm > > The code is updated. The error handling in SIMD case was rewriten as loops. > > Also combined the two non-SIMD code blocks into one. > Due to we have only one non-SIMD loop now, it is moved into > `generate_base64_decodeBlock`. > The size of the stub is 692 bytes, the non-SIMD loop takes about 92 bytes if > my calculation is right. > > Verified with tests `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java`. > Compared with previous implementation, the performance changes are negligible. > > Other comments are addressed too. Thanks. PING... Any suggestions on the updated commit? - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. > Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea > can be found at > http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. > > Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. > Tests in `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java` runned specially for the > correctness of the implementation. > > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. > > A JMH micro, Base64Decode.java, is added for performance test. > With different input length (upper-bounded by parameter `maxNumBytes` in the > JMH micro), > we witness ~2.5x improvements with long inputs and no regression with short > inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on > Kunpeng916. > > The Base64Decode.java JMH micro-benchmark results: > > Benchmark (lineSize) (maxNumBytes) Mode Cnt > Score Error Units > > # Kunpeng916, intrinsic > Base64Decode.testBase64Decode 4 1 avgt5 > 48.614 ± 0.609 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 58.199 ± 1.650 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 69.400 ± 0.931 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 96.818 ± 1.687 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 122.856 ± 9.217 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 130.935 ± 1.667 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 143.627 ± 1.751 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 152.311 ± 1.178 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 342.631 ± 0.584 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 573.635 ± 1.050 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 9534.136 ±45.172 ns/op > Base64Decode.testBase64Decode 4 5 avgt5 > 22718.726 ± 192.070 ns/op > Base64Decode.testBase64MIMEDecode 4 1 avgt 10 > 63.558 ±0.336 ns/op > Base64Decode.testBase64MIMEDecode 4 3 avgt 10 > 82.504 ±0.848 ns/op > Base64Decode.testBase64MIMEDecode 4 7 avgt 10 > 120.591 ±0.608 ns/op > Base64Decode.testBase64MIMEDecode 4 32 avgt 10 > 324.314 ±6.236 ns/op > Base64Decode.testBase64MIMEDecode 4 64 avgt 10 > 532.678 ±4.670 ns/op > Base64Decode.testBase64MIMEDecode 4 80 avgt 10 > 678.126 ±4.324 ns/op > Base64Decode.testBase64MIMEDecode 4 96 avgt 10 > 771.603 ±6.393 ns/op > Base64Decode.testBase64MIMEDecode 4112 avgt 10 > 889.608 ± 0.759 ns/op > Base64Decode.testBase64MIMEDecode 4512 avgt 10 > 3663.557 ±3.422 ns/op > Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 > 7017.784 ±9.128 ns/op > Base64Decode.testBase64MIMEDecode 4 2 avgt 10 > 128670.660 ± 7951.521 ns/op > Base64Decode.testBase64MIMEDecode 4 5 avgt 10 > 317113.667 ± 161.758 ns/op > > # Kunpeng916, default > Base64Decode.testBase64Decode 4 1 avgt5 > 48.455 ± 0.571 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 57.937 ± 0.505 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 73.823 ± 1.452 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 106.484 ± 1.243 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 141.004 ± 1.188 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 156.284 ± 0.572 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 174.137 ± 0.177 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 188.445 ± 0.572 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 610.847 ± 1.559 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 1155.368 ± 0.813 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 19751.477 ± 24.669 ns/op >
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Mon, 29 Mar 2021 08:38:59 GMT, Andrew Haley wrote: > > With an intial implemention, we can have almost half of the code size > > reduced (1312B -> 748B). Sounds OK to you? > > Sounds great, but I'm still somewhat concerned that the non-SIMD case only > offers 3-12% performance gain. Make it just 748 bytes, and therefore not > icache-hostile, then perhaps the balance of risk and reward is justified. Hi, @theRealAph @nick-arm The code is updated. The error handling in SIMD case was rewriten as loops. Also combined the two non-SIMD code blocks into one. Due to we have only one non-SIMD loop now, it is moved into `generate_base64_decodeBlock`. The size of the stub is 692 bytes, the non-SIMD loop takes about 92 bytes if my calculation is right. Verified with tests `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java`. Compared with previous implementation, the performance changes are negligible. Other comments are addressed too. Thanks. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v2]
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. > Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea > can be found at > http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. > > Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. > Tests in `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java` runned specially for the > correctness of the implementation. > > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. > > A JMH micro, Base64Decode.java, is added for performance test. > With different input length (upper-bounded by parameter `maxNumBytes` in the > JMH micro), > we witness ~2.5x improvements with long inputs and no regression with short > inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on > Kunpeng916. > > The Base64Decode.java JMH micro-benchmark results: > > Benchmark (lineSize) (maxNumBytes) Mode Cnt > Score Error Units > > # Kunpeng916, intrinsic > Base64Decode.testBase64Decode 4 1 avgt5 > 48.614 ± 0.609 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 58.199 ± 1.650 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 69.400 ± 0.931 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 96.818 ± 1.687 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 122.856 ± 9.217 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 130.935 ± 1.667 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 143.627 ± 1.751 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 152.311 ± 1.178 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 342.631 ± 0.584 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 573.635 ± 1.050 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 9534.136 ±45.172 ns/op > Base64Decode.testBase64Decode 4 5 avgt5 > 22718.726 ± 192.070 ns/op > Base64Decode.testBase64MIMEDecode 4 1 avgt 10 > 63.558 ±0.336 ns/op > Base64Decode.testBase64MIMEDecode 4 3 avgt 10 > 82.504 ±0.848 ns/op > Base64Decode.testBase64MIMEDecode 4 7 avgt 10 > 120.591 ±0.608 ns/op > Base64Decode.testBase64MIMEDecode 4 32 avgt 10 > 324.314 ±6.236 ns/op > Base64Decode.testBase64MIMEDecode 4 64 avgt 10 > 532.678 ±4.670 ns/op > Base64Decode.testBase64MIMEDecode 4 80 avgt 10 > 678.126 ±4.324 ns/op > Base64Decode.testBase64MIMEDecode 4 96 avgt 10 > 771.603 ±6.393 ns/op > Base64Decode.testBase64MIMEDecode 4112 avgt 10 > 889.608 ± 0.759 ns/op > Base64Decode.testBase64MIMEDecode 4512 avgt 10 > 3663.557 ±3.422 ns/op > Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 > 7017.784 ±9.128 ns/op > Base64Decode.testBase64MIMEDecode 4 2 avgt 10 > 128670.660 ± 7951.521 ns/op > Base64Decode.testBase64MIMEDecode 4 5 avgt 10 > 317113.667 ± 161.758 ns/op > > # Kunpeng916, default > Base64Decode.testBase64Decode 4 1 avgt5 > 48.455 ± 0.571 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 57.937 ± 0.505 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 73.823 ± 1.452 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 106.484 ± 1.243 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 141.004 ± 1.188 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 156.284 ± 0.572 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 174.137 ± 0.177 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 188.445 ± 0.572 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 610.847 ± 1.559 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 1155.368 ± 0.813 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 19751.477 ± 24.669 ns/op >
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Mon, 29 Mar 2021 03:28:54 GMT, Dong Bo wrote: > I think I can rewrite this part as loops. > With an intial implemention, we can have almost half of the code size reduced > (1312B -> 748B). Sounds OK to you? Sounds great, but I'm still somewhat concerned that the non-SIMD case only offers 3-12% performance gain. Make it just 748 bytes, and therefore not icache-hostile, then perhaps the balance of risk and reward is justified. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson wrote: > > There's a lot of unrolling, particularly in the non-SIMD case. Please > > consider taking out some of the unrolling; I suspect it'd not increase time > > by very much but would greatly reduce the code cache pollution. It's very > > tempting to unroll everything to make a benchmark run quickly, but we have > > to take a balanced approach. > > But there's only ever one of these generated at startup, right? It's not like > the string intrinsics that are expanded at every call site. I'm talking about icache pollution. This stuff could be quite small. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson wrote: >> Firstly, I wonder how important this is for most applications. I don't >> actually know, but let's put that to one side. >> >> There's a lot of unrolling, particularly in the non-SIMD case. Please >> consider taking out some of the unrolling; I suspect it'd not increase time >> by very much but would greatly reduce the code cache pollution. It's very >> tempting to unroll everything to make a benchmark run quickly, but we have >> to take a balanced approach. > >> >> There's a lot of unrolling, particularly in the non-SIMD case. Please >> consider taking out some of the unrolling; I suspect it'd not increase time >> by very much but would greatly reduce the code cache pollution. It's very >> tempting to unroll everything to make a benchmark run quickly, but we have >> to take a balanced approach. > > But there's only ever one of these generated at startup, right? It's not like > the string intrinsics that are expanded at every call site. @nick-arm Thank you for watching this. > That probably ought to go around the whole routine in > generate_base64_decodeBlock rather than here? > There are two non-simd blocks in this intrinsic. The 1st is at the begining, mainly to roll MIME decoding to non-simd processing due to the performance issue as I claimed before. The 2nd is at the end to handle trailing inputs. So I guess we need generate_base64_decode_nosimdround here. > "illegal inputs". Are there existing jtreg tests that cover these cases? > Yes, they are covered by `test/hotspot/jtreg/compiler/intrinsics/base64/TestBase64.java`. > This table and the one below seem to be identical to first half of the NoSIMD > tables. Can't you just use one set of 256-entry tables for both SIMD and > non-SIMD algorithms? > They are not identical, `*ForSIMD[64]==0`, `*forNoSIMD[64]=255`. In SIMD case, `*ForSIMD[64]` acts as a pivot to tell us that we already get the decoded data with the 1st lookup when performing the 2nd lookup. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson wrote: >> Firstly, I wonder how important this is for most applications. I don't >> actually know, but let's put that to one side. >> >> There's a lot of unrolling, particularly in the non-SIMD case. Please >> consider taking out some of the unrolling; I suspect it'd not increase time >> by very much but would greatly reduce the code cache pollution. It's very >> tempting to unroll everything to make a benchmark run quickly, but we have >> to take a balanced approach. > >> >> There's a lot of unrolling, particularly in the non-SIMD case. Please >> consider taking out some of the unrolling; I suspect it'd not increase time >> by very much but would greatly reduce the code cache pollution. It's very >> tempting to unroll everything to make a benchmark run quickly, but we have >> to take a balanced approach. > > But there's only ever one of these generated at startup, right? It's not like > the string intrinsics that are expanded at every call site. Thanks for the comments. > Firstly, I wonder how important this is for most applications. I don't > actually know, but let's put that to one side. > As claimed in JEP 135, Base64 is frequently used to encode binary/octet sequences that are transmitted as textual data. It is commonly used by applications using Multipurpose Internal Mail Extensions (MIME), encoding passwords for HTTP headers, message digests, etc. > There's a lot of unrolling, particularly in the non-SIMD case. Please > consider taking out some of the unrolling; I suspect it'd not increase time > by very much but would greatly reduce the code cache pollution. It's very > tempting to unroll everything to make a benchmark run quickly, but we have to > take a balanced approach. > There is no code unrolling in the non-SIMD case. The instructions are just loading, processing, storing data within loops. About half of the code size is the error handling in SIMD case: // handle illegal input if (size == 16) { Label ErrorInLowerHalf; __ umov(rscratch1, in2, __ D, 0); __ cbnz(rscratch1, ErrorInLowerHalf); // illegal input is in higher half, store the lower half now. __ st3(out0, out1, out2, __ T8B, __ post(dst, 24)); for (int i = 8; i < 15; i++) { __ umov(rscratch2, in2, __ B, (u1) i); __ cbnz(rscratch2, Exit); __ umov(r10, out0, __ B, (u1) i); __ umov(r11, out1, __ B, (u1) i); __ umov(r12, out2, __ B, (u1) i); __ strb(r10, __ post(dst, 1)); __ strb(r11, __ post(dst, 1)); __ strb(r12, __ post(dst, 1)); } __ b(Exit); I think I can rewrite this part as loops. With an intial implemention, we can have almost half of the code size reduced (1312B -> 748B). Sounds OK to you? > Please consider losing the non-SIMD case. It doesn't result in any > significant gain. > The non-SIMD case is useful for MIME decoding performance. The MIME base64 encoded data is arranged in lines (line size can be set by user with maximum 76B). Newline characters, e.g. `\r\n`, are illegal but can be ignored by MIME decoding. While the SIMD case works as `load data -> two vector table lookups -> combining -> error detection -> store data`. When using SIMD for MIME decoding, the 1st byte of the input are possibly a newline character. The SIMD case will execute too much wasty code before it can detect the error and exit, with non-simd case, there are only few ldrs, orrs, strs for error detecting. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Sat, 27 Mar 2021 09:53:37 GMT, Andrew Haley wrote: > > There's a lot of unrolling, particularly in the non-SIMD case. Please > consider taking out some of the unrolling; I suspect it'd not increase time > by very much but would greatly reduce the code cache pollution. It's very > tempting to unroll everything to make a benchmark run quickly, but we have to > take a balanced approach. But there's only ever one of these generated at startup, right? It's not like the string intrinsics that are expanded at every call site. - PR: https://git.openjdk.java.net/jdk/pull/3228
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Sat, 27 Mar 2021 09:54:57 GMT, Andrew Haley wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo wrote: > In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. > Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea > can be found at > http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. > > Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. > Tests in `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java` runned specially for the > correctness of the implementation. > > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. > > A JMH micro, Base64Decode.java, is added for performance test. > With different input length (upper-bounded by parameter `maxNumBytes` in the > JMH micro), > we witness ~2.5x improvements with long inputs and no regression with short > inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on > Kunpeng916. > > The Base64Decode.java JMH micro-benchmark results: > > Benchmark (lineSize) (maxNumBytes) Mode Cnt > Score Error Units > > # Kunpeng916, intrinsic > Base64Decode.testBase64Decode 4 1 avgt5 > 48.614 ± 0.609 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 58.199 ± 1.650 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 69.400 ± 0.931 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 96.818 ± 1.687 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 122.856 ± 9.217 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 130.935 ± 1.667 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 143.627 ± 1.751 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 152.311 ± 1.178 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 342.631 ± 0.584 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 573.635 ± 1.050 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 9534.136 ±45.172 ns/op > Base64Decode.testBase64Decode 4 5 avgt5 > 22718.726 ± 192.070 ns/op > Base64Decode.testBase64MIMEDecode 4 1 avgt 10 > 63.558 ±0.336 ns/op > Base64Decode.testBase64MIMEDecode 4 3 avgt 10 > 82.504 ±0.848 ns/op > Base64Decode.testBase64MIMEDecode 4 7 avgt 10 > 120.591 ±0.608 ns/op > Base64Decode.testBase64MIMEDecode 4 32 avgt 10 > 324.314 ±6.236 ns/op > Base64Decode.testBase64MIMEDecode 4 64 avgt 10 > 532.678 ±4.670 ns/op > Base64Decode.testBase64MIMEDecode 4 80 avgt 10 > 678.126 ±4.324 ns/op > Base64Decode.testBase64MIMEDecode 4 96 avgt 10 > 771.603 ±6.393 ns/op > Base64Decode.testBase64MIMEDecode 4112 avgt 10 > 889.608 ± 0.759 ns/op > Base64Decode.testBase64MIMEDecode 4512 avgt 10 > 3663.557 ±3.422 ns/op > Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 > 7017.784 ±9.128 ns/op > Base64Decode.testBase64MIMEDecode 4 2 avgt 10 > 128670.660 ± 7951.521 ns/op > Base64Decode.testBase64MIMEDecode 4 5 avgt 10 > 317113.667 ± 161.758 ns/op > > # Kunpeng916, default > Base64Decode.testBase64Decode 4 1 avgt5 > 48.455 ± 0.571 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 57.937 ± 0.505 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 73.823 ± 1.452 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 106.484 ± 1.243 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 141.004 ± 1.188 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 156.284 ± 0.572 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 174.137 ± 0.177 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 188.445 ± 0.572 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 610.847 ± 1.559 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 1155.368 ± 0.813 ns/op > Base64Decode.testBase64Decode 4
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Sat, 27 Mar 2021 09:53:37 GMT, Andrew Haley wrote: >> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. >> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic >> idea can be found at >> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. >> >> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. >> Tests in `test/jdk/java/util/Base64/` and >> `compiler/intrinsics/base64/TestBase64.java` runned specially for the >> correctness of the implementation. >> >> There can be illegal characters at the start of the input if the data is >> MIME encoded. >> It would be no benefits to use SIMD for this case, so the stub use no-simd >> instructions for MIME encoded data now. >> >> A JMH micro, Base64Decode.java, is added for performance test. >> With different input length (upper-bounded by parameter `maxNumBytes` in the >> JMH micro), >> we witness ~2.5x improvements with long inputs and no regression with short >> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on >> Kunpeng916. >> >> The Base64Decode.java JMH micro-benchmark results: >> >> Benchmark (lineSize) (maxNumBytes) Mode Cnt >> Score Error Units >> >> # Kunpeng916, intrinsic >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.614 ± 0.609 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 58.199 ± 1.650 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 69.400 ± 0.931 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 96.818 ± 1.687 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 122.856 ± 9.217 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 130.935 ± 1.667 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 143.627 ± 1.751 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 152.311 ± 1.178 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 342.631 ± 0.584 ns/op >> Base64Decode.testBase64Decode 4 1000 avgt5 >> 573.635 ± 1.050 ns/op >> Base64Decode.testBase64Decode 4 2 avgt5 >> 9534.136 ±45.172 ns/op >> Base64Decode.testBase64Decode 4 5 avgt5 >> 22718.726 ± 192.070 ns/op >> Base64Decode.testBase64MIMEDecode 4 1 avgt 10 >> 63.558 ±0.336 ns/op >> Base64Decode.testBase64MIMEDecode 4 3 avgt 10 >> 82.504 ±0.848 ns/op >> Base64Decode.testBase64MIMEDecode 4 7 avgt 10 >> 120.591 ±0.608 ns/op >> Base64Decode.testBase64MIMEDecode 4 32 avgt 10 >> 324.314 ±6.236 ns/op >> Base64Decode.testBase64MIMEDecode 4 64 avgt 10 >> 532.678 ±4.670 ns/op >> Base64Decode.testBase64MIMEDecode 4 80 avgt 10 >> 678.126 ±4.324 ns/op >> Base64Decode.testBase64MIMEDecode 4 96 avgt 10 >> 771.603 ±6.393 ns/op >> Base64Decode.testBase64MIMEDecode 4112 avgt 10 >> 889.608 ± 0.759 ns/op >> Base64Decode.testBase64MIMEDecode 4512 avgt 10 >> 3663.557 ±3.422 ns/op >> Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 >> 7017.784 ±9.128 ns/op >> Base64Decode.testBase64MIMEDecode 4 2 avgt 10 >> 128670.660 ± 7951.521 ns/op >> Base64Decode.testBase64MIMEDecode 4 5 avgt 10 >> 317113.667 ± 161.758 ns/op >> >> # Kunpeng916, default >> Base64Decode.testBase64Decode 4 1 avgt5 >> 48.455 ± 0.571 ns/op >> Base64Decode.testBase64Decode 4 3 avgt5 >> 57.937 ± 0.505 ns/op >> Base64Decode.testBase64Decode 4 7 avgt5 >> 73.823 ± 1.452 ns/op >> Base64Decode.testBase64Decode 4 32 avgt5 >> 106.484 ± 1.243 ns/op >> Base64Decode.testBase64Decode 4 64 avgt5 >> 141.004 ± 1.188 ns/op >> Base64Decode.testBase64Decode 4 80 avgt5 >> 156.284 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4 96 avgt5 >> 174.137 ± 0.177 ns/op >> Base64Decode.testBase64Decode 4112 avgt5 >> 188.445 ± 0.572 ns/op >> Base64Decode.testBase64Decode 4512 avgt5 >> 610.847 ± 1.559 ns/op >> Base64Decode.testBase64Decode 4 1000
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo wrote: > In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. > Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea > can be found at > http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. > > Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. > Tests in `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java` runned specially for the > correctness of the implementation. > > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. > > A JMH micro, Base64Decode.java, is added for performance test. > With different input length (upper-bounded by parameter `maxNumBytes` in the > JMH micro), > we witness ~2.5x improvements with long inputs and no regression with short > inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on > Kunpeng916. > > The Base64Decode.java JMH micro-benchmark results: > > Benchmark (lineSize) (maxNumBytes) Mode Cnt > Score Error Units > > # Kunpeng916, intrinsic > Base64Decode.testBase64Decode 4 1 avgt5 > 48.614 ± 0.609 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 58.199 ± 1.650 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 69.400 ± 0.931 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 96.818 ± 1.687 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 122.856 ± 9.217 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 130.935 ± 1.667 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 143.627 ± 1.751 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 152.311 ± 1.178 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 342.631 ± 0.584 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 573.635 ± 1.050 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 9534.136 ±45.172 ns/op > Base64Decode.testBase64Decode 4 5 avgt5 > 22718.726 ± 192.070 ns/op > Base64Decode.testBase64MIMEDecode 4 1 avgt 10 > 63.558 ±0.336 ns/op > Base64Decode.testBase64MIMEDecode 4 3 avgt 10 > 82.504 ±0.848 ns/op > Base64Decode.testBase64MIMEDecode 4 7 avgt 10 > 120.591 ±0.608 ns/op > Base64Decode.testBase64MIMEDecode 4 32 avgt 10 > 324.314 ±6.236 ns/op > Base64Decode.testBase64MIMEDecode 4 64 avgt 10 > 532.678 ±4.670 ns/op > Base64Decode.testBase64MIMEDecode 4 80 avgt 10 > 678.126 ±4.324 ns/op > Base64Decode.testBase64MIMEDecode 4 96 avgt 10 > 771.603 ±6.393 ns/op > Base64Decode.testBase64MIMEDecode 4112 avgt 10 > 889.608 ± 0.759 ns/op > Base64Decode.testBase64MIMEDecode 4512 avgt 10 > 3663.557 ±3.422 ns/op > Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 > 7017.784 ±9.128 ns/op > Base64Decode.testBase64MIMEDecode 4 2 avgt 10 > 128670.660 ± 7951.521 ns/op > Base64Decode.testBase64MIMEDecode 4 5 avgt 10 > 317113.667 ± 161.758 ns/op > > # Kunpeng916, default > Base64Decode.testBase64Decode 4 1 avgt5 > 48.455 ± 0.571 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 57.937 ± 0.505 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 73.823 ± 1.452 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 106.484 ± 1.243 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 141.004 ± 1.188 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 156.284 ± 0.572 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 174.137 ± 0.177 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 188.445 ± 0.572 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 610.847 ± 1.559 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 1155.368 ± 0.813 ns/op > Base64Decode.testBase64Decode 4
Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo wrote: > In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. > Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea > can be found at > http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. > > Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. > Tests in `test/jdk/java/util/Base64/` and > `compiler/intrinsics/base64/TestBase64.java` runned specially for the > correctness of the implementation. > > There can be illegal characters at the start of the input if the data is MIME > encoded. > It would be no benefits to use SIMD for this case, so the stub use no-simd > instructions for MIME encoded data now. > > A JMH micro, Base64Decode.java, is added for performance test. > With different input length (upper-bounded by parameter `maxNumBytes` in the > JMH micro), > we witness ~2.5x improvements with long inputs and no regression with short > inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on > Kunpeng916. > > The Base64Decode.java JMH micro-benchmark results: > > Benchmark (lineSize) (maxNumBytes) Mode Cnt > Score Error Units > > # Kunpeng916, intrinsic > Base64Decode.testBase64Decode 4 1 avgt5 > 48.614 ± 0.609 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 58.199 ± 1.650 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 69.400 ± 0.931 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 96.818 ± 1.687 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 122.856 ± 9.217 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 130.935 ± 1.667 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 143.627 ± 1.751 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 152.311 ± 1.178 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 342.631 ± 0.584 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 573.635 ± 1.050 ns/op > Base64Decode.testBase64Decode 4 2 avgt5 > 9534.136 ±45.172 ns/op > Base64Decode.testBase64Decode 4 5 avgt5 > 22718.726 ± 192.070 ns/op > Base64Decode.testBase64MIMEDecode 4 1 avgt 10 > 63.558 ±0.336 ns/op > Base64Decode.testBase64MIMEDecode 4 3 avgt 10 > 82.504 ±0.848 ns/op > Base64Decode.testBase64MIMEDecode 4 7 avgt 10 > 120.591 ±0.608 ns/op > Base64Decode.testBase64MIMEDecode 4 32 avgt 10 > 324.314 ±6.236 ns/op > Base64Decode.testBase64MIMEDecode 4 64 avgt 10 > 532.678 ±4.670 ns/op > Base64Decode.testBase64MIMEDecode 4 80 avgt 10 > 678.126 ±4.324 ns/op > Base64Decode.testBase64MIMEDecode 4 96 avgt 10 > 771.603 ±6.393 ns/op > Base64Decode.testBase64MIMEDecode 4112 avgt 10 > 889.608 ± 0.759 ns/op > Base64Decode.testBase64MIMEDecode 4512 avgt 10 > 3663.557 ±3.422 ns/op > Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 > 7017.784 ±9.128 ns/op > Base64Decode.testBase64MIMEDecode 4 2 avgt 10 > 128670.660 ± 7951.521 ns/op > Base64Decode.testBase64MIMEDecode 4 5 avgt 10 > 317113.667 ± 161.758 ns/op > > # Kunpeng916, default > Base64Decode.testBase64Decode 4 1 avgt5 > 48.455 ± 0.571 ns/op > Base64Decode.testBase64Decode 4 3 avgt5 > 57.937 ± 0.505 ns/op > Base64Decode.testBase64Decode 4 7 avgt5 > 73.823 ± 1.452 ns/op > Base64Decode.testBase64Decode 4 32 avgt5 > 106.484 ± 1.243 ns/op > Base64Decode.testBase64Decode 4 64 avgt5 > 141.004 ± 1.188 ns/op > Base64Decode.testBase64Decode 4 80 avgt5 > 156.284 ± 0.572 ns/op > Base64Decode.testBase64Decode 4 96 avgt5 > 174.137 ± 0.177 ns/op > Base64Decode.testBase64Decode 4112 avgt5 > 188.445 ± 0.572 ns/op > Base64Decode.testBase64Decode 4512 avgt5 > 610.847 ± 1.559 ns/op > Base64Decode.testBase64Decode 4 1000 avgt5 > 1155.368 ± 0.813 ns/op > Base64Decode.testBase64Decode 4
RFR: 8256245: AArch64: Implement Base64 decoding intrinsic
In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding. Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords. Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build. Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation. There can be illegal characters at the start of the input if the data is MIME encoded. It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now. A JMH micro, Base64Decode.java, is added for performance test. With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro), we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916. The Base64Decode.java JMH micro-benchmark results: # Kunpeng916, intrinsic Base64Decode.testBase64Decode 4 1 avgt5 48.614 ± 0.609 ns/op Base64Decode.testBase64Decode 4 3 avgt5 58.199 ± 1.650 ns/op Base64Decode.testBase64Decode 4 7 avgt5 69.400 ± 0.931 ns/op Base64Decode.testBase64Decode 4 32 avgt5 96.818 ± 1.687 ns/op Base64Decode.testBase64Decode 4 64 avgt5 122.856 ± 9.217 ns/op Base64Decode.testBase64Decode 4 80 avgt5 130.935 ± 1.667 ns/op Base64Decode.testBase64Decode 4 96 avgt5 143.627 ± 1.751 ns/op Base64Decode.testBase64Decode 4112 avgt5 152.311 ± 1.178 ns/op Base64Decode.testBase64Decode 4512 avgt5 342.631 ± 0.584 ns/op Base64Decode.testBase64Decode 4 1000 avgt5 573.635 ± 1.050 ns/op Base64Decode.testBase64Decode 4 2 avgt5 9534.136 ±45.172 ns/op Base64Decode.testBase64Decode 4 5 avgt5 22718.726 ± 192.070 ns/op Base64Decode.testBase64MIMEDecode 4 1 avgt 10 63.558 ±0.336 ns/op Base64Decode.testBase64MIMEDecode 4 3 avgt 10 82.504 ±0.848 ns/op Base64Decode.testBase64MIMEDecode 4 7 avgt 10 120.591 ±0.608 ns/op Base64Decode.testBase64MIMEDecode 4 32 avgt 10 324.314 ±6.236 ns/op Base64Decode.testBase64MIMEDecode 4 64 avgt 10 532.678 ±4.670 ns/op Base64Decode.testBase64MIMEDecode 4 80 avgt 10 678.126 ±4.324 ns/op Base64Decode.testBase64MIMEDecode 4 96 avgt 10 771.603 ±6.393 ns/op Base64Decode.testBase64MIMEDecode 4112 avgt 10 889.608 ± 0.759 ns/op Base64Decode.testBase64MIMEDecode 4512 avgt 10 3663.557 ±3.422 ns/op Base64Decode.testBase64MIMEDecode 4 1000 avgt 10 7017.784 ±9.128 ns/op Base64Decode.testBase64MIMEDecode 4 2 avgt 10 128670.660 ± 7951.521 ns/op Base64Decode.testBase64MIMEDecode 4 5 avgt 10 317113.667 ± 161.758 ns/op # Kunpeng916, default Base64Decode.testBase64Decode 4 1 avgt5 48.455 ± 0.571 ns/op Base64Decode.testBase64Decode 4 3 avgt5 57.937 ± 0.505 ns/op Base64Decode.testBase64Decode 4 7 avgt5 73.823 ± 1.452 ns/op Base64Decode.testBase64Decode 4 32 avgt5 106.484 ± 1.243 ns/op Base64Decode.testBase64Decode 4 64 avgt5 141.004 ± 1.188 ns/op Base64Decode.testBase64Decode 4 80 avgt5 156.284 ± 0.572 ns/op Base64Decode.testBase64Decode 4 96 avgt5 174.137 ± 0.177 ns/op Base64Decode.testBase64Decode 4112 avgt5 188.445 ± 0.572 ns/op Base64Decode.testBase64Decode 4512 avgt5 610.847 ± 1.559 ns/op Base64Decode.testBase64Decode 4 1000 avgt5 1155.368 ± 0.813 ns/op Base64Decode.testBase64Decode 4 2 avgt5 19751.477 ± 24.669 ns/op Base64Decode.testBase64Decode 4 5 avgt5 50046.586 ± 523.155 ns/op Base64Decode.testBase64MIMEDecode 4 1 avgt 10 64.130 ± 0.238 ns/op Base64Decode.testBase64MIMEDecode 4 3 avgt 10 82.096 ± 0.205 ns/op