Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]

2021-04-08 Thread Nick Gasson
On Thu, 8 Apr 2021 06:33:43 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]

2021-04-08 Thread Nick Gasson
On Thu, 8 Apr 2021 09:05:43 GMT, Dong Bo  wrote:

> Hi @nick-arm, are you also ok with the newest commit?

It looks ok to me but I'm not a Reviewer.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]

2021-04-08 Thread Dong Bo
On Thu, 8 Apr 2021 08:28:53 GMT, Andrew Haley  wrote:

>> Dong Bo has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   reduce unnecessary memory write traffic in non-SIMD code
>
> Marked as reviewed by aph (Reviewer).

@theRealAph Thanks for the review.

Hi @nick-arm, are you also ok with the newest commit?

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]

2021-04-08 Thread Andrew Haley
On Thu, 8 Apr 2021 06:33:43 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]

2021-04-08 Thread Dong Bo
On Wed, 7 Apr 2021 09:53:36 GMT, Andrew Haley  wrote:

>> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5829:
>> 
>>> 5827: __ strb(r14, __ post(dst, 1));
>>> 5828: __ strb(r15, __ post(dst, 1));
>>> 5829: __ strb(r13, __ post(dst, 1));
>> 
>> I think this sequence should be 4 BFMs, STRW, BFM, STRW. That's the best we 
>> can do, I think.
>
> Sorry, that's not quite right, but you get the idea: let's not generate 
> unnecessary memory traffic.

Okay, implemented as:
__ lslw(r14, r10, 10);
__ bfiw(r14, r11, 4, 6);
__ bfmw(r14, r12, 2, 5);
__ rev16w(r14, r14);
__ bfiw(r13, r12, 6, 2);
__ strh(r14, __ post(dst, 2));
__ strb(r13, __ post(dst, 1));

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v7]

2021-04-08 Thread Dong Bo
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt   
> Score   Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.614 ± 0.609  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 58.199 ± 1.650  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 69.400 ± 0.931  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5  
> 96.818 ± 1.687  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 122.856 ± 9.217  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 130.935 ± 1.667  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 143.627 ± 1.751  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 152.311 ± 1.178  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 342.631 ± 0.584  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5 
> 573.635 ± 1.050  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5
> 9534.136 ±45.172  ns/op
> Base64Decode.testBase64Decode   4  5  avgt5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
> 63.558 ±0.336  ns/op
> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
> 82.504 ±0.848  ns/op
> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
> 120.591 ±0.608  ns/op
> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
> 324.314 ±6.236  ns/op
> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
> 532.678 ±4.670  ns/op
> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
> 678.126 ±4.324  ns/op
> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
> 771.603 ±6.393  ns/op
> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode   4512  avgt   10
> 3663.557 ±3.422  ns/op
> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
> 7017.784 ±9.128  ns/op
> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5 
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5   
> 19751.477 ±  24.669  ns/op
> 

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]

2021-04-07 Thread Andrew Haley
On Wed, 7 Apr 2021 05:51:02 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]

2021-04-07 Thread Andrew Haley
On Wed, 7 Apr 2021 09:50:45 GMT, Andrew Haley  wrote:

>> Dong Bo has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   fix misleading annotations
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5829:
> 
>> 5827: __ strb(r14, __ post(dst, 1));
>> 5828: __ strb(r15, __ post(dst, 1));
>> 5829: __ strb(r13, __ post(dst, 1));
> 
> I think this sequence should be 4 BFMs, STRW, BFM, STRW. That's the best we 
> can do, I think.

Sorry, that's not quite right, but you get the idea: let's not generate 
unnecessary memory traffic.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v5]

2021-04-07 Thread Dong Bo
On Tue, 6 Apr 2021 14:04:07 GMT, Andrew Haley  wrote:

>> Dong Bo has updated the pull request with a new target base due to a merge 
>> or a rebase. The pull request now contains 10 commits:
>> 
>>  - conflicts resolved
>>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into 
>> aarch64.base64.decode
>>  - resovling conflicts
>>  - load data with one ldrw, add JMH tests for error inputs
>>  - Merge branch 'master' into aarch64.base64.decode
>>  - copyright
>>  - trivial fixes
>>  - Handling error in SIMD case with loops, combining two non-SIMD cases into 
>> one code blob, addressing other comments
>>  - Merge branch 'master' into aarch64.base64.decode
>>  - 8256245: AArch64: Implement Base64 decoding intrinsic
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5800:
> 
>> 5798: __ br(Assembler::LT, Process4B);
>> 5799: 
>> 5800: // The 1st character of the input can be illegal if the data is 
>> MIME encoded.
> 
> Why is this sentence here? It is very misleading.

This sentence was used to describe the worst case observed frequently so that 
readers can understand more easily why the pre-processing non-SIMD code is 
necessary.
I apologize for being unclear and misleading. The annotations have been 
modified as suggested.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]

2021-04-06 Thread Dong Bo
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt   
> Score   Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.614 ± 0.609  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 58.199 ± 1.650  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 69.400 ± 0.931  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5  
> 96.818 ± 1.687  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 122.856 ± 9.217  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 130.935 ± 1.667  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 143.627 ± 1.751  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 152.311 ± 1.178  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 342.631 ± 0.584  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5 
> 573.635 ± 1.050  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5
> 9534.136 ±45.172  ns/op
> Base64Decode.testBase64Decode   4  5  avgt5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
> 63.558 ±0.336  ns/op
> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
> 82.504 ±0.848  ns/op
> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
> 120.591 ±0.608  ns/op
> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
> 324.314 ±6.236  ns/op
> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
> 532.678 ±4.670  ns/op
> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
> 678.126 ±4.324  ns/op
> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
> 771.603 ±6.393  ns/op
> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode   4512  avgt   10
> 3663.557 ±3.422  ns/op
> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
> 7017.784 ±9.128  ns/op
> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5 
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5   
> 19751.477 ±  24.669  ns/op
> 

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v5]

2021-04-06 Thread Andrew Haley
On Tue, 6 Apr 2021 07:25:57 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

2021-04-06 Thread Andrew Haley
On Fri, 2 Apr 2021 10:01:27 GMT, Andrew Haley  wrote:

>> Dong Bo has updated the pull request with a new target base due to a merge 
>> or a rebase. The pull request now contains six commits:
>> 
>>  - Merge branch 'master' into aarch64.base64.decode
>>  - copyright
>>  - trivial fixes
>>  - Handling error in SIMD case with loops, combining two non-SIMD cases into 
>> one code blob, addressing other comments
>>  - Merge branch 'master' into aarch64.base64.decode
>>  - 8256245: AArch64: Implement Base64 decoding intrinsic
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5802:
> 
>> 5800: // The 1st character of the input can be illegal if the data is 
>> MIME encoded.
>> 5801: // We cannot benefits from SIMD for this case. The max line size 
>> of MIME
>> 5802: // encoding is 76, with the PreProcess80B blob, we actually use 
>> no-simd
> 
> "cannot benefit"

OK, so I now understand what is actually going on here, and it has nothing to 
do with illegal first characters.
The problem is that the maximum block length the decode will be supplied with 
is 76 bytes, and there isn't enough time for the SIMD to be worthwhile. So the 
comment should be "In the MIME case, the line length cannot be more than 76 
bytes (see RFC 2045.) This is too short a block for SIMD to be worthwhile, so 
we use non-SIMD here."

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-04-06 Thread Dong Bo
On Tue, 6 Apr 2021 09:44:28 GMT, Andrew Haley  wrote:

> > It would be no benefits to use SIMD for this case, so the stub use no-simd 
> > instructions for MIME encoded data now.
> 
> What is the reasoning here? Sure, there can be illegal characters at the 
> start, but what if there are not? The generic logic uses decodeBlock() even 
> in the MIME case, because we don't know that there certainly will be illegal 
> characters.

This code block only process 80B of the inputs.
If no illegal characters were found, the stub will use the SIMD instructions to 
process the rest of the inputs if the data length is large enough, i.e. >= 64B, 
to form up at least one SIMD round.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-04-06 Thread Andrew Haley
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo  wrote:

> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.

What is the reasoning here? Sure, there can be illegal characters at the start, 
but what if there are not? The generic logic uses decodeBlock() even in the 
MIME case, because we don't know that there certainly will be illegal 
characters.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-04-06 Thread Dong Bo
On Fri, 2 Apr 2021 10:17:57 GMT, Andrew Haley  wrote:

>> PING... Any suggestions on the updated commit?
>
>> PING... Any suggestions on the updated commit?
> 
> Once you reply to the comments, sure.

>
> Are there any existing test cases for failing inputs?
>
I added one, the error character is injected at the paramized index of the 
encoded data.
There are no big differences for small error injected index, seems too much 
time is took by exception handing.
Witnessed ~2x performance improvements as expected. The JMH tests:
### Kunpeng 916, intrinsic,tested with `-jar benchmarks.jar 
testBase64WithErrorInputsDecode -p errorIndex=3,64,144,208,272,1000,2 -p 
maxNumBytes=1`
Base64Decode.testBase64WithErrorInputsDecode 3   4  
1  avgt   10   3696.151 ± 202.783  ns/op
Base64Decode.testBase64WithErrorInputsDecode64   4  
1  avgt   10   3899.269 ± 178.289  ns/op
Base64Decode.testBase64WithErrorInputsDecode   144   4  
1  avgt   10   3902.022 ± 163.611  ns/op
Base64Decode.testBase64WithErrorInputsDecode   208   4  
1  avgt   10   3982.423 ± 256.638  ns/op
Base64Decode.testBase64WithErrorInputsDecode   272   4  
1  avgt   10   3984.545 ± 144.282  ns/op
Base64Decode.testBase64WithErrorInputsDecode  1000   4  
1  avgt   10   4532.959 ± 310.068  ns/op
Base64Decode.testBase64WithErrorInputsDecode 2   4  
1  avgt   10  17578.148 ± 631.600  ns/op
### Kunpeng 916, default,tested with `-XX:-UseBASE64Intrinsics -jar 
benchmarks.jar testBase64WithErrorInputsDecode -p 
errorIndex=3,64,144,208,272,1000,2 -p maxNumBytes=1`
Base64Decode.testBase64WithErrorInputsDecode 3   4  
1  avgt   10   3760.330 ± 261.672  ns/op
Base64Decode.testBase64WithErrorInputsDecode64   4  
1  avgt   10   3900.326 ± 121.632  ns/op
Base64Decode.testBase64WithErrorInputsDecode   144   4  
1  avgt   10   4041.428 ± 174.435  ns/op
Base64Decode.testBase64WithErrorInputsDecode   208   4  
1  avgt   10   4177.670 ± 214.433  ns/op
Base64Decode.testBase64WithErrorInputsDecode   272   4  
1  avgt   10   4324.020 ± 106.826  ns/op
Base64Decode.testBase64WithErrorInputsDecode  1000   4  
1  avgt   10   5476.469 ± 171.647  ns/op
Base64Decode.testBase64WithErrorInputsDecode 2   4  
1  avgt   10  34163.743 ± 162.263  ns/op

>
> Your test results suggest that it isn't useful for that, surely?
>
The results suggest non-SIMD code provides ~11.9% improvements for MIME 
decoding.
Furthermore, according to local tests, we may have about ~30% performance 
regression for MIME decoding without non-SIMD code.

In worst case, a MIME line has only 4 base64 encoded characters and a newline 
string consisted of error inputs, e.g. `\r\n`.
When the instrinsic encounter an illegal character (`\r`), it has to exit.
Then the Java code will pass the next illegal source byte (`\n`) to the 
intrinsic.
With only SIMD code, it will execute too much wasty instructions before it can 
detect the error.
Whie with non-SIMD code, the instrinsic will execute only one non-SIMD round 
for this error input.

>
> For loads and four post increments rather than one load and a few BFMs? Why?
>
Nice suggestion. Done, thanks.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v5]

2021-04-06 Thread Dong Bo
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt   
> Score   Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.614 ± 0.609  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 58.199 ± 1.650  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 69.400 ± 0.931  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5  
> 96.818 ± 1.687  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 122.856 ± 9.217  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 130.935 ± 1.667  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 143.627 ± 1.751  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 152.311 ± 1.178  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 342.631 ± 0.584  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5 
> 573.635 ± 1.050  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5
> 9534.136 ±45.172  ns/op
> Base64Decode.testBase64Decode   4  5  avgt5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
> 63.558 ±0.336  ns/op
> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
> 82.504 ±0.848  ns/op
> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
> 120.591 ±0.608  ns/op
> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
> 324.314 ±6.236  ns/op
> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
> 532.678 ±4.670  ns/op
> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
> 678.126 ±4.324  ns/op
> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
> 771.603 ±6.393  ns/op
> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode   4512  avgt   10
> 3663.557 ±3.422  ns/op
> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
> 7017.784 ±9.128  ns/op
> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5 
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5   
> 19751.477 ±  24.669  ns/op
> 

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v4]

2021-04-06 Thread Dong Bo
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt   
> Score   Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.614 ± 0.609  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 58.199 ± 1.650  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 69.400 ± 0.931  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5  
> 96.818 ± 1.687  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 122.856 ± 9.217  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 130.935 ± 1.667  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 143.627 ± 1.751  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 152.311 ± 1.178  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 342.631 ± 0.584  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5 
> 573.635 ± 1.050  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5
> 9534.136 ±45.172  ns/op
> Base64Decode.testBase64Decode   4  5  avgt5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
> 63.558 ±0.336  ns/op
> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
> 82.504 ±0.848  ns/op
> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
> 120.591 ±0.608  ns/op
> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
> 324.314 ±6.236  ns/op
> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
> 532.678 ±4.670  ns/op
> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
> 678.126 ±4.324  ns/op
> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
> 771.603 ±6.393  ns/op
> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode   4512  avgt   10
> 3663.557 ±3.422  ns/op
> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
> 7017.784 ±9.128  ns/op
> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5 
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5   
> 19751.477 ±  24.669  ns/op
> 

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

2021-04-02 Thread Andrew Haley
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-04-02 Thread Andrew Haley
On Fri, 2 Apr 2021 07:05:26 GMT, Dong Bo  wrote:

> PING... Any suggestions on the updated commit?

Once you reply to the comments, sure.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-04-02 Thread Andrew Haley
On Mon, 29 Mar 2021 03:28:54 GMT, Dong Bo  wrote:

> > Please consider losing the non-SIMD case. It doesn't result in any 
> > significant gain.
> 
> The non-SIMD case is useful for MIME decoding performance.

Your test results suggest that it isn't useful for that, surely?

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

2021-04-02 Thread Andrew Haley
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

2021-04-02 Thread Andrew Haley
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-04-02 Thread Dong Bo
On Tue, 30 Mar 2021 03:24:16 GMT, Dong Bo  wrote:

>>> I think I can rewrite this part as loops.
>>> With an intial implemention, we can have almost half of the code size 
>>> reduced (1312B -> 748B). Sounds OK to you?
>> 
>> Sounds great, but I'm still somewhat concerned that the non-SIMD case only 
>> offers 3-12% performance gain. Make it just 748 bytes, and therefore not 
>> icache-hostile, then perhaps the balance of risk and reward is justified.
>
>> > With an intial implemention, we can have almost half of the code size 
>> > reduced (1312B -> 748B). Sounds OK to you?
>> 
>> Sounds great, but I'm still somewhat concerned that the non-SIMD case only 
>> offers 3-12% performance gain. Make it just 748 bytes, and therefore not 
>> icache-hostile, then perhaps the balance of risk and reward is justified.
> 
> Hi, @theRealAph @nick-arm 
> 
> The code is updated. The error handling in SIMD case was rewriten as loops.
> 
> Also combined the two non-SIMD code blocks into one.
> Due to we have only one non-SIMD loop now, it is moved into 
> `generate_base64_decodeBlock`.
> The size of the stub is 692 bytes, the non-SIMD loop takes about 92 bytes if 
> my calculation is right.
> 
> Verified with tests `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java`.
> Compared with previous implementation, the performance changes are negligible.
> 
> Other comments are addressed too. Thanks.

PING... Any suggestions on the updated commit?

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

2021-04-01 Thread Dong Bo
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt   
> Score   Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.614 ± 0.609  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 58.199 ± 1.650  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 69.400 ± 0.931  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5  
> 96.818 ± 1.687  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 122.856 ± 9.217  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 130.935 ± 1.667  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 143.627 ± 1.751  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 152.311 ± 1.178  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 342.631 ± 0.584  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5 
> 573.635 ± 1.050  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5
> 9534.136 ±45.172  ns/op
> Base64Decode.testBase64Decode   4  5  avgt5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
> 63.558 ±0.336  ns/op
> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
> 82.504 ±0.848  ns/op
> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
> 120.591 ±0.608  ns/op
> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
> 324.314 ±6.236  ns/op
> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
> 532.678 ±4.670  ns/op
> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
> 678.126 ±4.324  ns/op
> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
> 771.603 ±6.393  ns/op
> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode   4512  avgt   10
> 3663.557 ±3.422  ns/op
> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
> 7017.784 ±9.128  ns/op
> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5 
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5   
> 19751.477 ±  24.669  ns/op
> 

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-29 Thread Dong Bo
On Mon, 29 Mar 2021 08:38:59 GMT, Andrew Haley  wrote:

> > With an intial implemention, we can have almost half of the code size 
> > reduced (1312B -> 748B). Sounds OK to you?
> 
> Sounds great, but I'm still somewhat concerned that the non-SIMD case only 
> offers 3-12% performance gain. Make it just 748 bytes, and therefore not 
> icache-hostile, then perhaps the balance of risk and reward is justified.

Hi, @theRealAph @nick-arm 

The code is updated. The error handling in SIMD case was rewriten as loops.

Also combined the two non-SIMD code blocks into one.
Due to we have only one non-SIMD loop now, it is moved into 
`generate_base64_decodeBlock`.
The size of the stub is 692 bytes, the non-SIMD loop takes about 92 bytes if my 
calculation is right.

Verified with tests `test/jdk/java/util/Base64/` and 
`compiler/intrinsics/base64/TestBase64.java`.
Compared with previous implementation, the performance changes are negligible.

Other comments are addressed too. Thanks.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v2]

2021-03-29 Thread Dong Bo
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt   
> Score   Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.614 ± 0.609  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 58.199 ± 1.650  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 69.400 ± 0.931  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5  
> 96.818 ± 1.687  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 122.856 ± 9.217  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 130.935 ± 1.667  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 143.627 ± 1.751  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 152.311 ± 1.178  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 342.631 ± 0.584  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5 
> 573.635 ± 1.050  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5
> 9534.136 ±45.172  ns/op
> Base64Decode.testBase64Decode   4  5  avgt5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
> 63.558 ±0.336  ns/op
> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
> 82.504 ±0.848  ns/op
> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
> 120.591 ±0.608  ns/op
> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
> 324.314 ±6.236  ns/op
> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
> 532.678 ±4.670  ns/op
> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
> 678.126 ±4.324  ns/op
> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
> 771.603 ±6.393  ns/op
> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode   4512  avgt   10
> 3663.557 ±3.422  ns/op
> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
> 7017.784 ±9.128  ns/op
> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5 
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5   
> 19751.477 ±  24.669  ns/op
> 

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-29 Thread Andrew Haley
On Mon, 29 Mar 2021 03:28:54 GMT, Dong Bo  wrote:

> I think I can rewrite this part as loops.
> With an intial implemention, we can have almost half of the code size reduced 
> (1312B -> 748B). Sounds OK to you?

Sounds great, but I'm still somewhat concerned that the non-SIMD case only 
offers 3-12% performance gain. Make it just 748 bytes, and therefore not 
icache-hostile, then perhaps the balance of risk and reward is justified.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-29 Thread Andrew Haley
On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson  wrote:

> > There's a lot of unrolling, particularly in the non-SIMD case. Please 
> > consider taking out some of the unrolling; I suspect it'd not increase time 
> > by very much but would greatly reduce the code cache pollution. It's very 
> > tempting to unroll everything to make a benchmark run quickly, but we have 
> > to take a balanced approach.
> 
> But there's only ever one of these generated at startup, right? It's not like 
> the string intrinsics that are expanded at every call site.

I'm talking about icache pollution. This stuff could be quite small.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-28 Thread Dong Bo
On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson  wrote:

>> Firstly, I wonder how important this is for most applications. I don't 
>> actually know, but let's put that to one side. 
>> 
>> There's a lot of unrolling, particularly in the non-SIMD case. Please 
>> consider taking out some of the unrolling; I suspect it'd not increase time 
>> by very much but would greatly reduce the code cache pollution. It's very 
>> tempting to unroll everything to make a benchmark run quickly, but we have 
>> to take a balanced approach.
>
>> 
>> There's a lot of unrolling, particularly in the non-SIMD case. Please 
>> consider taking out some of the unrolling; I suspect it'd not increase time 
>> by very much but would greatly reduce the code cache pollution. It's very 
>> tempting to unroll everything to make a benchmark run quickly, but we have 
>> to take a balanced approach.
> 
> But there's only ever one of these generated at startup, right? It's not like 
> the string intrinsics that are expanded at every call site.

@nick-arm Thank you for watching this.

> That probably ought to go around the whole routine in 
> generate_base64_decodeBlock rather than here?
>

There are two non-simd blocks in this intrinsic.
The 1st is at the begining, mainly to roll MIME decoding to non-simd processing 
due to the performance issue as I claimed before.
The 2nd is at the end to handle trailing inputs. So I guess we need 
generate_base64_decode_nosimdround here.

>  "illegal inputs". Are there existing jtreg tests that cover these cases?
>

Yes, they are covered by 
`test/hotspot/jtreg/compiler/intrinsics/base64/TestBase64.java`.

> This table and the one below seem to be identical to first half of the NoSIMD 
> tables. Can't you just use one set of 256-entry tables for both SIMD and 
> non-SIMD algorithms?
>
They are not identical, `*ForSIMD[64]==0`, `*forNoSIMD[64]=255`.
In SIMD case, `*ForSIMD[64]` acts as a pivot to tell us that we already get the 
decoded data with the 1st lookup when performing the 2nd lookup.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-28 Thread Dong Bo
On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson  wrote:

>> Firstly, I wonder how important this is for most applications. I don't 
>> actually know, but let's put that to one side. 
>> 
>> There's a lot of unrolling, particularly in the non-SIMD case. Please 
>> consider taking out some of the unrolling; I suspect it'd not increase time 
>> by very much but would greatly reduce the code cache pollution. It's very 
>> tempting to unroll everything to make a benchmark run quickly, but we have 
>> to take a balanced approach.
>
>> 
>> There's a lot of unrolling, particularly in the non-SIMD case. Please 
>> consider taking out some of the unrolling; I suspect it'd not increase time 
>> by very much but would greatly reduce the code cache pollution. It's very 
>> tempting to unroll everything to make a benchmark run quickly, but we have 
>> to take a balanced approach.
> 
> But there's only ever one of these generated at startup, right? It's not like 
> the string intrinsics that are expanded at every call site.

Thanks for the comments.

> Firstly, I wonder how important this is for most applications. I don't 
> actually know, but let's put that to one side.
>

As claimed in JEP 135, Base64 is frequently used to encode binary/octet 
sequences that are transmitted as textual data.
It is commonly used by applications using Multipurpose Internal Mail Extensions 
(MIME), encoding passwords for HTTP headers, message digests, etc.
 
> There's a lot of unrolling, particularly in the non-SIMD case. Please 
> consider taking out some of the unrolling; I suspect it'd not increase time 
> by very much but would greatly reduce the code cache pollution. It's very 
> tempting to unroll everything to make a benchmark run quickly, but we have to 
> take a balanced approach.
>

There is no code unrolling in the non-SIMD case. The instructions are just 
loading, processing, storing data within loops.
About half of the code size is the error handling in SIMD case:
// handle illegal input
if (size == 16) {
  Label ErrorInLowerHalf;
  __ umov(rscratch1, in2, __ D, 0);
  __ cbnz(rscratch1, ErrorInLowerHalf);

  // illegal input is in higher half, store the lower half now.
  __ st3(out0, out1, out2, __ T8B, __ post(dst, 24));

  for (int i = 8; i < 15; i++) {
__ umov(rscratch2, in2, __ B, (u1) i);
__ cbnz(rscratch2, Exit);
__ umov(r10, out0, __ B, (u1) i);
__ umov(r11, out1, __ B, (u1) i);
__ umov(r12, out2, __ B, (u1) i);
__ strb(r10, __ post(dst, 1));
__ strb(r11, __ post(dst, 1));
__ strb(r12, __ post(dst, 1));
  }
  __ b(Exit);
I think I can rewrite this part as loops.
With an intial implemention, we can have almost half of the code size reduced 
(1312B -> 748B). Sounds OK to you?

> Please consider losing the non-SIMD case. It doesn't result in any 
> significant gain.
>

The non-SIMD case is useful for MIME decoding performance.
The MIME base64 encoded data is arranged in lines (line size can be set by user 
with maximum 76B).
Newline characters, e.g. `\r\n`, are illegal but can be ignored by MIME 
decoding.
While the SIMD case works as `load data -> two vector table lookups -> 
combining -> error detection -> store data`.
When using SIMD for MIME decoding, the 1st byte of the input are possibly a 
newline character.
The SIMD case will execute too much wasty code before it can detect the error 
and exit, with non-simd case, there are only few ldrs, orrs, strs for error 
detecting.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-28 Thread Nick Gasson
On Sat, 27 Mar 2021 09:53:37 GMT, Andrew Haley  wrote:

> 
> There's a lot of unrolling, particularly in the non-SIMD case. Please 
> consider taking out some of the unrolling; I suspect it'd not increase time 
> by very much but would greatly reduce the code cache pollution. It's very 
> tempting to unroll everything to make a benchmark run quickly, but we have to 
> take a balanced approach.

But there's only ever one of these generated at startup, right? It's not like 
the string intrinsics that are expanded at every call site.

-

PR: https://git.openjdk.java.net/jdk/pull/3228


Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-28 Thread Nick Gasson
On Sat, 27 Mar 2021 09:54:57 GMT, Andrew Haley  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-28 Thread Nick Gasson
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo  wrote:

> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt   
> Score   Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.614 ± 0.609  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 58.199 ± 1.650  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 69.400 ± 0.931  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5  
> 96.818 ± 1.687  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 122.856 ± 9.217  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 130.935 ± 1.667  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 143.627 ± 1.751  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 152.311 ± 1.178  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 342.631 ± 0.584  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5 
> 573.635 ± 1.050  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5
> 9534.136 ±45.172  ns/op
> Base64Decode.testBase64Decode   4  5  avgt5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
> 63.558 ±0.336  ns/op
> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
> 82.504 ±0.848  ns/op
> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
> 120.591 ±0.608  ns/op
> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
> 324.314 ±6.236  ns/op
> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
> 532.678 ±4.670  ns/op
> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
> 678.126 ±4.324  ns/op
> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
> 771.603 ±6.393  ns/op
> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode   4512  avgt   10
> 3663.557 ±3.422  ns/op
> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
> 7017.784 ±9.128  ns/op
> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5 
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode   4  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-27 Thread Andrew Haley
On Sat, 27 Mar 2021 09:53:37 GMT, Andrew Haley  wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt  
>>  Score   Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.614 ± 0.609  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 58.199 ± 1.650  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 69.400 ± 0.931  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5  
>> 96.818 ± 1.687  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 122.856 ± 9.217  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 130.935 ± 1.667  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 143.627 ± 1.751  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 152.311 ± 1.178  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 342.631 ± 0.584  ns/op
>> Base64Decode.testBase64Decode   4   1000  avgt5 
>> 573.635 ± 1.050  ns/op
>> Base64Decode.testBase64Decode   4  2  avgt5
>> 9534.136 ±45.172  ns/op
>> Base64Decode.testBase64Decode   4  5  avgt5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
>> 63.558 ±0.336  ns/op
>> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
>> 82.504 ±0.848  ns/op
>> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
>> 120.591 ±0.608  ns/op
>> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
>> 324.314 ±6.236  ns/op
>> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
>> 532.678 ±4.670  ns/op
>> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
>> 678.126 ±4.324  ns/op
>> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
>> 771.603 ±6.393  ns/op
>> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode   4512  avgt   10
>> 3663.557 ±3.422  ns/op
>> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
>> 7017.784 ±9.128  ns/op
>> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode   4  1  avgt5  
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode   4  3  avgt5  
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode   4  7  avgt5  
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode   4 32  avgt5 
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode   4 64  avgt5 
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode   4 80  avgt5 
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4 96  avgt5 
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode   4112  avgt5 
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode   4512  avgt5 
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode   4   1000  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-27 Thread Andrew Haley
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo  wrote:

> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt   
> Score   Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.614 ± 0.609  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 58.199 ± 1.650  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 69.400 ± 0.931  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5  
> 96.818 ± 1.687  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 122.856 ± 9.217  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 130.935 ± 1.667  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 143.627 ± 1.751  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 152.311 ± 1.178  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 342.631 ± 0.584  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5 
> 573.635 ± 1.050  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5
> 9534.136 ±45.172  ns/op
> Base64Decode.testBase64Decode   4  5  avgt5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
> 63.558 ±0.336  ns/op
> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
> 82.504 ±0.848  ns/op
> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
> 120.591 ±0.608  ns/op
> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
> 324.314 ±6.236  ns/op
> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
> 532.678 ±4.670  ns/op
> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
> 678.126 ±4.324  ns/op
> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
> 771.603 ±6.393  ns/op
> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode   4512  avgt   10
> 3663.557 ±3.422  ns/op
> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
> 7017.784 ±9.128  ns/op
> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5 
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode   4  

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-27 Thread Andrew Haley
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo  wrote:

> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark  (lineSize)  (maxNumBytes)  Mode  Cnt   
> Score   Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.614 ± 0.609  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 58.199 ± 1.650  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 69.400 ± 0.931  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5  
> 96.818 ± 1.687  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 122.856 ± 9.217  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 130.935 ± 1.667  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 143.627 ± 1.751  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 152.311 ± 1.178  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 342.631 ± 0.584  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5 
> 573.635 ± 1.050  ns/op
> Base64Decode.testBase64Decode   4  2  avgt5
> 9534.136 ±45.172  ns/op
> Base64Decode.testBase64Decode   4  5  avgt5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
> 63.558 ±0.336  ns/op
> Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
> 82.504 ±0.848  ns/op
> Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
> 120.591 ±0.608  ns/op
> Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
> 324.314 ±6.236  ns/op
> Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
> 532.678 ±4.670  ns/op
> Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
> 678.126 ±4.324  ns/op
> Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
> 771.603 ±6.393  ns/op
> Base64Decode.testBase64MIMEDecode   4112  avgt   10 
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode   4512  avgt   10
> 3663.557 ±3.422  ns/op
> Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
> 7017.784 ±9.128  ns/op
> Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode   4  1  avgt5  
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode   4  3  avgt5  
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode   4  7  avgt5  
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode   4 32  avgt5 
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode   4 64  avgt5 
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode   4 80  avgt5 
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4 96  avgt5 
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode   4112  avgt5 
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode   4512  avgt5 
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode   4   1000  avgt5
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode   4  

RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

2021-03-27 Thread Dong Bo
In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
can be found at 
http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.

Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
Tests in `test/jdk/java/util/Base64/` and 
`compiler/intrinsics/base64/TestBase64.java` runned specially for the 
correctness of the implementation.

There can be illegal characters at the start of the input if the data is MIME 
encoded.
It would be no benefits to use SIMD for this case, so the stub use no-simd 
instructions for MIME encoded data now.

A JMH micro, Base64Decode.java, is added for performance test.
With different input length (upper-bounded by parameter `maxNumBytes` in the 
JMH micro),
we witness ~2.5x improvements with long inputs and no regression with short 
inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
Kunpeng916.

The Base64Decode.java JMH micro-benchmark results:

# Kunpeng916, intrinsic
Base64Decode.testBase64Decode   4  1  avgt5  
48.614 ± 0.609  ns/op
Base64Decode.testBase64Decode   4  3  avgt5  
58.199 ± 1.650  ns/op
Base64Decode.testBase64Decode   4  7  avgt5  
69.400 ± 0.931  ns/op
Base64Decode.testBase64Decode   4 32  avgt5  
96.818 ± 1.687  ns/op
Base64Decode.testBase64Decode   4 64  avgt5 
122.856 ± 9.217  ns/op
Base64Decode.testBase64Decode   4 80  avgt5 
130.935 ± 1.667  ns/op
Base64Decode.testBase64Decode   4 96  avgt5 
143.627 ± 1.751  ns/op
Base64Decode.testBase64Decode   4112  avgt5 
152.311 ± 1.178  ns/op
Base64Decode.testBase64Decode   4512  avgt5 
342.631 ± 0.584  ns/op
Base64Decode.testBase64Decode   4   1000  avgt5 
573.635 ± 1.050  ns/op
Base64Decode.testBase64Decode   4  2  avgt5
9534.136 ±45.172  ns/op
Base64Decode.testBase64Decode   4  5  avgt5   
22718.726 ±   192.070  ns/op
Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
63.558 ±0.336  ns/op
Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
82.504 ±0.848  ns/op
Base64Decode.testBase64MIMEDecode   4  7  avgt   10 
120.591 ±0.608  ns/op
Base64Decode.testBase64MIMEDecode   4 32  avgt   10 
324.314 ±6.236  ns/op
Base64Decode.testBase64MIMEDecode   4 64  avgt   10 
532.678 ±4.670  ns/op
Base64Decode.testBase64MIMEDecode   4 80  avgt   10 
678.126 ±4.324  ns/op
Base64Decode.testBase64MIMEDecode   4 96  avgt   10 
771.603 ±6.393  ns/op
Base64Decode.testBase64MIMEDecode   4112  avgt   10 
889.608 ±   0.759  ns/op
Base64Decode.testBase64MIMEDecode   4512  avgt   10
3663.557 ±3.422  ns/op
Base64Decode.testBase64MIMEDecode   4   1000  avgt   10
7017.784 ±9.128  ns/op
Base64Decode.testBase64MIMEDecode   4  2  avgt   10  
128670.660 ± 7951.521  ns/op
Base64Decode.testBase64MIMEDecode   4  5  avgt   10  
317113.667 ±  161.758  ns/op

# Kunpeng916, default
Base64Decode.testBase64Decode   4  1  avgt5  
48.455 ±   0.571  ns/op
Base64Decode.testBase64Decode   4  3  avgt5  
57.937 ±   0.505  ns/op
Base64Decode.testBase64Decode   4  7  avgt5  
73.823 ±   1.452  ns/op
Base64Decode.testBase64Decode   4 32  avgt5 
106.484 ±   1.243  ns/op
Base64Decode.testBase64Decode   4 64  avgt5 
141.004 ±   1.188  ns/op
Base64Decode.testBase64Decode   4 80  avgt5 
156.284 ±   0.572  ns/op
Base64Decode.testBase64Decode   4 96  avgt5 
174.137 ±   0.177  ns/op
Base64Decode.testBase64Decode   4112  avgt5 
188.445 ±   0.572  ns/op
Base64Decode.testBase64Decode   4512  avgt5 
610.847 ±   1.559  ns/op
Base64Decode.testBase64Decode   4   1000  avgt5
1155.368 ±   0.813  ns/op
Base64Decode.testBase64Decode   4  2  avgt5   
19751.477 ±  24.669  ns/op
Base64Decode.testBase64Decode   4  5  avgt5   
50046.586 ± 523.155  ns/op
Base64Decode.testBase64MIMEDecode   4  1  avgt   10  
64.130 ±   0.238  ns/op
Base64Decode.testBase64MIMEDecode   4  3  avgt   10  
82.096 ±   0.205  ns/op