On Sat, 27 Mar 2021 09:53:37 GMT, Andrew Haley <a...@openjdk.org> wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic 
>> idea can be found at 
>> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>> 
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and 
>> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
>> correctness of the implementation.
>> 
>> There can be illegal characters at the start of the input if the data is 
>> MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd 
>> instructions for MIME encoded data now.
>> 
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the 
>> JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short 
>> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
>> Kunpeng916.
>> 
>> The Base64Decode.java JMH micro-benchmark results:
>> 
>> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt      
>>  Score       Error  Units
>> 
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode               4              1  avgt    5      
>> 48.614 ±     0.609  ns/op
>> Base64Decode.testBase64Decode               4              3  avgt    5      
>> 58.199 ±     1.650  ns/op
>> Base64Decode.testBase64Decode               4              7  avgt    5      
>> 69.400 ±     0.931  ns/op
>> Base64Decode.testBase64Decode               4             32  avgt    5      
>> 96.818 ±     1.687  ns/op
>> Base64Decode.testBase64Decode               4             64  avgt    5     
>> 122.856 ±     9.217  ns/op
>> Base64Decode.testBase64Decode               4             80  avgt    5     
>> 130.935 ±     1.667  ns/op
>> Base64Decode.testBase64Decode               4             96  avgt    5     
>> 143.627 ±     1.751  ns/op
>> Base64Decode.testBase64Decode               4            112  avgt    5     
>> 152.311 ±     1.178  ns/op
>> Base64Decode.testBase64Decode               4            512  avgt    5     
>> 342.631 ±     0.584  ns/op
>> Base64Decode.testBase64Decode               4           1000  avgt    5     
>> 573.635 ±     1.050  ns/op
>> Base64Decode.testBase64Decode               4          20000  avgt    5    
>> 9534.136 ±    45.172  ns/op
>> Base64Decode.testBase64Decode               4          50000  avgt    5   
>> 22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      
>> 63.558 ±    0.336  ns/op
>> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      
>> 82.504 ±    0.848  ns/op
>> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     
>> 120.591 ±    0.608  ns/op
>> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     
>> 324.314 ±    6.236  ns/op
>> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     
>> 532.678 ±    4.670  ns/op
>> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     
>> 678.126 ±    4.324  ns/op
>> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     
>> 771.603 ±    6.393  ns/op
>> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     
>> 889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    
>> 3663.557 ±    3.422  ns/op
>> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    
>> 7017.784 ±    9.128  ns/op
>> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  
>> 128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  
>> 317113.667 ±  161.758  ns/op
>> 
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode               4              1  avgt    5      
>> 48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode               4              3  avgt    5      
>> 57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode               4              7  avgt    5      
>> 73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode               4             32  avgt    5     
>> 106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode               4             64  avgt    5     
>> 141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode               4             80  avgt    5     
>> 156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode               4             96  avgt    5     
>> 174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode               4            112  avgt    5     
>> 188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode               4            512  avgt    5     
>> 610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode               4           1000  avgt    5    
>> 1155.368 ±   0.813  ns/op
>> Base64Decode.testBase64Decode               4          20000  avgt    5   
>> 19751.477 ±  24.669  ns/op
>> Base64Decode.testBase64Decode               4          50000  avgt    5   
>> 50046.586 ± 523.155  ns/op
>> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      
>> 64.130 ±   0.238  ns/op
>> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      
>> 82.096 ±   0.205  ns/op
>> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     
>> 118.849 ±   0.610  ns/op
>> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     
>> 331.177 ±   4.732  ns/op
>> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     
>> 549.117 ±   0.177  ns/op
>> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     
>> 702.951 ±   4.572  ns/op
>> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     
>> 799.566 ±   0.301  ns/op
>> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     
>> 923.749 ±   0.389  ns/op
>> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    
>> 4000.725 ±   2.519  ns/op
>> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    
>> 7674.994 ±   9.281  ns/op
>> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  
>> 142059.001 ± 157.920  ns/op
>> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  
>> 355698.369 ± 216.542  ns/op
>
> Firstly, I wonder how important this is for most applications. I don't 
> actually know, but let's put that to one side. 
> 
> There's a lot of unrolling, particularly in the non-SIMD case. Please 
> consider taking out some of the unrolling; I suspect it'd not increase time 
> by very much but would greatly reduce the code cache pollution. It's very 
> tempting to unroll everything to make a benchmark run quickly, but we have to 
> take a balanced approach.

Please consider losing the non-SIMD case. It doesn't result in any significant 
gain.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5728:
> 
>> 5726: 
>> 5727:     static const uint8_t fromBase64ForNoSIMD[256] = {
>> 5728:       255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 
>> 255u, 255u, 255u, 255u, 255u, 255u,
> 
> There seems to be no documentation of these magic tables of constants.

We're either going to need a proper description of the algorithm here or a 
permalink to one.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228

Reply via email to