On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo <don...@openjdk.org> wrote:

> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea 
> can be found at 
> http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
> 
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and 
> `compiler/intrinsics/base64/TestBase64.java` runned specially for the 
> correctness of the implementation.
> 
> There can be illegal characters at the start of the input if the data is MIME 
> encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd 
> instructions for MIME encoded data now.
> 
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the 
> JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short 
> inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on 
> Kunpeng916.
> 
> The Base64Decode.java JMH micro-benchmark results:
> 
> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       
> Score       Error  Units
> 
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode               4              1  avgt    5      
> 48.614 ±     0.609  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      
> 58.199 ±     1.650  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      
> 69.400 ±     0.931  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5      
> 96.818 ±     1.687  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     
> 122.856 ±     9.217  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     
> 130.935 ±     1.667  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     
> 143.627 ±     1.751  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     
> 152.311 ±     1.178  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     
> 342.631 ±     0.584  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5     
> 573.635 ±     1.050  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5    
> 9534.136 ±    45.172  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   
> 22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      
> 63.558 ±    0.336  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      
> 82.504 ±    0.848  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     
> 120.591 ±    0.608  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     
> 324.314 ±    6.236  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     
> 532.678 ±    4.670  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     
> 678.126 ±    4.324  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     
> 771.603 ±    6.393  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     
> 889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    
> 3663.557 ±    3.422  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    
> 7017.784 ±    9.128  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  
> 128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  
> 317113.667 ±  161.758  ns/op
> 
> # Kunpeng916, default
> Base64Decode.testBase64Decode               4              1  avgt    5      
> 48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      
> 57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      
> 73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5     
> 106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     
> 141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     
> 156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     
> 174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     
> 188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     
> 610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5    
> 1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5   
> 19751.477 ±  24.669  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   
> 50046.586 ± 523.155  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      
> 64.130 ±   0.238  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      
> 82.096 ±   0.205  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     
> 118.849 ±   0.610  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     
> 331.177 ±   4.732  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     
> 549.117 ±   0.177  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     
> 702.951 ±   4.572  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     
> 799.566 ±   0.301  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     
> 923.749 ±   0.389  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    
> 4000.725 ±   2.519  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    
> 7674.994 ±   9.281  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  
> 142059.001 ± 157.920  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  
> 355698.369 ± 216.542  ns/op

Firstly, I wonder how important this is for most applications. I don't actually 
know, but let's put that to one side. 

There's a lot of unrolling, particularly in the non-SIMD case. Please consider 
taking out some of the unrolling; I suspect it'd not increase time by very much 
but would greatly reduce the code cache pollution.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228

Reply via email to