Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-24 Thread Sandhya Viswanathan
On Thu, 24 Jun 2021 14:50:01 GMT, Vladimir Kozlov  wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Fixing Windows build warnings
>
> The rest of testing hs-tier1-4 and xcomp is finished and clean.
> So this is the only failure. I attached hs_err file to RFE.

Thanks a lot @vnkozlov for the review and test.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v8]

2021-06-24 Thread Vladimir Kozlov
On Thu, 24 Jun 2021 17:02:03 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixed Windows register stomping.

Latest update fixed TestBase64.java test issue.

-

Marked as reviewed by kvn (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-24 Thread Scott Gibbons
On Thu, 24 Jun 2021 14:50:01 GMT, Vladimir Kozlov  wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Fixing Windows build warnings
>
> The rest of testing hs-tier1-4 and xcomp is finished and clean.
> So this is the only failure. I attached hs_err file to RFE.

Hi, @vnkozlov.  I just pushed a change that fixes a register overwrite.  Can 
you please start the tests again?

Thanks

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v8]

2021-06-24 Thread Scott Gibbons
> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
> Also allows for performance improvement for non-AVX-512 enabled platforms. 
> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
> accept an additional parameter (isMIME) for fast-path MIME decoding.
> 
> A change was made to the signature of DecodeBlock in Base64.java to provide 
> the intrinsic information as to whether MIME decoding was being done.  This 
> allows for the intrinsic to bypass the expensive setup of zmm registers from 
> AVX tables, knowing there may be invalid Base64 characters every 76 
> characters or so.  A change was also made here removing the restriction that 
> the intrinsic must return an even multiple of 3 bytes decoded.  This 
> implementation handles the pad characters at the end of the string and will 
> return the actual number of characters decoded.
> 
> The AVX portion of this code will decode in blocks of 256 bytes per loop 
> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
> behaves identically.
> 
> Running the Base64Decode benchmark, this change increases decode performance 
> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
> are given in the table below.
> 
> **Base Score** is without intrinsic support, **Optimized Score** is using 
> this intrinsic, and **Gain** is **Base** / **Optimized**.
> 
> 
> Benchmark Name | Base Score | Optimized Score | Gain
> -- | -- | -- | --
> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26

Scott Gibbons has updated the pull request incrementally with one additional 
commit since the last revision:

  Fixed Windows register stomping.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4368/files
  - new: https://git.openjdk.java.net/jdk/pull/4368/files/58461b80..1729232c

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=4368=07
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=4368=06-07

  Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4368.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4368/head:pull/4368

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-24 Thread Vladimir Kozlov
On Wed, 23 Jun 2021 00:31:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing Windows build warnings

The rest of testing hs-tier1-4 and xcomp is finished and clean.
So this is the only failure. I attached hs_err file to RFE.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-24 Thread Vladimir Kozlov
On Wed, 23 Jun 2021 00:31:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing Windows build warnings

I hit strange failure in compiler/intrinsics/base64/TestBase64.java test on 
Windows machine which have Intel 8167M cpu (AVX512).

#  EXCEPTION_ACCESS_VIOLATION (0xc005) at pc=0x7ff92bcbd99e, pid=24628, 
tid=6804
#
# Problematic frame:
# V  [jvm.dll+0xabd99e]  ObjectMonitor::object_peek+0xe
#

Current thread (0x016c923de2c0):  JavaThread "MainThread" [_thread_in_Java, 
id=6804, stack(0x0060df60,0x0060df70)]

Stack: [0x0060df60,0x0060df70],  sp=0x0060df6fcb50,  free 
space=1010k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [jvm.dll+0xabd99e]  ObjectMonitor::object_peek+0xe  (objectMonitor.cpp:304)
V  [jvm.dll+0xc48d5b]  ObjectSynchronizer::quick_enter+0x9b  
(synchronizer.cpp:331)
V  [jvm.dll+0xb9b6f6]  SharedRuntime::monitor_enter_helper+0x36  
(sharedRuntime.cpp:2112)
V  [jvm.dll+0x389894]  Runtime1::monitorenter+0x94  (c1_Runtime1.cpp:748)
C  0x016c99c4a757

Java frames: (J=compiled Java code, 

Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-23 Thread Vladimir Kozlov
On Wed, 23 Jun 2021 00:31:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing Windows build warnings

I will run our internal testing before approving this.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


RE: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-23 Thread Gibbons, Scott
Hi, David.  I don't have permissions to run tests in this repo.  I have tested 
on several x86 platforms (ICX, SKL) with several options.  I'll be running more 
tests today.

Thanks,
--Scott

-Original Message-
From: hotspot-dev  On Behalf Of David Holmes
Sent: Tuesday, June 22, 2021 7:21 PM
To: build-dev@openjdk.java.net; core-libs-...@openjdk.java.net; 
hotspot-...@openjdk.java.net; hotspot-compiler-...@openjdk.java.net
Subject: Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 
[v7]

On Wed, 23 Jun 2021 00:31:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00 testBase64Decode size 
>> 3 | 17.00 | 16.72 | 1.02 testBase64Decode size 7 | 20.60 | 18.82 | 
>> 1.09 testBase64Decode size 32 | 34.21 | 26.77 | 1.28 testBase64Decode 
>> size 64 | 54.43 | 38.35 | 1.42 testBase64Decode size 80 | 66.40 | 
>> 48.34 | 1.37 testBase64Decode size 96 | 73.16 | 52.90 | 1.38 
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64 testBase64Decode 
>> size 512 | 288.81 | 32.04 | 9.01 testBase64Decode size 1000 | 560.48 
>> | 40.79 | 13.74 testBase64Decode size 2 | 9530.28 | 483.37 | 
>> 19.72 testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15 
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07 
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10 
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02 
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10 
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05 
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00 
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05 
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20 
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09 
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12 
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09 
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21 
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29 
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12 
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05 
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18 
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02 
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24 
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23 
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24 
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14 
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing Windows build warnings

What testing has been done for this change? I do not see that the Github 
Actions have been run for this PR. Has this been tested on a range of x86 
systems with differing AVX capabilities?

Thanks,
David

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-22 Thread David Holmes
On Wed, 23 Jun 2021 00:31:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing Windows build warnings

What testing has been done for this change? I do not see that the Github 
Actions have been run for this PR. Has this been tested on a range of x86 
systems with differing AVX capabilities?

Thanks,
David

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-22 Thread Scott Gibbons
> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
> Also allows for performance improvement for non-AVX-512 enabled platforms. 
> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
> accept an additional parameter (isMIME) for fast-path MIME decoding.
> 
> A change was made to the signature of DecodeBlock in Base64.java to provide 
> the intrinsic information as to whether MIME decoding was being done.  This 
> allows for the intrinsic to bypass the expensive setup of zmm registers from 
> AVX tables, knowing there may be invalid Base64 characters every 76 
> characters or so.  A change was also made here removing the restriction that 
> the intrinsic must return an even multiple of 3 bytes decoded.  This 
> implementation handles the pad characters at the end of the string and will 
> return the actual number of characters decoded.
> 
> The AVX portion of this code will decode in blocks of 256 bytes per loop 
> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
> behaves identically.
> 
> Running the Base64Decode benchmark, this change increases decode performance 
> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
> are given in the table below.
> 
> **Base Score** is without intrinsic support, **Optimized Score** is using 
> this intrinsic, and **Gain** is **Base** / **Optimized**.
> 
> 
> Benchmark Name | Base Score | Optimized Score | Gain
> -- | -- | -- | --
> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26

Scott Gibbons has updated the pull request incrementally with one additional 
commit since the last revision:

  Fixing Windows build warnings

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4368/files
  - new: https://git.openjdk.java.net/jdk/pull/4368/files/e1b4af9e..58461b80

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=4368=06
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=4368=05-06

  Stats: 24 lines in 1 file changed: 8 ins; 0 del; 16 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4368.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4368/head:pull/4368

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v6]

2021-06-22 Thread Sandhya Viswanathan
On Tue, 22 Jun 2021 20:47:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Addressing review comments.
>   
>   1. Changed errorvec handling
>   2. Removed unnecessary register copies and aliasing
>   3. Streamlined mask generation

@asgibbons The patch looks good to me.

@vnkozlov We need one more review for this patch. Could you please help?

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v6]

2021-06-22 Thread Sandhya Viswanathan
On Tue, 22 Jun 2021 20:47:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Addressing review comments.
>   
>   1. Changed errorvec handling
>   2. Removed unnecessary register copies and aliasing
>   3. Streamlined mask generation

Marked as reviewed by sviswanathan (Reviewer).

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v6]

2021-06-22 Thread Scott Gibbons
> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
> Also allows for performance improvement for non-AVX-512 enabled platforms. 
> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
> accept an additional parameter (isMIME) for fast-path MIME decoding.
> 
> A change was made to the signature of DecodeBlock in Base64.java to provide 
> the intrinsic information as to whether MIME decoding was being done.  This 
> allows for the intrinsic to bypass the expensive setup of zmm registers from 
> AVX tables, knowing there may be invalid Base64 characters every 76 
> characters or so.  A change was also made here removing the restriction that 
> the intrinsic must return an even multiple of 3 bytes decoded.  This 
> implementation handles the pad characters at the end of the string and will 
> return the actual number of characters decoded.
> 
> The AVX portion of this code will decode in blocks of 256 bytes per loop 
> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
> behaves identically.
> 
> Running the Base64Decode benchmark, this change increases decode performance 
> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
> are given in the table below.
> 
> **Base Score** is without intrinsic support, **Optimized Score** is using 
> this intrinsic, and **Gain** is **Base** / **Optimized**.
> 
> 
> Benchmark Name | Base Score | Optimized Score | Gain
> -- | -- | -- | --
> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26

Scott Gibbons has updated the pull request incrementally with one additional 
commit since the last revision:

  Addressing review comments.
  
  1. Changed errorvec handling
  2. Removed unnecessary register copies and aliasing
  3. Streamlined mask generation

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4368/files
  - new: https://git.openjdk.java.net/jdk/pull/4368/files/bb73df6c..e1b4af9e

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=4368=05
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=4368=04-05

  Stats: 55 lines in 1 file changed: 0 ins; 29 del; 26 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4368.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4368/head:pull/4368

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v5]

2021-06-19 Thread Sandhya Viswanathan
On Fri, 18 Jun 2021 22:12:11 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Added comments.  Streamlined flow for decode.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6155:

> 6153:   __ subl(output_size, length);
> 6154:   __ movq(rax, -1);
> 6155:   __ shrxq(rax, rax, output_size);// Input mask in rax

I think this could also be implemented as:
__ movq(rax, -1);
__ bzhiq(rax, rax, length);

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6173:

> 6171:   __ movq(rax, 64);
> 6172:   __ subq(rax, output_size);
> 6173:   __ shrxq(output_mask, output_mask, rax);

The output mask can also be computed using bzhiq:
__ movq(output_mask, -1);
__ bzhiq(output_mask, output_mask, output_size);

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6243:

> 6241: 
> 6242:   __ BIND(L_padding);
> 6243:   __ decrementq(r13, 1);

It will be good to use output_size here instead of r13.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6249:

> 6247:   __ jcc(Assembler::notEqual, L_donePadding);
> 6248: 
> 6249:   __ decrementq(r13, 1);

It will be good to 

Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v5]

2021-06-18 Thread Sandhya Viswanathan
On Fri, 18 Jun 2021 22:12:11 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Added comments.  Streamlined flow for decode.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6004:

> 6002:   __ BIND(L_continue);
> 6003: 
> 6004:   __ vpxor(errorvec, errorvec, errorvec, Assembler::AVX_512bit);

Why clearing errorvec is needed here?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6023:

> 6021:   __ evmovdquq(tmp16_op3, pack16_op, Assembler::AVX_512bit);
> 6022:   __ evmovdquq(tmp16_op2, pack16_op, Assembler::AVX_512bit);
> 6023:   __ evmovdquq(tmp16_op1, pack16_op, Assembler::AVX_512bit);

Why do we need 3 additional copies of pack16_op?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6026:

> 6024:   __ evmovdquq(tmp32_op3, pack32_op, Assembler::AVX_512bit);
> 6025:   __ evmovdquq(tmp32_op2, pack32_op, Assembler::AVX_512bit);
> 6026:   __ evmovdquq(tmp32_op1, pack32_op, Assembler::AVX_512bit);

Why do we need 3 additional copies of pack32_op?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6051:

> 6049:   __ vpternlogd(t0, 0xfe, input1, input2, 

Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v5]

2021-06-18 Thread Scott Gibbons
> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
> Also allows for performance improvement for non-AVX-512 enabled platforms. 
> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
> accept an additional parameter (isMIME) for fast-path MIME decoding.
> 
> A change was made to the signature of DecodeBlock in Base64.java to provide 
> the intrinsic information as to whether MIME decoding was being done.  This 
> allows for the intrinsic to bypass the expensive setup of zmm registers from 
> AVX tables, knowing there may be invalid Base64 characters every 76 
> characters or so.  A change was also made here removing the restriction that 
> the intrinsic must return an even multiple of 3 bytes decoded.  This 
> implementation handles the pad characters at the end of the string and will 
> return the actual number of characters decoded.
> 
> The AVX portion of this code will decode in blocks of 256 bytes per loop 
> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
> behaves identically.
> 
> Running the Base64Decode benchmark, this change increases decode performance 
> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
> are given in the table below.
> 
> **Base Score** is without intrinsic support, **Optimized Score** is using 
> this intrinsic, and **Gain** is **Base** / **Optimized**.
> 
> 
> Benchmark Name | Base Score | Optimized Score | Gain
> -- | -- | -- | --
> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26

Scott Gibbons has updated the pull request incrementally with one additional 
commit since the last revision:

  Added comments.  Streamlined flow for decode.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4368/files
  - new: https://git.openjdk.java.net/jdk/pull/4368/files/247f2245..bb73df6c

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=4368=04
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=4368=03-04

  Stats: 44 lines in 1 file changed: 18 ins; 10 del; 16 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4368.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4368/head:pull/4368

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v4]

2021-06-10 Thread Scott Gibbons
> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
> Also allows for performance improvement for non-AVX-512 enabled platforms. 
> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
> accept an additional parameter (isMIME) for fast-path MIME decoding.
> 
> A change was made to the signature of DecodeBlock in Base64.java to provide 
> the intrinsic information as to whether MIME decoding was being done.  This 
> allows for the intrinsic to bypass the expensive setup of zmm registers from 
> AVX tables, knowing there may be invalid Base64 characters every 76 
> characters or so.  A change was also made here removing the restriction that 
> the intrinsic must return an even multiple of 3 bytes decoded.  This 
> implementation handles the pad characters at the end of the string and will 
> return the actual number of characters decoded.
> 
> The AVX portion of this code will decode in blocks of 256 bytes per loop 
> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
> behaves identically.
> 
> Running the Base64Decode benchmark, this change increases decode performance 
> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
> are given in the table below.
> 
> **Base Score** is without intrinsic support, **Optimized Score** is using 
> this intrinsic, and **Gain** is **Base** / **Optimized**.
> 
> 
> Benchmark Name | Base Score | Optimized Score | Gain
> -- | -- | -- | --
> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26

Scott Gibbons has updated the pull request incrementally with one additional 
commit since the last revision:

  Addressing review comments.
  
  1. Modified evpmaddubsw.  Assert for avx512bw, renamed to vpmaddubsw.
  2. Added base64 to StubCodeMark names and associated variables.
  3. Added avx512bw check at top of vbmi loop. No need for avx512dq.
  4. Fixed all length references (addq=>addl, addq=>addptr, etc.).
  5. Converted to Address(base, offset) where appropriate.
  
  Compiles, and smoke-tested.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4368/files
  - new: https://git.openjdk.java.net/jdk/pull/4368/files/d66e32e3..247f2245

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=4368=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=4368=02-03

  Stats: 104 lines in 5 files changed: 4 ins; 0 del; 100 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4368.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4368/head:pull/4368

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v3]

2021-06-10 Thread Scott Gibbons
On Tue, 8 Jun 2021 23:42:13 GMT, Sandhya Viswanathan  
wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Fixing review comments.  Adding notes about isMIME parameter for other 
>> architectures; clarifying decodeBlock comments.
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 4555:
> 
>> 4553: void Assembler::evpmaddubsw(XMMRegister dst, XMMRegister src1, 
>> XMMRegister src2, int vector_len) {
>> 4554:   assert(VM_Version::supports_avx512bw(), "");
>> 4555:   InstructionAttr attributes(vector_len, /* rex_w */ false, /* 
>> legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ true, /* uses_vl */ true);
> 
> This instruction is also supported on AVX platforms. The assert check could 
> be as follows:
>   assert(vector_len == AVX_128bit? VM_Version::supports_avx() :
>  vector_len == AVX_256bit? VM_Version::supports_avx2() :
>  vector_len == AVX_512bit? VM_Version::supports_avx512bw() : 0, 
> "");
> Accordingly the instruction could be named as vpmaddubsw.

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5688:
> 
>> 5686:   address base64_vbmi_lookup_lo_addr() {
>> 5687: __ align(64, (unsigned long) __ pc());
>> 5688: StubCodeMark mark(this, "StubRoutines", "lookup_lo");
> 
> It will be good to add base64 to the StubCodeMark name for this and all the 
> tables.

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5983:
> 
>> 5981: // calculate length from offsets
>> 5982: __ movq(length, end_offset);
>> 5983: __ subq(length, start_offset);
> 
> These are 32bit, so movl, subl instead of movq, subq. Similar for all length 
> relates instructions below.

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5987:
> 
>> 5985: 
>> 5986: // If AVX512 VBMI not supported, just compile non-AVX code
>> 5987: if(VM_Version::supports_avx512_vbmi()) {
> 
> Need to also check for VM_Version::supports_avx512bw() support.
> Could you please check if VM_Version::supports_avx512dq is needed as well?

Done. No need for avx512dq.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6134:
> 
>> 6132:   __ subq(length, 64);
>> 6133:   __ addq(source, 64);
>> 6134:   __ addq(dest, 48);
> 
> All address related instructions here and below could use addptr, subptr etc.

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6273:
> 
>> 6271: 
>> 6272: __ shrq(length, 2);// Multiple of 4 bytes only - length is # 
>> 4-byte chunks
>> 6273: __ cmpq(length, 0);
> 
> Should these be shrl, cmpl?

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6278:
> 
>> 6276: // Set up src and dst pointers properly
>> 6277: __ addq(source, start_offset); // Initial offset
>> 6278: __ addq(dest, dp);
> 
> The convention is to use addptr for pointers.

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6284:
> 
>> 6282: __ shll(isURL, 8);// index into decode table based on isURL
>> 6283: __ lea(decode_table, 
>> ExternalAddress(StubRoutines::x86::base64_decoding_table_addr()));
>> 6284: __ addq(decode_table, isURL);
> 
> addptr here.

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6297:
> 
>> 6295: __ orl(byte1, byte4);
>> 6296: 
>> 6297: __ incrementq(source, 4);
> 
> addptr here.

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6317:
> 
>> 6315: __ load_signed_byte(byte4, Address(source, RegisterOrConstant(), 
>> Address::times_1, 3));
>> 6316: __ load_signed_byte(byte3, Address(decode_table, byte3, 
>> Address::times_1, 0));
>> 6317: __ load_signed_byte(byte4, Address(decode_table, byte4, 
>> Address::times_1, 0));
> 
> You could use Address(base, offset) form directly here and other places: e.g. 
> Address (source, 1) instead of Address(source, RegisterOrConstant(), 
> Address::times_1, 1).

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6329:
> 
>> 6327: __ subq(dest, rax);  // Number of bytes converted
>> 6328: __ movq(rax, dest);
>> 6329: __ pop(rbx);
> 
> subptr, movptr here.

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 7627:
> 
>> 7625:   StubRoutines::x86::_right_shift_mask = 
>> base64_right_shift_mask_addr();
>> 7626:   StubRoutines::_base64_encodeBlock = 
>> generate_base64_encodeBlock();
>> 7627:   if (VM_Version::supports_avx512_vbmi()) {
> 
> Need to add avx512bw check here also.

Done.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 7628:
> 
>> 7626:   StubRoutines::_base64_encodeBlock = 
>> generate_base64_encodeBlock();
>> 7627:   if (VM_Version::supports_avx512_vbmi()) {
>> 7628: StubRoutines::x86::_lookup_lo = base64_vbmi_lookup_lo_addr();
> 
> It would be good to add base64 to these names.

Done.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v3]

2021-06-08 Thread Sandhya Viswanathan
On Tue, 8 Jun 2021 00:30:38 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing review comments.  Adding notes about isMIME parameter for other 
> architectures; clarifying decodeBlock comments.

@asgibbons Thanks a lot for contributing this. The performance gain is 
impressive. I have some minor comments. Please take a look.

src/hotspot/cpu/x86/assembler_x86.cpp line 4555:

> 4553: void Assembler::evpmaddubsw(XMMRegister dst, XMMRegister src1, 
> XMMRegister src2, int vector_len) {
> 4554:   assert(VM_Version::supports_avx512bw(), "");
> 4555:   InstructionAttr attributes(vector_len, /* rex_w */ false, /* 
> legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ true, /* uses_vl */ true);

This instruction is also supported on AVX platforms. The assert check could be 
as follows:
  assert(vector_len == AVX_128bit? VM_Version::supports_avx() :
 vector_len == AVX_256bit? VM_Version::supports_avx2() :
 vector_len == AVX_512bit? VM_Version::supports_avx512bw() : 0, "");
Accordingly the instruction could be named as vpmaddubsw.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp 

Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v3]

2021-06-08 Thread Scott Gibbons
On Tue, 8 Jun 2021 14:13:53 GMT, Jatin Bhateja  wrote:

>> I must be missing something.  How is the brute force loop aligned if not by 
>> this directive?  I don't see an alignment anywhere else that could force it. 
>>  After the entry(), there are pushes and length comparisons followed by the 
>> conditional on VBMI.  The only thing I can guess would be that the jmp 
>> aligns, but I see no indication that that occurs.
>> 
>> Perhaps what you missed was that L_forceLoop is aligned (line 6288).  This 
>> is not the same label as L_bruteForce, which is a jump target from within 
>> the VBMI conditional (which should be aligned)?  Otherwise, I don't see how 
>> L_bruteForce could possibly already be aligned.
>
> Yes, I meant force loop already has alignment so earlier one can  be removed.

Sorry - still confused.  These are two different labels, bound to two different 
locations.  I believe the alignments for both are justified.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v3]

2021-06-08 Thread Jatin Bhateja
On Tue, 8 Jun 2021 13:25:00 GMT, Scott Gibbons 
 wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6239:
>> 
>>> 6237: 
>>> 6238:   __ align(32);
>>> 6239:   __ BIND(L_bruteForce);
>> 
>> Is this alignment needed ? Given that brute force loop is already aligned.
>
> I must be missing something.  How is the brute force loop aligned if not by 
> this directive?  I don't see an alignment anywhere else that could force it.  
> After the entry(), there are pushes and length comparisons followed by the 
> conditional on VBMI.  The only thing I can guess would be that the jmp 
> aligns, but I see no indication that that occurs.
> 
> Perhaps what you missed was that L_forceLoop is aligned (line 6288).  This is 
> not the same label as L_bruteForce, which is a jump target from within the 
> VBMI conditional (which should be aligned)?  Otherwise, I don't see how 
> L_bruteForce could possibly already be aligned.

Yes, I meant force loop already has alignment so earlier one can  be removed.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v3]

2021-06-08 Thread Scott Gibbons
On Tue, 8 Jun 2021 01:56:42 GMT, Jatin Bhateja  wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Fixing review comments.  Adding notes about isMIME parameter for other 
>> architectures; clarifying decodeBlock comments.
>
> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6239:
> 
>> 6237: 
>> 6238:   __ align(32);
>> 6239:   __ BIND(L_bruteForce);
> 
> Is this alignment needed ? Given that brute force loop is already aligned.

I must be missing something.  How is the brute force loop aligned if not by 
this directive?  I don't see an alignment anywhere else that could force it.  
After the entry(), there are pushes and length comparisons followed by the 
conditional on VBMI.  The only thing I can guess would be that the jmp aligns, 
but I see no indication that that occurs.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v3]

2021-06-07 Thread Jatin Bhateja
On Tue, 8 Jun 2021 00:30:38 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing review comments.  Adding notes about isMIME parameter for other 
> architectures; clarifying decodeBlock comments.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6239:

> 6237: 
> 6238:   __ align(32);
> 6239:   __ BIND(L_bruteForce);

Is this alignment needed ? Given that brute force loop is already aligned.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v3]

2021-06-07 Thread Scott Gibbons
> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
> Also allows for performance improvement for non-AVX-512 enabled platforms. 
> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
> accept an additional parameter (isMIME) for fast-path MIME decoding.
> 
> A change was made to the signature of DecodeBlock in Base64.java to provide 
> the intrinsic information as to whether MIME decoding was being done.  This 
> allows for the intrinsic to bypass the expensive setup of zmm registers from 
> AVX tables, knowing there may be invalid Base64 characters every 76 
> characters or so.  A change was also made here removing the restriction that 
> the intrinsic must return an even multiple of 3 bytes decoded.  This 
> implementation handles the pad characters at the end of the string and will 
> return the actual number of characters decoded.
> 
> The AVX portion of this code will decode in blocks of 256 bytes per loop 
> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
> behaves identically.
> 
> Running the Base64Decode benchmark, this change increases decode performance 
> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
> are given in the table below.
> 
> **Base Score** is without intrinsic support, **Optimized Score** is using 
> this intrinsic, and **Gain** is **Base** / **Optimized**.
> 
> 
> Benchmark Name | Base Score | Optimized Score | Gain
> -- | -- | -- | --
> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26

Scott Gibbons has updated the pull request incrementally with one additional 
commit since the last revision:

  Fixing review comments.  Adding notes about isMIME parameter for other 
architectures; clarifying decodeBlock comments.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4368/files
  - new: https://git.openjdk.java.net/jdk/pull/4368/files/00fd5621..d66e32e3

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=4368=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=4368=01-02

  Stats: 19 lines in 3 files changed: 8 ins; 4 del; 7 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4368.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4368/head:pull/4368

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v2]

2021-06-07 Thread Corey Ashford
On Tue, 8 Jun 2021 00:11:42 GMT, Scott Gibbons 
 wrote:

>> src/java.base/share/classes/java/util/Base64.java line 813:
>> 
>>> 811: while (sp < sl) {
>>> 812: if (shiftto == 18 && sp < sl - 4) {   // fast path
>>> 813: int dl = decodeBlock(src, sp, sl, dst, dp, isURL, 
>>> isMIME);
>> 
>> This new param is passed all the way down to the intrinsic.  I think 
>> existing intrinsics can safely ignore this parameter if it doesn't help the 
>> implementation (for example PPC64-LE has 16-byte vector registers, so isn't 
>> quite as seriously impacted by MIME).  However, in the code for the PPC64-LE 
>> intrinsic, this new parameter isn't mentioned.  I think if you're going to 
>> add a new parameter, it should be mentioned in the existing intrinsics as 
>> being present, but unused.
>
> Are you suggesting that I change *all* intrinsic implementations (aarch64, 
> ppc, etc.)?  I have no problem doing that - just checking if this is what's 
> desired.

Yes. I didn't realize that there's a decodeBlock intrinsic for aarch64 already, 
but yeah it should only be a couple of lines of comments for each.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v2]

2021-06-07 Thread Scott Gibbons
On Mon, 7 Jun 2021 22:34:33 GMT, Corey Ashford  wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Update full name
>
> src/java.base/share/classes/java/util/Base64.java line 813:
> 
>> 811: while (sp < sl) {
>> 812: if (shiftto == 18 && sp < sl - 4) {   // fast path
>> 813: int dl = decodeBlock(src, sp, sl, dst, dp, isURL, 
>> isMIME);
> 
> This new param is passed all the way down to the intrinsic.  I think existing 
> intrinsics can safely ignore this parameter if it doesn't help the 
> implementation (for example PPC64-LE has 16-byte vector registers, so isn't 
> quite as seriously impacted by MIME).  However, in the code for the PPC64-LE 
> intrinsic, this new parameter isn't mentioned.  I think if you're going to 
> add a new parameter, it should be mentioned in the existing intrinsics as 
> being present, but unused.

Are you suggesting that I change *all* intrinsic implementations (aarch64, ppc, 
etc.)?  I have no problem doing that - just checking if this is what's desired.

> src/java.base/share/classes/java/util/Base64.java line 818:
> 
>> 816:  * bytes of data were returned.
>> 817:  */
>> 818: int chars_decoded = ((dl + 2) / 3) * 4;
> 
> In the PR comments, you say, "A change was also made here removing the 
> restriction that the intrinsic must return an even multiple of 3 bytes 
> decoded.", however there's still a comment in the code above that says:
> 
>  * If the intrinsic function does not process all of the bytes in
>  * src, it must process a multiple of four of them, making the
>  * returned destination length a multiple of three.
> 
> So this comment needs to be changed or removed to reflect your commit.

I will change the comment, and add verbage regarding the new parameter.  Thank 
you.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v2]

2021-06-07 Thread Corey Ashford
On Mon, 7 Jun 2021 13:20:20 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Update full name

Thanks for making this interesting update, which improves the flexibility of 
intrinsics to make use of isMIME.

src/java.base/share/classes/java/util/Base64.java line 813:

> 811: while (sp < sl) {
> 812: if (shiftto == 18 && sp < sl - 4) {   // fast path
> 813: int dl = decodeBlock(src, sp, sl, dst, dp, isURL, 
> isMIME);

This new param is passed all the way down to the intrinsic.  I think existing 
intrinsics can safely ignore this parameter if it doesn't help the 
implementation (for example PPC64-LE has 16-byte vector registers, so isn't 
quite as seriously impacted by MIME).  However, in the code for the PPC64-LE 
intrinsic, this new parameter isn't mentioned.  I think if you're going to add 
a new parameter, it should be mentioned in the existing intrinsics as being 
present, but unused.

src/java.base/share/classes/java/util/Base64.java line 818:

> 816:  * bytes of data were returned.
> 817:  */
> 

Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v2]

2021-06-07 Thread Scott Gibbons
> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
> Also allows for performance improvement for non-AVX-512 enabled platforms. 
> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
> accept an additional parameter (isMIME) for fast-path MIME decoding.
> 
> A change was made to the signature of DecodeBlock in Base64.java to provide 
> the intrinsic information as to whether MIME decoding was being done.  This 
> allows for the intrinsic to bypass the expensive setup of zmm registers from 
> AVX tables, knowing there may be invalid Base64 characters every 76 
> characters or so.  A change was also made here removing the restriction that 
> the intrinsic must return an even multiple of 3 bytes decoded.  This 
> implementation handles the pad characters at the end of the string and will 
> return the actual number of characters decoded.
> 
> The AVX portion of this code will decode in blocks of 256 bytes per loop 
> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
> behaves identically.
> 
> Running the Base64Decode benchmark, this change increases decode performance 
> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
> are given in the table below.
> 
> **Base Score** is without intrinsic support, **Optimized Score** is using 
> this intrinsic, and **Gain** is **Base** / **Optimized**.
> 
> 
> Benchmark Name | Base Score | Optimized Score | Gain
> -- | -- | -- | --
> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26

Scott Gibbons has updated the pull request incrementally with one additional 
commit since the last revision:

  Update full name

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4368/files
  - new: https://git.openjdk.java.net/jdk/pull/4368/files/e527557a..00fd5621

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=4368=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=4368=00-01

  Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4368.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4368/head:pull/4368

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512

2021-06-07 Thread Erik Joelsson
On Fri, 4 Jun 2021 20:55:51 GMT, Scott Gibbons 
 wrote:

> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
> Also allows for performance improvement for non-AVX-512 enabled platforms. 
> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
> accept an additional parameter (isMIME) for fast-path MIME decoding.
> 
> A change was made to the signature of DecodeBlock in Base64.java to provide 
> the intrinsic information as to whether MIME decoding was being done.  This 
> allows for the intrinsic to bypass the expensive setup of zmm registers from 
> AVX tables, knowing there may be invalid Base64 characters every 76 
> characters or so.  A change was also made here removing the restriction that 
> the intrinsic must return an even multiple of 3 bytes decoded.  This 
> implementation handles the pad characters at the end of the string and will 
> return the actual number of characters decoded.
> 
> The AVX portion of this code will decode in blocks of 256 bytes per loop 
> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
> behaves identically.
> 
> Running the Base64Decode benchmark, this change increases decode performance 
> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
> are given in the table below.
> 
> **Base Score** is without intrinsic support, **Optimized Score** is using 
> this intrinsic, and **Gain** is **Base** / **Optimized**.
> 
> 
> Benchmark Name | Base Score | Optimized Score | Gain
> -- | -- | -- | --
> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26

The gitignore change looks ok, but should maybe be a separate change.

-

Marked as reviewed by erikj (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/4368


RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512

2021-06-05 Thread Scott Gibbons
Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
Also allows for performance improvement for non-AVX-512 enabled platforms. Due 
to the nature of MIME-encoded inputs, modify the intrinsic signature to accept 
an additional parameter (isMIME) for fast-path MIME decoding.

A change was made to the signature of DecodeBlock in Base64.java to provide the 
intrinsic information as to whether MIME decoding was being done.  This allows 
for the intrinsic to bypass the expensive setup of zmm registers from AVX 
tables, knowing there may be invalid Base64 characters every 76 characters or 
so.  A change was also made here removing the restriction that the intrinsic 
must return an even multiple of 3 bytes decoded.  This implementation handles 
the pad characters at the end of the string and will return the actual number 
of characters decoded.

The AVX portion of this code will decode in blocks of 256 bytes per loop 
iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
non-AVX code is an assembly-optimized version of the java DecodeBlock and 
behaves identically.

Running the Base64Decode benchmark, this change increases decode performance by 
an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers are 
given in the table below.

**Base Score** is without intrinsic support, **Optimized Score** is using this 
intrinsic, and **Gain** is **Base** / **Optimized**.


Benchmark Name | Base Score | Optimized Score | Gain
-- | -- | -- | --
testBase64Decode size 1 | 15.36 | 15.32 | 1.00
testBase64Decode size 3 | 17.00 | 16.72 | 1.02
testBase64Decode size 7 | 20.60 | 18.82 | 1.09
testBase64Decode size 32 | 34.21 | 26.77 | 1.28
testBase64Decode size 64 | 54.43 | 38.35 | 1.42
testBase64Decode size 80 | 66.40 | 48.34 | 1.37
testBase64Decode size 96 | 73.16 | 52.90 | 1.38
testBase64Decode size 112 | 84.93 | 51.82 | 1.64
testBase64Decode size 512 | 288.81 | 32.04 | 9.01
testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26

-

Commit messages:
 - Merge remote-tracking branch 'origin/base64_length_restrict' into 
base64_decode
 - Condition decode intrinsic within generator instead of outside to allow 
non-AVX acceleration
 - Adding MIME to signature.
 - Adding MIME to signature.
 - Adding MIME to signature.
 - Initialize vector before loop
 - Initialize vector before loop
 - Wrong register lengths.
 - Wrong register lengths.
 - writing in wrong order
 - ... and 418 more: 
https://git.openjdk.java.net/jdk/compare/48dc72b7...e527557a

Changes: https://git.openjdk.java.net/jdk/pull/4368/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk=4368=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8268276
  Stats: 743 lines in 10 files changed: 736 ins; 0 del; 7 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4368.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4368/head:pull/4368

PR: https://git.openjdk.java.net/jdk/pull/4368