Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v3]

Sandhya Viswanathan Tue, 08 Jun 2021 18:01:16 -0700

On Tue, 8 Jun 2021 00:30:38 GMT, Scott Gibbons 
<github.com+6704669+asgibb...@openjdk.org> wrote:


>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 20000 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 50000 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 20000 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 50000 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   20000 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   50000 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing review comments.  Adding notes about isMIME parameter for other 
> architectures; clarifying decodeBlock comments.

@asgibbons Thanks a lot for contributing this. The performance gain is 
impressive. I have some minor comments. Please take a look.

src/hotspot/cpu/x86/assembler_x86.cpp line 4555:

> 4553: void Assembler::evpmaddubsw(XMMRegister dst, XMMRegister src1, 
> XMMRegister src2, int vector_len) {
> 4554:   assert(VM_Version::supports_avx512bw(), "");
> 4555:   InstructionAttr attributes(vector_len, /* rex_w */ false, /* 
> legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ true, /* uses_vl */ true);

This instruction is also supported on AVX platforms. The assert check could be 
as follows:
  assert(vector_len == AVX_128bit? VM_Version::supports_avx() :
             vector_len == AVX_256bit? VM_Version::supports_avx2() :
             vector_len == AVX_512bit? VM_Version::supports_avx512bw() : 0, "");
Accordingly the instruction could be named as vpmaddubsw.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5688:

> 5686:   address base64_vbmi_lookup_lo_addr() {
> 5687:     __ align(64, (unsigned long) __ pc());
> 5688:     StubCodeMark mark(this, "StubRoutines", "lookup_lo");

It will be good to add base64 to the StubCodeMark name for this and all the 
tables.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5983:

> 5981:     // calculate length from offsets
> 5982:     __ movq(length, end_offset);
> 5983:     __ subq(length, start_offset);

These are 32bit, so movl, subl instead of movq, subq. Similar for all length 
relates instructions below.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5987:

> 5985: 
> 5986:     // If AVX512 VBMI not supported, just compile non-AVX code
> 5987:     if(VM_Version::supports_avx512_vbmi()) {

Need to also check for VM_Version::supports_avx512bw() support.
Could you please check if VM_Version::supports_avx512dq is needed as well?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6134:

> 6132:       __ subq(length, 64);
> 6133:       __ addq(source, 64);
> 6134:       __ addq(dest, 48);

All address related instructions here and below could use addptr, subptr etc.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6273:

> 6271: 
> 6272:     __ shrq(length, 2);    // Multiple of 4 bytes only - length is # 
> 4-byte chunks
> 6273:     __ cmpq(length, 0);

Should these be shrl, cmpl?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6278:

> 6276:     // Set up src and dst pointers properly
> 6277:     __ addq(source, start_offset);     // Initial offset
> 6278:     __ addq(dest, dp);

The convention is to use addptr for pointers.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6284:

> 6282:     __ shll(isURL, 8);    // index into decode table based on isURL
> 6283:     __ lea(decode_table, 
> ExternalAddress(StubRoutines::x86::base64_decoding_table_addr()));
> 6284:     __ addq(decode_table, isURL);

addptr here.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6297:

> 6295:     __ orl(byte1, byte4);
> 6296: 
> 6297:     __ incrementq(source, 4);

addptr here.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6317:

> 6315:     __ load_signed_byte(byte4, Address(source, RegisterOrConstant(), 
> Address::times_1, 3));
> 6316:     __ load_signed_byte(byte3, Address(decode_table, byte3, 
> Address::times_1, 0));
> 6317:     __ load_signed_byte(byte4, Address(decode_table, byte4, 
> Address::times_1, 0));

You could use Address(base, offset) form directly here and other places: e.g. 
Address (source, 1) instead of Address(source, RegisterOrConstant(), 
Address::times_1, 1).

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6329:

> 6327:     __ subq(dest, rax);      // Number of bytes converted
> 6328:     __ movq(rax, dest);
> 6329:     __ pop(rbx);

subptr, movptr here.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 7627:

> 7625:       StubRoutines::x86::_right_shift_mask = 
> base64_right_shift_mask_addr();
> 7626:       StubRoutines::_base64_encodeBlock = generate_base64_encodeBlock();
> 7627:       if (VM_Version::supports_avx512_vbmi()) {

Need to add avx512bw check here also.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 7628:

> 7626:       StubRoutines::_base64_encodeBlock = generate_base64_encodeBlock();
> 7627:       if (VM_Version::supports_avx512_vbmi()) {
> 7628:         StubRoutines::x86::_lookup_lo = base64_vbmi_lookup_lo_addr();

It would be good to add base64 to these names.

-------------

PR: https://git.openjdk.java.net/jdk/pull/4368

Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v3]

Reply via email to