On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson <ngas...@openjdk.org> wrote:

>> Firstly, I wonder how important this is for most applications. I don't 
>> actually know, but let's put that to one side. 
>> 
>> There's a lot of unrolling, particularly in the non-SIMD case. Please 
>> consider taking out some of the unrolling; I suspect it'd not increase time 
>> by very much but would greatly reduce the code cache pollution. It's very 
>> tempting to unroll everything to make a benchmark run quickly, but we have 
>> to take a balanced approach.
>
>> 
>> There's a lot of unrolling, particularly in the non-SIMD case. Please 
>> consider taking out some of the unrolling; I suspect it'd not increase time 
>> by very much but would greatly reduce the code cache pollution. It's very 
>> tempting to unroll everything to make a benchmark run quickly, but we have 
>> to take a balanced approach.
> 
> But there's only ever one of these generated at startup, right? It's not like 
> the string intrinsics that are expanded at every call site.

Thanks for the comments.

> Firstly, I wonder how important this is for most applications. I don't 
> actually know, but let's put that to one side.
>

As claimed in JEP 135, Base64 is frequently used to encode binary/octet 
sequences that are transmitted as textual data.
It is commonly used by applications using Multipurpose Internal Mail Extensions 
(MIME), encoding passwords for HTTP headers, message digests, etc.
 
> There's a lot of unrolling, particularly in the non-SIMD case. Please 
> consider taking out some of the unrolling; I suspect it'd not increase time 
> by very much but would greatly reduce the code cache pollution. It's very 
> tempting to unroll everything to make a benchmark run quickly, but we have to 
> take a balanced approach.
>

There is no code unrolling in the non-SIMD case. The instructions are just 
loading, processing, storing data within loops.
About half of the code size is the error handling in SIMD case:
    // handle illegal input
    if (size == 16) {
      Label ErrorInLowerHalf;
      __ umov(rscratch1, in2, __ D, 0);
      __ cbnz(rscratch1, ErrorInLowerHalf);

      // illegal input is in higher half, store the lower half now.
      __ st3(out0, out1, out2, __ T8B, __ post(dst, 24));

      for (int i = 8; i < 15; i++) {
        __ umov(rscratch2, in2, __ B, (u1) i);
        __ cbnz(rscratch2, Exit);
        __ umov(r10, out0, __ B, (u1) i);
        __ umov(r11, out1, __ B, (u1) i);
        __ umov(r12, out2, __ B, (u1) i);
        __ strb(r10, __ post(dst, 1));
        __ strb(r11, __ post(dst, 1));
        __ strb(r12, __ post(dst, 1));
      }
      __ b(Exit);
I think I can rewrite this part as loops.
With an intial implemention, we can have almost half of the code size reduced 
(1312B -> 748B). Sounds OK to you?

> Please consider losing the non-SIMD case. It doesn't result in any 
> significant gain.
>

The non-SIMD case is useful for MIME decoding performance.
The MIME base64 encoded data is arranged in lines (line size can be set by user 
with maximum 76B).
Newline characters, e.g. `\r\n`, are illegal but can be ignored by MIME 
decoding.
While the SIMD case works as `load data -> two vector table lookups -> 
combining -> error detection -> store data`.
When using SIMD for MIME decoding, the 1st byte of the input are possibly a 
newline character.
The SIMD case will execute too much wasty code before it can detect the error 
and exit, with non-simd case, there are only few ldrs, orrs, strs for error 
detecting.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228

Reply via email to