On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson <ngas...@openjdk.org> wrote:
>> Firstly, I wonder how important this is for most applications. I don't >> actually know, but let's put that to one side. >> >> There's a lot of unrolling, particularly in the non-SIMD case. Please >> consider taking out some of the unrolling; I suspect it'd not increase time >> by very much but would greatly reduce the code cache pollution. It's very >> tempting to unroll everything to make a benchmark run quickly, but we have >> to take a balanced approach. > >> >> There's a lot of unrolling, particularly in the non-SIMD case. Please >> consider taking out some of the unrolling; I suspect it'd not increase time >> by very much but would greatly reduce the code cache pollution. It's very >> tempting to unroll everything to make a benchmark run quickly, but we have >> to take a balanced approach. > > But there's only ever one of these generated at startup, right? It's not like > the string intrinsics that are expanded at every call site. Thanks for the comments. > Firstly, I wonder how important this is for most applications. I don't > actually know, but let's put that to one side. > As claimed in JEP 135, Base64 is frequently used to encode binary/octet sequences that are transmitted as textual data. It is commonly used by applications using Multipurpose Internal Mail Extensions (MIME), encoding passwords for HTTP headers, message digests, etc. > There's a lot of unrolling, particularly in the non-SIMD case. Please > consider taking out some of the unrolling; I suspect it'd not increase time > by very much but would greatly reduce the code cache pollution. It's very > tempting to unroll everything to make a benchmark run quickly, but we have to > take a balanced approach. > There is no code unrolling in the non-SIMD case. The instructions are just loading, processing, storing data within loops. About half of the code size is the error handling in SIMD case: // handle illegal input if (size == 16) { Label ErrorInLowerHalf; __ umov(rscratch1, in2, __ D, 0); __ cbnz(rscratch1, ErrorInLowerHalf); // illegal input is in higher half, store the lower half now. __ st3(out0, out1, out2, __ T8B, __ post(dst, 24)); for (int i = 8; i < 15; i++) { __ umov(rscratch2, in2, __ B, (u1) i); __ cbnz(rscratch2, Exit); __ umov(r10, out0, __ B, (u1) i); __ umov(r11, out1, __ B, (u1) i); __ umov(r12, out2, __ B, (u1) i); __ strb(r10, __ post(dst, 1)); __ strb(r11, __ post(dst, 1)); __ strb(r12, __ post(dst, 1)); } __ b(Exit); I think I can rewrite this part as loops. With an intial implemention, we can have almost half of the code size reduced (1312B -> 748B). Sounds OK to you? > Please consider losing the non-SIMD case. It doesn't result in any > significant gain. > The non-SIMD case is useful for MIME decoding performance. The MIME base64 encoded data is arranged in lines (line size can be set by user with maximum 76B). Newline characters, e.g. `\r\n`, are illegal but can be ignored by MIME decoding. While the SIMD case works as `load data -> two vector table lookups -> combining -> error detection -> store data`. When using SIMD for MIME decoding, the 1st byte of the input are possibly a newline character. The SIMD case will execute too much wasty code before it can detect the error and exit, with non-simd case, there are only few ldrs, orrs, strs for error detecting. ------------- PR: https://git.openjdk.java.net/jdk/pull/3228