On Mon, 23 Jan 2023 18:14:16 GMT, Scott Gibbons <d...@openjdk.org> wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2661:
>> 
>>> 2659:     __ vpbroadcastq(xmm4, Address(r13, 0), Assembler::AVX_256bit);
>>> 2660:     __ vmovdqu(xmm11, Address(r13, 0x28));
>>> 2661:     __ vpbroadcastb(xmm10, Address(r13, 0), Assembler::AVX_256bit);
>> 
>> Sorry in advance since I'm probably reading this wrong: the data that `r13` 
>> is pointing to appears to be a repeated byte pattern (`0x2f2f2f...`), does 
>> this mean this `vpbroadcastb` and the `vpbroadcastq` above end up filling up 
>> their respective registers with the exact same bits? If so, and since 
>> neither of them is mutated in the code below, then perhaps this can be 
>> simplified a bit.
>
> You're reading it correctly - this is redundant and could be handled 
> differently, as the same value is being loaded into ymm4 and ymm10.  I don't 
> think there will be any significant performance gain either way.  This was 
> done in this manner to allow easier transition to URL acceleration when it is 
> implemented, as URLs require handling '-' and '_' instead of '+' and '/' ('/' 
> = 0x2f).

I was mainly curious if there was some obscure detail or difference that eluded 
me. It wouldn't be the first time!

As it's outside of the loop you're probably right about it not being very 
important to overall performance, though we should be mindful of setup 
overheads of transitioning into AVX code. Especially since inputs likely are 
skewed towards the smallest applicable lengths. I think it would be prudent to 
run and check the microbenchmark with values near the AVX2 threshold, such as 
`maxNumBytes = 48`. 

If you choose to keep the code as-is would you mind documenting the rationale 
behind the redundancy? (Is there WIP on more generalized URL acceleration that 
could be referenced?)

-------------

PR: https://git.openjdk.org/jdk/pull/12126

Reply via email to