On Thu, 27 Mar 2025 21:42:08 GMT, Volodymyr Paprotski <[email protected]>
wrote:
>> Ferenc Rakoczi has updated the pull request incrementally with two
>> additional commits since the last revision:
>>
>> - Further readability improvements.
>> - Added asserts for array sizes
>
> src/hotspot/cpu/x86/stubGenerator_x86_64_sha3.cpp line 342:
>
>> 340: // Performs two keccak() computations in parallel. The steps of the
>> 341: // two computations are executed interleaved.
>> 342: static address generate_double_keccak(StubGenerator *stubgen,
>> MacroAssembler *_masm) {
>
> This function seems ok. I didnt do as line-by-line 'exact' review as for the
> NTT intrinsics, but just put the new version into a diff next to the original
> function. Seems like a reasonable clean 'refactor' (hardcode the blocksize,
> add new input registers 10-14. Makes it really easy to spot vs 0-4 original
> registers..)
>
> I didnt realize before that the 'top 3 limbs' are wasted. I guess it doesnt
> matter, there are registers to spare aplenty and it makes the entire
> algorithm cleaner and easier to follow.
>
> I did also stare at the algorithm with the 'What about AVX2' question.. This
> function would pretty much need to be rewritten it looks like :/
>
> Last two questions..
> - how much performance is gained from doubling this function up?
> - If thats worth it.. what if instead it was quadrupled the input? (I scanned
> the java code, it looked like NR was parametrized already to 2..). It looks
> like there are almost enough registers here to go to 4 (I think 3 would need
> to be freed up somehow.. alternatively, the upper 3 limbs are empty in all
> operations, perhaps it could be used instead.. at the expense of readability)
Well, the algorithm (keccak()) is doing the same things on 5 array elements (It
works on essentially a 5x5 matrix doing row and column operations, so putting 5
array entries in a vector register was the "natural" thing to do).
This function can only be used under very special circumstances, which occur
during the generation of tha "A matrix" in ML-KEM and ML-DSA, the speed of that
matrix generation has almost doubled (I don't have exact numbers).
We are using 7 registers per state and 15 for the constants, so we have only 3
to spare. We could perhaps juggle with the constants keeping just the ones that
will be needed next in registers and reloading them "just in time", but that
might slow things down a bit - more load instructions executed + maybe some
load delay. On the other hand, more parallelism. I might try it out.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/23860#discussion_r2024317665