On Mon, Nov 30, 2020 at 12:37 PM Niels Möller <ni...@lysator.liu.se> wrote:

> Niels Möller <ni...@lysator.liu.se> writes:
> 1. Does the save and restore of registers look correct? I checked the
>    abi spec, and the intention is to use the part of the 288 byte
>    "Protected zone" below the stack pointer.


There are requirements should be applied when modifying the stack pointer
register, I will add the needed rules from
https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.9.html

- The stack pointer shall maintain quadword alignment.
- The stack pointer shall point to the first word of the lowest allocated
stack frame, the "back chain" word. The stack shall grow downward, that is,
toward lower addresses. The first word of the stack frame shall always
point to the previously allocated stack frame (toward higher addresses),
except for the first stack frame, which shall have a back chain of 0 (NULL).
- The stack pointer shall be decremented and the back chain updated
atomically using one of the "Store Double Word with Update" instructions,
so that the stack pointer always points to the beginning of a linked list
of stack frames.

so to modify r1 you have to allocate additional 8 bytes in the stack to
store the old value of r1. The register store sequence will look like:

        li      r6, 0x10        C set up some...
        li      r7, 0x20        C ...useful...
        li      r8, 0x30        C ...offsets
        li      r9, 0x40        C ...offsets

        stdu    r1, -0x50(r1)   C Save callee-save registers
        stvx    v20, r6, r1
        stvx    v21, r7, r1
        stvx    v22, r8, r1
        stvx    v23, r9, r1

note that the allocated size is rounded up to a multiple of 16 bytes, so
that quadword stack alignment is maintained.

and the register restore sequence will look like:

        lvx     v20, r6, r1
        lvx     v21, r7, r1
        lvx     v22, r8, r1
        lvx     v23, r9, r1
        addi    r1, r1, 0x50

BTW since there is no function called while the register of the stack frame
is modified, I think it's fine to not follow the rules and keep the store
and restore sequences as are without any modification.

2. The use of the QR macro means that there's no careful
>    instruction-level interleaving of independent instructions. Do you
>    think it's beneficial to do manual interleaving (like in
>    chacha_2core.asm), or can it be left to the out-of-order execution
>    logic run sort it out and execute instructions in parallel?
>

You'll get performance benefits by interleaving the independent
instructions in this case, I can estimate the increase of performance
around 20%-30%.


> 3. Is there any clever way to construct the vector {0,1,2,3} in a
>    register, instead of loading it from memory?
>

I can think of this method:

li               r10,0
lvsl           T0,0,r10      C 0x000102030405060708090A0B0C0D0E0F
vupkhsb   T0,T0          C 0x00000001000200030004000500060007
vupkhsh   T0,T0          C 0x00000000000000010000000200000003

regards,
Mamone
_______________________________________________
nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to