On Thu, Jun 5, 2008 at 8:15 AM, Jan Hubicka <[EMAIL PROTECTED]> wrote: >> >> 1. Extend the register save area to put upper 128bit at the end. >> Pros: >> Aligned access. >> Save stack space if 256bit registers are used. >> Cons >> Split access. Require more split access beyond 256bit. >> >> 2. Extend the register save area to put full 265bit YMMs at the end. >> The first DWORD after the register save area has the offset of >> the extended array for YMM registers. The next DWORD has the >> element size of the extended array. Unaligned access will be used. >> Pros: >> No split access. >> Easily extendable beyond 256bit. >> Limited unaligned access penalty if stack is aligned at 32byte. >> Cons: >> May require store both the lower 128bit and full 256bit register >> content. We may avoid saving the lower 128bit if correct type >> is required when accessing variable argument list, similar to int >> vs. double. >> Waste 272 byte on stack when 256bit registers are used. >> Unaligned load and store. >> >> We should agree on one approach to ensure compatibility between >> different compilers. > > This is something that definitly should be hanlded by ABI update. > > We probably need to also somehow update the way to specify what to save > to varargs prologue. Otherwise if you would have YMM aware printf
Yes, but I believe that is compiler specific. Different compilers may have different approaches for varargs prologue, as long as they follow the psABI. > running on non-AVX hardware, we would end up with invalid instructions. That is nothing new. The same applies to SSE on ia32. Basically, you shouldn't call YMM aware printf on non-AVX hardware. You can have /lib64/avx/libc.so.6 if necessary. > > At the moment, eax is required to specify number of XMM registers, we > probably can extend it to have number of XMM registers in AL and YMM in > AH. ymm0 and xmm0 are the same register. xmm0 is the lower 128bit of xmm0. I am not sure if we need separate XMM registers from YMM registers. > > I personally don't have much preferences over 1. or 2.. 1. seems > relatively easy to implement too, or is packaging two 128bit values to > single 256bit difficult in va_arg expansion? > Access to 256bit register as lower and upper 128bits needs 2 instructions. For store vmovaps %xmm7, -143(%rax) vextractf128 $1, %ymm7, -15(%rax) For load vmovaps -143(%rax),%xmm7 vinsert128 $1, -15(%rax),%ymm7,%ymm7 If we go beyond 256bit, we need more instructions to access the full register. For 512bit, it will be split into lower 128bit, middle 128bit and upper 256bit. 1024bit will have 4 parts. For #2, only one instruction will be needed for 256bit and beyond. Thanks. -- H.J.