On Thu, Mar 28, 2024 at 10:03:04PM +0000, Amonson, Paul D wrote: >> * I think we need to verify there isn't a huge performance regression for >> smaller arrays. IIUC those will still require an AVX512 instruction or >> two as well as a function call, which might add some noticeable overhead. > > Not considering your changes, I had already tested small buffers. At less > than 512 bytes there was no measurable regression (there was one extra > condition check) and for 512+ bytes it moved from no regression to some > gains between 512 and 4096 bytes. Assuming you introduced no extra > function calls, it should be the same.
Cool. I think we should run the benchmarks again to be safe, though. >> I forgot to mention that I also want to understand whether we can >> actually assume availability of XGETBV when CPUID says we support >> AVX512: > > You cannot assume as there are edge cases where AVX-512 was found on > system one during compile but it's not actually available in a kernel on > a second system at runtime despite the CPU actually having the hardware > feature. Yeah, I understand that much, but I want to know how portable the XGETBV instruction is. Unless I can assume that all x86_64 systems and compilers support that instruction, we might need an additional configure check and/or CPUID check. It looks like MSVC has had support for the _xgetbv intrinsic for quite a while, but I'm still researching the other cases. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com