[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

peter at cordes dot ca via Gcc-bugs Mon, 28 Nov 2022 11:03:14 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688


--- Comment #25 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Alexander Monakov from comment #24)
> 
> I think it's possible to get UC/WC mappings via a graphics/compute API (e.g.
> OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device
> memory (and then CPU vendor cannot guarantee that 128b access won't tear
> because it might depend on downstream devices).


Even atomic_int doesn't work properly if you deref a pointer to WC memory.  WC
doesn't have the same ordering guarantees, so it would break acquire/release
semantics.
So we already don't support WC for this.

We do at least de-facto support atomics on UC memory because the ordering
guarantees are a superset of cacheable memory, and 8-byte atomicity for aligned
load/store is guaranteed even for non-cacheable memory types since P5 Pentium
(and on AMD).  (And lock cmpxchg16b is always atomic even on UC memory.)

But you're right that only Intel guarantees that 16-byte VMOVDQA loads/stores
would be atomic on UC memory.  So this change could break that very unwise
corner-case on AMD which only guarantees that for cacheable loads/stores, and
Zhaoxin only for WB.

But was anyone previously using 16-byte atomics on UC device memory?  Do we
actually care about supporting that?  I'd guess no and no, so it's just a
matter of documenting that somewhere.

Since GCC7 we've reported 16-byte atomics as being non-lock-free, so I *hope*
people weren't using __atomic_store_n on device memory.  The underlying
implementation was never guaranteed.

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

Reply via email to