[Bug c++/110619] Dangling pointer returned from constexpr function converts in nullptr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110619 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #7 from Peter Cordes --- (In reply to Andrew Pinski from comment #2) > >but it is not nullptr. > > Or is it just undefined so it could be considered a nullptr ... Implementation-defined behaviour, according to answers on https://stackoverflow.com/questions/76843246/why-does-the-address-of-an-out-of-scope-variable-equal-zero-with-constexpr https://eel.is/c++draft/basic.compound#def:value,invalid_pointer https://eel.is/c++draft/basic.stc.general#4 > Indirection through an invalid pointer value and passing an invalid pointer > value to a deallocation function have undefined behavior. > **Any other use of an invalid pointer value has implementation-defined > behavior.** So this wasn't a bug, but the new behaviour is also allowed. This commit could be reverted or kept, depending on maintainability and/or quality-of-life for users of GCC. Having it pick the other implementation-defined behaviour from clang (GCC's previous behaviour) is maybe a *good* thing, to help programmers catch dependence on an invalid pointer being either null or non-null if they try their code with both compilers.
[Bug middle-end/108441] [12 Regression] Maybe missed optimization: loading an 16-bit integer value from .rodata instead of an immediate store
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108441 --- Comment #4 from Peter Cordes --- This is already fixed in current trunk; sorry I forgot to check that before recommending to report this store-coalescing bug. # https://godbolt.org/z/j3MdWrcWM # GCC nightly -O3 (tune=generic) and GCC11 store: movl$16, %eax movw%ax, ldap(%rip) ret In case anyone's wondering why GCC doesn't movw $16, foo(%rip) it's avoiding LCP stalls on Intel P6-family CPUs from the 16-bit immediate. For MOV specifically, that only happens on P6-family (Nehalem and earlier), not Sandybridge-family, so it's getting close to time to drop it from -mtune=generic. (-mtune= bdver* or znver* don't do it, so there is a tuning setting controlling it) GCC *only* seems to know about MOV, so ironically with -march=skylake for example, we avoid a non-existant LCP stall for mov to memory, but GCC compiles x += 1234 into code that will LCP stall, addw $1234, x(%rip). -march=alderlake disables this tuning workaround, using movw $imm, mem. (The Silvermont-family E-cores in Alder Lake don't have this problem either, so that's correct. Agner Fog's guide didn't mention any changes in LCP stalls for Alder Lake.) Avoiding LCP stalls is somewhat less important on CPUs with a uop cache, since it only happens on legacy decode. Although various things can cause code to only run from legacy decode even inside a loop, such as Skylake's JCC erratum microcode mitigation if users don't assemble with the option to have GAS work around it, which GCC doesn't pass by default with -march=skylake. If there isn't already a bug open about tuning choices mismatching hardware, I can repost this as a new bug if you'd like. Related :https://stackoverflow.com/questions/75154687/is-this-a-missed-optimization-in-gcc-loading-an-16-bit-integer-value-from-roda and https://stackoverflow.com/questions/70719114/why-does-the-short-16-bit-variable-mov-a-value-to-a-register-and-store-that-u
[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 --- Comment #27 from Peter Cordes --- (In reply to Alexander Monakov from comment #26) > Sure, the right course of action seems to be to simply document that atomic > types and built-ins are meant to be used on "common" (writeback) memory Agreed. Where in the manual should this go? Maybe a new subsection of the chapter about __atomic builtins where we document per-ISA requirements for them to actually work? e.g. x86 memory-type stuff, and that ARM assumes all cores are in the same inner-shareable cache-coherency domain, thus barriers are dmb ish not dmb sy and so on. I guess we might want to avoid documenting the actual asm implementation strategies in the main manual, because that would imply it's supported to make assumptions based on that. Putting it near the __atomic docs might make it easier for readers to notice that the list of requirements exists, vs. scattering them into different pages for different ISAs. And we don't currently have any section in the manual about per-ISA quirks or requirements, just about command-line options, builtins, and attributes that are per-ISA, so there's no existing page where this could get tacked on. This would also be a place where we can document that __atomic ops are address-free when they're lock-free, and thus usable on shared memory between processes. ISO C++ says that *should* be the case for std::atomic, but doesn't standardize the existence of multiple processes. To avoid undue worry, documentation about this should probably start by saying that normal programs (running under mainstream OSes) don't have to worry about it or do anything special.
[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 --- Comment #25 from Peter Cordes --- (In reply to Alexander Monakov from comment #24) > > I think it's possible to get UC/WC mappings via a graphics/compute API (e.g. > OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device > memory (and then CPU vendor cannot guarantee that 128b access won't tear > because it might depend on downstream devices). Even atomic_int doesn't work properly if you deref a pointer to WC memory. WC doesn't have the same ordering guarantees, so it would break acquire/release semantics. So we already don't support WC for this. We do at least de-facto support atomics on UC memory because the ordering guarantees are a superset of cacheable memory, and 8-byte atomicity for aligned load/store is guaranteed even for non-cacheable memory types since P5 Pentium (and on AMD). (And lock cmpxchg16b is always atomic even on UC memory.) But you're right that only Intel guarantees that 16-byte VMOVDQA loads/stores would be atomic on UC memory. So this change could break that very unwise corner-case on AMD which only guarantees that for cacheable loads/stores, and Zhaoxin only for WB. But was anyone previously using 16-byte atomics on UC device memory? Do we actually care about supporting that? I'd guess no and no, so it's just a matter of documenting that somewhere. Since GCC7 we've reported 16-byte atomics as being non-lock-free, so I *hope* people weren't using __atomic_store_n on device memory. The underlying implementation was never guaranteed.
[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #23 from Peter Cordes --- (In reply to Xi Ruoyao from comment #20) > "On Zhaoxin CPUs with AVX, the VMOVDQA instruction is atomic if the accessed > memory is Write Back, but it's not guaranteed for other memory types." VMOVDQA is still fine, I think WB is the only memory type that's relevant for atomics, at least on the mainstream OSes we compile for. It's not normally possible for user-space to allocate memory of other types. Kernels normally use WB memory for their shared data, too. You're correct that WT and WP are the other two cacheable memory types, and Zhaoxin's statement doesn't explicitly guarantee atomicity for those, unlike Intel and AMD. But at least on Linux, I don't think there's a way for user-space to even ask for a page of WT or WP memory (or UC or WC). Only WB memory is easily available without hacking the kernel. As far as I know, this is true on other existing OSes. WT = write-through: read caching, no write-allocate. Write hits update the line and memory. WP = write-protect: read caching, no write-allocate. Writes go around the cache, evicting even on hit. (https://stackoverflow.com/questions/65953033/whats-the-usecase-of-write-protected-pat-memory-type quotes the Intel definitions.) Until recently, the main work on formalizing the x86 TSO memory model had only looked at WB memory. A 2022 paper looked at WT, UC, and WC memory types: https://dl.acm.org/doi/pdf/10.1145/3498683 - Extending Intel-x86 Consistency and Persistency Formalising the Semantics of Intel-x86 Memory Types and Non-temporal Stores (The intro part describing memory types is quite readable, in plain English not full of formal symbols. They only mention WP once, but tested some litmus tests with readers and writers using any combination of the other memory types.) Some commenters on my answer on when WT is ever used or useful confirmed that mainstream OSes don't give easy access to it. https://stackoverflow.com/questions/61129142/when-use-write-through-cache-policy-for-pages/61130838#61130838 * Linux has never merged a patch to let user-space allocate WT pages. * The Windows kernel reportedly doesn't have a mechanism to keep track of pages that should be WT or WP, so you won't find any. I don't know about *BSD making it plausible for user-space to point an _Atomic int * at a page of WT or WP memory. I'd guess not. I don't know if there's anywhere we can document that _Atomic objects need to be in memory that's allocated in a "normal" way. Probably hard to word without accidentally disallowing something that's fine.
[Bug tree-optimization/106138] Inefficient code generation: logical AND of disjoint booleans from equal and bitwise AND not optimized to constant false
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106138 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3 from Peter Cordes --- Ideally, bitwise & of booleans should also be handled, not just &&. A testcase (https://godbolt.org/z/qvosv8q7c) makes it easy to check both. //#define LOGIC_AND _Bool f2(char x) { _Bool b1 = x == 2; _Bool b2 = x & 1; #ifdef LOGIC_AND return b1 && b2; #else return b1 & b2; #endif } (Clang optimized it to return false for the && version, but not bitwise. GCC currently doesn't optimize either way.) This was originally posted on Stack Overflow (https://stackoverflow.com/q/72802469/224132), BTW.
[Bug target/105929] New: [AArch64] armv8.4-a allows atomic stp. 64-bit constants can use 2 32-bit halves with _Atomic or volatile
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105929 Bug ID: 105929 Summary: [AArch64] armv8.4-a allows atomic stp. 64-bit constants can use 2 32-bit halves with _Atomic or volatile Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: arm64-*-* void foo(unsigned long *p) { *p = 0xdeadbeefdeadbeef; } // compiles nicely: https://godbolt.org/z/8zf8ns14K mov w1, 48879 movkw1, 0xdead, lsl 16 stp w1, w1, [x0] ret But even with -Os -march=armv8.4-a the following doesn't: void foo_atomic(_Atomic unsigned long *p) { __atomic_store_n(p, 0xdeadbeefdeadbeef, __ATOMIC_RELAXED); } mov x1, 48879 movkx1, 0xdead, lsl 16 movkx1, 0xbeef, lsl 32 movkx1, 0xdead, lsl 48 stlrx1, [x0] ret ARMv8.4-a and later guarantees atomicity for aligned ldp/stp, according to ARM's architecture reference manual: ARM DDI 0487H.a - ID020222, so we could use the same asm as the non-atomic version. > If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access > fewer than 16 bytes are single-copy atomic when all of the following > conditions are true: > • All bytes being accessed are within a 16-byte quantity aligned to 16 bytes. > • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory (FEAT_LSE2 is the same CPU feature that gives 128-bit atomicity for aligned ldp/stp x,x,mem) Prior to that, apparently it wasn't guaranteed that stp of 32-bit halves merged into a single 64-bit store. So without -march=armv8.4-a it wasn't a missed optimization to construct the constant in a single register for _Atomic or volatile. But with ARMv8.4, we should use MOV/MOVK + STP. Since there doesn't seem to be a release-store version of STP, 64-bit release and seq_cst stores should still generate the full constant in a register, instead of using STP + barriers. (Without ARMv8.4-a, or with a memory-order other than relaxed, see PR105928 for generating 64-bit constants in 3 instructions instead of 4, at least for -Os, with add x0, x0, x0, lsl 32)
[Bug target/105928] New: [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928 Bug ID: 105928 Summary: [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least) Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: arm64-*-* void foo(unsigned long *p) { *p = 0xdeadbeefdeadbeef; } cleverly compiles to https://godbolt.org/z/b3oqao5Kz mov w1, 48879 movkw1, 0xdead, lsl 16 stp w1, w1, [x0] ret But producing the value in a register uses more than 3 instructions: unsigned long constant(){ return 0xdeadbeefdeadbeef; } mov x0, 48879 movkx0, 0xdead, lsl 16 movkx0, 0xbeef, lsl 32 movkx0, 0xdead, lsl 48 ret At least with -Os, and maybe at -O2 or -O3 if it's efficient, we could be doing a shifted ADD or ORR to broadcast a zero-extended 32-bit value to 64-bit. mov x0, 48879 movkx0, 0xdead, lsl 16 add x0, x0, x0, lsl 32 Some CPUs may fuse sequences of movk, and shifted operands for ALU ops may take extra time in some CPUs, so this might not actually be optimal for performance, but it is smaller for -Os and -Oz. We should also be using that trick for stores to _Atomic or volatile long*, where we currently do MOV + 3x MOVK, then an STR, with ARMv8.4-a which guarantees atomicity. --- ARMv8.4-a and later guarantees atomicity for ldp/stp within an aligned 16-byte chunk, so we should use MOV/MOVK / STP there even for volatile or __ATOMIC_RELAXED. But presumably that's a different part of GCC's internals, so I'll report that separately.
[Bug tree-optimization/105904] New: Predicated mov r0, #1 with opposite conditions could be hoisted, between 1 and 1<
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105904 Bug ID: 105904 Summary: Predicated mov r0, #1 with opposite conditions could be hoisted, between 1 and 1< // using the libstdc++ header unsigned roundup(unsigned x){ return std::bit_ceil(x); } https://godbolt.org/z/Px1fvWaex GCC's version is somewhat clunky, including MOV r0, #1 in either "side": roundup(unsigned int): cmp r0, #1 i hi addhi r3, r0, #-1 movhi r0, #1@@ here clzhi r3, r3 rsbhi r3, r3, #32 ite hi lslhi r0, r0, r3 movls r0, #1@@ here bx lr Even without spotting the other optimizations that clang finds, we can combine to a single unconditional MOV r0, #1. But only if we avoid setting flags, so it requires a 4-byte encoding, not MOVS. Still, it's one fewer instruction to execute. This is not totally trivial: it requires seeing that we can move it across the conditional LSL. So it's really a matter of folding the 1s between 1<
[Bug tree-optimization/105596] Loop counter widened to 128-bit unnecessarily
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596 --- Comment #1 from Peter Cordes --- https://godbolt.org/z/aoG55T5Yq gcc -O3 -m32 has the same problem with unsigned long long total and unsigned i. Pretty much identical instruction sequences in the loop for all 3 versions, doing add/adc to increment i, for example. (Plus a bit of spilling). fact_gcc_handhold still compiles without the unnecessary widening. Perhaps should retitle to widen to a "2-register type". IDK how easily this occurs in real-world loops with 64 and 32-bit integers on 32-bit machines, but that's probably more of a concern for wasting more clock cycles worldwide.
[Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596 Bug ID: 105596 Summary: Loop counter widened to 128-bit unnecessarily Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- For total *= i with a u128 total and a u32 loop counter, GCC pessimizes by widening i and doing a full 128x128 => 128-bit multiply, and having to do a 128-bit increment and compare. uint64_t i to make it a full register width doesn't help. unsigned __int128 fact(unsigned n){ unsigned __int128 total = n; for (unsigned i=2 ; i < n ; i++) total *= i; return total; } // 0! = 0 isn't mathematically correct, but that's not the point https://godbolt.org/z/W4MW9b6T3 (gcc trunk 13.0.0 20220508 (experimental) and clang 14, which makes efficient asm for all of these.) # gcc -O3 fact: movl%edi, %r9d xorl%r11d, %r11d movq%r9, %r10 # total = n zext into R11:R10 cmpl$2, %edi jbe .L7 # if n<=2 return r11:r10 movl$2, %esi# i = 2 in RDI:RSI xorl%edi, %edi .L9: # do{ movq%r11, %rcx movq%rdi, %rdx movq%r10, %rax movq%r9, %r8 # copy original n to destroy later imulq %r10, %rdx # 128x128 multiply with 2x imul, 1x widening mul imulq %rsi, %rcx addq%rdx, %rcx mulq%rsi movq%rdx, %r11 # update total in r11:r10 movq%rax, %r10 addq%rcx, %r11 # last partial product addq$1, %rsi# i++ as a 128-bit integer adcq$0, %rdi xorq%rsi, %r8 # r8 = n^i movq%rdi, %rcx # useless copy, we're already destroying r8 orq %r8, %rcx# hi(i^n) | lo(i^n) jne .L9 # }while(i != n); .L7: movq%r10, %rax movq%r11, %rdx ret So as well as creating extra work to do, it's not even doing it very efficiently, with multiple unnecessary mov instructions. This doesn't seem to be x86-64 specific. It also compiles similarly for AArch64 and MIPS64. For some ISAs, I'm not sure if potentially-infinite loops are making a difference, e.g. PowerPC is hard for me to read. RV64 has three multiply instructions in both versions. I haven't tested a 32-bit equivalent with uint64_t total and uint32_t i. This anti-optimization goes back to GCC4.6. With GCC4.5 and earlier, the above C compiles to a tight loop with the expected mul reg + imul reg,reg and 1 register loop counter: https://godbolt.org/z/6KheaqTx4 (using __uint128_t, since unsigned __int128 wasn't supported on GCC4.4 or 4.1) GCC 4.1 does an inefficient multiply, but one of the chunks is a freshly xor-zeroed register. It's still just incrementing and comparing a 32-bit loop counter, but widening it for a 128x128-bit multiply recipe. GCC4.4 optimizes away the parts that are useless for the high 64 bits of (u128)i being zero. - A different version compiles efficiently with GCC6 and earlier, only becoming slow like the above with GCC7 and later. unsigned __int128 fact_downcount(unsigned n){ unsigned __int128 total = n; for (unsigned i=n-1 ; i > 1 ; i--) total *= i; return total; // 0! = 0 isn't mathematically correct } - When the loop condition is possibly always-true, GCC can't prove the loop is non-infinite, and as usual can't widen the loop counter. In this case, that's a good thing: unsigned __int128 fact_gcc_handhold(unsigned n){ unsigned __int128 total = 1; // loop does do final n for (unsigned i=2 ; i <= n ; i++) // potentially infinite loop defeats this pessimization total *= i; return total; // fun fact: 0! = 1 is mathematically correct } fact_gcc_handhold: cmpl$1, %edi jbe .L4 movl$2, %ecx # i = 2 inECX movl$1, %eax # total = 1 in RDX:RAX xorl%edx, %edx .L3: #do{ movl%ecx, %esi# copy i instead of just incrementing it later :/ movq%rdx, %r8 # save high half of total addl$1, %ecx # i++ imulq %rsi, %r8 # lo x hi cross product mulq%rsi# lo x lo widening addq%r8, %rdx # 128x64-bit multiply cmpl%ecx, %edi jnb .L3 # }while(i < n) ret Allocating total in RDX:RAX is nice, putting the lo part where we need it for mulq anyway.
[Bug target/65146] alignment of _Atomic structure member is not correct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146 --- Comment #25 from Peter Cordes --- (In reply to CVS Commits from comment #24) > The master branch has been updated by Jakub Jelinek : > > https://gcc.gnu.org/g:04df5e7de2f3dd652a9cddc1c9adfbdf45947ae6 > > commit r11-2909-g04df5e7de2f3dd652a9cddc1c9adfbdf45947ae6 > Author: Jakub Jelinek > Date: Thu Aug 27 18:44:40 2020 +0200 > > ia32: Fix alignment of _Atomic fields [PR65146] > > For _Atomic fields, lowering the alignment of long long or double etc. > fields on ia32 is undesirable, because then one really can't perform > atomic > operations on those using cmpxchg8b. Just for the record, the description of this bugfix incorrectly mentioned cmpxchg8b being a problem. lock cmpxchg8b is *always* atomic, even if that means the CPU has to take a bus lock (disastrously expensive affecting all cores system-wide) instead of just delaying MESI response for one line exclusively owned in this core's private cache (aka cache lock). The correctness problem is __atomic_load_n / __atomic_store_n compiling to actual 8-byte pure loads / pure stores using SSE2 movq, SSE1 movlps, or x87 fild/fistp (bouncing through the stack), such as movq %xmm0, (%eax) That's where correctness depends on Intel and AMD's atomicity guarantees which are conditional on alignment. (And if AVX is supported, same deal for 16-byte load/store. Although we can and should use movaps for that, which bakes alignment checking into the instruction. Intel did recently document that CPUs with AVX guarantee atomicity of 16-byte aligned loads/stores, retroactive to all CPUs with AVX. It's about time, but yay.) > Not sure about iamcu_alignment change, I know next to nothing about IA > MCU, > but unless it doesn't have cmpxchg8b instruction, it would surprise me > if we > don't want to do it as well. I had to google iamcu. Apparently it's Pentium-like, but only has soft-FP (so I assume no MMX or SSE as well as no x87). If that leaves it no way to do 8-byte load/store except (lock) cmpxchg8b, that may mean there's no need for alignment, unless cache-line-split lock is still a performance issue. If it's guaranteed unicore as well, we can even omit the lock prefix and cmpxchg8b will still be an atomic RMW (or load or store) wrt. interrupts. (And being unicore would likely mean much less system-wide overhead for a split lock.)
[Bug target/82261] x86: missing peephole for SHLD / SHRD
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261 --- Comment #4 from Peter Cordes --- GCC will emit SHLD / SHRD as part of shifting an integer that's two registers wide. Hironori Bono proposed the following functions as a workaround for this missed optimization (https://stackoverflow.com/a/71805063/224132) #include #ifdef __SIZEOF_INT128__ uint64_t shldq_x64(uint64_t low, uint64_t high, uint64_t count) { return (uint64_t)(unsigned __int128)high << 64) | (unsigned __int128)low) << (count & 63)) >> 64); } uint64_t shrdq_x64(uint64_t low, uint64_t high, uint64_t count) { return (uint64_t)unsigned __int128)high << 64) | (unsigned __int128)low) >> (count & 63)); } #endif uint32_t shld_x86(uint32_t low, uint32_t high, uint32_t count) { return (uint32_t)(uint64_t)high << 32) | (uint64_t)low) << (count & 31)) >> 32); } uint32_t shrd_x86(uint32_t low, uint32_t high, uint32_t count) { return (uint32_t)uint64_t)high << 32) | (uint64_t)low) >> (count & 31)); } --- The uint64_t functions (using __int128) compile cleanly in 64-bit mode (https://godbolt.org/z/1j94Gcb4o) using 64-bit operand-size shld/shrd but the uint32_t functions compile to a total mess in 32-bit mode (GCC11.2 -O3 -m32 -mregparm=3) before eventually using shld, including a totally insane or dh, 0 GCC trunk with -O3 -mregparm=3 compiles them cleanly, but without regparm it's also slightly different mess. Ironically, the uint32_t functions compile to quite a few instructions in 64-bit mode, actually doing the operations as written with shifts and ORs, and having to manually mask the shift count to &31 because it uses a 64-bit operand-size shift which masks with &63. 32-bit operand-size SHLD would be a win here, at least for -mtune=intel or a specific Intel uarch. I haven't looked at whether they still compile ok after inlining into surrounding code, or whether operations would tend to combine with other things in preference to becoming an SHLD.
[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 --- Comment #5 from Peter Cordes --- > pextrw requires sse4.1 for mem operands. You're right! I didn't double-check the asm manual for PEXTRW when writing up the initial report, and had never realized that PINSRW wasn't symmetric with it. I was really surprised to see that in https://www.felixcloutier.com/x86/pextrw So we do need to care about tuning for _mm_storeu_si16(p, v) without SSE4.1 (without the option of PEXTRW to memory). PEXTRW to an integer register is obviously bad; we should be doing movd %xmm0, %eax mov %ax, (%rdi) instead of an inefficient pextrw $0, %xmm0, %eax ; movw-store Reported as PR105079, since the cause of the load missed-opt was GCC thinking the instruction wasn't available, rather than a wrong tuning choice like this is.
[Bug target/105079] New: _mm_storeu_si16 inefficiently uses pextrw to an integer reg (without SSE4.1)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105079 Bug ID: 105079 Summary: _mm_storeu_si16 inefficiently uses pextrw to an integer reg (without SSE4.1) Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* With PR105066 fixed, we do _mm_loadu_si16 with pinsrw from memory, because that's available with just SSE2. (And the cause wasn't tuning choices, it was a typo in what insns GCC thought were available.) Related: PR105072 re: folding such 16-bit loads into memory source operands for PMOVZX/SXBQ. But the famously non-orthogonal SSE2 only includes pextrw $imm, %xmm, reg. Not reg/mem until SSE4.1 (with a longer opcode for no apparent reason, instead of just allowing mem addressing modes for the existing one. But same mnemonic so the assembler takes care of it. https://www.felixcloutier.com/x86/pextrw) So we do need to care about tuning for _mm_storeu_si16(p, v) without the option of PEXTRW to memory. Currently we do this, which is obviously bad: pextrw $0, %xmm0, %eax # 2 uops movw%ax, (%rdi) we should be doing this movd%xmm0, %eax # 1 uop mov %ax, (%rdi) https://godbolt.org/z/Ee3Ez174M This is especially true if we don't need the integer value zero-extended into EAX. If we *did* also want the value zero-extended in an integer register, the extra uop in PEXTRW (in addition to the port 0 uop like MOVD) is a port-5 shuffle to extract an arbitrary 16-bit element, vs. a separate integer movzwl %cx, %eax could run on any integer ALU port. (Including port 6 on HSW/SKL, which doesn't compete with any vector ALUs). Mov-elimination for movzwl doesn't work on any current CPUs, only movzbl on Intel, and movl / movq on both Intel and AMD. So currently there's no benefit to picking a different register like %ecx, instead of just using movzwl %ax, %eax When we both store and use the integer value: int store16_and_use(void *p, __m128i v){ _mm_storeu_si16( p, v ); return 123 + *(unsigned short*)p; } https://godbolt.org/z/zq6TMo1oE current trunk GCC does this, which is not bad: # -O3 with or without -msse4.1 pextrw $0, %xmm0, %eax movw%ax, (%rdi) addl$123, %eax ret Clang13 uses MOVD + MOVZX like I was suggesting, even though it costs more code size. That's not necessarily better movd%xmm0, %eax movw%ax, (%rdi) movzwl %ax, %eax addl$123, %eax retq In this case it's not obviously wrong to use PEXTRW to an integer reg, but it's also fine to do it clang's way. So however that corner case shakes out in the process of fixing the main bug (using movd / movw without SSE4.1 when we don't reload) is fine. If SSE4.1 is available, the no-reload case should probably use PEXTRW to memory instead of movd + movw. On some CPUs, the ALU op that's part of PEXTRW has more choice of ALU port than xmm->gp_int operations.
[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508 --- Comment #17 from Peter Cordes --- (In reply to Andrew Pinski from comment #16) > >According to Intel ( > > https://software.intel.com/sites/landingpage/IntrinsicsGuide), there are no > > alignment requirements for _mm_load_sd, _mm_store_sd and _mm_loaddup_pd. For > > example, from _mm_load_sd: > > I disagree with saying there is no alignment requirement. > > The alignment requirement comes from the type of the argument (double > const*). [...] > Pointers themselves have an alignment requirement not just at the time of > the load/store of them. The intrinsics are badly designed to take pointer args with types other than void*, despite how they're expected to work. This is something we just need to accept. Starting with AVX-512, any new intrinsics take void*, but they haven't redefined the old ones. _mm_loadu_si128 takes a __m128i*, same as _mm_load_si128. alignof(__m128i) == 16, so _mm_loadu_si128 must not simply dereference it, that's what _mm_load_si128 does. Intel's intrinsics API requires you to do unaligned 16-byte loads by creating a misaligned pointer and passing it to a loadu intrinsic. (This in turn requires that implementations supporting these intrinsics define the behaviour of creating such a pointer without deref; in ISO C that alone would be UB.) This additional unaligned-pointer behaviour that implementations must define (at least for __m128i* and float/double*) is something I wrote about in an SO answer: https://stackoverflow.com/questions/52112605/is-reinterpret-casting-between-hardware-simd-vector-pointer-and-the-correspond _mm_loadu_ps (like _mm_load_ps) takes a float*, but its entire purpose it to not require alignment. _mm512_loadu_ps takes a void* arg, so we can infer that earlier FP load intrinsics really are intended to work on data with any alignment, not just with the alignment of a float. They're unlike a normal deref of a float* in aliasing rules, although that's separate from creating a misaligned float* in code outside the intrinsic. A hypothetical low-performance portable emulation of intrinsics that ended up dereferencing that float* arg directly would be broken for strict-aliasing as well. The requirement to define the behaviour of having a misaligned float* can be blamed on Intel in 1995 (when SSE1 was new). Later extensions like AVX _mm256_loadu_ps just followed the same pattern of taking float* until they finally used void* for intrinsics introduced with or after AVX-512. The introduction of _mm_loadu_si32 and si16 is another step in the right direction, recognizing that _mm_cvtsi32_si128( *int_ptr ) isn't strict-aliasing safe. When those were new, it might have been around the time Intel started exploring replacing ICC with the LLVM-based ICX. Anyway, the requirement to support misaligned vector and float/double pointers implies that _mm_load_ss/sd taking float*/double* doesn't imply alignof(float) or alignof(double). > So either the intrinsics definition needs to be changed to be > correct or GCC is correct. That's an option; I'd love it if all the load/store intrinsics were changed across all compilers to take void*. It's ugly and a pain to type _mm_loadu_si128( (const __m128i*)ptr ) as well as creating cognitive dissonance because alignof(__m128i) == 16. I'm not sure if it could break anything to change the intrinsics to take void* even for older ones; possibly only C++ overload resolution for insane code that defines a _mm_loadu_ps( other_type * ) and relies on float* args picking the intrinsic. If we changed just GCC, without getting buy-in from other compilers, taking void* would let people's code compile on GCC without casts from stuff like int*, when it wouldn't compile on other compilers. That could be considered a bad thing if people test their code with GCC and are surprised to get reports of failure from people using compilers that follow Intel's documentation for the intrinsic function arg types. (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html). It would basically be a case of being overly permissive for the feature / API that people are trying to write portable code against.
[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #14 from Peter Cordes --- This bug is mis-categorized; it's not a sanitizer bug, it's a bug in the implementation _mm_load_ss / sd. It currently derefs the `float const*` arg directly, which is not strict-aliasing or alignment safe. alignof(float) is 4, but Intel's documentation for this API still says "mem_addr does not need to be aligned on any particular boundary." _mm_load_ss (float const *__P) { return _mm_set_ss (*__P); } As discussed on PR99754 _mm_load_si32(const void*) *is* strict-aliasing and alignment safe. But it only existed recently, and GCC11's implementation of it is buggy (shuffling the element to the wrong place). Before that, one safe way to do a 32-bit SIMD load is with _mm_load_ss and _mm_castps_si128. Or it was supposed to be safe, but isn't!! Clang uses a packed may_alias struct containing a float to get a safe load done. Another way would be casting the pointer to typdef float aliasing_unaligned_f32 __attribute__((aligned(1),may_alias)); This is similar to what we do with __m32_u for use in aliasing-safe integer load/store, except we define that as int with vector_size(4),may_alias,aligned(1) for some reason. Perhaps influenced by __m64_u which is a vector of 2 ints. MSVC is like gcc -fno-strict-aliasing, so however it handles intrinsics, they're always aliasing-safe. I'm not 100% sure about what ICC formally guarantees, but in practice it doesn't move aliasing short* stores across a _mm_load_ss( (float*)pshort ) load. https://godbolt.org/z/6s76v71xz I didn't test with _mm_store_ss aliasing with short loads, only vice versa. So GCC is the odd one out, out of the major 4 compilers that support Intel's intrinsics API. All our narrow load/store intrinsics should be strict-aliasing and alignment safe, regardless of what pointer type they accept. Intel's early design of taking float* and double* instead of void* could be considered poor design. Their naming with just load/store instead of _mm_loadu_ss / storeu is also poor design, clearly motivated by the asm differences rather than an actual intrinsic API difference. In x86 asm, loads/stores narrower than 16 bytes never require alignment (unless the AC bit is set in EFLAGS). Assuming Intel modeled their intrinsics API after their asm, then it makes sense to have load and loadu for ps and si128, but only load/store with an implied lack of alignment for intrinsics that wrap instructions like movlps / movhps / movss / movsd, and movd / movq, which do narrower memory accesses. That of course *doesn't* make sense in C terms, where it's always potentially a problem to dereference misaligned pointers to narrow objects, even when compiling for x86-64: https://stackoverflow.com/questions/47510783/why-does-unaligned-access-to-mmaped-memory-sometimes-segfault-on-amd64 has an example and links some others, showing that compilers *don't* define the behaviour of deref of misaligned pointers. I'm pretty certain that Intel always intended their narrow load/store intrinsics to not have any alignment requirements, like the asm instructions that wrap them, but weren't thinking in C terms when naming them. And were sloppily in their choices of which ones to provide until decades later, since it seems they thought that _mm_cvtsi32_si128(*x) was sufficient for a movd load. (Only the case on a compiler without strict-aliasing or alignment, since the deref happens on the user's plain int*). Anyway, hopefully this refutes the argument that _mm_load_sd should be aligned because of the name, and clarifies what Intel might have been thinking when naming these.
[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 --- Comment #6 from Peter Cordes --- Looks good to me, thanks for taking care of this quickly, hopefully we can get this backported to the GCC11 series to limit the damage for people using these newish intrinsics. I'd love to recommend them for general use, except for this GCC problem where some distros have already shipped GCC versions that compile without error but in a 100% broken way. Portable ways to do narrow alignment/aliasing-safe SIMD loads were sorely lacking; there aren't good effective workarounds for this, especially for 16-bit loads. (I still don't know how to portably / safely write code that will compile to a memory-source PMOVZXBQ across all compilers; Intel's intrinsics API is rather lacking in some areas and relies on compilers folding loads into memory source operands.) > So, isn't that a bug in the intrinsic guide instead? Yes, __m128i _mm_loadu_si16 only really makes sense with SSE2 for PINSRW. Even movzx into an integer reg and then MOVD xmm, eax requires SSE2. With only SSE1 you'd have to movzx / dword store to stack / MOVSS reload. SSE1 makes *some* sense for _mm_loadu_si32 since it can be implemented with a single MOVSS if MOVD isn't available. But we already have SSE1 __m128 _mm_load_ss(const float *) for that. Except GCC's implementation of _mm_load_ss isn't alignment and strict-aliasing safe; it derefs the actual float *__P as _mm_set_ss (*__P). Which I think is a bug, although I'm not clear what semantics Intel intended for that intrinsic. Clang implements it as alignment/aliasing safe with a packed may_alias struct containing a float. MSVC always behaves like -fno-strict-aliasing, and I *think* ICC does, too. Perhaps best to follow the crowd and make all narrow load/store intrinsics alignment and aliasing safe, unless that causes code-gen regressions; users can _mm_set_ss( *ptr ) themselves if they want that to tell the compiler that's its a normal C float object. Was going to report this, but PR84508 is still open and already covers the relevant ss and sd intrinsics. That points out that Intel specifically documents it as not requiring alignment, not mentioning aliasing. Speaking of bouncing through a GP-integer reg, GCC unfortunately does that; it seems to incorrectly think PINSRW xmm, mem, 0 requires -msse4.1, unlike with a GP register source. Reported as PR105066 along with related missed optimizations about folding into a memory source operand for pmovzx/sx. But that's unrelated to correctness; this bug can be closed unless we're keeping it open until it's fixed in the GCC11 current stable series.
[Bug target/105066] New: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 Bug ID: 105066 Summary: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* PR99754 fixed the wrong-code for _mm_loadu_si16, but the resulting asm is not efficient without -msse4.1 (as part of -march= most things). It seems GCC thinks that pinsrw / pextrw with a memory operand requires SSE4.1, like pinsr/extr for b/d/q operand-size. But actually 16-bit insr/extr only needs SSE2 (We're also not efficiently folding it into a memory source operand for PMOVZXBQ, see below) https://godbolt.org/z/dYchb6hec shows GCC trunk 12.0.1 20220321 __m128i load16(void *p){ return _mm_loadu_si16( p ); } load16(void*): # no options, or -march=core2 or -mssse3 movzwl (%rdi), %eax pxor%xmm1, %xmm1 pinsrw $0, %eax, %xmm1 # should be MOVD %eax, or PINSRW mem movdqa %xmm1, %xmm0 ret vs. load16(void*): # -msse4.1 pxor%xmm1, %xmm1 pinsrw $0, (%rdi), %xmm1 movdqa %xmm1, %xmm0 ret The second version is actually 100% fine with SSE2: https://www.felixcloutier.com/x86/pinsrw shows that there's only a single opcode for PINSRW xmm, r32/m16, imm8 and it requires SSE2; reg vs. mem source is just a matter of the modr/m byte. The same problem exists for _mm_storeu_si16 not using pextrw to memory (which is also SSE2), instead bouncing through EAX. (Insanely still PEXTRW instead of MOVD). There is a choice of strategy here, but pinsrw/extrw between eax and xmm0 is clearly sub-optimal everywhere. Once we factor out the dumb register allocation that wastes a movdqa, the interesting options are: movzwl (%rdi), %eax # 1 uop on everything movd%eax, %xmm0 # 1 uop on everything vs. pxor%xmm0, %xmm0# 1 uop for the front-end, eliminated on Intel pinsrw $0, (%rdi), %xmm0 # 2 uops (load + shuffle/merge) Similarly for extract, pextrw $0, %xmm0, (%rdi) # 2 uops on most vs. movd%xmm0, %eax # 1 uop, only 1/clock even on Ice Lake movw%ax, (%rdi) # 1 uop On Bulldozer-family, bouncing through an integer reg adds a lot of latency vs. loading straight into the SIMD unit. (2 integer cores share a SIMD/FP unit, so movd between XMM and GP-integer is higher latency than most.) So that would definitely favour pinsrw/pextrw with memory. On Ice Lake, pextrw to mem is 2/clock throughput: the SIMD shuffle can run on p1/p5. But MOVD r,v is still p0 only, and MOVD v,r is still p5 only. So that also favours pinsrw/pextrw with memory, despite the extra front-end uop for pxor-zeroing the destination on load. Of course, if _mm_storeu_si16 is used on a temporary that's later reloaded, being able to optimize to a movd (and optionally movzx) is very good. Similar for _mm_loadu_si16 on a value we have in an integer reg, especially if we know it's already zero-extended to 32-bit for just a movd, we'd like to be able to do that. --- It's also essential that these loads fold efficiently into memory source operands for PMOVZX; pmovzxbq is one of the major use-cases for a 16-bit load. That may be a separate bug, IDK https://godbolt.org/z/3a9T55n3q shows _mm_cvtepu8_epi32(_mm_loadu_si32(p)) does fold a 32-bit memory source operand nicely to pmovzxbd (%rdi), %xmm0 which can micro-fuse into a single uop on Intel CPUs (for the 128-bit destination version, not YMM), but disaster with 16-bit loads: __m128i pmovzxbq(void *p){ return _mm_cvtepu8_epi64(_mm_loadu_si16(p)); } pmovzxbq(void*): # -O3 -msse4.1 -mtune=haswell pxor%xmm0, %xmm0 # 1 uop pinsrw $0, (%rdi), %xmm0 # 2 uops, one for shuffle port pmovzxbq%xmm0, %xmm0 # 1 uop for the same shuffle port ret (_mm_cvtepu8_epi64 requires SSE4.1 so there's no interaction with the -mno-sse4.1 implementation of the load.)
[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 --- Comment #3 from Peter Cordes --- Wait a minute, the current implementation of _mm_loadu_si32 isn't strict-aliasing or alignment safe!!! That defeats the purpose for its existence as something to use instead of _mm_cvtsi32_si128( *(int*)p ); The current code contains a deref of a plain (int*). It should be using something like typdef int unaligned_aliasing_int __attribute__((aligned(1),may_alias));
[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2 from Peter Cordes --- Can we get this patch applied soon? There aren't any other strict-aliasing-safe movd load intrinsics, but this one won't be portably usable while there are buggy GCC versions around. Until then, code should probably use something like inline __m128i movd(void *p){ return _mm_castps_si128(_mm_load_ss((const float*)p)); } (Which believe it or not is strict-aliasing safe even on integer data. At least it should be; last I tested it was across compilers, except maybe on ICC. Would have to double-check there.)
[Bug target/104773] New: compare with 1 not merged with subtract 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773 Bug ID: 104773 Summary: compare with 1 not merged with subtract 1 Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-*, arm-*-* std::bit_ceil(x) involves if(x == 0 || x == 1) return 1; and 1u << (32-clz(x-1)). The compare of course compiles to an unsigned <= 1, which can be done with a sub instead of cmp, producing the value we need as an input for the leading-zero count. But GCC does *not* do this. (Neither does clang for x86-64). I trimmed down the libstdc++ code into something I could compile even when Godbolt is doesn't have working headers for some ISAs: https://godbolt.org/z/3EE7W5bna // cut down from libstdc++ for normal integer cases; compiles the same template constexpr _Tp bit_ceil(_Tp __x) noexcept { constexpr auto _Nd = std::numeric_limits<_Tp>::digits; if (__x == 0 || __x == 1) return 1; auto __shift_exponent = _Nd - __builtin_clz((_Tp)(__x - 1u)); // using __promoted_type = decltype(__x << 1); ... // removed check for x<
[Bug libstdc++/97759] Could std::has_single_bit be faster?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #14 from Peter Cordes --- Agreed with the idea of expanding doing the popcount(x) == 1 peephole replacement in the compiler, not forcing the header to figure out whether we have efficient popcount or not. If we have BMI2, it's BLSR (Bit Lowest-Set Reset) sets CF=1 if the input was zero, and ZF=1 if the output is zero. Unfortunately none of the standard jcc/setcc conditions check ZF=1 & CF=0, even with CMC to invert CF first. (If Intel had designed it to produce ZF and CF inverted, it would be non-intuitive for all other uses but would have allowed blsr / jcc to implement if(has_single_bit(x)).) CF=1, ZF=0 impossible: input was zero, output was non-zero CF=1, ZF=1 input was zero CF=0, ZF=0 input had multiple bits set CF=0, ZF=1 input had a single bit set. If we're going to branch on it anyway after inlining, a branchy strategy is probably good: singlebit_bmi2_branchy: xor%eax, %eax blsr %edi, %edi# ZF=1 means we cleared the last bit, or the input was zero jc .Linput_zero # input was zero, return 0 regardless of ZF setz %al .Linput_zero: ret And when we want a boolean in a register, a combination of setz and cmovc can materialize one. With clever choice of registers, we can even avoid giving setcc a false dependency on a register that isn't already part of its dep chain singlebit_bmi2_cmov: blsr%edi, %eax setz%al # false dep, but it's ready if FLAGS are ready because we wrote it with BLSR cmovc %edi, %eax # return 1 only if ZF=1 (setz produces 1) and CF=0 (cmovc doesn't overwrite it with the input 0) ret With xor-zeroing first, we could produce the boolean zero-extended to 32-bit, instead of here where only the low 8 bits are actually 0 / 1. (Which is fine for returning a bool in all the mainstream calling conventions) (This is the same kind of idea as ARM64 sub/tst / ccmp / cset, where ccmp can conditionally update flags.) An evil variation on this uses setnz / dec to invert ZF without affecting CF, allowing JA: blsr %edi,%eax setnz %al # AL = !ZF dec%al # 1-1 -> ZF=1, 0-1 -> ZF=0. ZF=!ZF without affecting CF # seta %al # set on CF=0 and ZF=0 ja was_single_bit# only actually useful for branching after inlining dec/ja can't macro-fuse into a single uop, but on Skylake and later Intel it doesn't cost any extra partial-FLAGS merging uops, because JA simply has both parts of FLAGS as separate inputs. (This is why CMOVA / CMOVBE are still 2 uops on Skylake, unlike all other forms: they need 2 integer inputs and 2 parts of FLAGS, while others need either CF or SPAZO not both. Interestingly, Zen1/2 have that effect but not Zen3) I don't know how AMD handles dec / ja partial-flags shenanigans. Intel Haswell would I think have a flags-merging uop; older Intel doesn't support BMI1 so P6-family is irrelevant. https://stackoverflow.com/a/49868149/224132 I haven't benchmarked them because they have different use-cases (materializing a boolean vs. creating a FLAGS condition to branch on, being branchless itself), so any single benchmark would make one of them look good. If your data almost never (or always) has an all-zero input, the JC in the first version will predict well. After inlining, if the caller branches on the bool result, you might want to just branch on both conditions separately. I don't think this setnz/dec/ja version is ever useful. Unlike branching separately on ZF and CF, it's not bad if both 0 and multi-bit inputs are common while single-bit inputs are rare. But blsr/setz/cmovc + test/jnz is only 4 uops, same as this on Skylake. (test can macro-fuse with jnz). The uops are all dependent on each other, so it also has the same latency (to detect a branch miss) as popcnt / macro-fused cmp/je which is 2 uops. The only thing this has going for it is avoiding a port-1-only uop, I think. It's also possible to blsr / lahf / and ah, (1<<6) | (1<<0) / cmp ah, 1<<6 to directly check that ZF=1 and CF=0. I doubt that's useful. Or hmm, can we branch directly on PF after AND with that 2-bit mask? CF=1 ZF=0 is impossible, so the only other odd-parity case is CF=0 ZF=1. AMD and Intel can macro-fuse test/jp. blsr %edi, %eax lahf test $(1<<6) | (1<<0), %ah# check ZF and CF. jpo was_single_bit # ZF != CF means CF=0, ZF=1 because the other way is impossible. Also possible of course is the straightforward 2x setcc and AND to materialize a boolean in the bottom byte of EAX. Good ILP, only 3 cycle latency from input to result on Intel, but that's the same as setz/cmovc which is fewer uops and can avoid false dependencie
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #11 from Peter Cordes --- Also, horizontal byte sums are generally best done with VPSADBW against a zero vector, even if that means some fiddling to flip to unsigned first and then undo the bias. simde_vaddlv_s8: vpxorxmm0, xmm0, .LC0[rip] # set1_epi8(0x80) flip to unsigned 0..255 range vpxorxmm1, xmm1 vpsadbw xmm0, xmm0, xmm1 # horizontal byte sum within each 64-bit half vmovdeax, xmm0 # we only wanted the low half anyway sub eax, 8 * 128 # subtract the bias we added earlier by flipping sign bits ret This is so much shorter we'd still be ahead if we generated the vector constant on the fly instead of loading it. (3 instructions: vpcmpeqd same,same / vpabsb / vpslld by 7. Or pcmpeqd / psllw 8 / packsswb same,same to saturate to -128) If we had wanted a 128-bit (16 byte) vector sum, we'd need ... vpsadbw ... vpshufd xmm1, xmm0, 0xfe # shuffle upper 64 bits to the bottom vpaddd xmm0, xmm0, xmm1 vmovdeax, xmm0 sub eax, 16 * 128 Works efficiently with only SSE2. Actually with AVX2, we should unpack the top half with VUNPCKHQDQ to save a byte (no immediate operand), since we don't need PSHUFD copy-and-shuffle. Or movd / pextrw / scalar add but that's more uops: pextrw is 2 on its own.
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #10 from Peter Cordes --- Current trunk with -fopenmp is still not good https://godbolt.org/z/b3jjhcvTa Still doing two separate sign extensions and two stores / wider reload (store forwarding stall): -O3 -march=skylake -fopenmp simde_vaddlv_s8: pushrbp vpmovsxbw xmm2, xmm0 vpsrlq xmm0, xmm0, 32 mov rbp, rsp vpmovsxbw xmm3, xmm0 and rsp, -32 vmovq QWORD PTR [rsp-16], xmm2 vmovq QWORD PTR [rsp-8], xmm3 vmovdqa xmm4, XMMWORD PTR [rsp-16] ... then asm using byte-shifts Including stuff like movdqa xmm1, xmm0 psrldq xmm1, 4 instead of pshufd, which is an option because high garbage can be ignored. And ARM64 goes scalar. Current trunk *without* -fopenmp produces decent asm https://godbolt.org/z/h1KEKPTW9 For ARM64 we've been making good asm since GCC 10.x (vs. scalar in 9.3) simde_vaddlv_s8: sxtlv0.8h, v0.8b addvh0, v0.8h umovw0, v0.h[0] ret x86-64 gcc -O3 -march=skylake simde_vaddlv_s8: vpmovsxbw xmm1, xmm0 vpsrlq xmm0, xmm0, 32 vpmovsxbw xmm0, xmm0 vpaddw xmm0, xmm1, xmm0 vpsrlq xmm1, xmm0, 32 vpaddw xmm0, xmm0, xmm1 vpsrlq xmm1, xmm0, 16 vpaddw xmm0, xmm0, xmm1 vpextrw eax, xmm0, 0 ret That's pretty good, but VMOVD eax, xmm0 would be more efficient than VPEXTRW when we don't need to avoid high garbage (because it's a return value in this case). VPEXTRW zero-extends into RAX, so it's not directly helpful if we need to sign-extend to 32 or 64-bit for some reason; we'd still need a scalar movsx. Or with BMI2, go scalar before the last shift / VPADDW step, e.g. ... vmovd eax, xmm0 rorx edx, eax, 16 addeax, edx
[Bug tree-optimization/80570] auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570 --- Comment #3 from Peter Cordes --- (In reply to Andrew Pinski from comment #2) > Even on aarch64: > > .L2: > ldr q0, [x1], 16 > sxtlv1.2d, v0.2s > sxtl2 v0.2d, v0.4s > scvtf v1.2d, v1.2d > scvtf v0.2d, v0.2d > stp q1, q0, [x0] > > But the above is decent really. More that decent, that's what we *should* be doing, I think. AArch64 has versions of most instructions that read the top of a vector, unlike x86-64 where VPMOVZX / SX can only read from the bottom half. That's the key difference, and what makes this strategy good on ARM, bad on x86-64. (On 32-bit ARM, you load a q register, then read the two halves separately as 64-bit d<0..31> registers. AArch64 changed that so there are 32x 128-bit vector regs, and no partial regs aliasing the high half. But they provide OP, OP2 versions of some instructions that widen or things like that, with the "2" version accessing a high half. Presumably part of the motivation is to make it easier to port ARM NEON code that depended on accessing halves of a 128-bit q vector using its d regs. But it's a generally reasonable design and could also be motivated by seeing how inconvenient things get in SSE and AVX for pmovsx/zx.) Anyway, AArch64 SIMD is specifically designed to make it fully efficient to do wide loads and then unpack both halves, like is possible in ARM, but not x86-64. It's also using a store (of a pair of regs) that's twice the width of the load. But even if it was using a max-width load of a pair of 128-bit vectors (and having to store two pairs) that would be good, just effectively unrolling. But GCC sees it as one load and two separate stores, that it just happens to be able to combine as a pair.
[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103 --- Comment #9 from Peter Cordes --- Thanks for implementing my idea :) (In reply to Hongtao.liu from comment #6) > For elements located above 128bits, it seems always better(?) to use > valign{d,q} TL:DR: I think we should still use vextracti* / vextractf* when that can get the job done in a single instruction, especially when the VEX-encoded vextracti/f128 can save a byte of code size for v[4]. Extracts are simpler shuffles that might have better throughput on some future CPUs, especially the upcoming Zen4, so even without code-size savings we should use them when possible. Tiger Lake has a 256-bit shuffle unit on port 1 that supports some common shuffles (like vpshufb); a future Intel might add 256->128-bit extracts to that. It might also save a tiny bit of power, allowing on-average higher turbo clocks. --- On current CPUs with AVX-512, valignd is about equal to a single vextract, and better than multiple instruction. It doesn't really have downsides on current Intel, since I think Intel has continued to not have int/FP bypass delays for shuffles. We don't know yet what AMD's Zen4 implementation of AVX-512 will look like. If it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other than insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle like valignd probably costs more than 2 uops. (vpermq is more than 2 uops on Piledriver/Zen1). But a 128-bit extract will probably cost just one uop. (And especially an extract of the high 256 might be very cheap and low latency, like vextracti128 on Zen1, so we might prefer vextracti64x4 for v[8].) So this change is good, but using a vextracti64x2 or vextracti64x4 could be a useful peephole optimization when byte_offset % 16 == 0. Or of course vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessible with an EVEX-encoded instruction). vextractf-whatever allows an FP shuffle on FP data in case some future CPU cares about that for shuffles. An extract is a simpler shuffle that might have better throughput on some future CPU even with full-width execution units. Some future Intel CPU might add support for vextract uops to the extra shuffle unit on port 1. (Which is available when no 512-bit uops are in flight.) Currently (Ice Lake / Tiger Lake) it can only run some common shuffles like vpshufb ymm, but not including any vextract or valign. Of course port 1 vector ALUs are shut down when 512-bit uops are in flight, but could be relevant for __m256 vectors on these hypothetical future CPUs. When we can get the job done with a single vextract-something, we should use that instead of valignd. Otherwise use valignd. We already check the index for low-128 special cases to use vunpckhqdq vs. vpshufd (or vpsrldq) or similar FP shuffles. - On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to be zero), an extract that only writes a 128-bit register will keep them clean (even if it reads a ZMM), not needing a VZEROUPPER. Since VZEROUPPER is only needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function like float foo(float *p) { some vector stuff that can use high zmm regs; return scalar that happens to be from the middle of a vector; } could vextract into XMM0, but would need vzeroupper if it used valignd into ZMM0. (Also related https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc re reading a ZMM at all and turbo clock). --- Having known zeros outside the low 128 bits (from writing an xmm instead of rotating a zmm) is unlikely to matter, although for FP stuff copying fewer elements that might be subnormal could happen to be an advantage, maybe saving an FP assist for denormal. We're unlikely to be able to take advantage of it to save instructions/uops (like OR instead of blend). But it's not worse to use a single extract instruction instead of a single valignd.
[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 --- Comment #37 from Peter Cordes --- Correction, PR82666 is that the cmov on the critical path happens even at -O2 (with GCC7 and later). Not just with -O3 -fno-tree-vectorize. Anyway, that's related, but probably separate from choosing to do if-conversion or not after inlining.
[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #36 from Peter Cordes --- Related: a similar case of cmov being a worse choice, for a threshold condition with an array input that happens to already be sorted: https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2 GCC with -fprofile-generate / -fprofile-use does correctly decide to use branches. GCC7 and later (including current trunk) with -O3 -fno-tree-vectorize de-optimizes by putting the CMOV on the critical path, instead of as part of creating a zero/non-zero input for the ADD. PR82666. If you do allow full -O3, then vectorization is effective, though.
[Bug target/15533] Missed move to partial register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #5 from Peter Cordes --- The new asm less bad, but still not good. PR53133 is closed, but this code-gen is a new instance of partial-register writing with xor al,al. Also related: PR82940 re: identifying bitfield insert patterns in the middle-end; hopefully Andrew Pinski's planned set of patches to improve that can help back-ends do a better job? If we're going to read a 32-bit reg after writing an 8-bit reg (causing a partial-register stall on Nehalem and earlier), we should be doing mov a, %al # merge into the low byte of RAX ret Haswell and newer Intel don't rename the low byte partial register separately from the full register, so they behave like AMD and other non-P6 / non-Sandybridge CPU: dependency on the full register. That's good for this code; in this case the merging is necessary and we don't want the CPU to guess that it won't be needed later. The load+ALU-merge uops can micro-fuse into a single uop for the front end. xor %al,%al still has a false dependency on the old value of RAX because it's not a zeroing idiom; IIRC in my testing it's at least as good to do mov $0, %al. Both instructions are 2 bytes long. * https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers survey of the ways partial regs are handled on Intel P6 family vs. Intel Sandybridge vs. Haswell and later vs. non-Intel and Intel Silvermont etc. * https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to - details of my testing on Haswell / Skylake. *If* we still care about -mtune=nehalem and other increasingly less relevant CPUs, we should be avoiding a partial register stall for those tuning options with something like movzbl a, %edx and $-256, %eax or %edx, %eax i.e. what we're already doing, but spend a 5-byte AND-immediate instead of a 2-byte xor %al,%al or mov $0, %al (That's what clang always does, so it's missing the code-size optimization. https://godbolt.org/z/jsE57EKcb shows a similar case of return (a&0xFF00u) | (b&0xFFu); with two register args) - The penalty on Pentium-M through Nehalem is to stall for 2-3 cycles while a merging uop is inserted. The penalty on earlier P6 (PPro / Pentium III) is to stall for 5-6 cycles until the partial-register write retires. The penalty on Sandybridge (and maybe Ivy Bridge if it renames AL) is no stall, just insert a merging uop. On later Intel, and AMD, and Silvermont-family Intel, writing AL has a dependency on the old RAX; it's a merge on the spot. BTW, modern Intel does still rename AH separately, and merging does require the front-end to issue a merging uop in a cycle by itself. So writing AH instead of AL would be different.
[Bug middle-end/82940] Suboptimal code for (a & 0x7f) | (b & 0x80) on powerpc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82940 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #6 from Peter Cordes --- For a simpler test case, GCC 4.8.5 did redundantly mask before using bitfield-insert, but GCC 9.2.1 doesn't. unsigned merge2(unsigned a, unsigned b){ return (a&0xFF00u) | (b&0xFFu); } https://godbolt.org/z/froExaPxe # PowerPC (32-bit) GCC 4.8.5 rlwinm 4,4,0,0xff # b &= 0xFF is totally redundant rlwimi 3,4,0,24,31 blr # power64 GCC 9.2.1 (ATI13.0) rlwimi 3,4,0,255# bit-blend according to mask, rotate count=0 rldicl 3,3,0,32 # Is this zero-extension to 64-bit redundant? blr But ppc64 GCC does zero-extension of the result from 32 to 64-bit, which is probably not needed unless the calling convention has different requirements for return values than for incoming args. (I don't know PPC well enough.) So for at least some cases, modern GCC does ok. Also, when the blend isn't split at a byte boundary, even GCC4.8.5 manages to avoid redundant masking before the bitfield-insert. unsigned merge2(unsigned a, unsigned b){ return (a & 0xFF80u) | (b & 0x7Fu); } rlwimi 3,4,0,25,31 # GCC4.8.5, 32-bit so no zero-extension blr
[Bug tree-optimization/100922] CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922 --- Comment #2 from Peter Cordes --- Possibly also related: With different surrounding code, this loop can compile to asm which has two useless movz / mov register copies in the loop at -O2 (https://godbolt.org/z/PTcqzM6q7). (To set up for entry into the next loop in over-complicated ways, and doing this in the loop is unnecessary.) while( lut[(unsigned char)*str] == 0 ){ // also catches terminating 0 str++; } .L19: movzbl 1(%rdi), %edx addq$1, %rdi movzbl %dl, %ecx movl%edx, %eax cmpb$0, -120(%rsp,%rcx) je .L19 from source void remove_chars(char *restrict str, const char *restrict remove) { char lut[256] = {0}; do { lut[(unsigned char)*remove] = -1; }while(*remove++); /*** Over complicated asm in this loop */ while( lut[(unsigned char)*str] == 0 ){ // also catches terminating 0 str++; } // str points at first char to *not* keep (or the terminating 0) const char *in = str; char *out = str; while (*in) { char mask = lut[(unsigned char)*in]; unsigned char cin = *in, cout = *out; *out = mask ? cout : cin; out += mask + 1; in++; } *out = *in; }
[Bug tree-optimization/100922] New: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922 Bug ID: 100922 Summary: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Created attachment 50948 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50948&action=edit redundant_zero_extend.c It's rarely a good idea to load the same thing twice; generally better to copy a register. Or to read the same register twice when a copy isn't needed. So the following asm should never happen, but it does with current trunk, and similar with GCC as old as 4.5 movzbl (%rax), %edx movzbl (%rax), %ecx# no branch target between these instructions or ldrbw4, [x2] ldrbw3, [x2], 1 # post-indexed *x2++ (Happens at -O3. With -O2 we have a redundant register copy, so either way still a wasted instruction. And there are other differences earlier in the function with -O2 vs. -O3.) https://godbolt.org/z/jT7WaWeK8 - minimal test case. x86-64 and AArch64 trunk show basically identical code structure. x86-64 gcc (Compiler-Explorer-Build) 12.0.0 20210603 and aarch64-unknown-linux-gnu-gcc (GCC) 12.0.0 20210524 void remove_chars_inplace(char *str, const unsigned char keep_lut[256]) { while(keep_lut[(unsigned char)*str]){ // can be an if() and still repro str++;// keep_lut[0] is false } char *out = str; unsigned char c; /* must be unsigned char for correctness. */ do { c = *str++; unsigned char inc = keep_lut[c]; // unsigned long doesn't help *out = c; out += inc; // inc=0 or 1 to let next char overwrite or not } while(c); } x86-64 asm: remove_chars_inplace: jmp .L8 .L3:# top of search loop for first char to remove addq$1, %rdi .L8:# loop entry point movzbl (%rdi), %eax cmpb$0, (%rsi,%rax) # un-laminates and doesn't macro-fuse ... jne .L3 cmpb$0, (%rdi) # 2nd loop body can be skipped if *str == 0 # should be test %al,%al - this char was already loaded. leaq1(%rdi), %rax# even -march=znver2 fails to move this earlier or later to allow cmp/je fusion. (Intel won't macro-fuse cmp imm,mem / jcc) je .L1 .L5: # TOP OF 2ND LOOP movzbl (%rax), %edx movzbl (%rax), %ecx # redundant load of *str addq$1, %rax movzbl (%rsi,%rdx), %edx # inc = lut[c] movb%cl, (%rdi) addq%rdx, %rdi # out += inc testb %cl, %cl jne .L5# }while(c != 0) .L1: ret IDK if it's interesting or not that the cmpb $0, (%rdi) is also a redundant load. The first loop left *str, i.e. (%rdi), in EAX. Putting the LEA between cmp and je (even with -march=znver2) is a separate missed optimization. (unless that's working around Intel's JCC erratum) With only -O2 instead of -O3, we get better asm in that part: it takes advantage of having the char in AL, and jumps into the middle of the next loop after xor-zeroing the `inc` variable. Replacingc = *str++; with c = *str; str++; results in a wasted register copy with trunk, instead of a 2nd load (on x86-64 and arm64). Still a missed opt, but less bad. GCC7 and earlier still do an extra load with either way of writing that. Removing the first loop, or making its loop condition something like *str && keep_lut[*str], removes the problem entirely. The CSE possibility is gone. (Same even if we use lut[*(unsigned char*)str] - type-pun the pointer to unsigned char instead of casting the signed char value to unsigned char, on x86 where char is signed, but not on arm64 where char is unsigned.) --- I didn't find any clear duplicates; the following are barely worth mentioning: * pr94442 looks like extra spilling, not just redundant loading. * pr97366 is due to vectors of different types, probably. * pr64319 needs runtime aliasing detection to avoid, unlike this. The AArch64 version of this does seem to demo pr71942 (a useless and x4, x2, 255 on an LDRB result) when you get it to copy a register instead of doing a 2nd load.
[Bug rtl-optimization/88770] Redundant load opt. or CSE pessimizes code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88770 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2 from Peter Cordes --- Note that mov r64, imm64 is a 10-byte instruction, and can be slow to read from the uop-cache on Sandybridge-family. The crap involving OR is clearly sub-optimal, but *if* you already have two spare call-preserved registers across this call, the following is actually smaller code-size: movabs rdi, 21474836483 mov rbp, rdi movabs rsi, 39743127552 mov rbx, rsi calltest mov rdi, rbp mov rsi, rbx calltest This is more total uops for the back-end though (movabs is still single-uop, but takes 2 entries the uop cache on Sandybridge-family; https://agner.org/optimize/). So saving x86 machine-code size this way does limit the ability of out-of-order exec to see farther, if the front-end isn't the bottleneck. And it's highly unlikely to be worth saving/restoring two regs to enable this. (Or to push rdi / push rsi before call, then pop after!) Setting up the wrong value and then fixing it twice with OR is obviously terrible and never has any advantage, but the general idea to CSE large constants isn't totally crazy. (But it's profitable only in such limited cases that it might not be worth looking for, especially if it's only helpful at -Os)
[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636 Peter Cordes changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from Peter Cordes --- This seems to be fixed for ZMM vectors in GCC8. https://gcc.godbolt.org/z/7351be1v4 Seems to have never been a problem for __m256, at least not for __m256 zero256(){ return _mm256_setzero_ps(); } IDK what I was looking at when I originally reported; maybe just clang which *did* used to prefer YMM-zeroing. Some later comments suggested movdqa vs. pxor zeroing choices (and mov vs. xor for integer), but the bug title is just AVX / AVX-512 xor-zeroing, and that seems to be fixed. So I think this should be closed.
[Bug tree-optimization/42587] bswap not recognized for memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42587 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #12 from Peter Cordes --- (In reply to Andi Kleen from comment #11) > Only when the first test case is fixed too https://godbolt.org/z/7M8cx3vT1 GCC8.1 -O3 for x86-64 pushrbx mov ebx, edi callacpi_ut_track_stack_ptr mov eax, ebx pop rbx bswap eax ret The code in the initial report optimizes to bswap with GCC8.1 and later. Is that the test case you meant? GCC8.1 was released on May 2, 2018, well before your Nov comment, so maybe you meant something else.
[Bug middle-end/98801] Request for a conditional move built-in function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98801 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #5 from Peter Cordes --- (In reply to Richard Biener from comment #4) > Slight complication arises because people will want to have cmoves with a > memory destination. Do we even want to provide this? Most ISAs can't branchlessly conditionally store, except via an RMW (which wouldn't be thread-safe for the no-store case if not atomic) or something really clunky. (Like x86 rep stos with count=0 or 1.) ARM predicated instructions allow branchless load or store that doesn't disturb the memory operand (and won't even fault on a bad address). I guess another option to emulate it could be to make a dummy local and cmov to select a store address = dummy : real. But that's something users can build in the source using a non-memory conditional-select builtin that exposes the much more widely available ALU conditional-select functionality like x86 CMOV, AArch64 CSEL, MIPS MVN, etc. > That won't solve the eventual request to have cmov _from_ memory ... (if we > leave all of the memory combining to RTL people will again complain that > it's subject to compilers discretion). It might be sufficient for most use-cases like defending against timing side-channels to not really try to allow conditional loads (from maybe-invalid pointers). I'm not sure if the motivation for this includes trying to make code without data-dependent branching, to defend against timing side-channels. But if we do provide something like this, people are going to want to use it that way. That's one case where best-effort behaviour at the mercy of the optimizer for a ternary (or having to manually check the asm) is not great. Stack Overflow has gotten a few Q&As from people looking for guaranteed CMOV for reasons like that. So I think we should be wary of exposing functionality that most ISAs don't have. OTOH, failing to provide a way to take advantage of functionality that some ISAs *do* have is not great, e.g. ISO C failing to provide popcnt and bit-scan (clz / ctz) has been a problem for C for a long time. But for something like __builtin_clz, emulating on machines that don't have hardware support still works. If we're trying to support a guarantee of no data-dependent branching, that limits the emulation possibilities or makes them clunkier. Especially if we want to support ARM's ability to not fault / not access memory if the condition is false. The ALU-select part can be emulated with AND/OR, so that's something we can provide on any target. Folding memory operands into a predicated load on ARM could actually introduce data-dependent cache access, vs. an unconditional load and a predicated reg-reg MOV. So this becomes somewhat thorny, and some design work to figure out what documented guarantees to provide will be necessary. Performance use-cases would certainly rather just have a conditional load in one instruction.
[Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291 Bug ID: 98291 Summary: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* An FP reduction loop with 2 scalar accumulators auto-vectorizes into a mess, instead of effectively mapping each scalar to an element of one vector accumulator. (Unless we use -ffast-math, then that happens. clang gets it right even without -ffast-math). double dotprod(const double *a, const double *b, unsigned long long n) { double d1 = 0.0; double d2 = 0.0; for (unsigned long long i = 0; i < n; i += 2) { d1 += a[i] * b[i]; d2 += a[i + 1] * b[i + 1]; } return (d1 + d2); } https://godbolt.org/z/Kq48j9 With -ffast-math the nice sane loop we expect .L3: movupd (%rsi,%rax), %xmm0 movupd (%rdi,%rax), %xmm3 addq$1, %rdx addq$16, %rax mulpd %xmm3, %xmm0 addpd %xmm0, %xmm1 cmpq%rcx, %rdx jb .L3 without: ... main loop .L4: movupd (%rcx,%rax), %xmm1# 16-byte load movupd (%rsi,%rax), %xmm3 movhpd 16(%rcx,%rax), %xmm1 # overwrite the high half of it!! movhpd 16(%rsi,%rax), %xmm3 mulpd %xmm3, %xmm1 movupd 16(%rsi,%rax), %xmm3 movlpd 8(%rsi,%rax), %xmm3 addsd %xmm1, %xmm2 unpckhpd%xmm1, %xmm1 addsd %xmm1, %xmm2 movupd 16(%rcx,%rax), %xmm1 movlpd 8(%rcx,%rax), %xmm1 addq$32, %rax mulpd %xmm3, %xmm1 addsd %xmm1, %xmm0 unpckhpd%xmm1, %xmm1 addsd %xmm1, %xmm0 cmpq%rdx, %rax jne .L4 The overall strategy is insane, but even some of the details are insane. e.g. a 16-byte load into XMM1, and then overwriting the high half of that with a different double before reading it. That's bad enough, but you'd expect movsd / movhpd to manually gather 2 doubles, without introducing the possibility of a cache-line split load for zero benefit. Similarly, movupd / movlpd should have just loaded in the other order. (Or since they're contiguous, movupd 8(%rsi,%rax), %xmm3 / shufpd.) So beyond the bad overall strategy (which is likely worse than unrolled scalar), it might be worth checking for some of this kind of smaller-scale insanity somewhere later to make it less bad if some other inputs can trigger similar behaviour. (This small-scale detecting of movupd / movhpd and using movsd / movhpd could be a separate bug, but if it's just a symptom of something that should never happen in the first place then it's not really its own bug at all.)
[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366 --- Comment #1 from Peter Cordes --- Forgot to include https://godbolt.org/z/q44r13
[Bug target/97366] New: [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366 Bug ID: 97366 Summary: [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- When you use the same _mm_load_si128 or _mm256_load_si256 result twice, sometimes GCC loads it *and* uses it as a memory source operand. I'm not certain this is specific to x86 back-ends, please check bug tags if it happens elsewhere. (But it probably doesn't on 3-operand load/store RISC machines; it looks like one operation chooses to load and then operate, the other chooses to use the original source as a memory operand.) #include void gcc_double_load_128(int8_t *__restrict out, const int8_t *__restrict input) { for (unsigned i=0 ; i<1024 ; i+=16){ __m128i in = _mm_load_si128((__m128i*)&input[i]); __m128i high = _mm_srli_epi32(in, 4); _mm_store_si128((__m128i*)&out[i], _mm_or_si128(in,high)); } } gcc 8 and later -O3 -mavx2, including 11.0.0 20200920, with gcc_double_load_128(signed char*, signed char const*): xorl%eax, %eax .L6: vmovdqa (%rsi,%rax), %xmm1 # load vpsrld $4, %xmm1, %xmm0 vpor(%rsi,%rax), %xmm0, %xmm0 # reload as a memory operand vmovdqa %xmm0, (%rdi,%rax) addq$16, %rax cmpq$1024, %rax jne .L6 ret GCC7.5 and earlier use vpor %xmm1, %xmm0, %xmm0 to use the copy of the original that was already loaded. `-march=haswell` happens to fix this for GCC trunk, for this 128-bit version but not for a __m256i version. restrict doesn't make a difference, and there's no overlapping anyway. The two redundant loads both happen between any other stores. Using a memory source operand for vpsrld wasn't an option: the form with a memory source takes the *count* from memory, not the data. https://www.felixcloutier.com/x86/psllw:pslld:psllq Note that *without* AVX, the redundant load is a possible win, for code running on Haswell and later Intel (and AMD) CPUs. Possibly some heuristic is saving instructions for the legacy-SSE case (in a way that's probably worse overall) and hurting the AVX case. GCC 7.5, -O3 without any -m options gcc_double_load_128(signed char*, signed char const*): xorl%eax, %eax .L2: movdqa (%rsi,%rax), %xmm0 movdqa %xmm0, %xmm1 # this instruction avoided psrld $4, %xmm1 por %xmm1, %xmm0 # with a memory source reload, in GCC8 and later movaps %xmm0, (%rdi,%rax) addq$16, %rax cmpq$1024, %rax jne .L2 rep ret Using a memory-source POR saves 1 front-end uop by avoiding a register-copy, as long as the indexed addressing mode can stay micro-fused on Intel. (Requires Haswell or later for that to happen, or any AMD.) But in practice it's probably worse. Load-port pressure, and space in the out-of-order scheduler, as well as code-size, is a problem for using an extra memory-source operand in the SSE version, with the upside being saving 1 uop for the front-end. (And thus in the ROB.) mov-elimination on modern CPUs means the movdqa register copy costs no back-end resources (ivybridge and bdver1). I don't know if GCC trunk is using por (%rsi,%rax), %xmm0 on purpose for that reason, of if it's just a coincidence. I don't think it's a good idea on most CPUs, even if alignment is guaranteed. This is of course 100% a loss with AVX; we have to `vmovdqa/u` load for the shift, and it can leave the original value in a register so we're not saving a vmovdqua. And it's a bigger loss because indexed memory-source operands unlaminate from 3-operand instructions even on Haswell/Skylake: https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695#31027695 so it hurts the front-end as well as wasting cycles on load ports, and taking up space in the RS (scheduler). The fact that -mtune=haswell fixes this for 128-bit vectors is interesting, but it's clearly still a loss in the AVX version for all AVX CPUs. 2 memory ops / cycle on Zen could become a bottleneck, and it's larger code size. And -mtune=haswell *doesn't* fix it for the -mavx2 _m256i version. There is a possible real advantage in the SSE case, but it's very minor and outweighed by disadvantages. Especially for older CPUs like Nehalem that can only do 1 load / 1 store per clock. (Although this has so many uops in the loop that it barely bottlenecks on that.)