[Bug c++/110619] Dangling pointer returned from constexpr function converts in nullptr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110619 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #7 from Peter Cordes --- (In reply to Andrew Pinski from comment #2) > >but it is not nullptr. > > Or is it just undefined so it could be considered a nullptr ... Implementation-defined behaviour, according to answers on https://stackoverflow.com/questions/76843246/why-does-the-address-of-an-out-of-scope-variable-equal-zero-with-constexpr https://eel.is/c++draft/basic.compound#def:value,invalid_pointer https://eel.is/c++draft/basic.stc.general#4 > Indirection through an invalid pointer value and passing an invalid pointer > value to a deallocation function have undefined behavior. > **Any other use of an invalid pointer value has implementation-defined > behavior.** So this wasn't a bug, but the new behaviour is also allowed. This commit could be reverted or kept, depending on maintainability and/or quality-of-life for users of GCC. Having it pick the other implementation-defined behaviour from clang (GCC's previous behaviour) is maybe a *good* thing, to help programmers catch dependence on an invalid pointer being either null or non-null if they try their code with both compilers.
[Bug middle-end/108441] [12 Regression] Maybe missed optimization: loading an 16-bit integer value from .rodata instead of an immediate store
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108441 --- Comment #4 from Peter Cordes --- This is already fixed in current trunk; sorry I forgot to check that before recommending to report this store-coalescing bug. # https://godbolt.org/z/j3MdWrcWM # GCC nightly -O3 (tune=generic) and GCC11 store: movl$16, %eax movw%ax, ldap(%rip) ret In case anyone's wondering why GCC doesn't movw $16, foo(%rip) it's avoiding LCP stalls on Intel P6-family CPUs from the 16-bit immediate. For MOV specifically, that only happens on P6-family (Nehalem and earlier), not Sandybridge-family, so it's getting close to time to drop it from -mtune=generic. (-mtune= bdver* or znver* don't do it, so there is a tuning setting controlling it) GCC *only* seems to know about MOV, so ironically with -march=skylake for example, we avoid a non-existant LCP stall for mov to memory, but GCC compiles x += 1234 into code that will LCP stall, addw $1234, x(%rip). -march=alderlake disables this tuning workaround, using movw $imm, mem. (The Silvermont-family E-cores in Alder Lake don't have this problem either, so that's correct. Agner Fog's guide didn't mention any changes in LCP stalls for Alder Lake.) Avoiding LCP stalls is somewhat less important on CPUs with a uop cache, since it only happens on legacy decode. Although various things can cause code to only run from legacy decode even inside a loop, such as Skylake's JCC erratum microcode mitigation if users don't assemble with the option to have GAS work around it, which GCC doesn't pass by default with -march=skylake. If there isn't already a bug open about tuning choices mismatching hardware, I can repost this as a new bug if you'd like. Related :https://stackoverflow.com/questions/75154687/is-this-a-missed-optimization-in-gcc-loading-an-16-bit-integer-value-from-roda and https://stackoverflow.com/questions/70719114/why-does-the-short-16-bit-variable-mov-a-value-to-a-register-and-store-that-u
[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 --- Comment #27 from Peter Cordes --- (In reply to Alexander Monakov from comment #26) > Sure, the right course of action seems to be to simply document that atomic > types and built-ins are meant to be used on "common" (writeback) memory Agreed. Where in the manual should this go? Maybe a new subsection of the chapter about __atomic builtins where we document per-ISA requirements for them to actually work? e.g. x86 memory-type stuff, and that ARM assumes all cores are in the same inner-shareable cache-coherency domain, thus barriers are dmb ish not dmb sy and so on. I guess we might want to avoid documenting the actual asm implementation strategies in the main manual, because that would imply it's supported to make assumptions based on that. Putting it near the __atomic docs might make it easier for readers to notice that the list of requirements exists, vs. scattering them into different pages for different ISAs. And we don't currently have any section in the manual about per-ISA quirks or requirements, just about command-line options, builtins, and attributes that are per-ISA, so there's no existing page where this could get tacked on. This would also be a place where we can document that __atomic ops are address-free when they're lock-free, and thus usable on shared memory between processes. ISO C++ says that *should* be the case for std::atomic, but doesn't standardize the existence of multiple processes. To avoid undue worry, documentation about this should probably start by saying that normal programs (running under mainstream OSes) don't have to worry about it or do anything special.
[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 --- Comment #25 from Peter Cordes --- (In reply to Alexander Monakov from comment #24) > > I think it's possible to get UC/WC mappings via a graphics/compute API (e.g. > OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device > memory (and then CPU vendor cannot guarantee that 128b access won't tear > because it might depend on downstream devices). Even atomic_int doesn't work properly if you deref a pointer to WC memory. WC doesn't have the same ordering guarantees, so it would break acquire/release semantics. So we already don't support WC for this. We do at least de-facto support atomics on UC memory because the ordering guarantees are a superset of cacheable memory, and 8-byte atomicity for aligned load/store is guaranteed even for non-cacheable memory types since P5 Pentium (and on AMD). (And lock cmpxchg16b is always atomic even on UC memory.) But you're right that only Intel guarantees that 16-byte VMOVDQA loads/stores would be atomic on UC memory. So this change could break that very unwise corner-case on AMD which only guarantees that for cacheable loads/stores, and Zhaoxin only for WB. But was anyone previously using 16-byte atomics on UC device memory? Do we actually care about supporting that? I'd guess no and no, so it's just a matter of documenting that somewhere. Since GCC7 we've reported 16-byte atomics as being non-lock-free, so I *hope* people weren't using __atomic_store_n on device memory. The underlying implementation was never guaranteed.
[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #23 from Peter Cordes --- (In reply to Xi Ruoyao from comment #20) > "On Zhaoxin CPUs with AVX, the VMOVDQA instruction is atomic if the accessed > memory is Write Back, but it's not guaranteed for other memory types." VMOVDQA is still fine, I think WB is the only memory type that's relevant for atomics, at least on the mainstream OSes we compile for. It's not normally possible for user-space to allocate memory of other types. Kernels normally use WB memory for their shared data, too. You're correct that WT and WP are the other two cacheable memory types, and Zhaoxin's statement doesn't explicitly guarantee atomicity for those, unlike Intel and AMD. But at least on Linux, I don't think there's a way for user-space to even ask for a page of WT or WP memory (or UC or WC). Only WB memory is easily available without hacking the kernel. As far as I know, this is true on other existing OSes. WT = write-through: read caching, no write-allocate. Write hits update the line and memory. WP = write-protect: read caching, no write-allocate. Writes go around the cache, evicting even on hit. (https://stackoverflow.com/questions/65953033/whats-the-usecase-of-write-protected-pat-memory-type quotes the Intel definitions.) Until recently, the main work on formalizing the x86 TSO memory model had only looked at WB memory. A 2022 paper looked at WT, UC, and WC memory types: https://dl.acm.org/doi/pdf/10.1145/3498683 - Extending Intel-x86 Consistency and Persistency Formalising the Semantics of Intel-x86 Memory Types and Non-temporal Stores (The intro part describing memory types is quite readable, in plain English not full of formal symbols. They only mention WP once, but tested some litmus tests with readers and writers using any combination of the other memory types.) Some commenters on my answer on when WT is ever used or useful confirmed that mainstream OSes don't give easy access to it. https://stackoverflow.com/questions/61129142/when-use-write-through-cache-policy-for-pages/61130838#61130838 * Linux has never merged a patch to let user-space allocate WT pages. * The Windows kernel reportedly doesn't have a mechanism to keep track of pages that should be WT or WP, so you won't find any. I don't know about *BSD making it plausible for user-space to point an _Atomic int * at a page of WT or WP memory. I'd guess not. I don't know if there's anywhere we can document that _Atomic objects need to be in memory that's allocated in a "normal" way. Probably hard to word without accidentally disallowing something that's fine.
[Bug tree-optimization/106138] Inefficient code generation: logical AND of disjoint booleans from equal and bitwise AND not optimized to constant false
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106138 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3 from Peter Cordes --- Ideally, bitwise & of booleans should also be handled, not just &&. A testcase (https://godbolt.org/z/qvosv8q7c) makes it easy to check both. //#define LOGIC_AND _Bool f2(char x) { _Bool b1 = x == 2; _Bool b2 = x & 1; #ifdef LOGIC_AND return b1 && b2; #else return b1 & b2; #endif } (Clang optimized it to return false for the && version, but not bitwise. GCC currently doesn't optimize either way.) This was originally posted on Stack Overflow (https://stackoverflow.com/q/72802469/224132), BTW.
[Bug target/105929] New: [AArch64] armv8.4-a allows atomic stp. 64-bit constants can use 2 32-bit halves with _Atomic or volatile
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105929 Bug ID: 105929 Summary: [AArch64] armv8.4-a allows atomic stp. 64-bit constants can use 2 32-bit halves with _Atomic or volatile Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: arm64-*-* void foo(unsigned long *p) { *p = 0xdeadbeefdeadbeef; } // compiles nicely: https://godbolt.org/z/8zf8ns14K mov w1, 48879 movkw1, 0xdead, lsl 16 stp w1, w1, [x0] ret But even with -Os -march=armv8.4-a the following doesn't: void foo_atomic(_Atomic unsigned long *p) { __atomic_store_n(p, 0xdeadbeefdeadbeef, __ATOMIC_RELAXED); } mov x1, 48879 movkx1, 0xdead, lsl 16 movkx1, 0xbeef, lsl 32 movkx1, 0xdead, lsl 48 stlrx1, [x0] ret ARMv8.4-a and later guarantees atomicity for aligned ldp/stp, according to ARM's architecture reference manual: ARM DDI 0487H.a - ID020222, so we could use the same asm as the non-atomic version. > If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access > fewer than 16 bytes are single-copy atomic when all of the following > conditions are true: > • All bytes being accessed are within a 16-byte quantity aligned to 16 bytes. > • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory (FEAT_LSE2 is the same CPU feature that gives 128-bit atomicity for aligned ldp/stp x,x,mem) Prior to that, apparently it wasn't guaranteed that stp of 32-bit halves merged into a single 64-bit store. So without -march=armv8.4-a it wasn't a missed optimization to construct the constant in a single register for _Atomic or volatile. But with ARMv8.4, we should use MOV/MOVK + STP. Since there doesn't seem to be a release-store version of STP, 64-bit release and seq_cst stores should still generate the full constant in a register, instead of using STP + barriers. (Without ARMv8.4-a, or with a memory-order other than relaxed, see PR105928 for generating 64-bit constants in 3 instructions instead of 4, at least for -Os, with add x0, x0, x0, lsl 32)
[Bug target/105928] New: [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928 Bug ID: 105928 Summary: [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least) Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: arm64-*-* void foo(unsigned long *p) { *p = 0xdeadbeefdeadbeef; } cleverly compiles to https://godbolt.org/z/b3oqao5Kz mov w1, 48879 movkw1, 0xdead, lsl 16 stp w1, w1, [x0] ret But producing the value in a register uses more than 3 instructions: unsigned long constant(){ return 0xdeadbeefdeadbeef; } mov x0, 48879 movkx0, 0xdead, lsl 16 movkx0, 0xbeef, lsl 32 movkx0, 0xdead, lsl 48 ret At least with -Os, and maybe at -O2 or -O3 if it's efficient, we could be doing a shifted ADD or ORR to broadcast a zero-extended 32-bit value to 64-bit. mov x0, 48879 movkx0, 0xdead, lsl 16 add x0, x0, x0, lsl 32 Some CPUs may fuse sequences of movk, and shifted operands for ALU ops may take extra time in some CPUs, so this might not actually be optimal for performance, but it is smaller for -Os and -Oz. We should also be using that trick for stores to _Atomic or volatile long*, where we currently do MOV + 3x MOVK, then an STR, with ARMv8.4-a which guarantees atomicity. --- ARMv8.4-a and later guarantees atomicity for ldp/stp within an aligned 16-byte chunk, so we should use MOV/MOVK / STP there even for volatile or __ATOMIC_RELAXED. But presumably that's a different part of GCC's internals, so I'll report that separately.
[Bug tree-optimization/105904] New: Predicated mov r0, #1 with opposite conditions could be hoisted, between 1 and 1<
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105904 Bug ID: 105904 Summary: Predicated mov r0, #1 with opposite conditions could be hoisted, between 1 and 1< // using the libstdc++ header unsigned roundup(unsigned x){ return std::bit_ceil(x); } https://godbolt.org/z/Px1fvWaex GCC's version is somewhat clunky, including MOV r0, #1 in either "side": roundup(unsigned int): cmp r0, #1 i hi addhi r3, r0, #-1 movhi r0, #1@@ here clzhi r3, r3 rsbhi r3, r3, #32 ite hi lslhi r0, r0, r3 movls r0, #1@@ here bx lr Even without spotting the other optimizations that clang finds, we can combine to a single unconditional MOV r0, #1. But only if we avoid setting flags, so it requires a 4-byte encoding, not MOVS. Still, it's one fewer instruction to execute. This is not totally trivial: it requires seeing that we can move it across the conditional LSL. So it's really a matter of folding the 1s between 1<
[Bug tree-optimization/105596] Loop counter widened to 128-bit unnecessarily
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596 --- Comment #1 from Peter Cordes --- https://godbolt.org/z/aoG55T5Yq gcc -O3 -m32 has the same problem with unsigned long long total and unsigned i. Pretty much identical instruction sequences in the loop for all 3 versions, doing add/adc to increment i, for example. (Plus a bit of spilling). fact_gcc_handhold still compiles without the unnecessary widening. Perhaps should retitle to widen to a "2-register type". IDK how easily this occurs in real-world loops with 64 and 32-bit integers on 32-bit machines, but that's probably more of a concern for wasting more clock cycles worldwide.
[Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596 Bug ID: 105596 Summary: Loop counter widened to 128-bit unnecessarily Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- For total *= i with a u128 total and a u32 loop counter, GCC pessimizes by widening i and doing a full 128x128 => 128-bit multiply, and having to do a 128-bit increment and compare. uint64_t i to make it a full register width doesn't help. unsigned __int128 fact(unsigned n){ unsigned __int128 total = n; for (unsigned i=2 ; i < n ; i++) total *= i; return total; } // 0! = 0 isn't mathematically correct, but that's not the point https://godbolt.org/z/W4MW9b6T3 (gcc trunk 13.0.0 20220508 (experimental) and clang 14, which makes efficient asm for all of these.) # gcc -O3 fact: movl%edi, %r9d xorl%r11d, %r11d movq%r9, %r10 # total = n zext into R11:R10 cmpl$2, %edi jbe .L7 # if n<=2 return r11:r10 movl$2, %esi# i = 2 in RDI:RSI xorl%edi, %edi .L9: # do{ movq%r11, %rcx movq%rdi, %rdx movq%r10, %rax movq%r9, %r8 # copy original n to destroy later imulq %r10, %rdx # 128x128 multiply with 2x imul, 1x widening mul imulq %rsi, %rcx addq%rdx, %rcx mulq%rsi movq%rdx, %r11 # update total in r11:r10 movq%rax, %r10 addq%rcx, %r11 # last partial product addq$1, %rsi# i++ as a 128-bit integer adcq$0, %rdi xorq%rsi, %r8 # r8 = n^i movq%rdi, %rcx # useless copy, we're already destroying r8 orq %r8, %rcx# hi(i^n) | lo(i^n) jne .L9 # }while(i != n); .L7: movq%r10, %rax movq%r11, %rdx ret So as well as creating extra work to do, it's not even doing it very efficiently, with multiple unnecessary mov instructions. This doesn't seem to be x86-64 specific. It also compiles similarly for AArch64 and MIPS64. For some ISAs, I'm not sure if potentially-infinite loops are making a difference, e.g. PowerPC is hard for me to read. RV64 has three multiply instructions in both versions. I haven't tested a 32-bit equivalent with uint64_t total and uint32_t i. This anti-optimization goes back to GCC4.6. With GCC4.5 and earlier, the above C compiles to a tight loop with the expected mul reg + imul reg,reg and 1 register loop counter: https://godbolt.org/z/6KheaqTx4 (using __uint128_t, since unsigned __int128 wasn't supported on GCC4.4 or 4.1) GCC 4.1 does an inefficient multiply, but one of the chunks is a freshly xor-zeroed register. It's still just incrementing and comparing a 32-bit loop counter, but widening it for a 128x128-bit multiply recipe. GCC4.4 optimizes away the parts that are useless for the high 64 bits of (u128)i being zero. - A different version compiles efficiently with GCC6 and earlier, only becoming slow like the above with GCC7 and later. unsigned __int128 fact_downcount(unsigned n){ unsigned __int128 total = n; for (unsigned i=n-1 ; i > 1 ; i--) total *= i; return total; // 0! = 0 isn't mathematically correct } - When the loop condition is possibly always-true, GCC can't prove the loop is non-infinite, and as usual can't widen the loop counter. In this case, that's a good thing: unsigned __int128 fact_gcc_handhold(unsigned n){ unsigned __int128 total = 1; // loop does do final n for (unsigned i=2 ; i <= n ; i++) // potentially infinite loop defeats this pessimization total *= i; return total; // fun fact: 0! = 1 is mathematically correct } fact_gcc_handhold: cmpl$1, %edi jbe .L4 movl$2, %ecx # i = 2 inECX movl$1, %eax # total = 1 in RDX:RAX xorl%edx, %edx .L3: #do{ movl%ecx, %esi# copy i instead of just incrementing it later :/ movq%rdx, %r8 # save high half of total addl$1, %ecx # i++ imulq %rsi, %r8 # lo x hi cross product mulq%rsi# lo x lo widening addq%r8, %rdx # 128x64-bit multiply cmpl%ecx, %edi jnb .L3 # }while(i < n) ret Allocating total in RDX:RAX is nice, putting the lo part where we need it for mulq anyway.
[Bug target/65146] alignment of _Atomic structure member is not correct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146 --- Comment #25 from Peter Cordes --- (In reply to CVS Commits from comment #24) > The master branch has been updated by Jakub Jelinek : > > https://gcc.gnu.org/g:04df5e7de2f3dd652a9cddc1c9adfbdf45947ae6 > > commit r11-2909-g04df5e7de2f3dd652a9cddc1c9adfbdf45947ae6 > Author: Jakub Jelinek > Date: Thu Aug 27 18:44:40 2020 +0200 > > ia32: Fix alignment of _Atomic fields [PR65146] > > For _Atomic fields, lowering the alignment of long long or double etc. > fields on ia32 is undesirable, because then one really can't perform > atomic > operations on those using cmpxchg8b. Just for the record, the description of this bugfix incorrectly mentioned cmpxchg8b being a problem. lock cmpxchg8b is *always* atomic, even if that means the CPU has to take a bus lock (disastrously expensive affecting all cores system-wide) instead of just delaying MESI response for one line exclusively owned in this core's private cache (aka cache lock). The correctness problem is __atomic_load_n / __atomic_store_n compiling to actual 8-byte pure loads / pure stores using SSE2 movq, SSE1 movlps, or x87 fild/fistp (bouncing through the stack), such as movq %xmm0, (%eax) That's where correctness depends on Intel and AMD's atomicity guarantees which are conditional on alignment. (And if AVX is supported, same deal for 16-byte load/store. Although we can and should use movaps for that, which bakes alignment checking into the instruction. Intel did recently document that CPUs with AVX guarantee atomicity of 16-byte aligned loads/stores, retroactive to all CPUs with AVX. It's about time, but yay.) > Not sure about iamcu_alignment change, I know next to nothing about IA > MCU, > but unless it doesn't have cmpxchg8b instruction, it would surprise me > if we > don't want to do it as well. I had to google iamcu. Apparently it's Pentium-like, but only has soft-FP (so I assume no MMX or SSE as well as no x87). If that leaves it no way to do 8-byte load/store except (lock) cmpxchg8b, that may mean there's no need for alignment, unless cache-line-split lock is still a performance issue. If it's guaranteed unicore as well, we can even omit the lock prefix and cmpxchg8b will still be an atomic RMW (or load or store) wrt. interrupts. (And being unicore would likely mean much less system-wide overhead for a split lock.)
[Bug target/82261] x86: missing peephole for SHLD / SHRD
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261 --- Comment #4 from Peter Cordes --- GCC will emit SHLD / SHRD as part of shifting an integer that's two registers wide. Hironori Bono proposed the following functions as a workaround for this missed optimization (https://stackoverflow.com/a/71805063/224132) #include #ifdef __SIZEOF_INT128__ uint64_t shldq_x64(uint64_t low, uint64_t high, uint64_t count) { return (uint64_t)(unsigned __int128)high << 64) | (unsigned __int128)low) << (count & 63)) >> 64); } uint64_t shrdq_x64(uint64_t low, uint64_t high, uint64_t count) { return (uint64_t)unsigned __int128)high << 64) | (unsigned __int128)low) >> (count & 63)); } #endif uint32_t shld_x86(uint32_t low, uint32_t high, uint32_t count) { return (uint32_t)(uint64_t)high << 32) | (uint64_t)low) << (count & 31)) >> 32); } uint32_t shrd_x86(uint32_t low, uint32_t high, uint32_t count) { return (uint32_t)uint64_t)high << 32) | (uint64_t)low) >> (count & 31)); } --- The uint64_t functions (using __int128) compile cleanly in 64-bit mode (https://godbolt.org/z/1j94Gcb4o) using 64-bit operand-size shld/shrd but the uint32_t functions compile to a total mess in 32-bit mode (GCC11.2 -O3 -m32 -mregparm=3) before eventually using shld, including a totally insane or dh, 0 GCC trunk with -O3 -mregparm=3 compiles them cleanly, but without regparm it's also slightly different mess. Ironically, the uint32_t functions compile to quite a few instructions in 64-bit mode, actually doing the operations as written with shifts and ORs, and having to manually mask the shift count to &31 because it uses a 64-bit operand-size shift which masks with &63. 32-bit operand-size SHLD would be a win here, at least for -mtune=intel or a specific Intel uarch. I haven't looked at whether they still compile ok after inlining into surrounding code, or whether operations would tend to combine with other things in preference to becoming an SHLD.
[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 --- Comment #5 from Peter Cordes --- > pextrw requires sse4.1 for mem operands. You're right! I didn't double-check the asm manual for PEXTRW when writing up the initial report, and had never realized that PINSRW wasn't symmetric with it. I was really surprised to see that in https://www.felixcloutier.com/x86/pextrw So we do need to care about tuning for _mm_storeu_si16(p, v) without SSE4.1 (without the option of PEXTRW to memory). PEXTRW to an integer register is obviously bad; we should be doing movd %xmm0, %eax mov %ax, (%rdi) instead of an inefficient pextrw $0, %xmm0, %eax ; movw-store Reported as PR105079, since the cause of the load missed-opt was GCC thinking the instruction wasn't available, rather than a wrong tuning choice like this is.
[Bug target/105079] New: _mm_storeu_si16 inefficiently uses pextrw to an integer reg (without SSE4.1)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105079 Bug ID: 105079 Summary: _mm_storeu_si16 inefficiently uses pextrw to an integer reg (without SSE4.1) Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* With PR105066 fixed, we do _mm_loadu_si16 with pinsrw from memory, because that's available with just SSE2. (And the cause wasn't tuning choices, it was a typo in what insns GCC thought were available.) Related: PR105072 re: folding such 16-bit loads into memory source operands for PMOVZX/SXBQ. But the famously non-orthogonal SSE2 only includes pextrw $imm, %xmm, reg. Not reg/mem until SSE4.1 (with a longer opcode for no apparent reason, instead of just allowing mem addressing modes for the existing one. But same mnemonic so the assembler takes care of it. https://www.felixcloutier.com/x86/pextrw) So we do need to care about tuning for _mm_storeu_si16(p, v) without the option of PEXTRW to memory. Currently we do this, which is obviously bad: pextrw $0, %xmm0, %eax # 2 uops movw%ax, (%rdi) we should be doing this movd%xmm0, %eax # 1 uop mov %ax, (%rdi) https://godbolt.org/z/Ee3Ez174M This is especially true if we don't need the integer value zero-extended into EAX. If we *did* also want the value zero-extended in an integer register, the extra uop in PEXTRW (in addition to the port 0 uop like MOVD) is a port-5 shuffle to extract an arbitrary 16-bit element, vs. a separate integer movzwl %cx, %eax could run on any integer ALU port. (Including port 6 on HSW/SKL, which doesn't compete with any vector ALUs). Mov-elimination for movzwl doesn't work on any current CPUs, only movzbl on Intel, and movl / movq on both Intel and AMD. So currently there's no benefit to picking a different register like %ecx, instead of just using movzwl %ax, %eax When we both store and use the integer value: int store16_and_use(void *p, __m128i v){ _mm_storeu_si16( p, v ); return 123 + *(unsigned short*)p; } https://godbolt.org/z/zq6TMo1oE current trunk GCC does this, which is not bad: # -O3 with or without -msse4.1 pextrw $0, %xmm0, %eax movw%ax, (%rdi) addl$123, %eax ret Clang13 uses MOVD + MOVZX like I was suggesting, even though it costs more code size. That's not necessarily better movd%xmm0, %eax movw%ax, (%rdi) movzwl %ax, %eax addl$123, %eax retq In this case it's not obviously wrong to use PEXTRW to an integer reg, but it's also fine to do it clang's way. So however that corner case shakes out in the process of fixing the main bug (using movd / movw without SSE4.1 when we don't reload) is fine. If SSE4.1 is available, the no-reload case should probably use PEXTRW to memory instead of movd + movw. On some CPUs, the ALU op that's part of PEXTRW has more choice of ALU port than xmm->gp_int operations.
[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508 --- Comment #17 from Peter Cordes --- (In reply to Andrew Pinski from comment #16) > >According to Intel ( > > https://software.intel.com/sites/landingpage/IntrinsicsGuide), there are no > > alignment requirements for _mm_load_sd, _mm_store_sd and _mm_loaddup_pd. For > > example, from _mm_load_sd: > > I disagree with saying there is no alignment requirement. > > The alignment requirement comes from the type of the argument (double > const*). [...] > Pointers themselves have an alignment requirement not just at the time of > the load/store of them. The intrinsics are badly designed to take pointer args with types other than void*, despite how they're expected to work. This is something we just need to accept. Starting with AVX-512, any new intrinsics take void*, but they haven't redefined the old ones. _mm_loadu_si128 takes a __m128i*, same as _mm_load_si128. alignof(__m128i) == 16, so _mm_loadu_si128 must not simply dereference it, that's what _mm_load_si128 does. Intel's intrinsics API requires you to do unaligned 16-byte loads by creating a misaligned pointer and passing it to a loadu intrinsic. (This in turn requires that implementations supporting these intrinsics define the behaviour of creating such a pointer without deref; in ISO C that alone would be UB.) This additional unaligned-pointer behaviour that implementations must define (at least for __m128i* and float/double*) is something I wrote about in an SO answer: https://stackoverflow.com/questions/52112605/is-reinterpret-casting-between-hardware-simd-vector-pointer-and-the-correspond _mm_loadu_ps (like _mm_load_ps) takes a float*, but its entire purpose it to not require alignment. _mm512_loadu_ps takes a void* arg, so we can infer that earlier FP load intrinsics really are intended to work on data with any alignment, not just with the alignment of a float. They're unlike a normal deref of a float* in aliasing rules, although that's separate from creating a misaligned float* in code outside the intrinsic. A hypothetical low-performance portable emulation of intrinsics that ended up dereferencing that float* arg directly would be broken for strict-aliasing as well. The requirement to define the behaviour of having a misaligned float* can be blamed on Intel in 1995 (when SSE1 was new). Later extensions like AVX _mm256_loadu_ps just followed the same pattern of taking float* until they finally used void* for intrinsics introduced with or after AVX-512. The introduction of _mm_loadu_si32 and si16 is another step in the right direction, recognizing that _mm_cvtsi32_si128( *int_ptr ) isn't strict-aliasing safe. When those were new, it might have been around the time Intel started exploring replacing ICC with the LLVM-based ICX. Anyway, the requirement to support misaligned vector and float/double pointers implies that _mm_load_ss/sd taking float*/double* doesn't imply alignof(float) or alignof(double). > So either the intrinsics definition needs to be changed to be > correct or GCC is correct. That's an option; I'd love it if all the load/store intrinsics were changed across all compilers to take void*. It's ugly and a pain to type _mm_loadu_si128( (const __m128i*)ptr ) as well as creating cognitive dissonance because alignof(__m128i) == 16. I'm not sure if it could break anything to change the intrinsics to take void* even for older ones; possibly only C++ overload resolution for insane code that defines a _mm_loadu_ps( other_type * ) and relies on float* args picking the intrinsic. If we changed just GCC, without getting buy-in from other compilers, taking void* would let people's code compile on GCC without casts from stuff like int*, when it wouldn't compile on other compilers. That could be considered a bad thing if people test their code with GCC and are surprised to get reports of failure from people using compilers that follow Intel's documentation for the intrinsic function arg types. (https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html). It would basically be a case of being overly permissive for the feature / API that people are trying to write portable code against.
[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #14 from Peter Cordes --- This bug is mis-categorized; it's not a sanitizer bug, it's a bug in the implementation _mm_load_ss / sd. It currently derefs the `float const*` arg directly, which is not strict-aliasing or alignment safe. alignof(float) is 4, but Intel's documentation for this API still says "mem_addr does not need to be aligned on any particular boundary." _mm_load_ss (float const *__P) { return _mm_set_ss (*__P); } As discussed on PR99754 _mm_load_si32(const void*) *is* strict-aliasing and alignment safe. But it only existed recently, and GCC11's implementation of it is buggy (shuffling the element to the wrong place). Before that, one safe way to do a 32-bit SIMD load is with _mm_load_ss and _mm_castps_si128. Or it was supposed to be safe, but isn't!! Clang uses a packed may_alias struct containing a float to get a safe load done. Another way would be casting the pointer to typdef float aliasing_unaligned_f32 __attribute__((aligned(1),may_alias)); This is similar to what we do with __m32_u for use in aliasing-safe integer load/store, except we define that as int with vector_size(4),may_alias,aligned(1) for some reason. Perhaps influenced by __m64_u which is a vector of 2 ints. MSVC is like gcc -fno-strict-aliasing, so however it handles intrinsics, they're always aliasing-safe. I'm not 100% sure about what ICC formally guarantees, but in practice it doesn't move aliasing short* stores across a _mm_load_ss( (float*)pshort ) load. https://godbolt.org/z/6s76v71xz I didn't test with _mm_store_ss aliasing with short loads, only vice versa. So GCC is the odd one out, out of the major 4 compilers that support Intel's intrinsics API. All our narrow load/store intrinsics should be strict-aliasing and alignment safe, regardless of what pointer type they accept. Intel's early design of taking float* and double* instead of void* could be considered poor design. Their naming with just load/store instead of _mm_loadu_ss / storeu is also poor design, clearly motivated by the asm differences rather than an actual intrinsic API difference. In x86 asm, loads/stores narrower than 16 bytes never require alignment (unless the AC bit is set in EFLAGS). Assuming Intel modeled their intrinsics API after their asm, then it makes sense to have load and loadu for ps and si128, but only load/store with an implied lack of alignment for intrinsics that wrap instructions like movlps / movhps / movss / movsd, and movd / movq, which do narrower memory accesses. That of course *doesn't* make sense in C terms, where it's always potentially a problem to dereference misaligned pointers to narrow objects, even when compiling for x86-64: https://stackoverflow.com/questions/47510783/why-does-unaligned-access-to-mmaped-memory-sometimes-segfault-on-amd64 has an example and links some others, showing that compilers *don't* define the behaviour of deref of misaligned pointers. I'm pretty certain that Intel always intended their narrow load/store intrinsics to not have any alignment requirements, like the asm instructions that wrap them, but weren't thinking in C terms when naming them. And were sloppily in their choices of which ones to provide until decades later, since it seems they thought that _mm_cvtsi32_si128(*x) was sufficient for a movd load. (Only the case on a compiler without strict-aliasing or alignment, since the deref happens on the user's plain int*). Anyway, hopefully this refutes the argument that _mm_load_sd should be aligned because of the name, and clarifies what Intel might have been thinking when naming these.
[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 --- Comment #6 from Peter Cordes --- Looks good to me, thanks for taking care of this quickly, hopefully we can get this backported to the GCC11 series to limit the damage for people using these newish intrinsics. I'd love to recommend them for general use, except for this GCC problem where some distros have already shipped GCC versions that compile without error but in a 100% broken way. Portable ways to do narrow alignment/aliasing-safe SIMD loads were sorely lacking; there aren't good effective workarounds for this, especially for 16-bit loads. (I still don't know how to portably / safely write code that will compile to a memory-source PMOVZXBQ across all compilers; Intel's intrinsics API is rather lacking in some areas and relies on compilers folding loads into memory source operands.) > So, isn't that a bug in the intrinsic guide instead? Yes, __m128i _mm_loadu_si16 only really makes sense with SSE2 for PINSRW. Even movzx into an integer reg and then MOVD xmm, eax requires SSE2. With only SSE1 you'd have to movzx / dword store to stack / MOVSS reload. SSE1 makes *some* sense for _mm_loadu_si32 since it can be implemented with a single MOVSS if MOVD isn't available. But we already have SSE1 __m128 _mm_load_ss(const float *) for that. Except GCC's implementation of _mm_load_ss isn't alignment and strict-aliasing safe; it derefs the actual float *__P as _mm_set_ss (*__P). Which I think is a bug, although I'm not clear what semantics Intel intended for that intrinsic. Clang implements it as alignment/aliasing safe with a packed may_alias struct containing a float. MSVC always behaves like -fno-strict-aliasing, and I *think* ICC does, too. Perhaps best to follow the crowd and make all narrow load/store intrinsics alignment and aliasing safe, unless that causes code-gen regressions; users can _mm_set_ss( *ptr ) themselves if they want that to tell the compiler that's its a normal C float object. Was going to report this, but PR84508 is still open and already covers the relevant ss and sd intrinsics. That points out that Intel specifically documents it as not requiring alignment, not mentioning aliasing. Speaking of bouncing through a GP-integer reg, GCC unfortunately does that; it seems to incorrectly think PINSRW xmm, mem, 0 requires -msse4.1, unlike with a GP register source. Reported as PR105066 along with related missed optimizations about folding into a memory source operand for pmovzx/sx. But that's unrelated to correctness; this bug can be closed unless we're keeping it open until it's fixed in the GCC11 current stable series.
[Bug target/105066] New: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 Bug ID: 105066 Summary: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* PR99754 fixed the wrong-code for _mm_loadu_si16, but the resulting asm is not efficient without -msse4.1 (as part of -march= most things). It seems GCC thinks that pinsrw / pextrw with a memory operand requires SSE4.1, like pinsr/extr for b/d/q operand-size. But actually 16-bit insr/extr only needs SSE2 (We're also not efficiently folding it into a memory source operand for PMOVZXBQ, see below) https://godbolt.org/z/dYchb6hec shows GCC trunk 12.0.1 20220321 __m128i load16(void *p){ return _mm_loadu_si16( p ); } load16(void*): # no options, or -march=core2 or -mssse3 movzwl (%rdi), %eax pxor%xmm1, %xmm1 pinsrw $0, %eax, %xmm1 # should be MOVD %eax, or PINSRW mem movdqa %xmm1, %xmm0 ret vs. load16(void*): # -msse4.1 pxor%xmm1, %xmm1 pinsrw $0, (%rdi), %xmm1 movdqa %xmm1, %xmm0 ret The second version is actually 100% fine with SSE2: https://www.felixcloutier.com/x86/pinsrw shows that there's only a single opcode for PINSRW xmm, r32/m16, imm8 and it requires SSE2; reg vs. mem source is just a matter of the modr/m byte. The same problem exists for _mm_storeu_si16 not using pextrw to memory (which is also SSE2), instead bouncing through EAX. (Insanely still PEXTRW instead of MOVD). There is a choice of strategy here, but pinsrw/extrw between eax and xmm0 is clearly sub-optimal everywhere. Once we factor out the dumb register allocation that wastes a movdqa, the interesting options are: movzwl (%rdi), %eax # 1 uop on everything movd%eax, %xmm0 # 1 uop on everything vs. pxor%xmm0, %xmm0# 1 uop for the front-end, eliminated on Intel pinsrw $0, (%rdi), %xmm0 # 2 uops (load + shuffle/merge) Similarly for extract, pextrw $0, %xmm0, (%rdi) # 2 uops on most vs. movd%xmm0, %eax # 1 uop, only 1/clock even on Ice Lake movw%ax, (%rdi) # 1 uop On Bulldozer-family, bouncing through an integer reg adds a lot of latency vs. loading straight into the SIMD unit. (2 integer cores share a SIMD/FP unit, so movd between XMM and GP-integer is higher latency than most.) So that would definitely favour pinsrw/pextrw with memory. On Ice Lake, pextrw to mem is 2/clock throughput: the SIMD shuffle can run on p1/p5. But MOVD r,v is still p0 only, and MOVD v,r is still p5 only. So that also favours pinsrw/pextrw with memory, despite the extra front-end uop for pxor-zeroing the destination on load. Of course, if _mm_storeu_si16 is used on a temporary that's later reloaded, being able to optimize to a movd (and optionally movzx) is very good. Similar for _mm_loadu_si16 on a value we have in an integer reg, especially if we know it's already zero-extended to 32-bit for just a movd, we'd like to be able to do that. --- It's also essential that these loads fold efficiently into memory source operands for PMOVZX; pmovzxbq is one of the major use-cases for a 16-bit load. That may be a separate bug, IDK https://godbolt.org/z/3a9T55n3q shows _mm_cvtepu8_epi32(_mm_loadu_si32(p)) does fold a 32-bit memory source operand nicely to pmovzxbd (%rdi), %xmm0 which can micro-fuse into a single uop on Intel CPUs (for the 128-bit destination version, not YMM), but disaster with 16-bit loads: __m128i pmovzxbq(void *p){ return _mm_cvtepu8_epi64(_mm_loadu_si16(p)); } pmovzxbq(void*): # -O3 -msse4.1 -mtune=haswell pxor%xmm0, %xmm0 # 1 uop pinsrw $0, (%rdi), %xmm0 # 2 uops, one for shuffle port pmovzxbq%xmm0, %xmm0 # 1 uop for the same shuffle port ret (_mm_cvtepu8_epi64 requires SSE4.1 so there's no interaction with the -mno-sse4.1 implementation of the load.)
[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 --- Comment #3 from Peter Cordes --- Wait a minute, the current implementation of _mm_loadu_si32 isn't strict-aliasing or alignment safe!!! That defeats the purpose for its existence as something to use instead of _mm_cvtsi32_si128( *(int*)p ); The current code contains a deref of a plain (int*). It should be using something like typdef int unaligned_aliasing_int __attribute__((aligned(1),may_alias));
[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2 from Peter Cordes --- Can we get this patch applied soon? There aren't any other strict-aliasing-safe movd load intrinsics, but this one won't be portably usable while there are buggy GCC versions around. Until then, code should probably use something like inline __m128i movd(void *p){ return _mm_castps_si128(_mm_load_ss((const float*)p)); } (Which believe it or not is strict-aliasing safe even on integer data. At least it should be; last I tested it was across compilers, except maybe on ICC. Would have to double-check there.)
[Bug target/104773] New: compare with 1 not merged with subtract 1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773 Bug ID: 104773 Summary: compare with 1 not merged with subtract 1 Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-*, arm-*-* std::bit_ceil(x) involves if(x == 0 || x == 1) return 1; and 1u << (32-clz(x-1)). The compare of course compiles to an unsigned <= 1, which can be done with a sub instead of cmp, producing the value we need as an input for the leading-zero count. But GCC does *not* do this. (Neither does clang for x86-64). I trimmed down the libstdc++ code into something I could compile even when Godbolt is doesn't have working headers for some ISAs: https://godbolt.org/z/3EE7W5bna // cut down from libstdc++ for normal integer cases; compiles the same template constexpr _Tp bit_ceil(_Tp __x) noexcept { constexpr auto _Nd = std::numeric_limits<_Tp>::digits; if (__x == 0 || __x == 1) return 1; auto __shift_exponent = _Nd - __builtin_clz((_Tp)(__x - 1u)); // using __promoted_type = decltype(__x << 1); ... // removed check for x<
[Bug libstdc++/97759] Could std::has_single_bit be faster?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #14 from Peter Cordes --- Agreed with the idea of expanding doing the popcount(x) == 1 peephole replacement in the compiler, not forcing the header to figure out whether we have efficient popcount or not. If we have BMI2, it's BLSR (Bit Lowest-Set Reset) sets CF=1 if the input was zero, and ZF=1 if the output is zero. Unfortunately none of the standard jcc/setcc conditions check ZF=1 & CF=0, even with CMC to invert CF first. (If Intel had designed it to produce ZF and CF inverted, it would be non-intuitive for all other uses but would have allowed blsr / jcc to implement if(has_single_bit(x)).) CF=1, ZF=0 impossible: input was zero, output was non-zero CF=1, ZF=1 input was zero CF=0, ZF=0 input had multiple bits set CF=0, ZF=1 input had a single bit set. If we're going to branch on it anyway after inlining, a branchy strategy is probably good: singlebit_bmi2_branchy: xor%eax, %eax blsr %edi, %edi# ZF=1 means we cleared the last bit, or the input was zero jc .Linput_zero # input was zero, return 0 regardless of ZF setz %al .Linput_zero: ret And when we want a boolean in a register, a combination of setz and cmovc can materialize one. With clever choice of registers, we can even avoid giving setcc a false dependency on a register that isn't already part of its dep chain singlebit_bmi2_cmov: blsr%edi, %eax setz%al # false dep, but it's ready if FLAGS are ready because we wrote it with BLSR cmovc %edi, %eax # return 1 only if ZF=1 (setz produces 1) and CF=0 (cmovc doesn't overwrite it with the input 0) ret With xor-zeroing first, we could produce the boolean zero-extended to 32-bit, instead of here where only the low 8 bits are actually 0 / 1. (Which is fine for returning a bool in all the mainstream calling conventions) (This is the same kind of idea as ARM64 sub/tst / ccmp / cset, where ccmp can conditionally update flags.) An evil variation on this uses setnz / dec to invert ZF without affecting CF, allowing JA: blsr %edi,%eax setnz %al # AL = !ZF dec%al # 1-1 -> ZF=1, 0-1 -> ZF=0. ZF=!ZF without affecting CF # seta %al # set on CF=0 and ZF=0 ja was_single_bit# only actually useful for branching after inlining dec/ja can't macro-fuse into a single uop, but on Skylake and later Intel it doesn't cost any extra partial-FLAGS merging uops, because JA simply has both parts of FLAGS as separate inputs. (This is why CMOVA / CMOVBE are still 2 uops on Skylake, unlike all other forms: they need 2 integer inputs and 2 parts of FLAGS, while others need either CF or SPAZO not both. Interestingly, Zen1/2 have that effect but not Zen3) I don't know how AMD handles dec / ja partial-flags shenanigans. Intel Haswell would I think have a flags-merging uop; older Intel doesn't support BMI1 so P6-family is irrelevant. https://stackoverflow.com/a/49868149/224132 I haven't benchmarked them because they have different use-cases (materializing a boolean vs. creating a FLAGS condition to branch on, being branchless itself), so any single benchmark would make one of them look good. If your data almost never (or always) has an all-zero input, the JC in the first version will predict well. After inlining, if the caller branches on the bool result, you might want to just branch on both conditions separately. I don't think this setnz/dec/ja version is ever useful. Unlike branching separately on ZF and CF, it's not bad if both 0 and multi-bit inputs are common while single-bit inputs are rare. But blsr/setz/cmovc + test/jnz is only 4 uops, same as this on Skylake. (test can macro-fuse with jnz). The uops are all dependent on each other, so it also has the same latency (to detect a branch miss) as popcnt / macro-fused cmp/je which is 2 uops. The only thing this has going for it is avoiding a port-1-only uop, I think. It's also possible to blsr / lahf / and ah, (1<<6) | (1<<0) / cmp ah, 1<<6 to directly check that ZF=1 and CF=0. I doubt that's useful. Or hmm, can we branch directly on PF after AND with that 2-bit mask? CF=1 ZF=0 is impossible, so the only other odd-parity case is CF=0 ZF=1. AMD and Intel can macro-fuse test/jp. blsr %edi, %eax lahf test $(1<<6) | (1<<0), %ah# check ZF and CF. jpo was_single_bit # ZF != CF means CF=0, ZF=1 because the other way is impossible. Also possible of course is the straightforward 2x setcc and AND to materialize a boolean in the bottom byte of EAX. Good ILP, only 3 cycle la
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #11 from Peter Cordes --- Also, horizontal byte sums are generally best done with VPSADBW against a zero vector, even if that means some fiddling to flip to unsigned first and then undo the bias. simde_vaddlv_s8: vpxorxmm0, xmm0, .LC0[rip] # set1_epi8(0x80) flip to unsigned 0..255 range vpxorxmm1, xmm1 vpsadbw xmm0, xmm0, xmm1 # horizontal byte sum within each 64-bit half vmovdeax, xmm0 # we only wanted the low half anyway sub eax, 8 * 128 # subtract the bias we added earlier by flipping sign bits ret This is so much shorter we'd still be ahead if we generated the vector constant on the fly instead of loading it. (3 instructions: vpcmpeqd same,same / vpabsb / vpslld by 7. Or pcmpeqd / psllw 8 / packsswb same,same to saturate to -128) If we had wanted a 128-bit (16 byte) vector sum, we'd need ... vpsadbw ... vpshufd xmm1, xmm0, 0xfe # shuffle upper 64 bits to the bottom vpaddd xmm0, xmm0, xmm1 vmovdeax, xmm0 sub eax, 16 * 128 Works efficiently with only SSE2. Actually with AVX2, we should unpack the top half with VUNPCKHQDQ to save a byte (no immediate operand), since we don't need PSHUFD copy-and-shuffle. Or movd / pextrw / scalar add but that's more uops: pextrw is 2 on its own.
[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #10 from Peter Cordes --- Current trunk with -fopenmp is still not good https://godbolt.org/z/b3jjhcvTa Still doing two separate sign extensions and two stores / wider reload (store forwarding stall): -O3 -march=skylake -fopenmp simde_vaddlv_s8: pushrbp vpmovsxbw xmm2, xmm0 vpsrlq xmm0, xmm0, 32 mov rbp, rsp vpmovsxbw xmm3, xmm0 and rsp, -32 vmovq QWORD PTR [rsp-16], xmm2 vmovq QWORD PTR [rsp-8], xmm3 vmovdqa xmm4, XMMWORD PTR [rsp-16] ... then asm using byte-shifts Including stuff like movdqa xmm1, xmm0 psrldq xmm1, 4 instead of pshufd, which is an option because high garbage can be ignored. And ARM64 goes scalar. Current trunk *without* -fopenmp produces decent asm https://godbolt.org/z/h1KEKPTW9 For ARM64 we've been making good asm since GCC 10.x (vs. scalar in 9.3) simde_vaddlv_s8: sxtlv0.8h, v0.8b addvh0, v0.8h umovw0, v0.h[0] ret x86-64 gcc -O3 -march=skylake simde_vaddlv_s8: vpmovsxbw xmm1, xmm0 vpsrlq xmm0, xmm0, 32 vpmovsxbw xmm0, xmm0 vpaddw xmm0, xmm1, xmm0 vpsrlq xmm1, xmm0, 32 vpaddw xmm0, xmm0, xmm1 vpsrlq xmm1, xmm0, 16 vpaddw xmm0, xmm0, xmm1 vpextrw eax, xmm0, 0 ret That's pretty good, but VMOVD eax, xmm0 would be more efficient than VPEXTRW when we don't need to avoid high garbage (because it's a return value in this case). VPEXTRW zero-extends into RAX, so it's not directly helpful if we need to sign-extend to 32 or 64-bit for some reason; we'd still need a scalar movsx. Or with BMI2, go scalar before the last shift / VPADDW step, e.g. ... vmovd eax, xmm0 rorx edx, eax, 16 addeax, edx
[Bug tree-optimization/80570] auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570 --- Comment #3 from Peter Cordes --- (In reply to Andrew Pinski from comment #2) > Even on aarch64: > > .L2: > ldr q0, [x1], 16 > sxtlv1.2d, v0.2s > sxtl2 v0.2d, v0.4s > scvtf v1.2d, v1.2d > scvtf v0.2d, v0.2d > stp q1, q0, [x0] > > But the above is decent really. More that decent, that's what we *should* be doing, I think. AArch64 has versions of most instructions that read the top of a vector, unlike x86-64 where VPMOVZX / SX can only read from the bottom half. That's the key difference, and what makes this strategy good on ARM, bad on x86-64. (On 32-bit ARM, you load a q register, then read the two halves separately as 64-bit d<0..31> registers. AArch64 changed that so there are 32x 128-bit vector regs, and no partial regs aliasing the high half. But they provide OP, OP2 versions of some instructions that widen or things like that, with the "2" version accessing a high half. Presumably part of the motivation is to make it easier to port ARM NEON code that depended on accessing halves of a 128-bit q vector using its d regs. But it's a generally reasonable design and could also be motivated by seeing how inconvenient things get in SSE and AVX for pmovsx/zx.) Anyway, AArch64 SIMD is specifically designed to make it fully efficient to do wide loads and then unpack both halves, like is possible in ARM, but not x86-64. It's also using a store (of a pair of regs) that's twice the width of the load. But even if it was using a max-width load of a pair of 128-bit vectors (and having to store two pairs) that would be good, just effectively unrolling. But GCC sees it as one load and two separate stores, that it just happens to be able to combine as a pair.
[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103 --- Comment #9 from Peter Cordes --- Thanks for implementing my idea :) (In reply to Hongtao.liu from comment #6) > For elements located above 128bits, it seems always better(?) to use > valign{d,q} TL:DR: I think we should still use vextracti* / vextractf* when that can get the job done in a single instruction, especially when the VEX-encoded vextracti/f128 can save a byte of code size for v[4]. Extracts are simpler shuffles that might have better throughput on some future CPUs, especially the upcoming Zen4, so even without code-size savings we should use them when possible. Tiger Lake has a 256-bit shuffle unit on port 1 that supports some common shuffles (like vpshufb); a future Intel might add 256->128-bit extracts to that. It might also save a tiny bit of power, allowing on-average higher turbo clocks. --- On current CPUs with AVX-512, valignd is about equal to a single vextract, and better than multiple instruction. It doesn't really have downsides on current Intel, since I think Intel has continued to not have int/FP bypass delays for shuffles. We don't know yet what AMD's Zen4 implementation of AVX-512 will look like. If it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other than insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle like valignd probably costs more than 2 uops. (vpermq is more than 2 uops on Piledriver/Zen1). But a 128-bit extract will probably cost just one uop. (And especially an extract of the high 256 might be very cheap and low latency, like vextracti128 on Zen1, so we might prefer vextracti64x4 for v[8].) So this change is good, but using a vextracti64x2 or vextracti64x4 could be a useful peephole optimization when byte_offset % 16 == 0. Or of course vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessible with an EVEX-encoded instruction). vextractf-whatever allows an FP shuffle on FP data in case some future CPU cares about that for shuffles. An extract is a simpler shuffle that might have better throughput on some future CPU even with full-width execution units. Some future Intel CPU might add support for vextract uops to the extra shuffle unit on port 1. (Which is available when no 512-bit uops are in flight.) Currently (Ice Lake / Tiger Lake) it can only run some common shuffles like vpshufb ymm, but not including any vextract or valign. Of course port 1 vector ALUs are shut down when 512-bit uops are in flight, but could be relevant for __m256 vectors on these hypothetical future CPUs. When we can get the job done with a single vextract-something, we should use that instead of valignd. Otherwise use valignd. We already check the index for low-128 special cases to use vunpckhqdq vs. vpshufd (or vpsrldq) or similar FP shuffles. - On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to be zero), an extract that only writes a 128-bit register will keep them clean (even if it reads a ZMM), not needing a VZEROUPPER. Since VZEROUPPER is only needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function like float foo(float *p) { some vector stuff that can use high zmm regs; return scalar that happens to be from the middle of a vector; } could vextract into XMM0, but would need vzeroupper if it used valignd into ZMM0. (Also related https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc re reading a ZMM at all and turbo clock). --- Having known zeros outside the low 128 bits (from writing an xmm instead of rotating a zmm) is unlikely to matter, although for FP stuff copying fewer elements that might be subnormal could happen to be an advantage, maybe saving an FP assist for denormal. We're unlikely to be able to take advantage of it to save instructions/uops (like OR instead of blend). But it's not worse to use a single extract instruction instead of a single valignd.
[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 --- Comment #37 from Peter Cordes --- Correction, PR82666 is that the cmov on the critical path happens even at -O2 (with GCC7 and later). Not just with -O3 -fno-tree-vectorize. Anyway, that's related, but probably separate from choosing to do if-conversion or not after inlining.
[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #36 from Peter Cordes --- Related: a similar case of cmov being a worse choice, for a threshold condition with an array input that happens to already be sorted: https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2 GCC with -fprofile-generate / -fprofile-use does correctly decide to use branches. GCC7 and later (including current trunk) with -O3 -fno-tree-vectorize de-optimizes by putting the CMOV on the critical path, instead of as part of creating a zero/non-zero input for the ADD. PR82666. If you do allow full -O3, then vectorization is effective, though.
[Bug target/15533] Missed move to partial register
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #5 from Peter Cordes --- The new asm less bad, but still not good. PR53133 is closed, but this code-gen is a new instance of partial-register writing with xor al,al. Also related: PR82940 re: identifying bitfield insert patterns in the middle-end; hopefully Andrew Pinski's planned set of patches to improve that can help back-ends do a better job? If we're going to read a 32-bit reg after writing an 8-bit reg (causing a partial-register stall on Nehalem and earlier), we should be doing mov a, %al # merge into the low byte of RAX ret Haswell and newer Intel don't rename the low byte partial register separately from the full register, so they behave like AMD and other non-P6 / non-Sandybridge CPU: dependency on the full register. That's good for this code; in this case the merging is necessary and we don't want the CPU to guess that it won't be needed later. The load+ALU-merge uops can micro-fuse into a single uop for the front end. xor %al,%al still has a false dependency on the old value of RAX because it's not a zeroing idiom; IIRC in my testing it's at least as good to do mov $0, %al. Both instructions are 2 bytes long. * https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers survey of the ways partial regs are handled on Intel P6 family vs. Intel Sandybridge vs. Haswell and later vs. non-Intel and Intel Silvermont etc. * https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to - details of my testing on Haswell / Skylake. *If* we still care about -mtune=nehalem and other increasingly less relevant CPUs, we should be avoiding a partial register stall for those tuning options with something like movzbl a, %edx and $-256, %eax or %edx, %eax i.e. what we're already doing, but spend a 5-byte AND-immediate instead of a 2-byte xor %al,%al or mov $0, %al (That's what clang always does, so it's missing the code-size optimization. https://godbolt.org/z/jsE57EKcb shows a similar case of return (a&0xFF00u) | (b&0xFFu); with two register args) - The penalty on Pentium-M through Nehalem is to stall for 2-3 cycles while a merging uop is inserted. The penalty on earlier P6 (PPro / Pentium III) is to stall for 5-6 cycles until the partial-register write retires. The penalty on Sandybridge (and maybe Ivy Bridge if it renames AL) is no stall, just insert a merging uop. On later Intel, and AMD, and Silvermont-family Intel, writing AL has a dependency on the old RAX; it's a merge on the spot. BTW, modern Intel does still rename AH separately, and merging does require the front-end to issue a merging uop in a cycle by itself. So writing AH instead of AL would be different.
[Bug middle-end/82940] Suboptimal code for (a & 0x7f) | (b & 0x80) on powerpc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82940 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #6 from Peter Cordes --- For a simpler test case, GCC 4.8.5 did redundantly mask before using bitfield-insert, but GCC 9.2.1 doesn't. unsigned merge2(unsigned a, unsigned b){ return (a&0xFF00u) | (b&0xFFu); } https://godbolt.org/z/froExaPxe # PowerPC (32-bit) GCC 4.8.5 rlwinm 4,4,0,0xff # b &= 0xFF is totally redundant rlwimi 3,4,0,24,31 blr # power64 GCC 9.2.1 (ATI13.0) rlwimi 3,4,0,255# bit-blend according to mask, rotate count=0 rldicl 3,3,0,32 # Is this zero-extension to 64-bit redundant? blr But ppc64 GCC does zero-extension of the result from 32 to 64-bit, which is probably not needed unless the calling convention has different requirements for return values than for incoming args. (I don't know PPC well enough.) So for at least some cases, modern GCC does ok. Also, when the blend isn't split at a byte boundary, even GCC4.8.5 manages to avoid redundant masking before the bitfield-insert. unsigned merge2(unsigned a, unsigned b){ return (a & 0xFF80u) | (b & 0x7Fu); } rlwimi 3,4,0,25,31 # GCC4.8.5, 32-bit so no zero-extension blr
[Bug tree-optimization/100922] CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922 --- Comment #2 from Peter Cordes --- Possibly also related: With different surrounding code, this loop can compile to asm which has two useless movz / mov register copies in the loop at -O2 (https://godbolt.org/z/PTcqzM6q7). (To set up for entry into the next loop in over-complicated ways, and doing this in the loop is unnecessary.) while( lut[(unsigned char)*str] == 0 ){ // also catches terminating 0 str++; } .L19: movzbl 1(%rdi), %edx addq$1, %rdi movzbl %dl, %ecx movl%edx, %eax cmpb$0, -120(%rsp,%rcx) je .L19 from source void remove_chars(char *restrict str, const char *restrict remove) { char lut[256] = {0}; do { lut[(unsigned char)*remove] = -1; }while(*remove++); /*** Over complicated asm in this loop */ while( lut[(unsigned char)*str] == 0 ){ // also catches terminating 0 str++; } // str points at first char to *not* keep (or the terminating 0) const char *in = str; char *out = str; while (*in) { char mask = lut[(unsigned char)*in]; unsigned char cin = *in, cout = *out; *out = mask ? cout : cin; out += mask + 1; in++; } *out = *in; }
[Bug tree-optimization/100922] New: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922 Bug ID: 100922 Summary: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Created attachment 50948 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50948&action=edit redundant_zero_extend.c It's rarely a good idea to load the same thing twice; generally better to copy a register. Or to read the same register twice when a copy isn't needed. So the following asm should never happen, but it does with current trunk, and similar with GCC as old as 4.5 movzbl (%rax), %edx movzbl (%rax), %ecx# no branch target between these instructions or ldrbw4, [x2] ldrbw3, [x2], 1 # post-indexed *x2++ (Happens at -O3. With -O2 we have a redundant register copy, so either way still a wasted instruction. And there are other differences earlier in the function with -O2 vs. -O3.) https://godbolt.org/z/jT7WaWeK8 - minimal test case. x86-64 and AArch64 trunk show basically identical code structure. x86-64 gcc (Compiler-Explorer-Build) 12.0.0 20210603 and aarch64-unknown-linux-gnu-gcc (GCC) 12.0.0 20210524 void remove_chars_inplace(char *str, const unsigned char keep_lut[256]) { while(keep_lut[(unsigned char)*str]){ // can be an if() and still repro str++;// keep_lut[0] is false } char *out = str; unsigned char c; /* must be unsigned char for correctness. */ do { c = *str++; unsigned char inc = keep_lut[c]; // unsigned long doesn't help *out = c; out += inc; // inc=0 or 1 to let next char overwrite or not } while(c); } x86-64 asm: remove_chars_inplace: jmp .L8 .L3:# top of search loop for first char to remove addq$1, %rdi .L8:# loop entry point movzbl (%rdi), %eax cmpb$0, (%rsi,%rax) # un-laminates and doesn't macro-fuse ... jne .L3 cmpb$0, (%rdi) # 2nd loop body can be skipped if *str == 0 # should be test %al,%al - this char was already loaded. leaq1(%rdi), %rax# even -march=znver2 fails to move this earlier or later to allow cmp/je fusion. (Intel won't macro-fuse cmp imm,mem / jcc) je .L1 .L5: # TOP OF 2ND LOOP movzbl (%rax), %edx movzbl (%rax), %ecx # redundant load of *str addq$1, %rax movzbl (%rsi,%rdx), %edx # inc = lut[c] movb%cl, (%rdi) addq%rdx, %rdi # out += inc testb %cl, %cl jne .L5# }while(c != 0) .L1: ret IDK if it's interesting or not that the cmpb $0, (%rdi) is also a redundant load. The first loop left *str, i.e. (%rdi), in EAX. Putting the LEA between cmp and je (even with -march=znver2) is a separate missed optimization. (unless that's working around Intel's JCC erratum) With only -O2 instead of -O3, we get better asm in that part: it takes advantage of having the char in AL, and jumps into the middle of the next loop after xor-zeroing the `inc` variable. Replacingc = *str++; with c = *str; str++; results in a wasted register copy with trunk, instead of a 2nd load (on x86-64 and arm64). Still a missed opt, but less bad. GCC7 and earlier still do an extra load with either way of writing that. Removing the first loop, or making its loop condition something like *str && keep_lut[*str], removes the problem entirely. The CSE possibility is gone. (Same even if we use lut[*(unsigned char*)str] - type-pun the pointer to unsigned char instead of casting the signed char value to unsigned char, on x86 where char is signed, but not on arm64 where char is unsigned.) --- I didn't find any clear duplicates; the following are barely worth mentioning: * pr94442 looks like extra spilling, not just redundant loading. * pr97366 is due to vectors of different types, probably. * pr64319 needs runtime aliasing detection to avoid, unlike this. The AArch64 version of this does seem to demo pr71942 (a useless and x4, x2, 255 on an LDRB result) when you get it to copy a register instead of doing a 2nd load.
[Bug rtl-optimization/88770] Redundant load opt. or CSE pessimizes code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88770 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2 from Peter Cordes --- Note that mov r64, imm64 is a 10-byte instruction, and can be slow to read from the uop-cache on Sandybridge-family. The crap involving OR is clearly sub-optimal, but *if* you already have two spare call-preserved registers across this call, the following is actually smaller code-size: movabs rdi, 21474836483 mov rbp, rdi movabs rsi, 39743127552 mov rbx, rsi calltest mov rdi, rbp mov rsi, rbx calltest This is more total uops for the back-end though (movabs is still single-uop, but takes 2 entries the uop cache on Sandybridge-family; https://agner.org/optimize/). So saving x86 machine-code size this way does limit the ability of out-of-order exec to see farther, if the front-end isn't the bottleneck. And it's highly unlikely to be worth saving/restoring two regs to enable this. (Or to push rdi / push rsi before call, then pop after!) Setting up the wrong value and then fixing it twice with OR is obviously terrible and never has any advantage, but the general idea to CSE large constants isn't totally crazy. (But it's profitable only in such limited cases that it might not be worth looking for, especially if it's only helpful at -Os)
[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636 Peter Cordes changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #4 from Peter Cordes --- This seems to be fixed for ZMM vectors in GCC8. https://gcc.godbolt.org/z/7351be1v4 Seems to have never been a problem for __m256, at least not for __m256 zero256(){ return _mm256_setzero_ps(); } IDK what I was looking at when I originally reported; maybe just clang which *did* used to prefer YMM-zeroing. Some later comments suggested movdqa vs. pxor zeroing choices (and mov vs. xor for integer), but the bug title is just AVX / AVX-512 xor-zeroing, and that seems to be fixed. So I think this should be closed.
[Bug tree-optimization/42587] bswap not recognized for memory
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42587 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #12 from Peter Cordes --- (In reply to Andi Kleen from comment #11) > Only when the first test case is fixed too https://godbolt.org/z/7M8cx3vT1 GCC8.1 -O3 for x86-64 pushrbx mov ebx, edi callacpi_ut_track_stack_ptr mov eax, ebx pop rbx bswap eax ret The code in the initial report optimizes to bswap with GCC8.1 and later. Is that the test case you meant? GCC8.1 was released on May 2, 2018, well before your Nov comment, so maybe you meant something else.
[Bug middle-end/98801] Request for a conditional move built-in function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98801 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #5 from Peter Cordes --- (In reply to Richard Biener from comment #4) > Slight complication arises because people will want to have cmoves with a > memory destination. Do we even want to provide this? Most ISAs can't branchlessly conditionally store, except via an RMW (which wouldn't be thread-safe for the no-store case if not atomic) or something really clunky. (Like x86 rep stos with count=0 or 1.) ARM predicated instructions allow branchless load or store that doesn't disturb the memory operand (and won't even fault on a bad address). I guess another option to emulate it could be to make a dummy local and cmov to select a store address = dummy : real. But that's something users can build in the source using a non-memory conditional-select builtin that exposes the much more widely available ALU conditional-select functionality like x86 CMOV, AArch64 CSEL, MIPS MVN, etc. > That won't solve the eventual request to have cmov _from_ memory ... (if we > leave all of the memory combining to RTL people will again complain that > it's subject to compilers discretion). It might be sufficient for most use-cases like defending against timing side-channels to not really try to allow conditional loads (from maybe-invalid pointers). I'm not sure if the motivation for this includes trying to make code without data-dependent branching, to defend against timing side-channels. But if we do provide something like this, people are going to want to use it that way. That's one case where best-effort behaviour at the mercy of the optimizer for a ternary (or having to manually check the asm) is not great. Stack Overflow has gotten a few Q&As from people looking for guaranteed CMOV for reasons like that. So I think we should be wary of exposing functionality that most ISAs don't have. OTOH, failing to provide a way to take advantage of functionality that some ISAs *do* have is not great, e.g. ISO C failing to provide popcnt and bit-scan (clz / ctz) has been a problem for C for a long time. But for something like __builtin_clz, emulating on machines that don't have hardware support still works. If we're trying to support a guarantee of no data-dependent branching, that limits the emulation possibilities or makes them clunkier. Especially if we want to support ARM's ability to not fault / not access memory if the condition is false. The ALU-select part can be emulated with AND/OR, so that's something we can provide on any target. Folding memory operands into a predicated load on ARM could actually introduce data-dependent cache access, vs. an unconditional load and a predicated reg-reg MOV. So this becomes somewhat thorny, and some design work to figure out what documented guarantees to provide will be necessary. Performance use-cases would certainly rather just have a conditional load in one instruction.
[Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291 Bug ID: 98291 Summary: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* An FP reduction loop with 2 scalar accumulators auto-vectorizes into a mess, instead of effectively mapping each scalar to an element of one vector accumulator. (Unless we use -ffast-math, then that happens. clang gets it right even without -ffast-math). double dotprod(const double *a, const double *b, unsigned long long n) { double d1 = 0.0; double d2 = 0.0; for (unsigned long long i = 0; i < n; i += 2) { d1 += a[i] * b[i]; d2 += a[i + 1] * b[i + 1]; } return (d1 + d2); } https://godbolt.org/z/Kq48j9 With -ffast-math the nice sane loop we expect .L3: movupd (%rsi,%rax), %xmm0 movupd (%rdi,%rax), %xmm3 addq$1, %rdx addq$16, %rax mulpd %xmm3, %xmm0 addpd %xmm0, %xmm1 cmpq%rcx, %rdx jb .L3 without: ... main loop .L4: movupd (%rcx,%rax), %xmm1# 16-byte load movupd (%rsi,%rax), %xmm3 movhpd 16(%rcx,%rax), %xmm1 # overwrite the high half of it!! movhpd 16(%rsi,%rax), %xmm3 mulpd %xmm3, %xmm1 movupd 16(%rsi,%rax), %xmm3 movlpd 8(%rsi,%rax), %xmm3 addsd %xmm1, %xmm2 unpckhpd%xmm1, %xmm1 addsd %xmm1, %xmm2 movupd 16(%rcx,%rax), %xmm1 movlpd 8(%rcx,%rax), %xmm1 addq$32, %rax mulpd %xmm3, %xmm1 addsd %xmm1, %xmm0 unpckhpd%xmm1, %xmm1 addsd %xmm1, %xmm0 cmpq%rdx, %rax jne .L4 The overall strategy is insane, but even some of the details are insane. e.g. a 16-byte load into XMM1, and then overwriting the high half of that with a different double before reading it. That's bad enough, but you'd expect movsd / movhpd to manually gather 2 doubles, without introducing the possibility of a cache-line split load for zero benefit. Similarly, movupd / movlpd should have just loaded in the other order. (Or since they're contiguous, movupd 8(%rsi,%rax), %xmm3 / shufpd.) So beyond the bad overall strategy (which is likely worse than unrolled scalar), it might be worth checking for some of this kind of smaller-scale insanity somewhere later to make it less bad if some other inputs can trigger similar behaviour. (This small-scale detecting of movupd / movhpd and using movsd / movhpd could be a separate bug, but if it's just a symptom of something that should never happen in the first place then it's not really its own bug at all.)
[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366 --- Comment #1 from Peter Cordes --- Forgot to include https://godbolt.org/z/q44r13
[Bug target/97366] New: [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366 Bug ID: 97366 Summary: [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- When you use the same _mm_load_si128 or _mm256_load_si256 result twice, sometimes GCC loads it *and* uses it as a memory source operand. I'm not certain this is specific to x86 back-ends, please check bug tags if it happens elsewhere. (But it probably doesn't on 3-operand load/store RISC machines; it looks like one operation chooses to load and then operate, the other chooses to use the original source as a memory operand.) #include void gcc_double_load_128(int8_t *__restrict out, const int8_t *__restrict input) { for (unsigned i=0 ; i<1024 ; i+=16){ __m128i in = _mm_load_si128((__m128i*)&input[i]); __m128i high = _mm_srli_epi32(in, 4); _mm_store_si128((__m128i*)&out[i], _mm_or_si128(in,high)); } } gcc 8 and later -O3 -mavx2, including 11.0.0 20200920, with gcc_double_load_128(signed char*, signed char const*): xorl%eax, %eax .L6: vmovdqa (%rsi,%rax), %xmm1 # load vpsrld $4, %xmm1, %xmm0 vpor(%rsi,%rax), %xmm0, %xmm0 # reload as a memory operand vmovdqa %xmm0, (%rdi,%rax) addq$16, %rax cmpq$1024, %rax jne .L6 ret GCC7.5 and earlier use vpor %xmm1, %xmm0, %xmm0 to use the copy of the original that was already loaded. `-march=haswell` happens to fix this for GCC trunk, for this 128-bit version but not for a __m256i version. restrict doesn't make a difference, and there's no overlapping anyway. The two redundant loads both happen between any other stores. Using a memory source operand for vpsrld wasn't an option: the form with a memory source takes the *count* from memory, not the data. https://www.felixcloutier.com/x86/psllw:pslld:psllq Note that *without* AVX, the redundant load is a possible win, for code running on Haswell and later Intel (and AMD) CPUs. Possibly some heuristic is saving instructions for the legacy-SSE case (in a way that's probably worse overall) and hurting the AVX case. GCC 7.5, -O3 without any -m options gcc_double_load_128(signed char*, signed char const*): xorl%eax, %eax .L2: movdqa (%rsi,%rax), %xmm0 movdqa %xmm0, %xmm1 # this instruction avoided psrld $4, %xmm1 por %xmm1, %xmm0 # with a memory source reload, in GCC8 and later movaps %xmm0, (%rdi,%rax) addq$16, %rax cmpq$1024, %rax jne .L2 rep ret Using a memory-source POR saves 1 front-end uop by avoiding a register-copy, as long as the indexed addressing mode can stay micro-fused on Intel. (Requires Haswell or later for that to happen, or any AMD.) But in practice it's probably worse. Load-port pressure, and space in the out-of-order scheduler, as well as code-size, is a problem for using an extra memory-source operand in the SSE version, with the upside being saving 1 uop for the front-end. (And thus in the ROB.) mov-elimination on modern CPUs means the movdqa register copy costs no back-end resources (ivybridge and bdver1). I don't know if GCC trunk is using por (%rsi,%rax), %xmm0 on purpose for that reason, of if it's just a coincidence. I don't think it's a good idea on most CPUs, even if alignment is guaranteed. This is of course 100% a loss with AVX; we have to `vmovdqa/u` load for the shift, and it can leave the original value in a register so we're not saving a vmovdqua. And it's a bigger loss because indexed memory-source operands unlaminate from 3-operand instructions even on Haswell/Skylake: https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695#31027695 so it hurts the front-end as well as wasting cycles on load ports, and taking up space in the RS (scheduler). The fact that -mtune=haswell fixes this for 128-bit vectors is interesting, but it's clearly still a loss in the AVX version for all AVX CPUs. 2 memory ops / cycle on Zen could become a bottleneck, and it's larger code size. And -mtune=haswell *doesn't* fix it for the -mavx2 _m256i version. There is a possible real advantage in the SSE case, but it's very minor and outweighed by disadvantages. Especially for older CPUs like Nehalem that can only do 1 load / 1 store per clock. (Although this has so many uops in the loop that it barely bottlenecks on that.)
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #53 from Peter Cordes --- I think we can close this as fixed at some point. The last activity on this bug was some patches that sound like they were supposed to fix, and the MCVEs from comments I tested no longer has a problem. GCC9.3 -O3 -march=core2 -fomit-frame-pointer only uses a `.p2align` to align the top of the loop, not between leave and ret or between cmp/jcc. void wait_for_enter() { volatile int foo = 0; // to get a LEAVE instruction emitted at all int u = getchar(); while (!u) u = getchar()-13; } https://godbolt.org/z/RvxzZv (Note that Godbolt normally filters .p2align so you have to either compile to binary or not filter directives in the asm source. Otherwise you'll never see NOPs except in the unusual case where GCC actually emits a nop mnemonic.)
[Bug target/93141] Missed optimization : Use of adc when checking overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93141 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2 from Peter Cordes --- gcc doesn't actually *branch* unless you use an if(), it just uses cmp/sbb to do a 128-bit compare. CMP is like a SUB that only sets flags. The CF result of SBB is used as an input for ADC. https://godbolt.org/z/64C4R- of a testcase GCC also wastes a varying number of MOV instructions beyond the minimum one to make cmp/sbb work, depending on BMI2 MULX or not, and how the sum is written. u128 prod = a[i] * (unsigned __int128) b[i]; #if 1 sum += prod; //if(sum
[Bug target/40838] gcc shouldn't assume that the stack is aligned
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #91 from Peter Cordes --- This bug should be closed as "resolved fixed". The "fix" was to change the ABI doc and break existing hand-written asm, and old binaries. This was intentional and resulted in some pain, but at this point it's a done deal. My attempt at a summary of the current state of affairs for 32-bit x86 calling conventions (on Linux and elsewhere): Yes, the version of the i386 System V ABI used on Linux really did change between gcc2.8 and gcc8. Those compilers are not ABI-compatible with each other. This is a known fact. Hand-written asm that makes function calls with misaligned stack pointers is violating the (updated) ABI, and was also knowingly broken by this change. (Perhaps unintentionally at first, with stack alignment intended to just provide a performance benefit, not a correctness issue. But the resolution ended up being to standardize on 16-byte alignment matching x86-64 System V. Instead of reverting to the old ABI and breaking compat with new binaries that had started to rely on 16-byte incoming alignment, or to add significant overhead to every function that didn't know how both its caller and callee were compiled, i.e. most functions. Using MOVUPS instead of MOVAPS everywhere wouldn't work well because it would mean no folding of memory operands into ALU instructions: without AVX's VEX encoding, paddd xmm0, [mem] requires aligned mem. And existing binaries that rely on incoming 16-byte alignment weren't doing that.) An earlier comment also mentioned common arrays: the ABI also requires arrays larger than 16 bytes to have 16-byte alignment. Perhaps unnecessary pain for little real benefit: i386 on Linux has been mostly obsolete for a long time, and the inefficient stack-args calling convention was never changed. It's ironic that Linux broke ABI compat for i386 in the name of more efficient SSE-usage despite not caring to introduce anything like Windows fastcall or vectorcall (efficient register-args calling conventions). (GCC does have ABI-changing -mregparm=3 and -msseregparm to pass integers in regs, and pass/return FP values in XMM registers (instead of passing on the stack / returning in x87 st0). But no distros have switched over to using that calling convention for i386 binaries, AFAIK. The Linux kernel does use regparm for 32-bit kernel builds.) Even more ironic, probably a lot of 32-bit code is compiled without -msse2 (because one of the main reasons for using 32-bit code is CPUs too old for x86-64, which is about the same vintage as SSE2). SSE usage can still happen with runtime dispatching in binaries that are compatible with old machines while still being able to take advantage of new ones. But in most cases, if you want performance you use x86-64 kernel + user-space, or maybe x32 user-space (ILP32 in 64-bit mode) to get modern calling conventions and the benefit of twice as many registers. x86-64 System V has mandated 16-byte stack alignment from the start. (I don't know the history, but perhaps i386 code-gen started assuming / depending on it for correctness, not just performance, by accident because of devs being used to x86-64?) The 32-bit ABI on some other OSes, including i386 *BSD and 32-bit Windows, has *not* changed; presumably gcc there doesn't rely on incoming stack alignment. (It might try to propagate 16-byte alignment for performance benefits, though.) My understanding is that i386 MacOS still uses a version of i386 System V that doesn't include the 16-byte stack alignment update, like other *BSDs. (In reply to Harald van Dijk from comment #90) > compile > > void exit(int); > int main(void) { exit(0); } > > with GCC 2.8, compile current glibc with GCC 8, and there will be a segfault > in glibc's __run_exit_handlers because GCC 2.8 never kept the stack > 16-byte-aligned, but GCC 8 does now generate code which assumes it. > > For the moment, I've rebuilt glibc with -mincoming-stack-boundary=2 to handle > the problem well enough for my current needs, but it's not a complete > solution. Yes, you need workarounds like this to change modern GCC's ABI back to legacy 4-byte. Note that you might break atomicity of C11 _Atomic 8-byte objects even outside structs by doing this, if they split across a cache line (Intel) or possibly narrower (AMD) boundary. But only if they were stack allocated.
[Bug target/89346] Unnecessary EVEX encoding
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89346 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1 from Peter Cordes --- Still present in pre10.0.0 trunk 20191022. We pessimize vmovdqu/a in AVX2 intrinsics and autovectorization with -march=skylake-avx512 (and arch=native on such machines) It seems only VMOVDQU/A load/store/register-copy instructions are affected; we get AVX2 VEX vpxor instead of AVX512VL EVEX vpxord for xor-zeroing, and non-zeroing XOR. (And most other instructions have the same mnemonic for VEX and EVEX, like vpaddd. This includes FP moves like VMOVUPS/PD) (https://godbolt.org/z/TEvWiU for example) The good options are: * use VEX whenever possible instead of AVX512VL to save code-size. (2 or 3 byte prefix instead of 4-byte EVEX) * Avoid the need for vzeroupper by using only x/y/zmm16..31. (Still has a max-turbo penalty so -mprefer-vector-width=256 is still appropriate for code that doesn't spend a lot of time in vectorized loops.) This might be appropriate for very simple functions / blocks that only have a few SIMD instructions before the next vzeroupper would be needed. (e.g. copying or zeroing some memory); could be competitive on code-size as well as saving the 4-uop instruction. VEX instructions can't access x/y/zmm16..31 so this forces an EVEX encoding for everything involving the vector (and rules out using AVX2 and earlier instructions, which may be a problem for KNL without AVX512VL unless we narrow to 128-bit in an XMM reg) (citation for not needing vzeroupper if y/zmm0..15 aren't written explicitly: https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc - it's even safe to do vpxor xmm0,xmm0,xmm0 vpcmpeqb k0, zmm0, [rdi] without vzeroupper. Although that will reduce max turbo *temporarily* because it's a 512-bit uop. Or more frequently useful: to zero some memory with vpxor xmm zeroing and YMM stores.
[Bug target/82459] AVX512BW instruction costs: vpmovwb is 2 uops on Skylake and not always worth using vs. vpack + vpermq lane-crossing fixup
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 Peter Cordes changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=89346 Summary|AVX512F instruction costs: |AVX512BW instruction costs: |vmovdqu8 stores may be an |vpmovwb is 2 uops on |extra uop, and vpmovwb is 2 |Skylake and not always |uops on Skylake and not |worth using vs. vpack + |always worth using |vpermq lane-crossing fixup --- Comment #5 from Peter Cordes --- Turns out vmovdqu8 with no masking doesn't cost an extra uop. IACA was wrong, and Agner Fog's results were *only* for the masked case. The only downside of that is the code-size cost of using EVEX load/store instructions instead of AVX2 VEX. That's bug 89346 https://www.uops.info/table.html confirms that SKX non-masked vmovdqu8 load and store are both single uop. (Or the usual micro-fused store-address + store-data). https://www.uops.info/html-tp/SKX/VMOVDQU8_ZMM_M512-Measurements.html https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_ZMM-Measurements.html And between registers it can be eliminated if there's no masking. But *with* masking, as a load it's a micro-fused load+ALU uop, and as a masked store it's just a normal store uop for xmm and ymm. But zmm masked store is 5 uops (micro-fused to 4 front-end uops)! (Unlike vmovdqu16 or 32 masked stores which are efficient even for zmm). https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_K_ZMM-Measurements.html uops.info's table also shows us that IACA3.0 is wrong about vmovdqu8 as an *unmasked* ZMM store: IACA thinks that's also 5 uops. Retitling this bug report since that part was based on Intel's bogus data, not real testing. vpmovwb is still 2 uops, and current trunk gcc still uses 2x vpmovwb + vinserti64x4 for ZMM auto-vec. -mprefer-vector-width=512 is not the default, but people may enable it in code that heavily uses 512-bit vectors. YMM auto-vec is unchanged since previous comments: we do get vpackusbw + vpermq, but an indexed addressing mode defeats micro-fusion. And we have redundant VPAND after shifting. --- For icelake-client/server (AVX512VBMI) GCC is using vpermt2b, but it doesn't fold the shifts into the 2-source byte shuffle. (vpermt2b has 5c latency and 2c throughput on ICL, so probably its uop count is the same as uops.info measured for CannonLake: 1*p05 + 2*p5. Possible 2x 1-uop vpermb with merge-masking for the 2nd into the first would work better.) IceLake vpmovwb ymm,zmm is still 2-cycle throughput, 4-cycle latency, so probably still 2 uops.
[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 --- Comment #4 from Peter Cordes --- (In reply to Andrew Pinski from comment #3) > (In reply to Peter Cordes from comment #1) > > On AArch64 (with gcc8.2), we see a similar effect, more instructions in the > > loop. And an indexed addressing mode. That was an overstatement, the generic tuning I showed isn't using 2 separate pointers or indices like we get on x86. Your thunderx2t99 output is like that, but write-back addressing modes mean it doesn't cost extra instructions. > I am not shocked that IV-OPTS can chose these widly differences. > I have not looked at the cost differences to understand why > -mcpu=thunderx2t99 chose what close might be the best (we could use one less > IV by replacing the first ldr by using the same IV as the last str). I don't know ARM tuning; the x86 version is clearly worse with an extra uop inside the loop. And an extra instruction to copy the register before the loop, wasting code-size if nothing else. On Skylake for example, the loop is 10 uops and bottlenecks on front-end throughput (4 uops / clock) if the back-end can keep up with a bit less than 1 store per clock. (Easy if pointers are aligned and data is hot in L1d). Reducing it to 9 uops should help in practice. Getting it down to 8 uops would be really nice, but we can't do that unless we could use a shuffle that micro-fuses with a load. (For int elements, AVX2 VPERMD can micro-fuse a memory source, so can SSE2 PSHUFD. pshufb's xmm/memory operand is the control vector which doesn't help us. AVX512 vpermb can't micro-fuse)
[Bug target/92246] Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246 --- Comment #1 from Peter Cordes --- And BTW, GCC *does* use vpermd (not vpermt2d) for swapt = int or long. This problem only applies to char and short. Possibly because AVX2 includes vpermd ymm. Apparently CannonLake has 1 uop vpermb but 2 uop vpermw, according to real testing on real hardware by https://uops.info/. Their automated test methods are generally reliable. That seems to be true for Ice Lake, too, so when AVX512VBMI is available we should be using vpermb any time we might have used vpermw with a compile-time-constant control vector. (verpmw requires AVX512BW, e.g. SKX and Cascade Lake. vpermb requires AVX512VBMI, only Ice Lake and the mostly aborted CannonLake.) Instlat provides some confirmation: https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel00706E5_IceLakeY_InstLatX64.txt shows vpermb at 3 cycle latency, but vpermw at 4 cycle latency (presumably a chain of 2 uops, 1c and 3c being the standard latencies that exist in recent Intel CPUs). InstLat doesn't document which input the dep chain goes through, so it's not 100% confirmation of only 1 uop. But it's likely that ICL has 1 uop vpermb given that CNL definitely does. uops.info lists latencies separately from each input to the result, sometimes letting us figure out that e.g. one of the inputs isn't needed until the 2nd uop. Seems to be the case for CannonLake vpermw: latency from one of the inputs is only 3 cycles, the other is 4. https://www.uops.info/html-lat/CNL/VPERMW_YMM_YMM_YMM-Measurements.html
[Bug target/92246] New: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246 Bug ID: 92246 Summary: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512) Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* typedef short swapt; void strrev_explicit(swapt *head, long len) { swapt *tail = head + len - 1; for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } g++ -O3 -march=skylake-avx512 (Compiler-Explorer-Build) 10.0.0 20191022 (experimental) https://godbolt.org/z/LS34w9 ... .L4: vmovdqu16 (%rdx), %ymm1 vmovdqu16 (%rax), %ymm0 vmovdqa64 %ymm1, %ymm3# useless copy vpermt2w%ymm1, %ymm2, %ymm3 vmovdqu16 %ymm3, (%rax) vpermt2w%ymm0, %ymm2, %ymm0 addq$32, %rax vmovdqu16 %ymm0, (%rcx) subq$32, %rdx subq$32, %rcx # two tail pointers, PR 92244 is unrelated to this cmpq%rsi, %rax jne .L4 vpermt2w ymm is 3 uops on SKX and CannonLake: 2p5 + p015 (https://www.uops.info/table.html) Obviously better would be vpermw (%rax), %ymm2, %ymm0. vpermw apparently can't micro-micro-fuse a load, but it's only 2 ALU uops plus a load if we use a memory source. SKX still bottlenecks on 2p5 for vpermw, losing only the p015 uop, but in general fewer uops is better. But on CannonLake it runs on p01 + p5 (plus p23 with a memory source). uops.info doesn't have IceLake-client data yet but vpermw throughput on IceLake is 1/clock, vs 1 / 2 clocks for vpermt2w, so this could double throughput on CNL and ICL. We have exactly the same problem with AVX512VBMI vpermt2b over vpermb with ICL g++ -O3 -march=icelake-client -mprefer-vector-width=512
[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 Peter Cordes changed: What|Removed |Added Summary|extra sub inside vectorized |vectorized loop updating 2 |loop instead of calculating |copies of the same pointer |end-pointer |(for in-place reversal ||cross in the middle) --- Comment #2 from Peter Cordes --- Forgot to update title after looking more carefully at the asm.
[Bug tree-optimization/92244] extra sub inside vectorized loop instead of calculating end-pointer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 --- Comment #1 from Peter Cordes --- On AArch64 (with gcc8.2), we see a similar effect, more instructions in the loop. And an indexed addressing mode. https://godbolt.org/z/6ZVWY_ # strrev_explicit -O3 -mcpu=cortex-a53 ... .L4: ldr q1, [x4, x2]# tail ldr q0, [x3]# head tbl v1.16b, {v1.16b}, v2.16b# byte shuffle tbl v0.16b, {v0.16b}, v2.16b str q1, [x3], 16# post-increment store to head cmp x3, x1 str q0, [x4, x2] sub x2, x2, #16 # doesn't update flags, not SUBS bne .L4 # }while( head != end_head ) # strrev_implicit -O3 -mcpu=cortex-a53 ... .L19: ldr q1, [x3] ldr q0, [x2] tbl v1.16b, {v1.16b}, v2.16b tbl v0.16b, {v0.16b}, v2.16b str q1, [x2], 16 # post-increment addressing mode cmp x2, x4 str q0, [x3], -16 # post-decrement addressing mode bne .L19 # }while( head != end_head )
[Bug tree-optimization/92244] New: extra sub inside vectorized loop instead of calculating end-pointer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 Bug ID: 92244 Summary: extra sub inside vectorized loop instead of calculating end-pointer Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- We get a redundant instruction inside the vectorized loop here. But it's not a separate *counter*, it's a duplicate of the tail pointer. It goes away if we find tail with while(*tail++); instead of calculating it from head+length. Only happens with vectorization, not pure scalar (bug 92243 is about the fact that -O3 fails to use bswap as a GP-integer shuffle to auto-vectorize without x86 SSSE3). typedef char swapt; void strrev_explicit(swapt *head, long len) { swapt *tail = head + len - 1; for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } https://godbolt.org/z/wdGv4S compiled with g++ -O3 -march=sandybridge gives us a main loop of ... movq%rcx, %rsi # RSI = RCX before entering the loop addq%rdi, %r8 .L4: vmovdqu (%rcx), %xmm3 # tail load from RCX addq$16, %rax# head subq$16, %rcx# tail subq$16, %rsi# 2nd tail? vmovdqu -16(%rax), %xmm0 vpshufb %xmm2, %xmm3, %xmm1 vmovups %xmm1, -16(%rax) vpshufb %xmm2, %xmm0, %xmm0 vmovups %xmm0, 16(%rsi) # tail store to RSI cmpq%r8, %rax # } while(head != end_head) jne .L4 RSI = RCX before and after the loop. This is obviously pointless. head uses the same register for loads and stores. Then we have bloated fully-unrolled scalar cleanup, instead of using the shuffle control for 8-byte vectors -> movhps. Or scalar bswap. Ideally we'd do something clever at the overlap like one load + shuffle + store, but we might have to load the next vector before storing the current to make this work at the overlap. That would presumably require more special-casing this kind of meet-in-the-middle loop. The implicit-length version doesn't have this extra sub in the main loop. void strrev_implicit(swapt *head) { swapt *tail = head; while(*tail) ++tail;// find the 0 terminator, like head+strlen --tail; // tail points to the last real char for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } .L22: vmovdqu (%rcx), %xmm3 addq$16, %rdx # head subq$16, %rcx # tail vmovdqu -16(%rdx), %xmm0 vpshufb %xmm2, %xmm3, %xmm1 vmovups %xmm1, -16(%rdx) vpshufb %xmm2, %xmm0, %xmm0 vmovups %xmm0, 16(%rcx) cmpq%rsi, %rdx # } while(head != end_head) jne .L22
[Bug tree-optimization/92243] Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243 --- Comment #1 from Peter Cordes --- Forgot to mention, this probably applies to other ISAs with GP-integer byte-reverse instructions and efficient unaligned loads.
[Bug tree-optimization/92243] New: Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243 Bug ID: 92243 Summary: Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* We could use integer bswap to speed up an in-place byte-reverse loop by a factor of probably 8, the same way we uses SIMD shuffles. Consider this loop which reverses an explicit-length char array: https://godbolt.org/z/ujXq_J typedef char swapt; // int can auto-vectorize with just SSE2 void strrev_explicit(swapt *head, long len) { swapt *tail = head + len - 1; for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } gcc -O3 (including current trunk) targeting x86-64 makes naive scalar byte-at-a-time code, even though bswap r64 is available to byte-reverse a uint64 in 1 or 2 uops (AMD and Intel, respectively). With -mssse3, we do see auto-vectorization using SIMD pshufb (after checking lengths and calculating how many 16-byte chunks can be done before bloated fully-unrolled cleanup). Doing the same thing with 64-bit integer registers would be very much worth it (for code where a loop like this was a bottleneck). With `swapt = short`, vectorizing with SSE2 pshuflw / pshufhw / pshufd is probably worth it, but GCC chooses not to do that either. Or working in 8-byte chunks just using movq + pshuflw, so we only have 1 shuffle per 8-byte load/store instead of 3 per 16-byte store. That's a good balance for modern Intel (Haswell, Skylake, and I think IceLake), although some AMD and earlier Intel with more integer shuffle throughput (e.g. Sandybridge) might do better with 3x shuffles per 16-byte load/store.
[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887 --- Comment #5 from Peter Cordes --- Reported bug 92080 for the missed CSE
[Bug tree-optimization/92080] New: Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080 Bug ID: 92080 Summary: Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c) Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* As a workaround for PR 82887 some code (e.g. a memset) uses __m512i zmm = _mm512_set1_epi8((char)c); __m256i ymm = _mm256_set1_epi8((char)c); instead of ymm = _mm512_castsi512_si256(zmm); (found in the persistent-memory library https://github.com/pmem/pmdk/blob/a6031710f7c102c6b8b6b19dc9708a3b7d43e87b/src/libpmem/x86_64/memset/memset_nt_avx512f.h#L193 ) Obviously we'd like to CSE that instead of actually broadcasting twice. MVCE: #include __m512i sinkz; __m256i sinky; void foo(char c) { sinkz = _mm512_set1_epi8(c); sinky = _mm256_set1_epi8(c); } https://godbolt.org/z/CeXhi8 g++ (Compiler-Explorer-Build) 10.0.0 20191012 # g++ -O3 -march=skylake-avx512 (AVX512BW + AVX512VL are the relevant ones) foo(char): vpbroadcastb%edi, %zmm0 vmovdqa64 %zmm0, sinkz(%rip) vpbroadcastb%edi, %ymm0 # wasted insn vmovdqa64 %ymm0, sinky(%rip) # wasted EVEX prefix vzeroupper ret Without AVX512VL it wastes even more instructions (vmovd + AVX2 vpbroadcastb xmm,ymm), even though AVX512BW vpbroadcastb zmm does set the YMM register. (There are no CPUs with AVX512BW but not AVX512VL; if people compile that way it's their own fault. But this might be relevant for set1_epi32() on KNL). Clang finds this optimization, and uses a shorter vmovdqa for the YMM store saving another 2 bytes of code size: vpbroadcastb%edi, %zmm0 vmovdqa64 %zmm0, sinkz(%rip) vmovdqa %ymm0, sinky(%rip) vzeroupper ret
[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4 from Peter Cordes --- Since some code is apparently still avoiding this because of old broken GCC (e.g. https://github.com/pmem/pmdk/blob/a6031710f7c102c6b8b6b19dc9708a3b7d43e87b/src/libpmem/x86_64/memset/memset_nt_avx512f.h#L193 ) Perhaps a workaround of _mm512_castsi512_si256 would be useful? Or does that ICE as well? I can't repro the bug on Godbolt so IDK. Doing _mm512_set1_epi8(c) and a separate _mm256_set1_epi8(c) doesn't CSE with GCC, only clang. https://godbolt.org/z/uZ4lv- And if you leave out -march=skylake-avx512 you get even worse asm from GCC.
[Bug middle-end/91515] missed optimization: no tailcall for types of class MEMORY
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91515 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1 from Peter Cordes --- The real missed optimization is that GCC is returning its own incoming arg instead of returning the copy of it that create() will return in RAX. This is what blocks tailcall optimization; it doesn't "trust" the callee to return what it's passing as RDI. See https://stackoverflow.com/a/57597039/224132 for my analysis (the OP asked the same thing on SO before reporting this, but forgot to link it in the bug report.) The RAX return value tends to rarely be used, but probably it should be; it's less likely to have just been reloaded recently. RAX is more likely to be ready sooner than R12 for out-of-order exec. Either reloaded earlier (still in the callee somewhere if it's complex and/or non-leaf) or never spilled/reloaded. So we're not even gaining a benefit from saving/restoring R12 to hold our incoming RDI. Thus it's not worth the extra cost (in code-size and instructions executed), IMO. Trust the callee to return the pointer in RAX.
[Bug c/91398] Possible missed optimization: Can a pointer be passed as hidden pointer in x86-64 System V ABI
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91398 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4 from Peter Cordes --- EAD neglected to link previous discussion about this in the initial bug report. https://stackoverflow.com/a/57377890/224132 points out that the SysV ABI wording is > If the type has class MEMORY, then **the caller provides space** for the > return value and passes the address of this storage in %rdi We can argue semantics, but in my answer on the same question, I argued that the implication is that that space won't alias any other space. (Because the return-value object exists in the C abstract machine, so the default assumption should be that it exists for real in the calling convention.) Whether it's practical to look for this optimization or not, I'm still curious about the point that @M.M made about the semantics of restrict https://stackoverflow.com/questions/57377314/what-prevents-the-usage-of-a-function-argument-as-hidden-pointer/57436765#comment101288442_57403379 Does the callee do_something() reading a global count as happening inside the block scope of use(Vec3 *restrict out) { ... }? The ISO C standard wording talks about reaching the end of a block, which hasn't happened even though `out` is not in scope inside the other function. If so, then calling use(&global) creates UB when *out = do_something(); executes because it writes the pointed-to memory via a restrict-pointer in the same block where it reads it from a pointer that's not derived from out. If so, restrict would make this optimization safe if we can prove that do_something is "noexcept" and doesn't longjmp.
[Bug tree-optimization/91026] switch expansion produces a jump table with trivial entries
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3 from Peter Cordes --- (In reply to Martin Liška from comment #2) > Switch conversion bails out because it knowns that a jump table (or a bit > test can) be used for this snippet. Then we prefer to use a jump table then > a bit test. With -fno-jump-tables we generate the same code. > That said, I confirm it's a small limitation. This regression appeared in GCC9 for this test-case, and is present in GCC9.1 on Godbolt: https://godbolt.org/z/fDjTxN bool is_vowel(char c) { switch (c) { case 'a': case 'e': case 'i': case 'o': case 'u': case 'y': return 1; default: return 0; } } But simplifying it case 'a': case 'e': case 'i': to those 3 cases gets gcc9 and trunk to use an immediate bitmap. With gcc8 and earlier, the x86-64 asm for the 2 versions is identical except for the immediate used with TEST EAX, imm32. (And BTW, there's a missed optimization here of using mask & (1<>n) & 1. Or better, looking for that conversion in user source code / logic because people often write tests that way requiring the creation of an actual 1 in a register. Or for ISAs with flags, have the mask already right-shifted by 1 so the bit shifted out is the one we want. Then CF = result with no extra test. Also an x86 missed optimization: BT reg,reg is very efficient (single uop) on Intel and Ryzen, and avoids needing a 3-uop-on-Intel shift-by-CL or a mov reg,1 I'll report these ideas separately if/when I get around to it.
[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103 --- Comment #4 from Peter Cordes --- We should not put any stock in what ICC does for GNU C native vector indexing. I think it doesn't know how to optimize that because it *always* spills/reloads even for `vec[0]` which could be a no-op. And it's always a full-width spill (ZMM), not just the low XMM/YMM part that contains the desired element. I mainly mentioned ICC in my initial post to suggest the store/reload strategy in general as an *option*. ICC also doesn't optimize intriniscs: it pretty much always faithfully transliterates them to asm. e.g. v = _mm_add_epi32(v, _mm_set1_epi32(1)); twice compiles to two separate paddd instructions, instead of one with a constant of set1(2). If we want to see ICC's strided-store strategy, we'd need to write some pure C that auto-vectorizes. That said, store/reload is certainly a valid option when we want all the elements, and gets *more* attractive with wider vectors, where the one extra store amortizes over more elements. Strided stores will typically bottleneck on cache/memory bandwidth unless the destination lines are already hot in L1d. But if there's other work in the loop, we care about OoO exec of that work with the stores, so uop throughput could be a factor. If we're tuning for Intel Haswell/Skylake with 1 per clock shuffles but 2 loads + 1 store per clock throughput (if we avoid indexed addressing modes for stores), then it's very attractive and unlikely to be a bottleneck. There's typically spare load execution-unit cycles in a loop that's also doing stores + other work. You need every other uop to be (or include) a load to bottleneck on that at 4 uops per clock, unless you have indexed stores (which can't run on the simple store-AGU on port 7 and need to run on port 2/3, taking a cycle from a load). Cache-split loads do get replayed to grab the 2nd half, so it costs extra execution-unit pressure as well as extra cache-read cycles. Intel says Ice will have 2 load + 2 store pipes, and a 2nd shuffle unit. A mixed strategy there might be interesting: extract the high 256 bits to memory with vextractf32x8 and reload it, but shuffle the low 128/256 bits. That strategy might be good on earlier CPUs, too. At least with movss + extractps stores from the low XMM where we can do that directly. AMD before Ryzen 2 has only 2 AGUs, so only 2 memory ops per clock, up to one of which can be a store. It's definitely worth considering extracting the high 128-bit half of a YMM and using movss then shuffles like vextractps: 2 uops on Ryzen or AMD. - If the stride is small enough (so more than 1 element fits in a vector), we should consider shuffle + vmaskmovps masked stores, or with AVX512 then AVX512 masked stores. But for larger strides, AVX512 scatter may get better in the future. It's currently (SKX) 43 uops for VSCATTERDPS or ...DD ZMM, so not very friendly to surrounding code. It sustains one per 17 clock throughput, slightly worse than 1 element stored per clock cycle. Same throughput on KNL, but only 4 uops so it can overlap much better with surrounding code. For qword elements, we have efficient stores of the high or low half of an XMM. A MOVHPS store doesn't need a shuffle uop on most Intel CPUs. So we only need 1 (YMM) or 3 (ZMM) shuffles to get each of the high 128-bit lanes down to an XMM register. Unfortunately on Ryzen, MOVHPS [mem], xmm costs a shuffle+store. But Ryzen has shuffle EUs on multiple ports.
[Bug target/91103] New: AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103 Bug ID: 91103 Summary: AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* GCC9.1 and current trunk aren't good at extracting high elements, whether it's with GNU C native vector syntax, or when auto-vectorizing something that ends with the result in the high element. Using VALIGND we can get any element with one immediate instruction, but its better to use AVX2 VPERMPD(immediate) when possible. Or inside loops, VPERMPS(vector), or VPERMT2PS(vector). Or of course vextractf32x4 if possible (element at the bottom of a 128-bit lane). Or with only AVX2 available, VPERMPD(immediate) for high elements in __m256 and __m256d vectors is still a big win. #include float elem12(__m512 v) { return v[12]; } float elem15(__m512 v) { return v[15]; } gcc -Ofast -march=skylake-avx512 https://godbolt.org/z/241r8p elem15: vextractf32x8 ymm0, zmm0, 0x1 vextractf128xmm0, ymm0, 0x1# elem12 ends here, after these 2 insns vshufps xmm0, xmm0, xmm0, 255 # no vzeroupper I guess because the caller must have __m512 vars too, recent optimization ret But AVX512F has vextractf32x4 to extract a 128-bit lane, which would preclude the need for AVX2 vextractf128. That's what clang does. Obviously inside a loop it would be *much* better to use a single lane-crossing VPERMPS to also avoid the shufps. Intel Skylake easily bottlenecks on shuffle throughput. We'd need a 15 in an XMM register as a control vector, but loading it would be off the latency critical path. (If we needed the scalar zero-extended instead of garbage in high elements, we could VPERMI2PS or VPERMT2PS with a zeroed vector and a shuffle-control.) --- If the element we want is an even element in the low 256 bits, we can get it with a VPERMPD-immediate. GCC does this: elem6(float __vector(16)): # GCC 10 trunk vextractf128xmm0, ymm0, 0x1 vunpckhps xmm0, xmm0, xmm0 ret Instead it should be AVX2 vpermpd ymm0, ymm0, 3 This bug also applies to __m256, not just __m512 https://www.felixcloutier.com/x86/vpermpd VPERMPD is a 64-bit granularity lane-crossing shuffle. The AVX512F immediate version reuses the immediate for another 256-bit wide shuffle in the upper half; only the vector-control version can bring an element from the top half of a ZMM down to the bottom. But if we're going to use a vector control, we might as well use VPERMPS. For the integer version of this bug, use VPERMQ -- But we can do even better by using an integer VALIGND (AVX512F) shuffle on FP data. There unfortunately isn't an FP flavour of VALIGND, just integer. AFAIK, Skylake-AVX512 still has no bypass-delay penalty for integer shuffles between FP math instructions, i.e. the shift unit is connected to both FP and integer forwarding networks. Intel's optimization manual for Skylake (client) has a bypass-latency table that shows 0 extra latency cycles for SHUF/5/1,3 reading from anything, or anything reading from it. https://www.felixcloutier.com/x86/valignd:valignq It's a 4 or 8-byte granularity version of palignr, except that it's lane-crossing so the 256 and 512-bit versions are actually useful. The immediate shift count can thus bring *any* element down to the bottom. (Using the same input twice makes it a rotate). VALIGND is good on Knight's Landing, too: unlike most 2-input shuffles, it has 1 per clock throughput. For *any* compile-time-constant index, we can always compile v[i] to this: extract15: valigndzmm0, zmm0, zmm0, 15 # I think this is right. ret The only downside I'm aware of is that some future AVX512 CPU might not run VALIGND as efficiently as SKX and KNL. For vector elements narrower than 32 bits, we may need 2 shuffles even if we consider using a shuffle-control vector. On Skylake-AVX512, AVX512BW vpermw will get the job done, but costs 2 shuffle uops. On CannonLake (and presumably other future Intel), it and AVX512VBMI vpermb are only 1 uop, so it's definitely worth creating a shuffle-control vector if it can be reused. Also worth considering instead of 2 shuffles: *unaligned* spill / reload like ICC does for GNU C native vector indexing. Store-forwarding latency is only 6 or 7 cycles I think, and it avoids any port 5 pressure. Not generally a good choice IMO when we can get the job done in one shuffle, but worth considering if we need multiple elements. If the function doe
[Bug target/90582] New: AArch64 stack-protector wastes an instruction on address-generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90582 Bug ID: 90582 Summary: AArch64 stack-protector wastes an instruction on address-generation Product: gcc Version: 8.2.1 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- void protect_me() { volatile int buf[2]; buf[1] = 3; } https://godbolt.org/z/xdlr5w AArch64 gcc8.2 -O3 -fstack-protector-strong protect_me: stp x29, x30, [sp, -32]! adrpx0, __stack_chk_guard add x0, x0, :lo12:__stack_chk_guard ### this instruction mov x29, sp # frame pointer even though -fomit-frame-pointer is part of -O3. Goes away with explicit -fomit-frame-pointer ldr x1, [x0]# copy the cookie str x1, [sp, 24] mov x1,0# and destroy the reg mov w1, 3 # right before it's already destroyed str w1, [sp, 20] # buf[1] = 3 ldr x1, [sp, 24]# canary ldr x0, [x0]# key destroys the key pointer eor x0, x1, x0 cbnzx0, .L5 ldp x29, x30, [sp], 32 # FP and LR save/restore (for some reason?) ret .L5: # can the store of the link register go here, for backtracing? bl __stack_chk_fail A function that returns a global can embed the low 12 bits of the address into the load instruction. AArch64 instructions are fixed-width, so there's no reason (AFAIK) not to do this. f: adrpx0, foo ldr w0, [x0, #:lo12:foo] ret I'm not an AArch64 performance expert; it's plausible that zero displacements are worth spending an extra instruction on for addresses that are used twice, but unlikely. So we should be doing adrpx0, __stack_chk_guard ldr x1, [x0, #:lo12:__stack_chk_guard] # in prologue to copy cookie ... ldr x0, [x0, #:lo12:__stack_chk_guard] # in epilogue to check cookie This also avoids leaving an exact pointer right to __stack_chk_guard in a register, in case a vulnerable callee or code in the function body can be tricked into dereferencing it and leaking the cookie. (In non-leaf functions, we generate the pointer in a call-preserved register like x19, so yes it will be floating around in a register for callees). I'd hate to suggest destroying the pointer when copying to the stack, because that would require another adrp later. Finding a gadget that has exactly the right offset (the low 12 bits of __stack_chk_guard's address) is a lot less likely than finding an ldr from [x0]. Of course this will introduce a lot of LDR instructions with an #:lo12:__stack_chk_guard offset, but hopefully they won't be part of useful gadgets because they lead to writing the stack, or to EOR/CBNZ to __stack_chk_fail I don't see a way to optimize canary^key == 0 any further, unlike x86-64 PR 90568. I assume EOR / CBNZ is as at least as efficient as SUBS / BNE on all/most AArch64 microarchitectures, but someone should check. -O3 includes -fomit-frame-pointer according to -fverbose-asm, but functions protected with -fstack-protector-strong still get a frame pointer in x29 (costing a MOV x29, sp instruction, and save/restore with STP/LDP along with x30.) However, explicitly using -fomit-frame-pointer stops that from happening. Is that a separate bug, or am I missing something? Without stack-protector, the function is vastly simpler protect_me: sub sp, sp, #16 mov w0, 3 str w0, [sp, 12] add sp, sp, 16 ret Does stack-protector really need to spill/reload x29/x30 (FP and LR)? Bouncing the return address through memory seems inefficient, even though branch prediction does hide that latency. Is that just so __stack_chk_fail can backtrace? Can we move the store of the link register into the __stack_chk_fail branch, off the fast path? Or if we do unconditionally store x30 (the link register), at least don't bother reloading it in a leaf function if register allocation didn't need to clobber it. Unlike x86-64, the return address can't be attacked with buffer overflows if it stays safe in a register the whole function. Obviously my test-case with a volatile array and no inputs at all is making -fstack-protector-strong look dumb by protecting a perfectly safe function. IDK how common it is to have leaf functions with arrays or structs that just use them for some computation on function args or globals and then return, maybe after copying the array b
[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #5 from Peter Cordes --- And BTW, this only helps if the SUB and JNE are consecutive, which GCC (correctly) doesn't currently optimize for with XOR. If this sub/jne is different from a normal sub/branch and won't already get optimized for macro-fusion, we may get even more benefit from this change by teaching gcc to keep them adjacent. GCC currently sometimes splits up the instructions like this: xorq%fs:40, %rdx movl%ebx, %eax jne .L7 from gcc8.3 (but not 9.1 or trunk in this case) on https://godbolt.org/z/nNjQ8u #include unsigned int get_random_seed() { std::random_device rd; return rd(); } Even with -O3 -march=skylake. That's not wrong because XOR can't macro-fuse, but the point of switching to SUB is that it *can* macro-fuse into a single sub-and-branch uop on Sandybridge-family. So we might need to teach gcc about that. So when you change this, please make it aware of optimizing for macro-fusion by keeping the sub and jne back to back. Preferably with tune=generic (because Sandybridge-family is fairly widespread and it doesn't hurt on other CPUs), but definitely with -mtune=intel or -mtune=sandybridge or later. Nehalem and earlier can only macro-fuse test/cmp The potential downside of putting it adjacent instead of 1 or 2 insns earlier for uarches that can't macro-fuse SUB/JNE should be about zero on average. These branches should predict very well, and there are no in-order x86 CPUs still being sold. So it's mostly just going to be variations in fetch/decode that help sometimes, hurt sometimes, like any code alignment change.
[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #3 from Peter Cordes --- (In reply to Jakub Jelinek from comment #2) > The xor there is intentional, for security reasons we do not want the stack > canary to stay in the register afterwards, because then it could be later > spilled or accessible to some exploit in another way. Ok, so we can't use CMP, therefore we should use SUB, which as I showed does help on Sandybridge-family vs. XOR. x - x = 0 just like x ^ x = 0 Otherwise SUB wouldn't set ZF. SUB is not worse than XOR on any other CPUs; there are no CPUs with better XOR throughput than ADD/SUB. In the canary mismatch case, leaving attacker_value - key in a register seems no worse than leaving attacker_value ^ key in a register. Either value trivially reveals the canary value to an attacker that knows what they overwrote the stack with, if it does somehow leak. We jump to __stack_chk_fail in that case, not relying on the return value on the stack, so a ROP attack wouldn't be sufficient to leak that value anywhere.
[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #1 from Peter Cordes --- https://godbolt.org/z/hHCVTc Forgot to mention, stack-protector also disables use of the red-zone for no apparent reason, so that's another missed optimization. (Perhaps rarely relevant; probably most functions that get stack protection are big enough that they need more stack, or non-leaf. I sidestepped that with volatile.)
[Bug target/90568] New: stack protector should use cmp or sub, not xor, to allow macro-fusion on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 Bug ID: 90568 Summary: stack protector should use cmp or sub, not xor, to allow macro-fusion on x86 Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* cmp/jne is always at least as efficient as xor/jne, and more efficient on CPUs that support macro-fusion of compare and branch. Most support cmp/jne fusion (including all mainstream Intel and AMD, not low-power), but none support xor/jne fusion. void foo() { volatile int buf[4]; buf[1] = 2; } gcc trunk on Godbolt, but same code-gen all the way back to gcc4.9 foo: subq$40, %rsp movq%fs:40, %rax movq%rax, 24(%rsp) xorl%eax, %eax movl$2, 4(%rsp) movq24(%rsp), %rax xorq%fs:40, %rax ## This insn should be CMP jne .L5 addq$40, %rsp ret .L5: call__stack_chk_fail As far as I can tell, the actual XOR result value in RAX is not an input to __stack_chk_fail because gcc sometimes uses a different register. Therefore we don't need it, and can use any other way to check for equality. If we need to avoid "leaking" the canary value in a register, we can use SUB, otherwise CMP is even better and can macro-fuse on more CPUs. Only Sandybridge-family can fuse SUB/JCC. (And yes, it can fuse even with a memory-source and a segment override prefix. SUB %fs:40(%rsp), %rax / JNE is a single uop on Skylake; I checked this with perf counters in an asm loop.) AMD can fuse any TEST or CMP/JCC, but only those instructions (so SUB is as bad as XOR for AMD). See Agner Fog's microarch PDF. Linux test program (NASM) that runs sub (mem), %reg with an FS prefix to prove that it does macro-fuse and stays micro-fused as a single uop: default rel %use smartalign alignmode p6, 64 global _start _start: cookie equ 12345 mov eax, 158 ; __NR_arch_prctl mov edi, 0x1002; ARCH_SET_FS lea rsi, [buf] syscall ; wrfsbase rsi; not enabled by the kernel mov qword [fs: 0x28], cookie mov ebp, 10 align 64 .loop: mov eax, cookie sub rax, [fs: 0x28] jne _start and ecx, edx dec ebp jnz .loop .end: xor edi,edi mov eax,231 ; __NR_exit_group syscall ; sys_exit_group(0) section .bss align 4096 buf:resb 4096 nasm -felf64 branch-fuse-mem.asm && ld -o branch-fuse-mem branch-fuse-mem.o to make a static executable taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u -r2 ./branch-fuse-mem On my i7-6700k Performance counter stats for './branch-fuse-mem' (2 runs): 240.78 msec task-clock:u #0.999 CPUs utilized ( +- 0.23% ) 2 context-switches #0.010 K/sec ( +- 20.00% ) 0 cpu-migrations#0.000 K/sec 3 page-faults #0.012 K/sec 1,000,764,258 cycles:u #4.156 GHz ( +- 0.00% ) 2,000,000,076 branches:u# 8306.384 M/sec ( +- 0.00% ) 6,000,000,088 instructions:u#6.00 insn per cycle ( +- 0.00% ) 4,000,109,615 uops_issued.any:u # 16613.222 M/sec ( +- 0.00% ) 5,000,098,334 uops_executed.thread:u# 20766.367 M/sec ( +- 0.00% ) 0.240935 +- 0.000546 seconds time elapsed ( +- 0.23% ) Note 1.0 billion cycles (1 per iteration), and 4B fused-domain uops_issued.any, i.e. 4 uops per loop iteration. (5 uops *executed* is because one of those front-end uops has a load micro-fused). Changing SUB to CMP has no effect. With SUB changed to XOR, the loop takes 1.25 cycles per iteration, and the front-end issues 5 uops per iteration. Other counters are the same. Skylake's pipeline is 4-wide, like all Intel since Core2, so an extra uop for the front-end creates a bottleneck. -- On Intel pre Haswell, the decoders will only make at most 1 fusion per decode group, so you may need to make the loop larger to still get fusion. Or use this as the loop-branch, e.g. with a 1 in memory sub rax, [fs: 0x28] jnz .loop or with a 0 in memory, sub or cmp or xor will all set flags according to the register being non-zero. But sub or xor will introduce an extra cycle of latency on the critical path for the loop counter.
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4 from Peter Cordes --- Yes, rep scasb is abysmal, and gcc -O3's 4-byte-at-a-time scalar loop is not very good either. With 16-byte alignment, (which we have from calloc on x86-64 System V), we can inline a *much* better SSE2 loop. See https://stackoverflow.com/a/55589634/224132 for more details and microbenchmarks; On Skylake it's about 4 to 5x faster than the current 4-byte loop for large strings, 3x faster for short strings. For short strings (strlen=33), it's about 1.5x faster than calling strlen. For very large strings (too big for L2 cache), it's ~1.7x slower than glibc's AVX2 strlen. The lack of VEX encoding for pxor and pmovmskb is just me being lazy; let gcc emit them all with VEX if AVX is enabled. # at this point gcc has `s` in RDX, `i` in ECX pxor %xmm0, %xmm0 # zeroed vector to compare against .p2align 4 .Lstrlen16: # do { #ifdef __AVX__ vpcmpeqb (%rdx), %xmm0, %xmm1 #else movdqa (%rdx), %xmm1 pcmpeqb%xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory #endif add $16, %rdx # ptr++ pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask test %eax, %eax jz.Lstrlen16# }while(mask==0); # RDX points at the 16-byte chunk *after* the one containing the terminator # EAX = bit-mask of the 0 bytes, and is known to be non-zero bsf%eax, %eax # EAX = bit-index of the lowest set bit # terminator is at rdx+rax - 16 # movb $'A', -16(%rdx, %rax) // for a microbench that used s[strlen(s)]='A' sub%rbp, %rdx # p -= start lea -16(%rdx, %rax) # p += byte_within_vector - 16 We should actually use REP BSF because that's faster on AMD (tzcnt), and same speed on Intel. Also an inline-asm implementation of it with a microbenchmark adapted from the SO question. (Compile with -DUSE_ASM -DREAD_ONLY to benchmark a fixed length repeatedly) https://godbolt.org/z/9tuVE5 It uses clock() for timing, which I didn't bother updating. I made it possible to run it for lots of iterations for consistent timing. (And so the real work portion dominates the runtime so we can use perf stat to measure it.) If we only have 4-byte alignment, maybe check the first 4B, then do (p+4) & ~7 to either overlap that 4B again or not when we start 8B chunks. But probably it's good to get to 16-byte alignment and do whole SSE2 vectors, because repeating an aligned 16-byte test that overlaps an 8-byte test costs the same as doing another 8-byte test. (Except on CPUs like Bobcat that split 128-bit vectors into 64-bit halves). The extra AND to round down to an alignment boundary is all it takes, plus the code-size cost of peeling 1 iteration each of 4B and 8B before a 16-byte loop. We can use 4B / 8B with movd / movq instead of movdqa. For pmovmskb, we can ignore the compare-true results for the upper 8 bytes by testing the result with `test %al,%al`, or in general with `test $0x0F, %al` to check only the low 4 bits of EAX for the 4-byte case. The scalar bithack version can use BSF instead of CMOV binary search for the byte with a set high bit. That should be a win if we ever wanted to do scalar on some x86 target especially with 8-byte registers, or on AArch64. AArch64 can rbit / clz to emulate bsf and find the position of the first set bit. (Without efficient SIMD compare result -> integer_mask, or efficient SIMD -> integer at all on some ARM / AArch64 chips, SIMD compares for search loops aren't always (ever?) a win. IIRC, glibc strlen and memchr don't use vectors on ARM / AArch64, just scalar bithacks.)
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #22 from Peter Cordes --- Nice, that's exactly the kind of thing I suggested in bug 80571. If this covers * vsqrtss/sd (mem),%merge_into, %xmm * vpcmpeqd%same,%same, %dest# false dep on KNL / Silvermont * vcmptrueps %same,%same, %ymm # splat -1 without AVX2. false dep on all known uarches as well as int->FP conversions, then we could probably close that as fixed by this as well. bug 80571 does suggest that we could look for any cold reg, like a non-zero constant, instead of requiring an xor-zeroed vector, so it might go slightly beyond what this patch does. And looking for known-to-be-ready dead regs from earlier in the same dep chain could certainly be useful for non-AVX code-gen, allowing us to copy-and-sqrt without introducing a dependency on anything that's not already ready. (In reply to h...@gcc.gnu.org from comment #21) > Author: hjl > Date: Fri Feb 22 15:54:08 2019 > New Revision: 269119
[Bug target/80571] AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571 --- Comment #2 from Peter Cordes --- I think hjl's patch for PR 89071 / PR 87007 fixes (most of?) this, at least for AVX. If register pressure is an issue, using a reg holding a arbitrary constant (instead of xor-zeroed) is a valid option, as this bug points out. So I'm not sure we should close this as a duplicate of those fixed bugs.
[Bug target/38959] Additional switches to disallow processor supplementary instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959 --- Comment #4 from Peter Cordes --- The __builtin_ia32_rdpmc being a pure function bug I mentioned in my previous comment is already reported and fixed (in gcc9 only): bug 87550 It was present since at least gcc 5.0 https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214
[Bug target/38959] Additional switches to disallow processor supplementary instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3 from Peter Cordes --- We can maybe close this as fixed (if -march=i386 didn't exist/work at the time) or invalid. Or maybe we want to add some CPU-level awareness to code-gen for __builtin_ia32_rdtsc / rdpmc / rdtscp. The cmov / fcomi / fcomi proposed switches are already supported as part of -march=pentium -mtune=generic or lower, e.g. -march=i386. (The 32-bit default is something like arch=i686 and tune=generic, with it being possible to configure gcc so SSE2 is on by default in 32-bit code.) Those are the important ones, because they're emitted automatically by the compiler's back-end. The other options would just be trying to save you from yourself, e.g. rejecting source that contains __rdtsc() / __builtin_ia32_rdtsc() I'm not sure what the situation is with long NOPs. GCC doesn't (normally?) emit them, just using .p2align directives for the assembler. In 32-bit mode, GAS appears to avoid long NOPs, using either 2-byte xchg ax,ax or pseudo-nops like LEA esi,[esi+eiz*1+0x0] that add a cycle of latency to the dep chain involving ESI. Even with -march=haswell, gcc+gas fail to use more efficient long NOPs for padding between functions. --- I'm not sure if CPUID is ever emitted by gcc's back-end directly, only from inline asm. i386/cpuid.h uses inline asm. But __get_cpuid_max() checks if CPUID is even supported in a 386-compatible way, checking if a bit in EFLAGS is sticky or not. If your source code is written safely, you won't have a problem unless possibly __builtin_cpu_init runs CPUID without checking, in programs that use __builtin_cpu_supports() or _is(). __builtin_ia32_rdpmc() and __rdtsc() do *not* check -march= before emitting rdpmc and rdtsc. Neither does __rdtscp(), which is interesting because that instruction is new enough that some still-relevant CPUs don't support it. __rdpmc() isn't "volatile", though, so stop-start optimizes to 0. (I found this bug looking for existing reports of that issue.) Test cases: https://godbolt.org/z/hqPdza FCMOV and CMOV are also handled correctly, but I didn't write functions for them. int fcomi(double x, double y) { return x Proposed switches: > > --nocpuid This option causes the compiler to not generate cpuid opcodes > --nocmov This option causes the compiler to not generate cmov opcodes > --nofcmov This option causes the compiler to not generate fcmov opcodes > --nofcomi This option causes the compiler to not generate fcomi opcodes > --nonopl This option causes the compiler to not generate fcomi opcodes > --nordpmc This option causes the compiler to not generate rdpmc opcodes > --nordtsc This option causes the compiler to not generate rdtsc opcodes > > Possibly a general switch that is equivalent to all of the above > > --nosupplementaryinstructions > > Rationale > > It is possible that a developer still wants to compile for a particular > architecture (for example the i486), but does not wish to generate code with > supplementary instructions (such as cpuid), that may be present on that > architecture.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #15 from Peter Cordes --- (In reply to Uroš Bizjak from comment #13) > I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP > and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we > currently don't emit XOR clear in front of these instrucitons, when they > operate with memory input. They *do* have an output dependency. It might or might not actually be a problem and be worth clogging the front-end with extra uops to avoid, it depending on surrounding code. >.< e.g. ROUNDSD: DEST[127:63] remains unchanged Thanks, Intel. You'd think by SSE4.1 they would have learned that false dependencies suck, and that it's extremely rare to actually take advantage of this merge behaviour, but no. For register-source ROUNDSD / ROUNDSS, we can use ROUNDPD / ROUNDPS which write the full destination register and have identical performance on all CPUs that support them. (Except Silvermont, where roundps/pd have 5c latency vs. 4c for roundss/sd. Goldmont makes them equal.) KNL has faster (V)ROUNDPS/D than ROUNDSS/SD, maybe only because of the SSE encoding? Agner Fog isn't clear, and doesn't have an entry that would match vroundss/sd. Copy-and-round is good for avoiding extra MOVAPS instructions which can make SSE code front-end bound, and reduce the effective size of the out-of-order window. Preserving FP exception semantics for packed instead of scalar register-source: * if the upper element(s) of the source is/are known 0, we can always do this with sqrt and round, and convert: they won't produce any FP exceptions, not even inexact. (But not rsqrt / rcpps, of course.) This will be the case after a scalar load, so if we need the original value in memory *and* the result of one of these instructions, we're all set. * with rounding, the immediate can control masking of precision exceptions, but not Invalid which is always raised by SRC = SNaN. If we can rule out SNaN in the upper elements of the input, we can use ROUNDPS / ROUNDPD roundps/d can't produce a denormal output. I don't think denormal inputs slow it down on any CPUs, but worth checking for cases where we don't care about preserving exception semantics and want to use it with potentially-arbitrary garbage in high elements. rsqrtps can't produce a denormal output because sqrt makes the output closer to 1.0 (reducing the magnitude of the exponent). (And thus neither can sqrtps.) SQRTPS/PD is the same performance as SQRTSS/SD on new CPUs, but old CPUs that crack 128-bit ops into 64-bit are slower: Pentium III, Pentium M, and Bobcat. And Jaguar for sqrt. Also Silvermont is *MUCH* slower for SQRTPD/PS then SD/SS, and even Goldmont Plus has slower packed SQRT, RSQRT, and RCP than scalar. But RCPPS can produce a denormal. (double)1.0/FLT_MAX = 2.938736e-39, which is smaller than FLT_MIN = 1.175494e-38 So according to Agner's tables: * ROUNDPS/PD is never slower than ROUNDSS/SD on any CPU that support them. * SQRTPS/PD *are* slower than scalar on Silvermont through Goldmont Plus, and Bobcat, Nano 3000, and P4 Prescott/Nocona. By about a factor of 2, enough that should probably care about it for tune=generic. For ss/ps only (not double), also K10 and Jaguar have slower sqrtps than ss. Also in 32-bit mode, P4, Pentium M and earlier Intel, and Atom, are much slower for packed than scalar sqrt. SQRTPD is *faster* than SQRTSD on KNL. (But hopefully we're never tuning for KNL without AVX available.) * RSQRT / RCP: packed is slower on Atom, Silvermont, and Goldmont (multi-uop so a big decode stall). Somewhat slower on Goldmont Plus (1 uop but half throughput). Also slower on Nano3000, and slightly slower on Pentium 4 (before and after Prescott/Nocona), and KNL. (But hopefully KNL can always use VRSQRT28PS/PD or scalar) Pentium M and older again decode as at least 2 uops for packed, same as Bobcat and K8. Same performance for packed vs. scalar on Jaguar, K10, bdver1-4, ryzen, Core2 and later, and SnB-family. * CVTSS2SD vs. PD, and SD2SS vs. PD2PS packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop), Nano3000, KNL. On Silvermont by just 1 cycle latency (so even a MOVAPS on the critical path would make it equal.) Similar on Atom. Slower on CPUs that do 128-bit vectors as two 64-bit uops, like Bobcat, and Pentium M / K8 and older. packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c throughput), Prescott, P4. Much faster on Jaguar (1c vs. 8c throughput, and 1 uop vs. 2). same speed (but without the false dep) for SnB-family (mostly), Core 2, Ryzen. Odd stuff: Agner reports: Nehalem: ps2pd = 2 uops / 2c, ss2sd = 1 uop / 1c. (I guess just zero-padding the significand, no rounding required). pd2ps and sd2ss are equal at 2 uops / 4c latency. SnB: cvtpd2ps is 1c higher latency than sd2ss. IvB: ps2pd on IvB is 1c vs. 2c for ss2sd On HSW and later things have settled down to e
[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #6 from Peter Cordes --- Oops, these were SD not SS. Getting sleepy >.<. Still, my optimization suggestion for doing both compares in one masked SUB of +-PBCx applies equally. And I think my testing with VBLENDVPS should apply equally to VBLENDVPD. Since this is `double`, if we're going branchless we should definitely be vectorizing for a pair of doubles, like doing xij = X0(1,i) - X0(1,j) and yij = X0(2,i) - X0(2,j) together with a vmovupd, and a vector of PBCx, PBCy. Even if we later need both x and y separately (if those FMAs in the asm are multiplying components of one vector), we might still come out ahead from doing the expensive input processing with PD, then it's only one `vunpckhpd` to get the Y element ready, and that can run in parallel with any x * z stuff Or if we can unroll by 3 SIMD vectors over contiguous memory, we can get {X0,Y0} {Z0,X1} {Y1,Z1}. We get twice the work for a cost of only 3 extra unpacks, doing 2 i and j values at once. If this was 3 floats, using a SIMD load would be tricky (maybe vmaskmovps if we need to avoid going off the end), unless we again unroll by 3 = LCM(vec_len, width)
[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #5 from Peter Cordes --- IF ( xij.GT.+HALf ) xij = xij - PBCx IF ( xij.LT.-HALf ) xij = xij + PBCx For code like this, *if we can prove only one of the IF() conditions will be true*, we can implement it more efficiently, I think, by checking the magnitude of xij to see if a SUB is needed, and if so figuring out the sign to apply to PBCx. if(abs(xij) > HALF) { xij -= PBCx XOR sign_bit( xij ) } # xij in xmm0 # PBCx in xmm7 # HALF in xmm6 # set1( -0.0f ) in xmm5 (i.e. 1U<<31 a sign-bit mask) vandnps%xmm5, %xmm0, %xmm1# abs(xij) vcmpltps %xmm1, %xmm6, %xmm1# HALF < abs(xij) vandps%xmm5, %xmm0, %xmm2 # signbit(xij) vxorps%xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx (xij<0) vandps%xmm2, %xmm1, %xmm1 # +-PBCx or 0.0 if abs(xij) is between -+HALF vsubps%xmm1, %xmm0, %xmm0 # xij -= PBCx, -PBCx, or 0.0 There's a good amount of ILP here, but the critical path is ANDPS + CMPPS + ANDPS + SUBPS = 10 cycles on Skylake. We might want to use VPAND for some of this on Haswell, to avoid a port 5 bottleneck at least on the critical path. (Skylake runs FP booleans on any port. BDW and earlier restrict them to port 5 where they can't compete with FMA, and where bypass latency is always optimal. On SKL they can introduce extra bypass latency if they pick p0 or p1.) vandnps %xmm5, %xmm0, %xmm2 # signbit(xij) vxorps%xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx (xij<0) could be replaced with a (v)blendvps using the original xij to select between PBCx and -PBCx. With the SSE encoding, that saves a uop and a cycle of latency (but only off the critical path). And I think it would cost us a vmovaps to set up for it. --- I think this is better than IF-conversion of both IFs separately, but I haven't really looked. It should be much better for *latency*. But it's only equivalent if subtracting PBCx can't possibly make xij negative and the next IF condition also true. --- I was looking at a similar case of applying a fixup if the abs value of an input is outside a range in https://stackoverflow.com/questions/54364694/how-to-convert-scalar-code-of-the-double-version-of-vdts-pade-exp-fast-ex-app/54377840#54377840. I don't think I came up with anything there that's not already obvious or covered by the example above, though. Except if we had needed to square xij at some point, we could have checked xij * xij < HALF*HALF as the bound condition to save the ANDNPS. But then the mulps latency is part of the input to cmpps.
[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #4 from Peter Cordes --- I suspect dep-chains are the problem, and branching to skip work is a Good Thing when it's predictable. (In reply to Richard Biener from comment #2) > On Skylake it's better (1uop, 1 cycle latency) while on Ryzen even better. > On Bulldozer it also isn't that bad (comparable to Skylake I guess). SKL: AVX VBLENDVPS x,x,x,x is 2 uops, 2c latency, ~1c throughput. (Same for ymm) SKL: SSE4 BLENDVPS x,x,xmm0 is 1 uop, 1c latency, ~0.36c throughput in my testing, or maybe 0.333c with breaking dep chains. (IDK how Agner got 1c. Maybe he that was an editing mistake, and he copied the 1c from the VEX version.) [V](P)BLENDV(B|PS|PD) is funny: the SSE versions are 1 uop on SKL, I assume because they only have 3 register operands (including implicit XMM0). But the VEX encoding has 4 operands: 1 output and 3 inputs. I think this is too many for 1 uop to encode, and that's why VBLENDVPS is 2 uops even on Skylake. (The blend-control register encoded by an imm8 in the VEX version instead of implicit xmm0, but I don't think that's what stops the decoders from making it 1 uop. I think it's simply having 4 total operands.) On Skylake, the uop(s) for [V]BLENDVPS/D and [V]PBLENDVB can run on any of p015 (instead of only p5 on BDW and earlier), but the 2-uop VEX version is still 2 cycle latency. The VEX version has a bias towards port 5, but less than half the total uops run on p5 so it's not p015 + p5. The SSE version seems equally distributed to all of p015. On SKL, the optimal choice might be to use the SSE encoding, if we can deal with a destructive destination and having the blend control in xmm0. The SSE/AVX penalty on SKL is output dependencies for write-only SSE instructions (like movaps or cvtps2dq) writing to an XMM register that has a dirty upper 128. It's a per-register thing, not like Haswell where there's it triggers a state slow change. (https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake) --- Footnote: VBLENDVPS throughput is only 1c for a big block of it back-to-back, even though it's only 2 uops that can run on any of 3 ports. So why isn't it 0.66c throughput? VBLENDVPS throughput (for back-to-back vblendvps) seems to be limited by some front-end effect. In an unrolled loop with 20 vblendvps (with no loop-carried dependencies), there are a negligible amount of cycles where the front-end delivered the full 4 uops. Most cycles only 2 are issued. This is not a general a problem for 2 uop instructions or anything: 9x bextr + dec/jnz = 19 uops total runs at 5.00c / iter, or 3.8 uops / clock, with the only cycle to not issue 4 uops being (I think) the group of 3 including the loop branch. Playing around with other 2 uops instructions, I didn't see front-end bottlenecks. I saw some back-end bottlenecks because other 2-uop instructions aren't so nicely distributed over ports, but perf counts for idq_uops_not_delivered.cycles_fe_was_ok:u generally equaled total cycles. (It counts when either the FE delivers 4 uops, or the back end was stalled and thus not the front-end's fault.) A 1 uop instruction following a vblendvps can issue with it in the same cycle, so this effect is probably not horrible for normal cases where we're using vblendvps mixed with normal instructions. I haven't investigated further, whether this is a front-end effect (uop cache fetch problem?) or whether it's an allocation bottleneck. Possibly being a 4-operand instruction has something to do with it, although each uop can't have that many I don't think.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #10 from Peter Cordes --- (In reply to Uroš Bizjak from comment #9) > There was similar patch for sqrt [1], I think that the approach is > straightforward, and could be applied to other reg->reg scalar insns as > well, independently of PR87007 patch. > > [1] https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00202.html Yeah, that looks good. So I think it's just vcvtss2sd and sd2ss, and VROUNDSS/SD that aren't done yet. That patch covers VSQRTSS/SD, VRCPSS, and VRSQRTSS. It also bizarrely uses it for VMOVSS, which gcc should only emit if it actually wants to merge (right?). *If* this part of the patch isn't a bug - return "vmovss\t{%1, %0, %0|%0, %0, %1}"; + return "vmovss\t{%d1, %0|%0, %d1}"; then even better would be vmovaps %1, %0 (which can benefit from mov-elimination, and doesn't need a port-5-only ALU uop.) Same for vmovsd of course.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #8 from Peter Cordes --- Created attachment 45544 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544&action=edit testloop-cvtss2sd.asm (In reply to H.J. Lu from comment #7) > I fixed assembly codes and run it on different AVX machines. > I got similar results: > > ./test > sse : 28346518 > sse_clear: 28046302 > avx : 28214775 > avx2 : 28251195 > avx_clear: 28092687 > > avx_clear: > vxorps %xmm0, %xmm0, %xmm0 > vcvtsd2ss %xmm1, %xmm0, %xmm0 > ret > > is slightly faster. I'm pretty sure that's a coincidence, or an unrelated microarchitectural effect where adding any extra uop makes a difference. Or just chance of code alignment for the uop-cache (32-byte or maybe 64-byte boundaries). You're still testing with the caller compiled without optimization. The loop is a mess of sign-extension and reloads, of course, but most importantly keeping the loop counter in memory creates a dependency chain involving store-forwarding latency. Attempting a load later can make it succeed more quickly in store-forwarding cases, on Intel Sandybridge-family, so perhaps an extra xor-zeroing uop is reducing the average latency of the store/reloads for the loop counter (which is probably the real bottleneck.) https://stackoverflow.com/questions/49189685/adding-a-redundant-assignment-speeds-up-code-when-compiled-without-optimization Loads are weird in general: the scheduler anticipates their latency and dispatches uops that will consume their results in the cycle when it expects a load will put the result on the forwarding network. But if the load *isn't* ready when expected, it may have to replay the uops that wanted that input. See https://stackoverflow.com/questions/54084992/weird-performance-effects-from-nearby-dependent-stores-in-a-pointer-chasing-loop for a detailed analysis of this effect on IvyBridge. (Skylake doesn't have the same restrictions on stores next to loads, but other effects can cause replays.) https://stackoverflow.com/questions/52351397/is-there-a-penalty-when-baseoffset-is-in-a-different-page-than-the-base/52358810#52358810 is an interesting case for pointer-chasing where the load port speculates that it can use the base pointer for TLB lookups, instead of the base+offset. https://stackoverflow.com/questions/52527325/why-does-the-number-of-uops-per-iteration-increase-with-the-stride-of-streaming shows load replays on cache misses. So there's a huge amount of complicating factors from using a calling loop that keeps its loop counter in memory, because SnB-family doesn't have a simple fixed latency for store forwarding. If I put the tests in a different order, I sometimes get results like: ./test sse : 26882815 sse_clear: 26207589 avx_clear: 25968108 avx : 25920897 avx2 : 25956683 Often avx (with the false dep on the load result into XMM1) is slower than avx_clear of avx2, but there's a ton of noise. Adding vxorps %xmm2, %xmm2, %xmm2 to avx.S also seems to have sped it up; now it's the same speed as the others, even though I'm *not* breaking the dependency chain anymore. XMM2 is unrelated, nothing touches it. This basically proves that your benchmark is sensitive to extra instructions, whether they interact with vcvtsd2ss or not. We know that in the general case, throwing in extra NOPs or xor-zeroing instructions on unused registers does not make code faster, so we should definitely distrust the result of this microbenchmark. I've attached my NASM loop. It has various commented-out loop bodies, and notes in comments on results I found with performance counters. I don't know if it will be useful (because it's a bit messy), but it's what I use for testing snippets of asm in a static binary with near-zero startup overhead. I just run perf stat on the whole executable and look at cycles / uops.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #6 from Peter Cordes --- (In reply to Peter Cordes from comment #5) > But whatever the effect is, it's totally unrelated to what you were *trying* > to test. :/ After adding a `ret` to each AVX function, all 5 are basically the same speed (compiling the C with `-O2` or -O2 -march=native), with just noise making it hard to see anything clearly. sse_clear tends to be faster than sse in a group of runs, but if there are differences it's more likely due to weird front-end effects and all the loads of inputs + store/reload of the return address by call/ret. I did while ./test; : ;done to factor out CPU clock-speed ramp up and maybe some cache warmup stuff, but it's still noisy from run to run. Making printf/write system calls between tests will cause TLB / branch-prediction effects because of kernel spectre mitigation, so I guess every test is in the same boat, running right after a system call. Adding loads and stores into the mix makes microbenchmarking a lot harder. Also notice that since `xmm0` and `xmm1` pointers are global, those pointers are reloaded every time through the loop even with optimization. I guess you're not trying to minimize the amount of work outside of the asm functions, to measure them as part of a messy loop. So for the version that have a false dependency, you're making that dependency on the result of this: movrax,QWORD PTR [rip+0x2ebd] # reload xmm1 vmovapd xmm1,XMMWORD PTR [rax+rbx*1] # index xmm1 Anyway, I think there's too much noise in the data, and lots of reason to expect that vcvtsd2ss %xmm0, %xmm0, %xmm1 is strictly better than VPXOR+convert, except in cases where adding an extra uop actually helps, or where code-alignment effects matter.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #5 from Peter Cordes --- (In reply to H.J. Lu from comment #4) > (In reply to Peter Cordes from comment #2) > > Can you show some > > asm where this performs better? > > Please try cvtsd2ss branch at: > > https://github.com/hjl-tools/microbenchmark/ > > On Intel Core i7-6700K, I got I have the same CPU. > [hjl@gnu-skl-2 microbenchmark]$ make > gcc -g -I.-c -o test.o test.c > gcc -g -c -o sse.o sse.S > gcc -g -c -o sse-clear.o sse-clear.S > gcc -g -c -o avx.o avx.S > gcc -g -c -o avx2.o avx2.S > gcc -g -c -o avx-clear.o avx-clear.S > gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o > ./test > sse : 24533145 > sse_clear: 24286462 > avx : 64117779 > avx2 : 62186716 > avx_clear: 58684727 > [hjl@gnu-skl-2 microbenchmark]$ You forgot the RET at the end of the AVX functions (but not the SSE ones); The AVX functions fall through into each other, then into __libc_csu_init before jumping around and eventually returning. That's why they're much slower. Single-step through the loop in GDB... │0x5660 vcvtsd2ss xmm0,xmm0,xmm1 >│0x5664 nopWORD PTR cs:[rax+rax*1+0x0] │0x566e xchg ax,ax │0x5670vcvtsd2ss xmm0,xmm1,xmm1 │0x5674 nopWORD PTR cs:[rax+rax*1+0x0] │0x567e xchg ax,ax │0x5680 vxorps xmm0,xmm0,xmm0 │0x5684 vcvtsd2ss xmm0,xmm0,xmm1 │0x5688 nopDWORD PTR [rax+rax*1+0x0] │0x5690 <__libc_csu_init>endbr64 │0x5694 <__libc_csu_init+4> push r15 │0x5696 <__libc_csu_init+6> movr15,rdx And BTW, SSE vs. SSE_clear are about the same speed because your loop bottlenecks on the store/reload latency of keeping a loop counter in memory (because you compiled the C without optimization). Plus, the C caller loads write-only into XMM0 and XMM1 every iteration, breaking any loop-carried dependency the false dep would create. I'm not sure why it makes a measurable difference to run the extra NOPS, and 3x vcvtsd2ss instead of 1 for avx() vs. avx_clear(), because the C caller should still be breaking dependencies for the AVX-128 instructions. But whatever the effect is, it's totally unrelated to what you were *trying* to test. :/
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #3 from Peter Cordes --- (In reply to H.J. Lu from comment #1) I have a patch for PR 87007: > > https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html > > which inserts a vxorps at the last possible position. vxorps > will be executed only once in a function. That's talking about the mem,reg case, which like I said is different. I reported Bug 80571 a while ago about the mem,reg case (or gp-reg for si2ss/d), so it's great that you have a fix for that, doing one xor-zeroing and reusing that as a merge target for a whole function / loop. But this bug is about the reg,reg case, where I'm pretty sure there's nothing to be gained from xor-zeroing anything. We can fully avoid any false dep just by choosing both source registers = src, making the destination properly write-only. If you *have* an xor-zeroed register, there's no apparent harm in using it as the merge-target for a reg-reg vcvt, vsqrt, vround, or whatever, but there's no benefit either vs. just setting both source registers the same. So whichever is easier to implement, but ideally we want to avoid introducing a vxorps into functions / blocks that don't need it at all.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #2 from Peter Cordes --- (In reply to H.J. Lu from comment #1) > But > > vxorps %xmm0, %xmm0, %xmm0 > vcvtsd2ss %xmm1, %xmm0, %xmm0 > > are faster than both. On Skylake-client (i7-6700k), I can't reproduce this result in a hand-written asm loop. (I was using NASM to make a static executable that runs a 100M iteration loop so I could measure with perf). Can you show some asm where this performs better? vcvtsd2ss src-reg,dst,dst is always 2 uops, regardless of the merge destination being an xor-zeroed register. (Either zeroed outside the loop, or inside, or once per 4 converts with an unrolled loop.) I can't construct a case where vcvtsd2ss %xmm1, %xmm1, %xmm0 is worse in any way (dependencies, uops, latency, throughput) than VXORPS + vcvtsd2ss with dst = middle source. I wasn't mixing it with other instructions other than VXORPS, but I don't think anything is going to get rid of its 2nd uop, and choosing both inputs = the same source removes any benefit from dep-breaking the output. If adding a VXORPS helped, its probably due to some other side-effect. Could the effect you saw have been due to code-gen changes for memory sources, maybe vxorps + vcvtsd2ss (mem), %xmm0, %xmm0 vs. vmovsd + vcvtsd2ss %xmm1, %xmm1, %xmm0? (Those should be about equal, but memory-source SS2SD is cheaper, no port5 uop.) BTW, the false-dependency effect is much more obvious with SS2SD, where the latency from src1 to output is 4 cycles, vs. 1 cycle for SD2SS. Even without dependency-breaking, repeated vcvtsd2ss %xmm1, %xmm0, %xmm0 can run at 1 per clock (same as with dep breaking), because the port-5 uop that merges into the low 32 bits of xmm0 with 1 cycle latency is 2nd. So latency from xmm0 -> xmm0 for that [v]cvtsd2ss %xmm1, %xmm0 is 1 cycle. With dep-breaking, they both still bottleneck on the port5 uop if you're doing nothing else.
[Bug target/80586] vsqrtss with AVX should avoid a dependency on the destination register.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80586 Peter Cordes changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #1 from Peter Cordes --- Fixed for vsqrtss/sd somewhere in 9.0, but not 8.2. https://godbolt.org/z/0Gxf05. The general case of one-input scalar xmm,xmm instructions like vcvtss2sd is still all over the place, with false deps or wasted xor-zeroing. Reported that as bug 89071 It seems only VSQRTsd/ss itself was fixed for this; sorry I didn't think of checking for other one-input instructions when I reported this.
[Bug target/89071] New: AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 Bug ID: 89071 Summary: AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- float cvt(double unused, double xmm1) { return xmm1; } g++ (GCC-Explorer-Build) 9.0.0 20190120 (experimental): vxorps %xmm0, %xmm0, %xmm0 vcvtsd2ss %xmm1, %xmm0, %xmm0# merge into XMM0 clang7.0 vcvtsd2ss %xmm1, %xmm1, %xmm0# both sources are from XMM1, no false dep gcc already uses this trick for SQRTSS/SD, but not for float<->double conversion. I haven't checked all the other scalar instructions, but roundss for floor() does neither and has a false dependency. (i.e. it chooses the output register as the merge-target, not the actual input.) return floorf(x); -> vroundss$9, %xmm1, %xmm0, %xmm0 Some testcases: https://godbolt.org/z/-rqUVZ --- In SSE, one-input scalar instructions like CVT* and SQRTSS/SD have an output dependency because of Intel's short-sighted ISA design optimizing for Pentium-III's 64-bit SIMD: zero-extending to fill the destination XMM register would have cost an extra uop to write the upper half of the destination. For consistency(?), SSE2 scalar instructions (new with Pentium 4 which had 128-bit SIMD execution units / register file) have the same behaviour of merging into the low 64 bits of the destination, even conversion between double and float between two xmm registers, which didn't exist before SSE2. (Previously conversion instructions were only between float in XMM and integers in scalar or MMX regs, or packed-integer <-> ps which filled the whole XMM reg and thus avoided a false dependency). (Fortunately this isn't a problem for 2-input instructions like ADDSS: the operation already depends on both registers.) --- The VEX encoding makes the merge-target separate from the actual destination, so we can finally avoid false dependencies without wasting an instruction breaking it. (When the source is already in an XMM register). For instructions where the source isn't an XMM register (e.g. memory or integer reg for int->FP conversions), one zeroed register can be used as a read-only merge target by any number of scalar AVX instructions, including in a loop. That's bug 80571. (It's unfortunate that Intel didn't take the opportunity to give the AVX versions subtly different semantics, and zero-extend into the target register. That would probably have enabled vcvtsd2ss to be single-uop instead of 2 on Sandybridge-family. IDK if they didn't think of that, or if they wanted strict consistency with the semantics of the SSE version, or if they thought decoding / internals would be easier if they didn't have to omit the merge-into-destination part of the scalar operation. At least they made the extra dependency an explicit input, so we can choose a register other than the destination, but it's so rarely useful to actually merge into the low 64 or 32 of another reg that it's just long-term harmful to gimp the ISA with an extra dependency for these instructions, especially integer->FP.) (I suspect that most of the dep-breaking gcc does isn't gaining any speed, but the trick is figuring out when we can omit it while being sure that we don't couple things into one big loop-carried chain, or serialize some things that OoO exec could otherwise benefit from hiding. Within one function with no calls, we might be able to prove that a false dep isn't serializing anything important (e.g. if there's already enough ILP and something else breaks a dep on that register between loop iterations), but in general it's hard if we can't pick a register that was already part of the dep chain that led to the input for this operation, and thus is harmless to introduce a dep on.) Relevant instructions that can exist in scalar xmm,xmm form: VROUNDSS/SD (gcc leaves a false dep, clang gets it right) VSQRTSS/SD (gcc already gets this right) VRCPSS VRSQRTSS haven't checked [V]CVTSS2SD xmm,xmm (Skylake: SRC1/output dependency is a separate 1c latency 32-bit merge uop) The memory-source version is still 2 uops. [V]CVTSD2SS xmm,xmm (Skylake: SRC1/output dependency is the main 4c conversion uop, the extra uop is first, maybe extracting 32 bits from the src?) The memory-source version of [V]CVTSD2SS is only 1 uop! So avoiding a false dep by loading with MOVSS/MOVSD and then using the reg-reg version is a bad idea for CVTSD2SS. It's actually much better to PXOR and then CVTSD2SS (me
[Bug target/89063] [x86] lack of support for BEXTR from BMI extension
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1 from Peter Cordes --- Unfortunately Intel Haswell/Skylake implement BEXTR as 2 uops with 2c latency. Presumably those uops are a shift + bzhi, so 1p06 + 1p15 would explain Agner Fog's experimental result of 2p0156 for BEXTR, with 0.5c throughput. On AMD Excavator/Ryzen, it's 1 uop with 1c latency. On Steamroller and earlier, it's 2 uops but 1c latency. (I assume that's latency from the non-control input to the output. So maybe one of the uops pre-processes the control input, otherwise you'd expect 2c latency from either operand.) Ryzen dropped support for AMD TBM, so only Excavator (bdver4) has 1-uop bextr imm16 which would avoid the need for mov reg,imm32 with the control operand. But mov-imm + bextr can still be a win on Ryzen, lower latency than RORX+AND BMI2 RORX is single-uop on all CPUs that support it. If we already need a 2nd uop to mask anyway, we can use RORX+AND-immediate to duplicate the functionality and performance of BEXTR-immediate, with the smaller code-size if the AND-mask fits in an imm8. (5+5 vs. 6+3 or 6+4 if the AND needs a REX) Without an immediate-source BEXTR (like AMD TBM has/had), the only advantage mov-immediate+bextr has (on Intel) over mov-reg+shift+and is that can deal with wide bitfields using a count instead of an immediate AND mask. (Especially if it doesn't fit in 32 bits). If you can reuse the same control-register in a loop, BEXTR is good-ish for copy-and-extract. PEXT is 1 uop on Intel CPUs even though the simpler-looking BEXTR is 2. But PEXT is extremely slow on Ryzen (7 uops, 18c lat and tput). So for 32-bit constants at least, mov r32,imm32 + PEXT to copy-and-extract is better than BEXTR on Intel. movabs imm64 is too big and can cause front-end problems (slower to read from the uop cache, if that effect from Sandybridge is still present on Haswell/Skylake), and has no advantage vs. RORX + AND unless the bitfield you're extracting is wider than 32 bits. PEXT has 3 cycle latency, though, and can only run on port 1 on SnB-family. (All integer uops with latency > 1 are p1-only). It's potentially good for throughput, but worse than RORX+AND for latency. Unfortunately x86 bitfield instructions are pretty weak compared to ARM / AArch64 ubfx or PowerPC rlwinm and friends, where the bit-positions are simply specified as immediates. Only AMD's immediate version of BEXTR (1 uop on Excavator) matched them. Having a bunch of different control operands for BEXTR or PEXT in registers might be usable in a loop, but a lot more rarely useful than immediate controls. : 0: c4 e3 fb f0 c7 2a rorx $0x2a,%rdi,%rax# $(64-22) 6: c4 e3 fb f0 d7 35 rorx $0x35,%rdi,%rdx# $(64-11) c: 83 e7 3fand$0x3f,%edi f: 83 e0 3fand$0x3f,%eax 12: 83 e2 3fand$0x3f,%edx 15: 01 f8 add%edi,%eax # 32-bit operand-size because we can prove it can't overflow 17: 01 d0 add%edx,%eax # missed optimization in both gcc's versions. 19: c3 retq Not counting the ret, this is 7 uops for Skylake and Ryzen. **I'm pretty sure this is our best bet for -march=skylake, and for tune=generic -mbmi2** The BEXT intrinsics version is 9 uops for SKL, 7 for Ryzen, but is 2 bytes larger. (not counting the savings from avoiding a REX prefix on the ADD instructions; that missed optimization applies equally to both.) OTOH, the critical path latency for BEXTR on Ryzen is better by 1 cycle, so we could still consider it for -march=znver1. Or for tune=generic -mbmi without BMI2. The legacy mov+shr+and version is 10 uops because gcc wasted a `mov %rdi,%rax` instruction; it *should* be 9 uops for all normal CPUs. --- With only BMI1 but not BMI2 enabled, we should probably use the mov-imm + BEXTR version. It's not worse than the mov+shr+and version on SnB-family or bd/zn, and it's better on some AMD. And it's probably smaller code-size. And in future if Intel designs CPUs that can handle BEXTR as a single uop with 1c latency, mov+bextr will become good-ish everywhere. For code-size, BEXTR has a definite advantage for bitfields wider than 1 byte, because AND $imm32, %r32 is 6 bytes long instead of 3.
[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #4 from Peter Cordes --- The VPAND instructions in the 256-bit version are a missed-optimization. I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512. (If we improve code-gen for that choice, it will make it a win in more cases.) https://godbolt.org/g/2dfkNV Loads are folding into the shifts now, unlike with gcc7.3. (But they can't micro-fuse because of the indexed addressing mode. A pointer increment might save 1 front-end uop even in the non-unrolled loop) The separate integer loop counter is gone, replaced with a compare against an end-index. But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. Fewer instructions and (more importantly) 1/3 the shuffle uops. GCC knows how to do this for the 256-bit version, so it's apparently a failure of the cost-model that it doesn't for the 512-bit version. (Maybe requiring a shuffle-control vector instead of immediate puts it off? Or maybe it's counting the cost of the useless vpand instructions for the pack / permq option, even though they're not part of the shuffle-throughput bottleneck?) We do use vpackuswb + vpermq for 256-bit, but we have redundant AND instructions with set1_epi16(0x00FF) after a right shift already leaves the high byte zero. --- Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu. GCC should be using the VEX encoding of an instruction whenever it does exactly the same thing. At least we didn't use vpandd or vpandq EVEX instructions. (I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a store with no masking. Hopefully it's efficient.)
[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #3 from Peter Cordes --- I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512. (If we improve code-gen for that choice, it will make it a win in more cases.) https://godbolt.org/g/2dfkNV Loads are folding into the shifts now, unlike with gcc7.3. (But they can't micro-fuse because of the indexed addressing mode. A pointer increment might save 1 front-end uop even in the non-unrolled loop) The separate integer loop counter is gone, replaced with a compare against an end-index. But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. Fewer instructions and (more importantly) 1/3 the shuffle uops. GCC knows how to do this for the 256-bit version, so it's apparently a failure of the cost-model that it doesn't for the 512-bit version. (Maybe requiring a shuffle-control vector instead of immediate puts it off? Or maybe it's counting the cost of the useless vpand instructions for the pack / permq option, even though they're not part of the shuffle-throughput bottleneck?) We do use vpackuswb + vpermq for 256-bit, but we have redundant AND instructions with set1_epi16(0x00FF) after a right shift already leaves the high byte zero. --- Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu. GCC should be using the VEX encoding of an instruction whenever it does exactly the same thing. At least we didn't use vpandd or vpandq EVEX instructions. (I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a store with no masking. Hopefully it's efficient.)
[Bug rtl-optimization/86352] New: setc/movzx introduced into loop to provide a constant 0 value for a later rep stos
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352 Bug ID: 86352 Summary: setc/movzx introduced into loop to provide a constant 0 value for a later rep stos Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* The wrong-code bug 86314 also revealed some very weird code-gen decisions, which the fix didn't improve. (I think the lock bts peephole is seen pretty late, and that's one necessary factor for this problem. But even without it, an unnecessary data dependency between the lock bts loop and clearing memory is silly.) This ended up being about 5 separate bugs, but IDK which belong together or are already reported: * useless mov %rsi, %rcx and useless mov %rdx, %rdi * using setc/movzx instead of xor %eax,%eax to get a constant 0; slower and creating a data dependency * Doing that inside the loop instead of after * Not adjusting register allocation to allow xor / set-flags / setc * rep stos vs. vector stores as a zeroing strategy vs. any other repeated value. The reproducer test-case for bug 86314 loops until it finds and claims a zero bit in a uint64_t, then returns a Bucket() object (with a constructor that zero-initializes it) with no data dependency on anything. But gcc decides to introduce a flag -> integer 0/1 inside the acquire() loop instead of just using xor eax,eax before rep stosq. The loop can only exit when CF = 0, so RAX = 0, so it's not a correctness problem. The loop is branching on CF as set by BTS, so there's no need to have the 0/1 in a register at all inside the loop, and setc/movzx from a known-zero CF is more expensive that xor-zeroing. (Plus it gives the STOSQ a data dependency on the LOCK BTS flag result which it wouldn't have otherwise. The stores can't commit until after the lock memory barrier, but they can execute.) This is the actual code-gen from (GCC-Explorer-Build) 9.0.0 20180627 https://godbolt.org/g/XGF5tR BucketMap::acquireBucket(): movq%rdi, %rdx movq%rsi, %rcx # useless, lock bts can use (%rsi) .L2: movq(%rsi), %rax andl$1, %eax# source is simplified to only check positions 0 or 1 lock btsq %rax, (%rcx) # Why not (%rsi)? setc%al movzbl %al, %eax # xor / bts / setc would have been possible with a different reg jc .L2 # rax = 0 because the loop can only exit when CF=0 # should use xor %eax,%eax here instead movq%rdx, %rdi # Useless, RDI still == RDX movl$16, %ecx rep stosq movq%rdx, %rax # can't be done before rep stosq: RAX needs to be 0 ret With -m32, where 64-bit lock bts isn't available, we have lock cmpxchg8b ending with an OR. So there is a zero in an integer register from that, but it's not in EAX, so the code gen includes an extra `mov %esi, %eax`, which is not cheaper than xor %eax,%eax especially with -march=haswell. Sandybridge-family has xor-zeroing as cheap as a NOP, but mov-elimination isn't always perfect and SnB itself doesn't have it. And of course mov still has a data dependency on the source of the zero, so it defeats the effect of branch prediction + speculative breaking (control) dependencies. This last applies on any out-of-order x86. I guess the lock bts peephole is seen too late to notice that it can't recycle the 0 from the loop condition anymore, and ends up generating code to materialize it. But why inside the loop? -- Even if we *did* need an integer 0/1 in a register inside the loop, we could still use the xor / set-flags / setcc optimization: Simply use a register other than RAX for the load / AND $1 / bts source. And you can hoist the xor-zeroing out of the loop. xor %eax, %eax .L2: movq(%rsi), %rcx andl$1, %ecx lock btsq %rax, (%rsi) setc%al # use %rax jc .L2 --- Separately: If the initializer is non-zero, it uses SSE or AVX stores. That makes no sense either: if rep stosq is optimal, use mov eax, 1 for the all-ones case. (See the ifdef in the Godbolt link to try it) If it's not optimal, use xorps xmm0,xmm0 to create an all-zero vector. I guess gcc is checking for all-zeros as a common special case, but doesn't check for repeats of any other value, except for repeated bytes recognized as memset. So it makes sense that gcc uses a different strategy, but I think for only 16x 8 bytes (128 bytes) that vector stores beat rep stos on current CPUs. (That may change when IceLake introduces fast
[Bug target/80820] _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 --- Comment #5 from Peter Cordes --- AVX512F with marge-masking for integer->vector broadcasts give us a single-uop replacement for vpinsrq/d, which is 2 uops on Intel/AMD. See my answer on https://stackoverflow.com/questions/50779309/loading-an-xmm-from-gp-regs. I don't have access to real hardware, but according to reported uop counts, this should be very good: 1 uop per instruction on Skylake-avx512 or KNL vmovq xmm0, rax1 uop p5 2c latency vpbroadcastq xmm0{k1}, rdx ; k1 = 0b00101 uop p5 3c latency vpbroadcastq ymm0{k2}, rdi ; k2 = 0b01001 uop p5 3c latency vpbroadcastq ymm0{k3}, rsi ; k3 = 0b10001 uop p5 3c latency xmm vs. ymm vs. zmm makes no difference to latency, according to InstLatx64 (For a full ZMM vector, maybe start a 2nd dep chain and vinsert to combine 256-bit halves. Also means only 3 k registers instead of 7) vpbroadcastq zmm0{k4}, rcx ; k4 =0b1 3c latency ... filling up the ZMM reg Starting with k1 = 2 = 0b0010, we can init the rest with KSHIFT: mov eax, 0b0010 = 2 kmovwk1, eax KSHIFTLW k2, k1, 1 KSHIFTLW k3, k1, 2 # KSHIFTLW k4, k1, 3 ... KSHIFT runs only on port 5 (SKX), but so does KMOV; moving from integer registers would just cost extra instructions to set up integer regs first. It's actually ok if the upper bytes of the vector are filled with broadcasts, not zeros, so we could use 0b1110 / 0b1100 etc. for the masks. We could start with kxnor to generate a -1 and left-shift that, but that's 2 port5 uops vs. mov eax,2 / kmovw k1, eax being p0156 + p5. Loading k registers from memory is not helpful: according to IACA, it costs 3 uops. (But that includes p237, and a store-AGU uop makes no sense, so it might be wrong.)
[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #14 from Peter Cordes --- I happened to look at this old bug again recently. re: extracting high the low two 32-bit elements: (In reply to Uroš Bizjak from comment #11) > > Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other > > CPUs where an FP shuffle has bypass delay between integer ops) > > > > movd %xmm0, %eax > > movshdup %xmm0, %xmm0 # saves 1B of code-size vs. psrldq, I think. > > movd %xmm0, %edx > > > > Or without SSE3, > > > > movd %xmm0, %eax > > psrldq $4, %xmm0# 1 m-op cheaper than pshufd on K8 > > movd %xmm0, %edx > > The above two proposals are not suitable for generic moves. We should not > clobber input value, and we are not allowed to use temporary. SSE3 movshdup broadcasts the high element within each pair of 32-bit elements so movshdup %xmm0, %xmm1 movd %xmm1, %eax saves a byte of code vs pshufd / movd, and saves a uop on Merom and avoids a flt->int. (According to Agner Fog's tables, pshufd is flt->int domain, i.e. it wants input in the float domain. While movshdup ironically is only an integer shuffle.) Probably not worth looking for that optimization, though, because it's not worth using universally (Nehalem has worse latency for float shuffles between int instructions). With just SSE2, PSHUFLW is the same size as PSHUFD and faster on Merom / K8 (slowshuffle CPUs where PSHUFD is multiple uops). It's not slower on any current CPUs. I could imagine some future CPU having better throughput for 32-bit element size shuffles than 16-bit, though. That's already the case for wider lane-crossing shuffles (VPERMW YMM is multiple uops on Skylake-AVX512). This would be a definite win for tune=core2 or k8, and Pentium M, but those are so old it's probably not worth adding extra code to look for it. I think it's pretty future-proof, though, unless Intel or AMD add an extra shuffle unit for element sizes of 32-bit or wider on another port.
[Bug tree-optimization/69615] 0 to limit signed range checks don't always use unsigned compare
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615 --- Comment #5 from Peter Cordes --- Update: https://godbolt.org/g/ZQDY1G gcc7/8 optimizes this to and / cmp / jb, while gcc6.3 doesn't. void rangecheck_var(int64_t x, int64_t lim2) { //lim2 >>= 60; lim2 &= 0xf; // let the compiler figure out the limited range of limit if (x>=0 && x=0 && x<=(INT_MAX-1)) ext(); } // clang and gcc use 2 branches
[Bug tree-optimization/84011] Optimize switch table with run-time relocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 --- Comment #13 from Peter Cordes --- (In reply to Jakub Jelinek from comment #10) > ?? That is the task for the linker SHF_MERGE|SHF_STRINGS handling. > Why should gcc duplicate that? Because gcc would benefit from knowing if merging makes the total block of strings for a switch() table short enough to use a uint8_t offset[] instead of uint16_t. If we don't know at compile time, we'd have to be conservative and potentially use a wider offset table. (Although as Joseph points out https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585#c2, without more linker support for this we could end up missing out on literal merging across compilation units. So perhaps a first step in applying this idea would be to use 32-bit offsets from the start of the .rodata.str1.1 section, so we can still let the linker merge strings and end up with them non-contiguous without having to force the one that gets kept to be the one that's part of our block of strings.)
[Bug tree-optimization/84011] Optimize switch table with run-time relocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 --- Comment #12 from Peter Cordes --- (In reply to Jakub Jelinek from comment #10) > (In reply to Peter Cordes from comment #9) > > gcc already totally misses optimizations here where one string is a suffix > > of another. "mii" could just be a pointer to the 3rd byte of "sgmii", but > > we instead duplicate all the characters. That's where major savings are > > possible for this function. > > ?? That is the task for the linker SHF_MERGE|SHF_STRINGS handling. > Why should gcc duplicate that? Oops, right I was only looking at gcc's asm output, didn't check an actual linked binary. Will the linker currently catch a case like this? .LC_base: .LC2: .string "mii" .LC3: .string "gmii" table: .byte .LC2 - .LC_base, .LC3 - .LC_base and drop .string "mii" entirely + rewrite the table to .byte .LC3+1 - .LC_base, .LC3 - .LC_base (This discussion should probably be happening on bug 85585.) Sorry I don't know the actual mechanism by which gcc signals to the linker that it can / can't merge. I guess only in some sections? Because gcc couldn't allow it if was emitting an array like this, where dropping a string would change the offsets for later data and break offset calculations: const struct { char str[11]; } table[] = { {"mii"}, {"gmii"} };
[Bug tree-optimization/84011] Optimize switch table with run-time relocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #9 from Peter Cordes --- (In reply to rguent...@suse.de from comment #4) > An optimization would be to > add an indirection by, say, only recording the constant offset > into an "array of strings" in the table, thus effectively > > "case1\0case2\0..."[CSWITCH[i]] > > which would require only a relocation to access the single string > constant. But it would prohibit cases of string merging within > those strings unless we implement that as well for this optimization. gcc already totally misses optimizations here where one string is a suffix of another. "mii" could just be a pointer to the 3rd byte of "sgmii", but we instead duplicate all the characters. That's where major savings are possible for this function. > Note this might be profitable unconditionally, not just with -fpie/pic > as the CSWITCH table would be smaller (dependent on the total > size of the merged string). Indeed, I wrote up bug 85585 with ideas for optimizing this. A table of byte or uint16_t offsets into a static buffer of packed strings looks good for PIC and for position-dependent. To avoid any runtime relocations, all you need is the ability to get a static address into a register (e.g. RIP-relative LEA) and do an indexed load relative to it, just like using a normal static char[]. Then add the load result to that address. Runtime relocation is nice to avoid even if you don't *need* to avoid it. Also possible is padding each string out to a constant length and calculating an index into that, removing a level of indirection. (Good when strings are similar length and/or all short, and there aren't many strings that are duplicates or suffixes of others.) Again you just need to get a static address into a register, and add it to 11*enum_value. This is all ADD + LEA (with one of them being RIP-relative).
[Bug tree-optimization/85585] switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a string
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585 --- Comment #1 from Peter Cordes --- By comparison, the no-PIE table of pointers only needs one instruction: movqCSWTCH.4(,%rdi,8), %rax So all my suggestions cost 1 extra instruction on x86 in no-PIE mode, but at a massive savings in data size. clang -fPIE compiles the plain switch to the obvious / sane 2 instruction sequence which should be our baseline for normal cases. # clang6.0 -fPIE -O3 (switch compilers on the Godbolt link) leaq.Lswitch.table.phy_modes(%rip), %rcx movq(%rcx,%rax,8), %rax Clang is willing to make a table that needs relocations for the entries. (My suggestions all avoid that because they're based on offsets, not a table of pointers. Avoiding rodata relocations that dirty a page and prevent sharing has some non-zero value, although it's low on many architectures where memory is cheap.)
[Bug tree-optimization/85585] New: switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a st
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585 Bug ID: 85585 Summary: switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a string table Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Bug 84011 shows some really silly code-gen for PIC code and discussion suggested using a table of offsets instead of a table of actual pointers, so you just need one base address. A further optimization is possible when the strings are all similar length, and/or the longest one isn't much longer than a pointer: Pad all strings to the same length with trailing 0 bytes, and calculate a pointer instead of loading it from an array. This removes the possibility of multiple entries sharing the same suffix (which is a missed optimization gcc wasn't already doing), but avoids needing any space for storing pointers in memory at all. In the case discussed in bug 84011 (Linux's phy.h const char *phy_modes(phy_interface_t interface)), the longest strings are 11 bytes (including the \0), and there are 23 of them. So it takes 253 bytes of char data to store everything (not counting the "unknown" for the default: special case) with all strings padded to 11 bytes. The current strings + pointer-table implementation doesn't merge string literals where one string is a suffix of another; this is another a missed-optimization that would save many bytes here. (e.g. instead of .string "mii" and .string "gmii", just have .LC4 .byte 's'; .LC3: .byte 'g'; .LC2: .string "mii".) That optimization plus byte or 16-bit offsets into the table would be nice and compact, and most CPUs have efficient zero-extending narrow loads. So for cases where the other optimization I'm suggesting isn't good, that would probably be best. The current packed string-data takes 158 bytes , so with 4-byte offsets it takes 158+23*4 = 250 bytes. Or with 8-byte pointers/offsets, it takes 158 + 23*8 = 342 bytes. Or with 1-byte offsets, 158 + 23*1 = 181 bytes: load with movzbl. (If you can't use the offset directly as an 8-byte memory source operand for ADD to a pointer, there's no point making it 32 bits instead of 8.) The code for *using* such a table is quite simple. This C source compiles to what I'm suggesting: https://godbolt.org/g/E8J3iS struct foo { char str[11]; } const table[23] = {}; const char *lookup(unsigned long idx) { if(idx > 23) { return "unknown"; //idx=23; } return table[idx].str; } Multiply by 11 only takes 2 LEA instructions on x86, so for PIC code with a RIP-relative LEA we end up with 4 ALU instructions total to get a string address, after checking the if condition: # gcc7.3 -march=haswell -O3 -fPIE output: https://godbolt.org/g/qMzaY8 leaq.LC0(%rip), %rax# "unknown" cmpq$23, %rdi ja .L4 # branchless is also an option leaq(%rdi,%rdi,4), %rax leaqtable(%rip), %rdx # RIP-relative table base address leaq(%rdi,%rax,2), %rax addq%rdx, %rax # table + 11*idx .L4: ret This is even better in no-PIE mode where a static address is usable as a signed 32-bit immediate: lookup(unsigned long): movl$.LC0, %eax cmpq$23, %rdi ja .L4 leaq(%rdi,%rdi,4), %rax leaqtable(%rdi,%rax,2), %rax# 3 cycle latency for 3-component LEA on SnB-family .L4: ret So this has extremely low code-size cost on x86-64, for the benefit of removing a table load in the dependency chain from enum to string data. It does cost significant data size vs. a byte-offset table with suffix-merging, but it's better than what gcc is doing now in non-PIE (table of qword pointers), and *much* better in PIE (insane jump table). - The byte-index version is equivalent to transforming the C source like this: const char packedstrings[158] = {}; const unsigned char offsets[23] = {}; const char *lookup_byteidx(unsigned long idx) { if(idx>23) return "unknown"; return &packedstrings[offsets[idx]]; } leaq.LC0(%rip), %rax # "unknown" cmpq$23, %rdi ja .L9 leaqoffsets(%rip), %rax leaqpackedstrings(%rip), %rdx movzbl (%rax,%rdi), %eax addq%rdx, %rax .L9: ret We can save an instruction here by making the relativ
[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 --- Comment #2 from Peter Cordes --- The stray LEA bug seems to be fixed in current trunk (9.0.0 20180429), at least for this testcase. Gcc's stack-alignment strategy seems to be improved overall (not copying the return address when not needed), so probably it's really fixed. It's still present in 7.3.
[Bug c++/69560] x86_64: alignof(uint64_t) produces incorrect results with -m32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69560 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #23 from Peter Cordes --- Just to recap the current situation (gcc/g++ 8.0.1 20180425): I ported David Marillat's testcase to work as C or C++ https://godbolt.org/g/QdG2V6. (And changed it to set global variables instead of calling printf, so you can see the results from looking at the asm output instead of running it). C++11 alignof() now agrees with C11 alignof() (which didn't change) that alignof(int64_t) is 4 when targeting the i386 System V ABI. Previously G++'s alignof() reported 8, while gcc's C11 alignof (stdalign.h) reported 4. That was the only change: struct-member alignof results are unchanged, and already matched between C11 and C++11. 4 is the minimum alignment that *any* int64_t, or pointer to int64_t, is assumed to have when generating code for i386 SysV. gcc / g++ are allowed to generate code that breaks if passed a pointer to int64_t that wasn't 4-byte aligned. (Auto-vectorization is one case where that can happen on x86: https://stackoverflow.com/q/47510783/224132). They're *not* allowed to assume that it's 8-byte aligned unless they can see the definition and know that a particular int64_t object is over-aligned, e.g. to its natural alignment of 8, like gcc chooses to do whenever possible (i.e. outside structs). So in both C++ and C (and in g++/gcc after this patch), alignof(int64_t) is the minimum that any allocator must give an int64_t for correctness (in this funky 32-bit ABI), not the recommended alignment that gcc and g++ both already used whenever ABI struct-packing rules didn't constrain them. It's also the guaranteed minimum that code can *assume*. e.g. a manually-vectorized library function might check alignof(T) == sizeof(T) before assuming that using 16-byte aligned loads/stores can line up with element boundaries. (An array inside a struct { int foo; int64_t arr[10]; } would violate this for i386 SysV). Anyway, I think use-cases like these are why the standard is worded the way it is, and why it makes sense for alignof() to report the guaranteed/required minimum. The recommended or actual alignment is useful, too, though, for other cases, so it's nice that GNU __alignof() is also available to report that. Semi-related: gcc depends on 8-byte alignment for C11 _Atomic int64_t but still fails to provide it inside structs on the i386 SysV ABI (Bug 65146), using the same alignment rules as regular int64_t. C++11 std::atomic is fine, getting the required natural alignment even on i386 SysV so SSE2 movq is atomic and lock add is efficient. This change to what alignof() reports in C++ had no effect on C at all, or on any alignment choices made by the compiler in either C or C++. I only mention it as another interesting case where i386 SysV's under-alignment of 64-bit types requiring special care, but that one will require an ABI change of some sort to fix.
[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1 from Peter Cordes --- This LEA stuff is part of what gcc does to align the stack by 32 for spilling AVX locals. Gcc's stack-align sequence is over-complicated and ties up an extra register for the whole function (add volatile to the local and see the -O3 code). Or at least it was; it seems gcc8 trunk just makes a stack frame with EBP / RBP but references 32-byte aligned locals from aligned RSP instead of unaligned RBP. It used to copy the address of the return address to make a full copy of ret-addr / saved-RBP for the aligned stack frame, which was super weird. https://godbolt.org/g/RLJNtd. (With an alloca or something, gcc8 does the same crazy stack-frame stuff as gcc7, otherwise it's much cleaner, like clang) The actual bug here is that it's not fully optimized away when it turns out that no 32-byte spills / reloads from locals are left in the function. gcc for x86-64 sometimes has a few leftover instructions like that in more complex functions using __m256; this is not exclusively an i386 problem, but it's happens more easily for 32-bit it seems.
[Bug target/85366] New: Failure to use both div and mod results of one IDIV in a prime-factor loop while(n%i==0) { n/=i; }
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85366 Bug ID: 85366 Summary: Failure to use both div and mod results of one IDIV in a prime-factor loop while(n%i==0) { n/=i; } Product: gcc Version: 8.0.1 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* From https://codereview.stackexchange.com/questions/191792/find-prime-factors-in-c/191801#191801, simplified to use a pointer instead of returning std::vector. Interestingly, the version with std::vector can be more easily coaxed to use both results of one idiv, see the Godbolt link. void find_prime_factors_ptr(int n, int *p) { // inefficient to test even numbers > 2, but that's a separate missed optimization. for (int i = 2; i <= n; i++) { while (n % i == 0) { *p++ = i; n /= i; // reordering the loop body doesn't help } } } https://godbolt.org/g/ogyZW8 g++ 8.0.1 20180411 -O3 -march=haswell gives us this inner loop: ... # outer loop movl%edi, %eax # idiv to test if inner loop should even run once, leaving n/i in eax .L4: movl%edi, %eax# but instead we discard it addq$4, %rsi movl%ecx, -4(%rsi) cltd idivl %ecx cltd # then modulo that division result to see if the next iteration should run movl%eax, %edi idivl %ecx # leaves n/i in eax, ready for next iteration... testl %edx, %edx je .L4 ... So both ways to get to .L4 (fall in or loop) have n/i in EAX from an idiv already! The loop doesn't need to be re-structured to take advantage, gcc just needs to keep track of what it's doing. ## Hand optimized version of the whole function: cmpl$1, %edi jle .L9 movl$2, %ecx .L5: movl%edi, %eax cltd idivl %ecx # eax = tmp = n/i testl %edx, %edx jne .L3 .L4: movl%ecx, (%rsi) addq$4, %rsi # we're tuning for Haswell, no register-read stalls so increment after reading and save a byte in the addressing mode movl%eax, %edi# n = tmp cltd idivl %ecx # eax = tmp = n/i testl %edx, %edx je .L4 .L3: incl%ecx cmpl%edi, %ecx jle .L5 .L9: ret I didn't make *any* changes to the code outside the inner loop. I ended up just removing movl %edi, %eax / cltd / idiv %ecx. Changing the inner loop to int tmp; while (tmp = n/i, n % i == 0) { *p++ = i; n = tmp; } gives us the asm almost that good (an extra mov inside the loop), but we get a jmp into the loop instead of peeling the while condition from before the first iteration: # gcc8.0.1 -O3 -march=haswell output, commented but unmodified find_prime_factors_ptr_opt(int, int*): cmpl$1, %edi jle .L18 movl$2, %ecx jmp .L19 .L16: # top of inner loop addq$4, %rsi movl%ecx, -4(%rsi) movl%eax, %edi# extra mov puts this and the next mov on the critical path .L19:# inner loop entry point movl%edi, %eax cltd idivl %ecx testl %edx, %edx je .L16 # bottom of inner incl%ecx cmpl%edi, %ecx jle .L19 # bottom of outer .L18: ret Saving code-size here with the dependent chain of movl %eax, %edi / movl %edi, %eax is pretty minor even on CPUs like original Sandybridge, or Bulldozer, without mov-elimination, because idiv's latency dominates. But it could easily be taken out of the inner loop by duplicating it outside the outer loop, then moving it to the outer-only part of the loop body, like this: cmpl$1, %edi jle .L18 movl$2, %ecx movl%edi, %eax # eax = n added here jmp .L19 .L16: # top of inner loop addq$4, %rsi movl%ecx, -4(%rsi) movl%eax, %edi # n = tmp still here .L19:# inner loop entry point #movl%edi, %eax # eax = n removed from here in inner/outer loop cltd idivl %ecx testl %edx, %edx je .L16 # bottom of inner movl%edi, %eax# eax = n also added here, in the outer-only part incl%ecx cmpl%edi, %ecx jle .L19 #
[Bug target/85038] x32: unnecessary address-size prefix when a pointer register is already zero-extended
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038 --- Comment #1 from Peter Cordes --- Correction for AArch64: it supports addressing modes with a 64-bit base register + 32-bit index register with zero or sign extension for the 32-bit index. But not 32-bit base registers. As a hack that's better than nothing, AArch64 could use a 32-bit pointer as the index with a UXTW mode, using a zeroed register as the base (unless indexed modes have any perf downside on real AArch64 chips). But unfortunately, the architectural zero register isn't usable as the base: that encoding means the stack pointer for this instruction. ldr w1,[xzr,w2,uxtw] doesn't assemble, only x0-x30 or SP. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0801b/BABBGCAC.html http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/LDR_reg_gen.html describes LDR Wt, [Xn|SP, Rm{, extend {amount}}] where Rm can be an X or W register, and "extend" can be SXTW or UXTW for word regs, or LSL for X regs. (SXTX is a synonym for LSL). Any of the modes can use a left-shift amount, applied *after* extension to 64-bit. See https://community.arm.com/processors/b/blog/posts/a64-shift-and-extend-operations-operand-modifiers for details on operand-modifiers. gcc6.3 doesn't take advantage with -mabi=ilp32, and Godbolt doesn't have later AArch64 gcc. So gcc will need to know about zero-extended pointers, and the signedness of 32-bit values, to take advantage of AArch64's addressing modes for the common case of a 32-bit index. Teaching gcc to track signed/unsigned in RTL would benefit x32 and AArch64 ILP32, if I understand the situation correctly.