https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #2 from Peter Cordes <peter at cordes dot ca> --- (In reply to Uroš Bizjak from comment #1) > The "xorl %eax, %eax; movw %ax, (%rsi)" pair is just optimized way to > implement "movw $0, (%rsi);". That's questionable on Skylake. Sandybridge and later Intel CPUs don't have LCP decode stalls on mov with a 16-bit immediate, only on ALU instructions. movw $0, (%rsi) has no displacement and a 16-bit immediate, so it only takes 1 entry in the uop cache (Agner Fog's microarch pdf, table 9.1). I don't see any other downsides for a mov-imm16 to memory on Skylake. > It just happens that peephole pass found unused %eax as an empty temporary > reg when splitting direct move of immediate to memory. Then it's a missed-optimization to keep re-zeroing inside the loop, instead of picking %ecx and hoisting the xor so the rest of the loop can keep clobbering %eax. Although really doing what Christoph suggested is even better if you do want the xor-zeroing for something else. If a peephole pass is introducing xor-zeroing instructions, then it's a missed optimization if other instructions can't take advantage. Having an xor-zeroing inside the loop *instead* of a movzx is pretty good if the xor-zero is needed for some other reason. > popcntl has a false dependency on its output in certain situations, Yes, always on Intel Sandybridge-family, including Skylake. > where popcntw doesn have this limitation. So, gcc choose this approach for a > reason. Intel Haswell/Skylake (and I think IvyBridge) don't rename low16 separately from the full register (https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to). *Any* write of a 16-bit register has a false dependency on the old value. So popcntw isn't a special case of false dependency, it's like other 16-bit instructions. Is that what you're thinking of? The only CPUs where it's even theoretically possible for popcntw to not have an output dependency are Nehalem and Sandybridge. All other CPUs don't rename low16 separately from the full register or are too old to have popcnt. Do you have a source for your claim that popcntw doesn't have a dependency on the 16-bit destination register, on Sandybridge? It's probably true on Nehalem, because they rename %ax and don't have false dependencies for popcntl/q, but more likely Sandybridge will treat %ax as an input dependency. If you don't have an xor-zeroed register you can clobber, movzx-load + popcntl is pretty clearly better than popcntw mem,%ax; movzx %ax, %eax on all CPUs, or at least not worse. It saves a code byte (no operand-size prefix), guarantees no false dependency on %eax, and avoids a 16-bit popcnt (slow on Silvermont/KNL). It's also fewer total uops for the out-of-order scheduler / execution units: popcnt (mem), reg is a micro-fused load + ALU, and movzwl r16,r32 is another ALU uop. But a movzx load decodes to just a load uop on Intel CPUs, no ALU uop needed. For example, gcc -m32 -O3 -mtune=generic compiles this int pc_ushort(unsigned short a) { return __builtin_popcount(a); } // https://godbolt.org/g/nnNYLU popcntw 4(%esp), %ax # false dep on most CPUs movzwl %ax, %eax # extra ALU uop on the critical path ret Much better: movzwl 4(%esp), %eax # just as fast as a mov load popcntl %eax, %eax # popcntl false dep taken care of by same,same Or another -m32 example: int pcl(unsigned long a) { return __builtin_popcountll(a); } # -O3 -mpopcnt -m32 xorl %eax, %eax # correctly omitted with -mtune=znver1 popcntl 4(%esp), %eax extra non-eliminated ALU uop for the execution units on Nehalem and AMD vs. movl 4(%esp), %eax popcntl %eax, %eax Also, the load might issue 1 cycle earlier this way, or can even be hoisted ahead of other surrounding code. Splitting a micro-fused load+ALU into a mov-load is always at least as good as xor-zeroing for dep breaking when the source is in memory. (A reg-reg mov is not always better; putting a mov on the critical path is always worse for older CPUs, vs. xor off the critical path.) Here are some more examples of current gcc 8.0.0 20170920 getting things wrong for various tunings: int pcll_uint(unsigned int a) { return __builtin_popcountll(a); } # -O3 -march=bdver2 movl %edi, %eax # zero-extend to 64-bit popcntq %rax, %rax # 64-bit popcntq has 4c throughput, vs. 2c for 16/32 bit (Agner Fog's table for Piledriver) ret Missed optimization to popcntl %edi, %eax (without mov) applies to znver1 / knl / others as well, but especially bad on Bulldozer-family where popcntq is actually slower than popcntl. (__builtin_popcountll with -m32 manages to avoid doing two 32-bit popcount instructions, though.) Even on Intel SnB-family, it wastes a REX prefix. int pc(unsigned short a) { return __builtin_popcount(a); } # -march=knl, or -mpopcnt -mtune=generic popcntw %di, %ax # Silvermont/KNL: 16-bit popcnt is 2 uops, and thus stalls decoders movzwl %ax, %eax # Intel SnB-family: mov-elimination fails: same reg ret The good options would be: # -mtune=IvyBrige or later, or -mtune=silvermont/KNL movzwl %di, %eax # mov-elimination on Intel IvB and later popcntl %eax, %eax # most efficient version of popcnt on every CPU ret Or same code size but removes movzx from the critical path: xor %eax,%eax popcntw % di, %ax # for -mtune other than Silvermont/KNL Good on Sandybridge or earlier, and I guess AMD. (AMD and Nehalem still need an execution unit for xor-zeroing, but movzx is non-zero latency and also needs an execution unit, so taking it off the critical path is a win.) int pcll_signext(signed short a) { return __builtin_popcountll(a); } # -O3 -march=bdver2 -m32 pushl %ebx # huh? movswl 8(%esp), %eax movl %eax, %ebx popcntl %eax, %eax sarl $31, %ebx popcntl %ebx, %edx addl %edx, %eax popl %ebx ret That's obviously ridiculous compared to what gcc should emit: movswl 8(%esp), %eax cltd # aka cdq. edx = broadcast sign bit of eax popcntl %eax, %eax popcntl %edx, %edx addl %edx, %eax ret It's unlikely that a special case optimization for popcnt of a sign-extension result is useful. But it's possible: the popcnt(high half) is either 0 or 32. A RORX + AND to move the sign bit and clear the rest could replace CLTD + POPCNT. Or if popcnt false deps aren't a problem, popcnt into a different register and then SHR + AND. Since this probably only happens in sloppy code that never has negative numbers but uses popcountll on a signed int, a test+jl might be worth it to skip the high half count + add.