https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602

Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |peter at cordes dot ca

--- Comment #2 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Uroš Bizjak from comment #1)
> The "xorl %eax, %eax; movw %ax, (%rsi)" pair is just optimized way to 
> implement "movw $0, (%rsi);".

That's questionable on Skylake.  Sandybridge and later Intel CPUs don't have
LCP decode stalls on mov with a 16-bit immediate, only on ALU instructions. 
movw $0, (%rsi)  has no displacement and a 16-bit immediate, so it only takes 1
entry in the uop cache (Agner Fog's microarch pdf, table 9.1).  I don't see any
other downsides for a mov-imm16 to memory on Skylake.

> It just happens that peephole pass found unused %eax as an empty temporary 
> reg when splitting direct move of immediate to memory.

Then it's a missed-optimization to keep re-zeroing inside the loop, instead of
picking %ecx and hoisting the xor so the rest of the loop can keep clobbering
%eax.

Although really doing what Christoph suggested is even better if you do want
the xor-zeroing for something else.  If a peephole pass is introducing
xor-zeroing instructions, then it's a missed optimization if other instructions
can't take advantage.  Having an xor-zeroing inside the loop *instead* of a
movzx is pretty good if the xor-zero is needed for some other reason.

> popcntl has a false dependency on its output in certain situations,

Yes, always on Intel Sandybridge-family, including Skylake.

> where popcntw doesn have this limitation. So, gcc choose this approach for a
> reason.

Intel Haswell/Skylake (and I think IvyBridge) don't rename low16 separately
from the full register
(https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to).
 *Any* write of a 16-bit register has a false dependency on the old value.  So
popcntw isn't a special case of false dependency, it's like other 16-bit
instructions.  Is that what you're thinking of?

The only CPUs where it's even theoretically possible for popcntw to not have an
output dependency are Nehalem and Sandybridge.  All other CPUs don't rename
low16 separately from the full register or are too old to have popcnt.

Do you have a source for your claim that popcntw doesn't have a dependency on
the 16-bit destination register, on Sandybridge?  It's probably true on
Nehalem, because they rename %ax and don't have false dependencies for
popcntl/q, but more likely Sandybridge will treat %ax as an input dependency.

If you don't have an xor-zeroed register you can clobber, movzx-load + popcntl
is pretty clearly better than popcntw mem,%ax;  movzx %ax, %eax on all CPUs, or
at least not worse.  It saves a code byte (no operand-size prefix), guarantees
no false dependency on %eax, and avoids a 16-bit popcnt (slow on
Silvermont/KNL).

It's also fewer total uops for the out-of-order scheduler / execution units:
popcnt (mem), reg  is a micro-fused load + ALU, and movzwl r16,r32 is another
ALU uop.  But a movzx load decodes to just a load uop on Intel CPUs, no ALU uop
needed.


For example, gcc -m32 -O3 -mtune=generic compiles this
int pc_ushort(unsigned short a) { return __builtin_popcount(a); }
// https://godbolt.org/g/nnNYLU
        popcntw 4(%esp), %ax        # false dep on most CPUs
        movzwl  %ax, %eax           # extra ALU uop on the critical path
        ret

Much better:

        movzwl   4(%esp), %eax      # just as fast as a mov load
        popcntl  %eax, %eax         # popcntl false dep taken care of by
same,same


Or another -m32 example:
int pcl(unsigned long a) { return __builtin_popcountll(a); }
        # -O3 -mpopcnt -m32
        xorl    %eax, %eax         # correctly omitted with -mtune=znver1
        popcntl 4(%esp), %eax

extra non-eliminated ALU uop for the execution units on Nehalem and AMD vs.

        movl    4(%esp), %eax
        popcntl %eax, %eax

Also, the load might issue 1 cycle earlier this way, or can even be hoisted
ahead of other surrounding code.  Splitting a micro-fused load+ALU into a
mov-load is always at least as good as xor-zeroing for dep breaking when the
source is in memory.  (A reg-reg mov is not always better; putting a mov on the
critical path is always worse for older CPUs, vs. xor off the critical path.)

Here are some more examples of current gcc 8.0.0 20170920  getting things wrong
for various tunings:

int pcll_uint(unsigned int a) { return __builtin_popcountll(a); }
   # -O3 -march=bdver2
        movl    %edi, %eax        # zero-extend to 64-bit
        popcntq %rax, %rax        # 64-bit popcntq has 4c throughput, vs. 2c
for 16/32 bit (Agner Fog's table for Piledriver)
        ret

Missed optimization to  popcntl %edi, %eax  (without mov) applies to znver1 /
knl / others as well, but especially bad on Bulldozer-family where popcntq is
actually slower than popcntl.  (__builtin_popcountll with -m32 manages to avoid
doing two 32-bit popcount instructions, though.)  Even on Intel SnB-family, it
wastes a REX prefix.



int pc(unsigned short a) { return __builtin_popcount(a); }
      # -march=knl, or -mpopcnt -mtune=generic

        popcntw %di, %ax         # Silvermont/KNL: 16-bit popcnt is 2 uops, and
thus stalls decoders
        movzwl  %ax, %eax        # Intel SnB-family: mov-elimination fails:
same reg
        ret

The good options would be:
        # -mtune=IvyBrige or later,  or -mtune=silvermont/KNL
        movzwl  %di, %eax        # mov-elimination on Intel IvB and later
        popcntl %eax, %eax       # most efficient version of popcnt on every
CPU
        ret

Or same code size but removes  movzx  from the critical path:

        xor     %eax,%eax
        popcntw % di, %ax        # for -mtune other than Silvermont/KNL

Good on Sandybridge or earlier, and I guess AMD.  (AMD and Nehalem still need
an execution unit for xor-zeroing, but movzx is non-zero latency and also needs
an execution unit, so taking it off the critical path is a win.)



int pcll_signext(signed short a) { return __builtin_popcountll(a); }
    # -O3 -march=bdver2 -m32
        pushl   %ebx            # huh?
        movswl  8(%esp), %eax
        movl    %eax, %ebx
        popcntl %eax, %eax
        sarl    $31, %ebx
        popcntl %ebx, %edx
        addl    %edx, %eax
        popl    %ebx
        ret

That's obviously ridiculous compared to what gcc should emit:

        movswl  8(%esp), %eax
        cltd                          # aka cdq.  edx = broadcast sign bit of
eax
        popcntl %eax, %eax
        popcntl %edx, %edx
        addl    %edx, %eax
        ret

It's unlikely that a special case optimization for popcnt of a sign-extension
result is useful.  But it's possible: the popcnt(high half) is either 0 or 32. 
A RORX + AND to move the sign bit and clear the rest could replace CLTD +
POPCNT.  Or if popcnt false deps aren't a problem, popcnt into a different
register and then SHR + AND.

Since this probably only happens in sloppy code that never has negative numbers
but uses popcountll on a signed int, a test+jl might be worth it to skip the
high half count + add.

Reply via email to