[Bug target/53687] _mm_cmpistri generates redundant movslq %ecx, %rcx on x86-64

2024-04-29 Thread lh_mouse at 126 dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53687

LIU Hao  changed:

   What|Removed |Added

 CC||lh_mouse at 126 dot com

--- Comment #2 from LIU Hao  ---
Intel's choice that `_mm_cmpistri()` should return `int` is just awkward. The
result has always been zero-extended to RCX; similarly for `PMOVMSK` (with AVX,
`VPMOVMSKB` produces a 32-bit result which is still zero-extended).

For Clang, casting the result to `uint32_t` is sufficient to eliminate the
zero-extension; for GCC it does not work.

[Bug target/53687] _mm_cmpistri generates redundant movslq %ecx,%rcx on x86-64

2017-09-02 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53687

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
This behaviour is pretty understandable.  gcc doesn't know that the
return-value range is only 0-16, i.e. guaranteed non-negative integers.  Since
you used a signed int offset, makes sense that it *sign* extends from 32 to 64.

If you use  unsigned offset, the missed-optimization becomes more obvious. 
gcc7.2 still uses a  movl%ecx, %ecx  to zero-extend into rcx.

https://godbolt.org/g/wWvqpa

(Incidentally, same,same is the worst possible choice of registers for Intel
CPUs.  It means the mov can never be eliminated in the rename stage, and always
needs an execution port with non-zero latency.)

Even uintptr_t offset doesn't avoid it, because then the conversion from the
intrinsic to the variable results in sign-extension up to 64-bit.  It treats it
exactly like a function that returns int, which in the SysV ABI is allowed to
have garbage in the upper32.


(BTW, this use of flags from inline asm is not guaranteed to be safe.  Nothing
stops the optimizer from doing the pointer-increment after the `pcmpistri`,
which would clobber flags.  You could do `pcmpistri` inside the asm and produce
a uintptr_t output operand, except that doesn't work with goto.  So really you
should write the whole loop in inline asm)


Or better, don't use inline asm at all: gcc can CSE _mm_cmpistri with
_mm_cmpistra, so you can just use the intrinsic twice to get multiple operands,
and it will compile to a single instruction.  This is like using `/` and `%`
operators to get both results of a `div`.