https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202

--- Comment #9 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #8)
> 
> For this one, we can load *a into %zmm0 to avoid false_dependence.
> 
> vmovdqau ZMMWORD PTR [rdi], zmm0
> vpternlogq      zmm0, zmm0, zmm0, 85

Yes, since ternlog with memory operand needs two fused-domain uops on Intel
CPUs, breaking out the load would be more efficient for both negate1 and
negate2.

Reply via email to