https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202
--- Comment #9 from Alexander Monakov <amonakov at gcc dot gnu.org> --- (In reply to Hongtao.liu from comment #8) > > For this one, we can load *a into %zmm0 to avoid false_dependence. > > vmovdqau ZMMWORD PTR [rdi], zmm0 > vpternlogq zmm0, zmm0, zmm0, 85 Yes, since ternlog with memory operand needs two fused-domain uops on Intel CPUs, breaking out the load would be more efficient for both negate1 and negate2.