https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
--- Comment #12 from Hongtao.liu <crazylht at gmail dot com> --- Correct AVX256 load cost outside of register allocation and vectorizer > they are > 1. AVX256 Load ---- 16 > 2. FMA3 ymm,ymm,ymm --- 16 > 3. AVX256 Regmove --- 2 > 4. FMA3 mem,ymm,ymm --- 32 That's why pass_combine would combine *avx256 load* and *FMA3 ymm,ymm,ymm* to *FMA3 mem,ymm,ymm*