Hi, this patch enables logic which avoid FMA for matrix multiplicaiton loop for 256 bit vectors. The underlying issue is same as with znver1. While combined latency of mutliply and add operations is slower than FMA, the dependency chain in matrix multiplication depends only on additions that are faster.
Bootstrapped/regtested x86_64-linux, comitted. * config/i386/i386-options.c (ix86_option_override_internal): Default PARAM_AVOID_FMA_MAX_BITS to 256 for znver2. * conifg/i386/x86-tune.def (X86_TUNE_AVOID_256FMA_CHAINS): Set for ZNVER2. Index: config/i386/i386-options.c =================================================================== --- config/i386/i386-options.c (revision 273727) +++ config/i386/i386-options.c (working copy) @@ -2779,7 +2779,11 @@ ix86_option_override_internal (bool main opts->x_flag_cf_protection = (cf_protection_level) (opts->x_flag_cf_protection | CF_SET); - if (ix86_tune_features [X86_TUNE_AVOID_128FMA_CHAINS]) + if (ix86_tune_features [X86_TUNE_AVOID_256FMA_CHAINS]) + maybe_set_param_value (PARAM_AVOID_FMA_MAX_BITS, 256, + opts->x_param_values, + opts_set->x_param_values); + else if (ix86_tune_features [X86_TUNE_AVOID_128FMA_CHAINS]) maybe_set_param_value (PARAM_AVOID_FMA_MAX_BITS, 128, opts->x_param_values, opts_set->x_param_values); Index: config/i386/x86-tune.def =================================================================== --- config/i386/x86-tune.def (revision 273727) +++ config/i386/x86-tune.def (working copy) @@ -431,6 +431,10 @@ DEF_TUNE (X86_TUNE_USE_GATHER, "use_gath smaller FMA chain. */ DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER) +/* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or + smaller FMA chain. */ +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2) + /*****************************************************************************/ /* AVX instruction selection tuning (some of SSE flags affects AVX, too) */ /*****************************************************************************/