https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100794
Bug ID: 100794 Summary: suboptimal code due to missing pre2 when vectorization fails Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: linkw at gcc dot gnu.org Target Milestone: --- I was investigating one degradation from SPEC2017 554.roms_r on Power9, the baseline is -O2 -mcpu=power9 -ffast-math while the test line is -O2 -mcpu=power9 -ffast-math -ftree-vectorize -fvect-cost-model=very-cheap. One reduced C test case is as below: #include <math.h> #define MIN fmin #define MAX fmax #define N1 400 #define N2 600 #define N3 800 extern int j_0, j_n, i_0, i_n; extern double diff2[N1][N2]; extern double dZdx[N1][N2][N3]; extern double dTdz[N1][N2][N3]; extern double dTdx[N1][N2][N3]; extern double FS[N1][N2][N3]; void test (int k1, int k2) { for (int j = j_0; j < j_n; j++) for (int i = i_0; i < i_n; i++) { double cff = 0.5 * diff2[j][i]; double cff1 = MIN (dZdx[k1][j][i], 0.0); double cff2 = MIN (dZdx[k2][j][i + 1], 0.0); double cff3 = MAX (dZdx[k2][j][i], 0.0); double cff4 = MAX (dZdx[k1][j][i + 1], 0.0); FS[k2][j][i] = cff * (cff1 * (cff1 * dTdz[k2][j][i] - dTdx[k1][j][i]) + cff2 * (cff2 * dTdz[k2][j][i] - dTdx[k2][j][i + 1]) + cff3 * (cff3 * dTdz[k2][j][i] - dTdx[k2][j][i]) + cff4 * (cff4 * dTdz[k2][j][i] - dTdx[k1][j][i + 1])); } } O2 fast: <bb 8> [local count: 955630225]: # prephitmp_107 = PHI <_6(8), pretmp_106(7)> # prephitmp_109 = PHI <_4(8), pretmp_108(7)> # prephitmp_111 = PHI <_23(8), pretmp_110(7)> # prephitmp_113 = PHI <_13(8), pretmp_112(7)> # doloop.9_55 = PHI <doloop.9_57(8), doloop.9_105(7)> # ivtmp.33_102 = PHI <ivtmp.33_101(8), ivtmp.44_70(7)> _87 = (double[400][600] *) ivtmp.45_60; _1 = MEM[(double *)_87 + ivtmp.33_102 * 1]; cff_38 = _1 * 5.0e-1; cff1_40 = MIN_EXPR <prephitmp_107, 0.0>; _4 = MEM[(double *)&dZdx + 8B + ivtmp.33_102 * 1]; cff2_42 = MIN_EXPR <_4, 0.0>; cff3_43 = MAX_EXPR <prephitmp_109, 0.0>; _6 = MEM[(double *)_79 + ivtmp.33_102 * 1]; cff4_44 = MAX_EXPR <_6, 0.0>; O2 fast vect (very-cheap) <bb 6> [local count: 955630225]: # doloop.9_55 = PHI <doloop.9_57(6), doloop.9_105(5)> # ivtmp.37_102 = PHI <ivtmp.37_101(6), ivtmp.46_72(5)> # ivtmp.38_92 = PHI <ivtmp.38_91(6), ivtmp.38_90(5)> _77 = (double[400][600] *) ivtmp.48_62; _1 = MEM[(double *)_77 + ivtmp.37_102 * 1]; cff_38 = _1 * 5.0e-1; _2 = MEM[(double *)&dZdx + ivtmp.38_92 * 1]; // redundant load cff1_40 = MIN_EXPR <_2, 0.0>; _4 = MEM[(double *)&dZdx + 8B + ivtmp.37_102 * 1]; cff2_42 = MIN_EXPR <_4, 0.0>; _5 = MEM[(double *)&dZdx + ivtmp.37_102 * 1]; // redundant load cff3_43 = MAX_EXPR <_5, 0.0>; _6 = MEM[(double *)&dZdx + 8B + ivtmp.38_92 * 1]; cff4_44 = MAX_EXPR <_6, 0.0>; I found the root cause is that: in the baseline version, PRE makes it to reuse some load result from previous iterations, it saves some loads. while in the test line version, with the check below: /* Inhibit the use of an inserted PHI on a loop header when the address of the memory reference is a simple induction variable. In other cases the vectorizer won't do anything anyway (either it's loop invariant or a complicated expression). */ if (sprime && TREE_CODE (sprime) == SSA_NAME && do_pre && (flag_tree_loop_vectorize || flag_tree_parallelize_loops > 1) PRE doesn't optimize it to avoid introducing loop carried dependence. It makes sense. But unfortunately the expected downstream loop vectorization isn't performed on the given loop since with "very-cheap" cost model, it doesn't allow vectorizer to peel for niters. Later there seems no downstream pass which is trying to optimize it, it eventually results in sub-optimal code. To rerun pre once after loop vectorization did fix the degradation, but not sure it's practical, since iterating pre seems much time-consuming. Or tagging this kind of loop and later just run pre on the tagged one? It seems also not practical to predict one loop whether can be loop-vectorized later. Also not sure whether there are some passes which can be taught for this.