https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583
Bug ID: 113583 Summary: Main loop in 519.lbm not vectorized. Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* riscv*-*-* This might be a known issue but a bugzilla search regarding lbm didn't show anything related. The main loop in SPEC2017 519.lbm GCC riscv does not vectorize while clang does. For x86 neither clang nor GCC seem to vectorize it. A (not entirely minimal but let's start somewhere) example is the following. This one is, however, vectorized by clang-17 x86 and not by GCC trunk x86 or other targets I checked. #define CST1 (1.0 / 3.0) typedef enum { C = 0, N, S, E, W, T, B, NW, NE, A, BB, CC, D, EE, FF, GG, HH, II, JJ, FLAGS, NN } CELL_ENTRIES; #define SX 100 #define SY 100 #define SZ 130 #define CALC_INDEX(x, y, z, e) ((e) + NN * ((x) + (y) * SX + (z) * SX * SY)) #define GRID_ENTRY_SWEEP(g, dx, dy, dz, e) ((g)[CALC_INDEX (dx, dy, dz, e) + (i)]) #define LOCAL(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e)) #define NEIGHBOR_C(g, e) (GRID_ENTRY_SWEEP (g, 0, 0, 0, e)) #define NEIGHBOR_S(g, e) (GRID_ENTRY_SWEEP (g, 0, -1, 0, e)) #define NEIGHBOR_N(g, e) (GRID_ENTRY_SWEEP (g, 0, +1, 0, e)) #define NEIGHBOR_E(g, e) (GRID_ENTRY_SWEEP (g, +1, 0, 0, e)) #define SRC_C(g) (LOCAL (g, C)) #define SRC_N(g) (LOCAL (g, N)) #define SRC_S(g) (LOCAL (g, S)) #define SRC_E(g) (LOCAL (g, E)) #define SRC_W(g) (LOCAL (g, W)) #define DST_C(g) (NEIGHBOR_C (g, C)) #define DST_N(g) (NEIGHBOR_N (g, N)) #define DST_S(g) (NEIGHBOR_S (g, S)) #define DST_E(g) (NEIGHBOR_E (g, E)) typedef double arr[SX * SY * SZ * NN]; #define OMEGA 0.123 void foo (arr src, arr dst) { double ux, uy, u2; const double lambda0 = 1.0 / (0.5 + 3.0 / (16.0 * (1.0 / OMEGA - 0.5))); double fs[NN], fa[NN], feqs[NN], feqa[NN]; for (int i = 0; i < SX * SY * SZ * NN; i += NN) { ux = 1.0; uy = 1.0; feqs[C] = CST1 * (1.0); feqs[N] = feqs[S] = CST1 * (1.0 + 4.5 * (+uy) * (+uy)); feqa[C] = 0.0; feqa[N] = 0.2; fs[C] = SRC_C (src); fs[N] = fs[S] = 0.5 * (SRC_N (src) + SRC_S (src)); fa[C] = 0.0; fa[N] = 0.1; DST_C (dst) = SRC_C (src) - OMEGA * (fs[C] - feqs[C]); DST_N (dst) = SRC_N (src) - OMEGA * (fs[N] - feqs[N]) - lambda0 * (fa[N] - feqa[N]); } } missed.c:19:2: note: ==> examining statement: _4 = *_3; missed.c:19:2: missed: no array mode for V8DF[20] missed.c:19:2: missed: no array mode for V8DF[20] missed.c:19:2: missed: the size of the group of accesses is not a power of 2 or not equal to 3 missed.c:19:2: missed: not falling back to elementwise accesses missed.c:43:11: missed: not vectorized: relevant stmt not supported: _4 = *_3; Also refer to https://godbolt.org/z/P517qc3Yf for riscv and https://godbolt.org/z/M134KvEEo for aarch64. For aarch64 it seems clang would vectorize the snippet but does not consider it profitable to do so. For riscv and the full lbm workload I roughly see one third the number of dynamically executed qemu instructions with the clang build vs GCC build, 340 billion vs 1200 billion.