[Bug tree-optimization/114635] New: OpenMP reductions fail dependency analysis

tnfchris at gcc dot gnu.org via Gcc-bugs Mon, 08 Apr 2024 02:58:49 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635


            Bug ID: 114635
           Summary: OpenMP reductions fail dependency analysis
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following testcase reduced from an HPC workload:

#include <math.h>

#define RESTRICT restrict

void work(int n, float *RESTRICT x, float *RESTRICT y,
          float *RESTRICT z, float *RESTRICT mass,
          float x0, float y0, float z0,
          float *RESTRICT ax, float *RESTRICT ay,
          float *RESTRICT az) {
  float lax = 0.0f, lay = 0.0f, laz = 0.0f;

#if _OPENMP >= 201307
#pragma omp simd reduction(+:lax,lay,laz)
#endif
  for (int i = 0; i < n; ++i) {
    float dx = x[i] - x0;
    float dy = y[i] - y0;
    float dz = z[i] - z0;
    float r2 = dx + dy + dz;

    if (r2 == 0.0f)
      continue;

    float f = (1.0f / (r2 * sqrtf(r2))) * mass[i];

    lax += f * dx;
    lay += f * dy;
    laz += f * dz; 
  }

  *ax += lax;
  *ay += lay;
  *az += laz;
}

when compiled with -Ofast -march=armv9-a -fopenmp-simd vectorizes as expected
but when the pragma is in effect, e.g.  -Ofast -march=armv9-a -fopenmp then the
main loop fails to vectorize with:

(compute_affine_dependence
  ref_a: D.5962[_33], stmt_a: _69 = D.5962[_33];
  ref_b: D.5962[_33], stmt_b: D.5962[_33] = _ifc__147;
) -> dependence analysis failed
/app/example.c:16:17: missed:  bad data dependence.
/app/example.c:16:17: note:  ***** Analysis  failed with vector mode VNx4SF

This doesn't seem to happen with just 2 reductions, but with 3 dependency
analysis seems to fail.

I don't know much about openmp but my understanding is that this pragma is
intended for architectures that don't have masking support and works by
splitting the loop and removing the reductions from the main loop creating
openmp "workers" whom each work on one thread.

the reduction values are turned into local arrays and these threads then write
into their own slots into these arrays.

The reduction itself is then done as a final post step.

It looks like the only thing we can vectorize is the post step.

I wonder, since the compiler is the one introducing these local arrays, can we
not mark them safe from inter dependencies?

[Bug tree-optimization/114635] New: OpenMP reductions fail dependency analysis

Reply via email to