https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123163

            Bug ID: 123163
           Summary: vectorization of pointer arithmetic within struct
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: manu at gcc dot gnu.org
  Target Milestone: ---

I am not sure if this is a missed optimization or my misunderstanding of how
GCC calculates the vectorization profitability.

I have a piece of code that performs the operation in "bar()" for values of n
in the order of thousands or hundreds of thousands.

struct list {
    const double * restrict x;
    struct list ** restrict next;
};

void bar(struct list * restrict p, int n) {
    for (int i = 0; i < n; i++) {
        p[i].x--;
    }
}

void foo(const double ** p, int n) {
    for (int i = 0; i < n; i++) {
        p[i]--;
    }
}

static inline void foo_inline(const double ** p, int n) {
    for (int i = 0; i < n; i++) {
        p[i]--;
    }
}

void baz(struct list * restrict p, int n) {
#ifndef N
#define N 16
#endif
    const double * vec[N];
    int blocks = n / N;
    for (int i = 0; i < blocks; i++) {
        for (int k = 0; k < N; k++)
            vec[k] = p[k].x;
        foo_inline(vec, N);
        for (int k = 0; k < N; k++)
            p[k].x = vec[k];
        p += N;
    }
    for (int i = blocks*N; i < n; i++) {
        p[i].x--;
    }
}


gcc -O3 -march=x86-64-v2 -fopt-info-vec-all :

<source>:7:23: missed: couldn't vectorize loop
<source>:8:13: missed: not vectorized: no vectype for stmt: _4 = *_3.x;
 scalar_type: const double * restrict
<source>:6:6: note: vectorized 0 loops in function.
<source>:10:1: note: ***** Analysis failed with vector mode V2DI
<source>:10:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI
<source>:13:23: optimized: loop vectorized using 16 byte vectors and unroll
factor 2
<source>:12:6: note: vectorized 1 loops in function.
<source>:16:1: note: ***** Analysis failed with vector mode V2DI
<source>:16:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI
<source>:38:30: missed: couldn't vectorize loop
<source>:39:13: missed: not vectorized: no vectype for stmt: _12 = *_11.x;
 scalar_type: const double * restrict
<source>:30:23: missed: couldn't vectorize loop
<source>:32:42: missed: not vectorized: no vectype for stmt: _323 = *p_324.x;
 scalar_type: const double * restrict
<source>:24:6: note: vectorized 0 loops in function.
<source>:41:1: note: ***** Analysis failed with vector mode V2DI
<source>:41:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI

That is, foo() is vectorized but the rest are not. Surprisingly foo_inline() is
not! 

However, with

gcc -O3 -march=x86-64-v2 -fopt-info-vec-all -DN=32 :

<source>:7:23: missed: couldn't vectorize loop
<source>:8:13: missed: not vectorized: no vectype for stmt: _4 = *_3.x;
 scalar_type: const double * restrict
<source>:6:6: note: vectorized 0 loops in function.
<source>:10:1: note: ***** Analysis failed with vector mode V2DI
<source>:10:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI
<source>:13:23: optimized: loop vectorized using 16 byte vectors and unroll
factor 2
<source>:12:6: note: vectorized 1 loops in function.
<source>:16:1: note: ***** Analysis failed with vector mode V2DI
<source>:16:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI
<source>:38:30: missed: couldn't vectorize loop
<source>:39:13: missed: not vectorized: no vectype for stmt: _12 = *_11.x;
 scalar_type: const double * restrict
<source>:30:23: missed: couldn't vectorize loop
<source>:30:23: missed: not vectorized: loop nest containing two or more
consecutive inner loops cannot be vectorized
<source>:34:27: optimized: loop vectorized using 16 byte vectors and unroll
factor 2
<source>:19:23: optimized: loop vectorized using 16 byte vectors and unroll
factor 2
<source>:31:27: optimized: loop vectorized using 16 byte vectors and unroll
factor 2
<source>:24:6: note: vectorized 3 loops in function.
<source>:41:1: note: ***** Analysis failed with vector mode V2DI
<source>:41:1: note: ***** Skipping vector mode V16QI, which would repeat the
analysis for V2DI

The three inner loops in baz() are vectorized including foo_inline(). 

Why the vectorizer would not vectorize bar() exactly as it does for baz() with
N=32? Note that using #pragma GCC unroll 32 in bar() does not help.


Rewriting bar() as:

void bar(struct list * restrict p, int n) {
    int blocks = n / N;
    for (int i = 0; i < blocks; i++) {
        #pragma GCC ivdep
        for (int k = 0; k < N; k++) {
            p[k].x--;
        }
        p += N;
    }
    for (int i = blocks*N; i < n; i++) {
        p[i].x--;
    }   
}

does not work either.

Reply via email to