https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91374

            Bug ID: 91374
           Summary: [Missed optimization] Versioning opportunities to
                    improve performance
           Product: gcc
           Version: tree-ssa
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hliu at amperecomputing dot com
  Target Milestone: ---

Consider the following code

=== begin code ===

#define LENGTH 512
#define STRIDE 32

char src[LENGTH];
char dst[LENGTH];
static const int height[2] = { 32, 16 };
static const int width[2] = { 16, 8 };

volatile int result;

void foo(int height, int width) {
    char * ptr_src = src;
    char * ptr_dst = dst;

    for( int y = 0; y < height; y++ )
    {
        for( int x = 0; x < width; x++ )
            ptr_dst[x] = ptr_src[x] + ptr_src[x];
        ptr_dst += STRIDE;
        ptr_src += STRIDE;
    }
}

int main(int argc, char *argv[]) {
    for (int i = 0; i < LENGTH; i++) {
        src[i] = i % 256;
    }

    int idx = argc % 2;
    int h = height[idx];
    int w = width[idx];
    foo(h, w);

    result = dst[argc];
}

=== end code ===

The inner loop boundary "width" can be 16 or 8. Compiled with "-O3", gcc
generates nested loops for foo.

But if we can make use of 16 and 8 to versioning the code (e.g. 3 versions for
8, 16, general), the inner loop can be removed and fully vectorized (for both
AArch64 and X86_64 architecture), the performance will be much better as there
is single loop. 

The case is simplified from real benchmark and foo is very hot.

Reply via email to