https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91374
Bug ID: 91374 Summary: [Missed optimization] Versioning opportunities to improve performance Product: gcc Version: tree-ssa Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hliu at amperecomputing dot com Target Milestone: --- Consider the following code === begin code === #define LENGTH 512 #define STRIDE 32 char src[LENGTH]; char dst[LENGTH]; static const int height[2] = { 32, 16 }; static const int width[2] = { 16, 8 }; volatile int result; void foo(int height, int width) { char * ptr_src = src; char * ptr_dst = dst; for( int y = 0; y < height; y++ ) { for( int x = 0; x < width; x++ ) ptr_dst[x] = ptr_src[x] + ptr_src[x]; ptr_dst += STRIDE; ptr_src += STRIDE; } } int main(int argc, char *argv[]) { for (int i = 0; i < LENGTH; i++) { src[i] = i % 256; } int idx = argc % 2; int h = height[idx]; int w = width[idx]; foo(h, w); result = dst[argc]; } === end code === The inner loop boundary "width" can be 16 or 8. Compiled with "-O3", gcc generates nested loops for foo. But if we can make use of 16 and 8 to versioning the code (e.g. 3 versions for 8, 16, general), the inner loop can be removed and fully vectorized (for both AArch64 and X86_64 architecture), the performance will be much better as there is single loop. The case is simplified from real benchmark and foo is very hot.