[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #24 from edison --- (In reply to Hongtao Liu from comment #23) > (In reply to edison from comment #22) > > for 607.cactuBSSN_s,if use preENV_GOMP_CPU_AFFINITY = 0-23 in CPU2017 .cfg, > > all p-core(i9-13900k) usage will down to 15%(the e-core almost 100%), if > > comment out it all p-core usage will up to 60%. > > > > 607.cactuBSSN_s on i9-13900K > > gcc 14.1 > > > > preENV_GOMP_CPU_AFFINITY = 0-23: 60.1 (-41.7 % slower) > > # preENV_GOMP_CPU_AFFINITY = 0-23: 103 > > > > but for AMD Zen4(+) that maybe another story so far(AMD Zen4 need > > preENV_GOMP_CPU_AFFINITY to make the threads run on high performance core > > first). > > Because E-core run slower than P-core, if you bind the thread to each core, > it prevents threads from migrating from the E-core to the P-core. I know, but I think there is not way bind thread to each core in OpenMP(CPU2017 speed)mode, only multi job(CPU2017 rate) can do that.
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #23 from Hongtao Liu --- (In reply to edison from comment #22) > for 607.cactuBSSN_s,if use preENV_GOMP_CPU_AFFINITY = 0-23 in CPU2017 .cfg, > all p-core(i9-13900k) usage will down to 15%(the e-core almost 100%), if > comment out it all p-core usage will up to 60%. > > 607.cactuBSSN_s on i9-13900K > gcc 14.1 > > preENV_GOMP_CPU_AFFINITY = 0-23: 60.1 (-41.7 % slower) > # preENV_GOMP_CPU_AFFINITY = 0-23: 103 > > but for AMD Zen4(+) that maybe another story so far(AMD Zen4 need > preENV_GOMP_CPU_AFFINITY to make the threads run on high performance core > first). Because E-core run slower than P-core, if you bind the thread to each core, it prevents threads from migrating from the E-core to the P-core.
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 edison changed: What|Removed |Added CC||edison_chan_gz at hotmail dot com --- Comment #22 from edison --- for 607.cactuBSSN_s,if use preENV_GOMP_CPU_AFFINITY = 0-23 in CPU2017 .cfg, all p-core(i9-13900k) usage will down to 15%(the e-core almost 100%), if comment out it all p-core usage will up to 60%. 607.cactuBSSN_s on i9-13900K gcc 14.1 preENV_GOMP_CPU_AFFINITY = 0-23: 60.1 (-41.7 % slower) # preENV_GOMP_CPU_AFFINITY = 0-23: 103 but for AMD Zen4(+) that maybe another story so far(AMD Zen4 need preENV_GOMP_CPU_AFFINITY to make the threads run on high performance core first).
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 liuhongt at gcc dot gnu.org changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #21 from liuhongt at gcc dot gnu.org --- The main gap is from openmp for hybrid machine.
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #20 from Jan Hubicka --- On zen4 hardware I now get GCC13 with -O3 -flto -march=native -fopenmp 2163 2161 2153 Average: 2159 Iterations Per Minute clang 17 with -O3 -flto -march=native -fopenmp 2004 1988 1991 Average: 1994 Iterations Per Minute trunk -O3 -flto -march=native -fopenmp Operation: Resizing: 2126 2135 2123 Average: 2128 Iterations Per Minute So no big changes here...
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #19 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:e1e127de18dbee47b88fa0ce74a1c7f4d658dc68 commit r14-4571-ge1e127de18dbee47b88fa0ce74a1c7f4d658dc68 Author: Zhang, Jun Date: Fri Sep 22 23:56:37 2023 +0800 x86: set spincount 1 for x86 hybrid platform By test, we find in hybrid platform spincount 1 is better. Use '-march=native -Ofast -funroll-loops -flto', results as follows: spec2017 speed RPL ADL 657.xz_s 0.00% 0.50% 603.bwaves_s 10.90% 26.20% 607.cactuBSSN_s 5.50% 72.50% 619.lbm_s2.40% 2.50% 621.wrf_s-7.70% 2.40% 627.cam4_s 0.50% 0.70% 628.pop2_s 48.20% 153.00% 638.imagick_s-0.10% 0.20% 644.nab_s2.30% 1.40% 649.fotonik3d_s 8.00% 13.80% 654.roms_s 1.20% 1.10% Geomean-int 0.00% 0.50% Geomean-fp 6.30% 21.10% Geomean-all 5.70% 19.10% omp2012 RPL ADL 350.md -1.81% -1.75% 351.bwaves 7.72% 12.50% 352.nab 14.63% 19.71% 357.bt331-0.20% 1.77% 358.botsalgn 0.00% 0.00% 359.botsspar 0.00% 0.65% 360.ilbdc0.00% 0.25% 362.fma3d2.66% -0.51% 363.swim 10.44% 0.00% 367.imagick 0.00% 0.12% 370.mgrid331 2.49% 25.56% 371.applu331 1.06% 4.22% 372.smithwa 0.74% 3.34% 376.kdtree 10.67% 16.03% GEOMEAN 3.34% 5.53% include/ChangeLog: PR target/109812 * spincount.h: New file. libgomp/ChangeLog: * env.c (initialize_env): Use do_adjust_default_spincount. * config/linux/x86/spincount.h: New file.
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #18 from Uroš Bizjak --- One interesting observation: clang is able to do this: 0.09 │ │ vmovddup -0x8(%rdx,%rsi,1),%xmm3 ▒ ... 0.11 │ │ vfmadd231sd %xmm2,%xmm3,%xmm1▒ ... 0.74 │ │ vfmadd231pd %xmm2,%xmm3,%xmm0▒ It figures out that duplicated V2DFmode value in %xmm3 can also be accessed in the same register as DFmode value. OTOH, current gcc does: vmovsd (%rsi,%rax,8), %xmm1 ... vmovddup%xmm1, %xmm4 ... vfmadd231pd %xmm4, %xmm0, %xmm2 ... vfmadd231sd %xmm1, %xmm0, %xmm3 The above code needs two registers.
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #17 from Jan Hubicka --- I was also thinking of DCE. It looks like plausible idea. It may leads to a surprise where you sture same undefined variable to two places and later compare them for equality, but that is undefined anyway.
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #16 from Jakub Jelinek --- Shouldn't we DCE something = x_N(D); stores when x is a VAR_DECL, at least provided something can't trap? I mean, the previous content is one of the possible uninitialized values.
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #15 from Martin Jambor --- Oh, because I missed the -DOPACITY in the second command line. The reason for SRAs creating the repalcement is total scalarization :-/
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #14 from Martin Jambor --- (In reply to Jan Hubicka from comment #13) > The only difference between slp vectorization is: > > - # _68 = PHI <_5(3)> > - # _67 = PHI <_11(3)> > - # _66 = PHI <_16(3)> > - .r = _68; > - .g = _67; > - .b = _66; > + # _70 = PHI <_5(3)> > + # _69 = PHI <_11(3)> > + # _68 = PHI <_16(3)> > + .r = _70; > + .g = _69; > + .b = _68; > + .o = r$o_33(D); > > so SRA invents r$o_33(D) even if that variable is undefined. Is this the testcase from comment #10 ? I don't see r$o in my dumps.
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 Jan Hubicka changed: What|Removed |Added CC||rguenther at suse dot de See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=110062 --- Comment #13 from Jan Hubicka --- The only difference between slp vectorization is: - # _68 = PHI <_5(3)> - # _67 = PHI <_11(3)> - # _66 = PHI <_16(3)> - .r = _68; - .g = _67; - .b = _66; + # _70 = PHI <_5(3)> + # _69 = PHI <_11(3)> + # _68 = PHI <_16(3)> + .r = _70; + .g = _69; + .b = _68; + .o = r$o_33(D); so SRA invents r$o_33(D) even if that variable is undefined. SLP vectorizer then sees it as interleaving stores: -t.c:19:16: note: _1 = rgbs[i_35].r; -t.c:19:16: note: _7 = rgbs[i_35].g; -t.c:19:16: note: _12 = rgbs[i_35].b; -t.c:19:16: note: Detected interleaving store of size 3 -t.c:19:16: note: .r = _68; -t.c:19:16: note: .g = _67; -t.c:19:16: note: .b = _66; +t.c:19:16: note: _1 = rgbs[i_37].r; +t.c:19:16: note: _7 = rgbs[i_37].g; +t.c:19:16: note: _12 = rgbs[i_37].b; +t.c:19:16: note: Detected interleaving store of size 4 +t.c:19:16: note: .r = _70; +t.c:19:16: note: .g = _69; +t.c:19:16: note: .b = _68; +t.c:19:16: note: .o = r$o_33(D); For first case it first tries to vectorize for vector of 3 doubles and fails: -t.c:19:16: note: .r = _68; -t.c:19:16: note: .g = _67; -t.c:19:16: note: .b = _66; -t.c:19:16: note: starting SLP discovery for node 0x2cb4fe8 -t.c:19:16: note: Build SLP for .r = _68; -t.c:19:16: note: get vectype for scalar type (group size 3): double -t.c:19:16: note: vectype: vector(2) double -t.c:19:16: note: nunits = 2 -t.c:19:16: missed: Build SLP failed: unrolling required in basic block SLP -t.c:19:16: note: Build SLP for .g = _67; -t.c:19:16: note: get vectype for scalar type (group size 3): double -t.c:19:16: note: vectype: vector(2) double -t.c:19:16: note: nunits = 2 -t.c:19:16: missed: Build SLP failed: unrolling required in basic block SLP -t.c:19:16: note: Build SLP for .b = _66; -t.c:19:16: note: get vectype for scalar type (group size 3): double -t.c:19:16: note: vectype: vector(2) double -t.c:19:16: note: nunits = 2 -t.c:19:16: missed: Build SLP failed: unrolling required in basic block SLP -t.c:19:16: note: SLP discovery for node 0x2cb4fe8 failed And later it tries to vectorize first 2 items: -t.c:19:16: note: Splitting SLP group at stmt 2 -t.c:19:16: note: Split group into 2 and 1 -t.c:19:16: note: Starting SLP discovery for -t.c:19:16: note: .r = _68; -t.c:19:16: note: .g = _67; -t.c:19:16 ... and after a lot of blablabla succeeds. If opaque field is present we start with vector of size 4: +t.c:19:16: note: .r = _70; +t.c:19:16: note: .g = _69; +t.c:19:16: note: .b = _68; +t.c:19:16: note: .o = r$o_33(D); +t.c:19:16: note: vect_is_simple_use: operand _70 = PHI <_5(3)>, type of def: internal +t.c:19:16: note: vect_is_simple_use: operand _69 = PHI <_11(3)>, type of def: internal +t.c:19:16: note: vect_is_simple_use: operand _68 = PHI <_16(3)>, type of def: internal +t.c:19:16: note: vect_is_simple_use: operand r$o_33(D), type of def: external +t.c:19:16: missed: treating operand as external +t.c:19:16: note: SLP discovery for node 0x2e80058 succeeded +t.c:19:16: note: SLP size 1 vs. limit 23. +t.c:19:16: note: Final SLP tree for instance 0x2def840: +t.c:19:16: note: node 0x2e80058 (max_nunits=4, refcnt=2) vector(4) double +t.c:19:16: note: op template: .r = _70; +t.c:19:16: note: stmt 0 .r = _70; +t.c:19:16: note: stmt 1 .g = _69; +t.c:19:16: note: stmt 2 .b = _68; +t.c:19:16: note: stmt 3 .o = r$o_33(D); +t.c:19:16: note: children 0x2e800d8 +t.c:19:16: note: node (external) 0x2e800d8 (max_nunits=1, refcnt=1) +t.c:19:16: note: { _70, _69, _68, r$o_33(D) } So it seems to succeed vectorizing with 4 entries but it does so for the single return statement: [local count: 1063004409]: # i_37 = PHI # r$r_40 = PHI <_5(5), r$r_25(D)(2)> # r$g_42 = PHI <_11(5), r$g_26(D)(2)> # r$b_44 = PHI <_16(5), r$b_27(D)(2)> # ivtmp_67 = PHI _1 = rgbs[i_37].r; _2 = (int) _1; _3 = (double) _2; _4 = _3 * w_21(D); _5 = _4 + r$r_40; _7 = rgbs[i_37].g; _8 = (int) _7; _9 = (double) _8; _10 = _9 * w_21(D); _11 = _10 + r$g_42; _12 = rgbs[i_37].b; _13 = (int) _12; _14 = (double) _13; _15 = _14 * w_21(D); _16 = _15 + r$b_44; i_22 = i_37 + 1; ivtmp_66 = ivtmp_67 - 1; if (ivtmp_66 != 0) goto ; [99.00%] else goto ; [1.00%] [local count: 1052374367]: goto ; [100.00%] [local count: 10737416]: # _70 = PHI <_5(3)> # _69 = PHI <_11(3)> # _68 = PHI <_16(3)> _65 =
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #12 from Jan Hubicka --- > /home/sdp/jun/btl0/install/bin/ld: /tmp/ccnX75zI.ltrans0.ltrans.o: in > function `main': > :(.text.startup+0x1): undefined reference to `GMCommand' I wonder if your plugin is configured correctly. Can you try to build with -flto -fuse-linker-plugin.
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #11 from jun zhang --- Hello, Hubicka and Artem I try to reproduce this issue in Raptor Lake, I use -fopenmp -O3 -flto, meet the following error, but if use -fopenmp -O3, no -flto, build ok. Could you help me? libtool: link: /home/sdp/jun/gcc0/install/bin/gcc -fopenmp -O3 -flto -march=native -Wall -o utilities/gm utilities/gm.o -L/home/sdp/jun/omp/Ofast/pts_g_gomp/install/.phoronix-test-suite/installed-tests/pts/graphics-magick-2.1.0/gm_/lib magick/.libs/libGraphicsMagick.a -lfreetype -ljbig -ltiff -ljpeg -lXext -lSM -lICE -lX11 -llzma -lbz2 -lz -lzstd -lm -lpthread -fopenmp /home/sdp/jun/btl0/install/bin/ld: /tmp/ccnX75zI.ltrans0.ltrans.o: in function `main': :(.text.startup+0x1): undefined reference to `GMCommand' collect2: error: ld returned 1 exit status make[1]: *** [Makefile:6411: utilities/gm] Error 1 make[1]: Leaving directory hubicka at gcc dot gnu.org 于2023年5月29日周一 02:50写道: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 > > --- Comment #10 from Jan Hubicka --- > This is benchmarkeable version of the simplified testcase: > > jan@localhost:/tmp> cat t.c > #define N 1000 > struct rgb {unsigned char r,g,b;} rgbs[N]; > int *addr; > struct drgb {double r,g,b; > #ifdef OPACITY > double o; > #endif > }; > > struct drgb sum(double w) > { > struct drgb r; > for (int i = 0; i < N; i++) > { > r.r += rgbs[i].r * w; > r.g += rgbs[i].g * w; > r.b += rgbs[i].b * w; > } > return r; > } > jan@localhost:/tmp> cat q.c > struct drgb {double r,g,b; > #ifdef OPACITY > double o; > #endif > }; > struct drgb sum(double w); > int > main() > { > for (int i = 0; i < 1000; i++) > sum(i); > } > > > jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g ; objdump -d a.out | grep > vfmadd231pd ; perf stat ./a.out > 40119d: c4 e2 d9 b8 d1 vfmadd231pd %xmm1,%xmm4,%xmm2 > > Performance counter stats for './a.out': > > 12,148.04 msec task-clock:u #1.000 CPUs > utilized > 0 context-switches:u #0.000 /sec > 0 cpu-migrations:u #0.000 /sec >736 page-faults:u# 60.586 /sec > 50,018,421,148 cycles:u #4.117 GHz >220,502 stalled-cycles-frontend:u#0.00% frontend > cycles idle > 39,950,154,369 stalled-cycles-backend:u # 79.87% backend > cycles idle >120,000,191,713 instructions:u #2.40 insn per > cycle > #0.33 stalled cycles > per > insn > 10,000,048,918 branches:u # 823.182 M/sec > 7,959 branch-misses:u #0.00% of all > branches > > 12.149466078 seconds time elapsed > > 12.149084000 seconds user >0.0 seconds sys > > > jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g -DOPACITY ; objdump -d > a.out | grep vfmadd231pd ; perf stat ./a.out > > Performance counter stats for './a.out': > > 12,141.11 msec task-clock:u #1.000 CPUs > utilized > 0 context-switches:u #0.000 /sec > 0 cpu-migrations:u #0.000 /sec >735 page-faults:u# 60.538 /sec > 50,018,839,129 cycles:u #4.120 GHz >185,034 stalled-cycles-frontend:u#0.00% frontend > cycles idle > 29,963,999,798 stalled-cycles-backend:u # 59.91% backend > cycles idle >120,000,191,729 instructions:u #2.40 insn per > cycle > #0.25 stalled cycles > per > insn > 10,000,048,913 branches:u # 823.652 M/sec > 7,311 branch-misses:u #0.00% of all > branches > > 12.142252354 seconds time elapsed > > 12.138237000 seconds user >0.00400 seconds sys > > > So on zen2 hardware I get same performance on both. It may be interesting to > test it on Raptor Lake. > > -- > You are receiving this mail because: > You are on the CC list for the bug.
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #10 from Jan Hubicka --- This is benchmarkeable version of the simplified testcase: jan@localhost:/tmp> cat t.c #define N 1000 struct rgb {unsigned char r,g,b;} rgbs[N]; int *addr; struct drgb {double r,g,b; #ifdef OPACITY double o; #endif }; struct drgb sum(double w) { struct drgb r; for (int i = 0; i < N; i++) { r.r += rgbs[i].r * w; r.g += rgbs[i].g * w; r.b += rgbs[i].b * w; } return r; } jan@localhost:/tmp> cat q.c struct drgb {double r,g,b; #ifdef OPACITY double o; #endif }; struct drgb sum(double w); int main() { for (int i = 0; i < 1000; i++) sum(i); } jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g ; objdump -d a.out | grep vfmadd231pd ; perf stat ./a.out 40119d: c4 e2 d9 b8 d1 vfmadd231pd %xmm1,%xmm4,%xmm2 Performance counter stats for './a.out': 12,148.04 msec task-clock:u #1.000 CPUs utilized 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 736 page-faults:u# 60.586 /sec 50,018,421,148 cycles:u #4.117 GHz 220,502 stalled-cycles-frontend:u#0.00% frontend cycles idle 39,950,154,369 stalled-cycles-backend:u # 79.87% backend cycles idle 120,000,191,713 instructions:u #2.40 insn per cycle #0.33 stalled cycles per insn 10,000,048,918 branches:u # 823.182 M/sec 7,959 branch-misses:u #0.00% of all branches 12.149466078 seconds time elapsed 12.149084000 seconds user 0.0 seconds sys jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g -DOPACITY ; objdump -d a.out | grep vfmadd231pd ; perf stat ./a.out Performance counter stats for './a.out': 12,141.11 msec task-clock:u #1.000 CPUs utilized 0 context-switches:u #0.000 /sec 0 cpu-migrations:u #0.000 /sec 735 page-faults:u# 60.538 /sec 50,018,839,129 cycles:u #4.120 GHz 185,034 stalled-cycles-frontend:u#0.00% frontend cycles idle 29,963,999,798 stalled-cycles-backend:u # 59.91% backend cycles idle 120,000,191,729 instructions:u #2.40 insn per cycle #0.25 stalled cycles per insn 10,000,048,913 branches:u # 823.652 M/sec 7,311 branch-misses:u #0.00% of all branches 12.142252354 seconds time elapsed 12.138237000 seconds user 0.00400 seconds sys So on zen2 hardware I get same performance on both. It may be interesting to test it on Raptor Lake.
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #9 from Jan Hubicka --- Oddly enough simplified version of the loop SLP vectorizes for me: struct rgb {unsigned char r,g,b;} *rgbs; int *addr; double *weights; struct drgb {double r,g,b;}; struct drgb sum() { struct drgb r; for (int i = 0; i < 10; i++) { int j = addr[i]; double w = weights[i]; r.r += rgbs[j].r * w; r.g += rgbs[j].g * w; r.b += rgbs[j].b * w; } return r; } I get: L2: movslq (%r9,%rdx,4), %rax vmovsd (%r8,%rdx,8), %xmm1 incq%rdx leaq(%rax,%rax,2), %rax addq%rsi, %rax movzbl (%rax), %ecx vmovddup%xmm1, %xmm4 vmovd %ecx, %xmm0 movzbl 1(%rax), %ecx movzbl 2(%rax), %eax vpinsrd $1, %ecx, %xmm0, %xmm0 vcvtdq2pd %xmm0, %xmm0 vfmadd231pd %xmm4, %xmm0, %xmm2 vcvtsi2sdl %eax, %xmm5, %xmm0 vfmadd231sd %xmm1, %xmm0, %xmm3 cmpq$10, %rdx jne .L2 I think the actual loop is: [local count: 44202554]: _106 = _262->pixel; _109 = *source_231(D).columns; [local count: 401841405]: # pixel$green_332 = PHI <_124(89), pixel$green_265(53)> # i_357 = PHI # pixel$red_371 = PHI <_119(89), pixel$red_263(53)> # pixel$blue_377 = PHI <_129(89), pixel$blue_267(53)> i.51_102 = (long unsigned int) i_357; _103 = i.51_102 * 16; _104 = _262 + _103; _105 = _104->pixel; _107 = _105 - _106; _108 = (long unsigned int) _107; _110 = _108 * _109; _112 = _110 + _621; weight_297 = _104->weight; _113 = _112 * 4; _114 = _276 + _113; _115 = _114->red; _116 = (int) _115; _117 = (double) _116; _118 = _117 * weight_297; _119 = _118 + pixel$red_371; _120 = _114->green; _121 = (int) _120; _122 = (double) _121; _123 = _122 * weight_297; _124 = _123 + pixel$green_332; _125 = _114->blue; _126 = (int) _125; _127 = (double) _126; _128 = _127 * weight_297; _129 = _128 + pixel$blue_377; i_298 = i_357 + 1; if (n_195 > i_298) goto ; [89.00%] else goto ; [11.00%] [local count: 44202554]: # _607 = PHI <_124(54)> # _606 = PHI <_119(54)> # _605 = PHI <_129(54)> goto ; [100.00%] [local count: 357638851]: goto ; [100.00%] and SLP vectorizer seems to claim: ../magick/resize.c:1284:52: note: _125 = _114->blue; ../magick/resize.c:1284:52: note: _120 = _114->green; ../magick/resize.c:1284:52: note: _115 = _114->red; ../magick/resize.c:1284:52: missed: not consecutive access weight_297 = _104->weight; ../magick/resize.c:1284:52: missed: not consecutive access _105 = _104->pixel; ../magick/resize.c:1284:52: missed: not consecutive access _134->red = iftmp.57_207; ../magick/resize.c:1284:52: missed: not consecutive access _134->green = iftmp.60_208; ../magick/resize.c:1284:52: missed: not consecutive access _134->blue = iftmp.63_209; ../magick/resize.c:1284:52: missed: not consecutive access _134->opacity = 0; ../magick/resize.c:1284:52: missed: not consecutive access _63 = *source_231(D).columns; ../magick/resize.c:1284:52: missed: not consecutive access _60 = _262->pixel; Not sure if that is related to the real testcase: struct rgb {unsigned char r,g,b;} *rgbs; int *addr; double *weights; struct drgb {double r,g,b,o;}; struct drgb sum() { struct drgb r; for (int i = 0; i < 10; i++) { int j = addr[i]; double w = weights[i]; r.r += rgbs[j].r * w; r.g += rgbs[j].g * w; r.b += rgbs[j].b * w; } return r; } make us to miss the vectorization even though there is nothing using drgb->o: sum: .LFB0: .cfi_startproc movq%rdi, %r8 movqweights(%rip), %rsi movqaddr(%rip), %rdi vxorps %xmm2, %xmm2, %xmm2 movqrgbs(%rip), %rcx xorl%edx, %edx .p2align 4 .p2align 3 .L2: movslq (%rdi,%rdx,4), %rax vmovsd (%rsi,%rdx,8), %xmm0 incq%rdx leaq(%rax,%rax,2), %rax addq%rcx, %rax movzbl (%rax), %r9d vcvtsi2sdl %r9d, %xmm2, %xmm1 movzbl 1(%rax), %r9d movzbl 2(%rax), %eax vfmadd231sd %xmm0, %xmm1, %xmm3 vcvtsi2sdl %r9d, %xmm2, %xmm1 vfmadd231sd %xmm0, %xmm1, %xmm5 vcvtsi2sdl %eax, %xmm2, %xmm1 vfmadd231sd %xmm0, %xmm1, %xmm4 cmpq$10, %rdx jne .L2 vmovq %xmm4, %xmm4 vunpcklpd %xmm5, %xmm3, %xmm0 movq%r8, %rax vinsertf128 $0x1, %xmm4, %ymm0, %ymm0 vmovupd %ymm0, (%r8) vzeroupper ret
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #8 from Jan Hubicka --- Created attachment 55178 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55178=edit Preprocessed source of VerticalFiller and HorisontalFiller
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 Jan Hubicka changed: What|Removed |Added Summary|GraphicsMagick resize is a |GraphicsMagick resize is a |lot slower in GCC 13.1 vs |lot slower in GCC 13.1 vs |Clang 16|Clang 16 on Intel Raptor ||Lake --- Comment #7 from Jan Hubicka --- On zen3 hardware I get GCC: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count:3 Estimated Time To Completion: 4 Minutes [17:00 UTC] Started Run 1 @ 16:57:17 Started Run 2 @ 16:58:22 Started Run 3 @ 16:59:26 Operation: Resizing: 1390 1386 1383 Average: 1386 Iterations Per Minute Deviation: 0.25% clang16: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count:3 Estimated Time To Completion: 4 Minutes [16:54 UTC] Started Run 1 @ 16:51:48 Started Run 2 @ 16:52:52 Started Run 3 @ 16:53:56 Operation: Resizing: 180 180 180 Average: 180 Iterations Per Minute Deviation: 0.00% GCC profile: 52.07% VerticalFilter._omp_fn.0 24.59% HorizontalFilter._omp_fn.0 11.78% ReadCachePixels.isra.0 Clang does not seem to have openmp in it, so to get comparable runs I added OMP_THREAD_LIMIT=1 With this I get: GraphicsMagick 1.3.38: pts/graphics-magick-2.1.0 [Operation: Resizing] Test 1 of 1 Estimated Trial Run Count:3 Estimated Time To Completion: 4 Minutes [17:17 UTC] Started Run 1 @ 17:14:14 Started Run 2 @ 17:15:18 Started Run 3 @ 17:16:22 Operation: Resizing: 184 186 186 Average: 185 Iterations Per Minute Deviation: 0.62% so GCC build is still bit faster. Internal loop of VerticalFillter is: 0.00 │4a0:┌─→mov 0x8(%rdx),%rax ▒ 1.33 ││ vmovsd (%rdx),%xmm1▒ 1.58 ││ add $0x10,%rdx ▒ 0.00 ││ sub %r13,%rax ▒ 4.77 ││ imul %r11,%rax ▒ 1.01 ││ add %rcx,%rax ▒ 0.04 ││ movzbl 0x2(%r15,%rax,4),%r10d ▒ 8.38 ││ vcvtsi2sd%r10d,%xmm2,%xmm0 ▒ 2.44 ││ movzbl 0x1(%r15,%rax,4),%r10d ◆ 1.55 ││ movzbl (%r15,%rax,4),%eax ▒ 0.00 ││ vfmadd231sd %xmm0,%xmm1,%xmm4 ▒ 13.91 ││ vcvtsi2sd%r10d,%xmm2,%xmm0 ▒ 1.86 ││ vfmadd231sd %xmm0,%xmm1,%xmm5 ▒ 13.00 ││ vcvtsi2sd%eax,%xmm2,%xmm0▒ 2.02 ││ vfmadd231sd %xmm0,%xmm1,%xmm3 ▒ 12.54 │├──cmp %rdx,%rdi ▒ 0.00 │└──jne 4a0 ▒ HorisontalFiller: 0.01 │520:┌─→mov 0x8(%r8),%rdx ▒ 0.96 ││ vmovsd (%r8),%xmm1 ▒ 1.93 ││ add $0x10,%r8 ▒ 0.50 ││ sub %r15,%rdx ▒ 4.02 ││ add %r11,%rdx ▒ 2.26 ││ movzbl 0x2(%r14,%rdx,4),%ebx ▒ 0.09 ││ vcvtsi2sd%ebx,%xmm2,%xmm0 ▒ 10.10 ││ movzbl 0x1(%r14,%rdx,4),%ebx ◆ 0.92 ││ movzbl (%r14,%rdx,4),%edx▒ 1.84 ││ vfmadd231sd %xmm0,%xmm1,%xmm4 ▒ 6.82 ││ vcvtsi2sd%ebx,%xmm2,%xmm0 ▒ 11.15 ││ vfmadd231sd %xmm0,%xmm1,%xmm3 ▒ 13.81 ││ vcvtsi2sd%edx,%xmm2,%xmm0 ▒ 6.16 ││ vfmadd231sd %xmm0,%xmm1,%xmm5 ▒ 8.61 │├──cmp %rsi,%r8 ▒ 1.56 │└──jne 520 ▒ ReadCachePixels: │2e0:┌─→mov(%rbx,%rax,4),%edx ▒ 83.03 ││ mov%edx,(%r12,%rax,4) ▒ 12.34 ││ inc%rax▒ 0.02 │├──cmp%rsi,%rax ▒ With Clang I get: