[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2024-07-02 Thread edison_chan_gz at hotmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #24 from edison  ---
(In reply to Hongtao Liu from comment #23)
> (In reply to edison from comment #22)
> > for 607.cactuBSSN_s,if use preENV_GOMP_CPU_AFFINITY = 0-23 in CPU2017 .cfg,
> > all  p-core(i9-13900k) usage will down to 15%(the e-core almost 100%), if
> > comment out it all p-core usage will up to 60%.
> > 
> > 607.cactuBSSN_s on i9-13900K
> > gcc 14.1
> > 
> > preENV_GOMP_CPU_AFFINITY = 0-23:   60.1 (-41.7 % slower)
> > # preENV_GOMP_CPU_AFFINITY = 0-23: 103
> > 
> > but for AMD Zen4(+) that maybe another story so far(AMD Zen4 need
> > preENV_GOMP_CPU_AFFINITY to make the threads run on high performance core
> > first).
> 
> Because E-core run slower than P-core, if you bind the thread to each core,
> it prevents threads from migrating from the E-core to the P-core.

I know, but I think there is not way bind thread to each core in OpenMP(CPU2017
speed)mode, only multi job(CPU2017 rate) can do that.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2024-07-02 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #23 from Hongtao Liu  ---
(In reply to edison from comment #22)
> for 607.cactuBSSN_s,if use preENV_GOMP_CPU_AFFINITY = 0-23 in CPU2017 .cfg,
> all  p-core(i9-13900k) usage will down to 15%(the e-core almost 100%), if
> comment out it all p-core usage will up to 60%.
> 
> 607.cactuBSSN_s on i9-13900K
> gcc 14.1
> 
> preENV_GOMP_CPU_AFFINITY = 0-23:   60.1 (-41.7 % slower)
> # preENV_GOMP_CPU_AFFINITY = 0-23: 103
> 
> but for AMD Zen4(+) that maybe another story so far(AMD Zen4 need
> preENV_GOMP_CPU_AFFINITY to make the threads run on high performance core
> first).

Because E-core run slower than P-core, if you bind the thread to each core, it
prevents threads from migrating from the E-core to the P-core.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2024-07-02 Thread edison_chan_gz at hotmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

edison  changed:

   What|Removed |Added

 CC||edison_chan_gz at hotmail dot 
com

--- Comment #22 from edison  ---
for 607.cactuBSSN_s,if use preENV_GOMP_CPU_AFFINITY = 0-23 in CPU2017 .cfg, all
 p-core(i9-13900k) usage will down to 15%(the e-core almost 100%), if comment
out it all p-core usage will up to 60%.

607.cactuBSSN_s on i9-13900K
gcc 14.1

preENV_GOMP_CPU_AFFINITY = 0-23:   60.1 (-41.7 % slower)
# preENV_GOMP_CPU_AFFINITY = 0-23: 103

but for AMD Zen4(+) that maybe another story so far(AMD Zen4 need
preENV_GOMP_CPU_AFFINITY to make the threads run on high performance core
first).

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-11-25 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

liuhongt at gcc dot gnu.org changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #21 from liuhongt at gcc dot gnu.org ---
The main gap is from openmp for hybrid machine.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-11-24 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #20 from Jan Hubicka  ---
On zen4 hardware I now get

GCC13 with -O3 -flto -march=native -fopenmp
2163
2161
2153

Average: 2159 Iterations Per Minute

clang 17 with -O3 -flto -march=native -fopenmp
2004
1988
1991

Average: 1994 Iterations Per Minute

trunk -O3 -flto -march=native -fopenmp
Operation: Resizing:
2126
2135
2123

Average: 2128 Iterations Per Minute

So no big changes here...

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-10-11 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #19 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:e1e127de18dbee47b88fa0ce74a1c7f4d658dc68

commit r14-4571-ge1e127de18dbee47b88fa0ce74a1c7f4d658dc68
Author: Zhang, Jun 
Date:   Fri Sep 22 23:56:37 2023 +0800

x86: set spincount 1 for x86 hybrid platform

By test, we find in hybrid platform spincount 1 is better.

Use '-march=native -Ofast -funroll-loops -flto',
results as follows:

spec2017 speed   RPL ADL
657.xz_s 0.00%   0.50%
603.bwaves_s 10.90%  26.20%
607.cactuBSSN_s  5.50%   72.50%
619.lbm_s2.40%   2.50%
621.wrf_s-7.70%  2.40%
627.cam4_s   0.50%   0.70%
628.pop2_s   48.20%  153.00%
638.imagick_s-0.10%  0.20%
644.nab_s2.30%   1.40%
649.fotonik3d_s  8.00%   13.80%
654.roms_s   1.20%   1.10%
Geomean-int  0.00%   0.50%
Geomean-fp   6.30%   21.10%
Geomean-all  5.70%   19.10%

omp2012  RPL ADL
350.md   -1.81%  -1.75%
351.bwaves   7.72%   12.50%
352.nab  14.63%  19.71%
357.bt331-0.20%  1.77%
358.botsalgn 0.00%   0.00%
359.botsspar 0.00%   0.65%
360.ilbdc0.00%   0.25%
362.fma3d2.66%   -0.51%
363.swim 10.44%  0.00%
367.imagick  0.00%   0.12%
370.mgrid331 2.49%   25.56%
371.applu331 1.06%   4.22%
372.smithwa  0.74%   3.34%
376.kdtree   10.67%  16.03%
GEOMEAN  3.34%   5.53%

include/ChangeLog:

PR target/109812
* spincount.h: New file.

libgomp/ChangeLog:

* env.c (initialize_env): Use do_adjust_default_spincount.
* config/linux/x86/spincount.h: New file.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-06-21 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #18 from Uroš Bizjak  ---
One interesting observation:

clang is able to do this:

  0.09 │ │  vmovddup -0x8(%rdx,%rsi,1),%xmm3  ▒
  ...
  0.11 │ │  vfmadd231sd  %xmm2,%xmm3,%xmm1▒
  ...
  0.74 │ │  vfmadd231pd  %xmm2,%xmm3,%xmm0▒

It figures out that duplicated V2DFmode value in %xmm3 can also be accessed in
the same register as DFmode value.

OTOH, current gcc does:

vmovsd  (%rsi,%rax,8), %xmm1
...
vmovddup%xmm1, %xmm4
...
vfmadd231pd %xmm4, %xmm0, %xmm2
...
vfmadd231sd %xmm1, %xmm0, %xmm3

The above code needs two registers.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-06-01 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #17 from Jan Hubicka  ---
I was also thinking of DCE. It looks like plausible idea.  It may leads to a
surprise where you sture same undefined variable to two places and later
compare them for equality, but that is undefined anyway.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-06-01 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #16 from Jakub Jelinek  ---
Shouldn't we DCE something = x_N(D); stores when x is a VAR_DECL, at least
provided
something can't trap?  I mean, the previous content is one of the possible
uninitialized values.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-06-01 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #15 from Martin Jambor  ---
Oh, because I missed the -DOPACITY in the second command line.  The reason for
SRAs creating the repalcement is total scalarization :-/

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-31 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #14 from Martin Jambor  ---
(In reply to Jan Hubicka from comment #13)
> The only difference between slp vectorization is:
> 
> -  # _68 = PHI <_5(3)>
> -  # _67 = PHI <_11(3)>
> -  # _66 = PHI <_16(3)>
> -  .r = _68;
> -  .g = _67;
> -  .b = _66;
> +  # _70 = PHI <_5(3)>
> +  # _69 = PHI <_11(3)>
> +  # _68 = PHI <_16(3)>
> +  .r = _70;
> +  .g = _69;
> +  .b = _68;
> +  .o = r$o_33(D);
> 
> so SRA invents r$o_33(D) even if that variable is undefined.

Is this the testcase from comment #10 ?  I don't see r$o in my dumps.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-31 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jan Hubicka  changed:

   What|Removed |Added

 CC||rguenther at suse dot de
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=110062

--- Comment #13 from Jan Hubicka  ---
The only difference between slp vectorization is:

-  # _68 = PHI <_5(3)>
-  # _67 = PHI <_11(3)>
-  # _66 = PHI <_16(3)>
-  .r = _68;
-  .g = _67;
-  .b = _66;
+  # _70 = PHI <_5(3)>
+  # _69 = PHI <_11(3)>
+  # _68 = PHI <_16(3)>
+  .r = _70;
+  .g = _69;
+  .b = _68;
+  .o = r$o_33(D);

so SRA invents r$o_33(D) even if that variable is undefined.

SLP vectorizer then sees it as interleaving stores:

-t.c:19:16: note:   _1 = rgbs[i_35].r;
-t.c:19:16: note:   _7 = rgbs[i_35].g;
-t.c:19:16: note:   _12 = rgbs[i_35].b;
-t.c:19:16: note:   Detected interleaving store of size 3
-t.c:19:16: note:   .r = _68;
-t.c:19:16: note:   .g = _67;
-t.c:19:16: note:   .b = _66;
+t.c:19:16: note:   _1 = rgbs[i_37].r;
+t.c:19:16: note:   _7 = rgbs[i_37].g;
+t.c:19:16: note:   _12 = rgbs[i_37].b;
+t.c:19:16: note:   Detected interleaving store of size 4
+t.c:19:16: note:   .r = _70;
+t.c:19:16: note:   .g = _69;
+t.c:19:16: note:   .b = _68;
+t.c:19:16: note:   .o = r$o_33(D);

For first case it first tries to vectorize for vector of 3 doubles and fails:

-t.c:19:16: note: .r = _68;
-t.c:19:16: note: .g = _67;
-t.c:19:16: note: .b = _66;
-t.c:19:16: note:   starting SLP discovery for node 0x2cb4fe8
-t.c:19:16: note:   Build SLP for .r = _68;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits = 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block SLP
-t.c:19:16: note:   Build SLP for .g = _67;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits = 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block SLP
-t.c:19:16: note:   Build SLP for .b = _66;
-t.c:19:16: note:   get vectype for scalar type (group size 3): double
-t.c:19:16: note:   vectype: vector(2) double
-t.c:19:16: note:   nunits = 2
-t.c:19:16: missed:   Build SLP failed: unrolling required in basic block SLP
-t.c:19:16: note:   SLP discovery for node 0x2cb4fe8 failed

And later it tries to vectorize first 2 items:

-t.c:19:16: note:   Splitting SLP group at stmt 2
-t.c:19:16: note:   Split group into 2 and 1
-t.c:19:16: note:   Starting SLP discovery for
-t.c:19:16: note: .r = _68;
-t.c:19:16: note: .g = _67;
-t.c:19:16

... and after a lot of blablabla succeeds.

If opaque field is present we start with vector of size 4:
+t.c:19:16: note: .r = _70;
+t.c:19:16: note: .g = _69;
+t.c:19:16: note: .b = _68;
+t.c:19:16: note: .o = r$o_33(D);


+t.c:19:16: note:   vect_is_simple_use: operand _70 = PHI <_5(3)>, type of def:
internal
+t.c:19:16: note:   vect_is_simple_use: operand _69 = PHI <_11(3)>, type of
def: internal
+t.c:19:16: note:   vect_is_simple_use: operand _68 = PHI <_16(3)>, type of
def: internal
+t.c:19:16: note:   vect_is_simple_use: operand r$o_33(D), type of def:
external
+t.c:19:16: missed:   treating operand as external
+t.c:19:16: note:   SLP discovery for node 0x2e80058 succeeded
+t.c:19:16: note:   SLP size 1 vs. limit 23.
+t.c:19:16: note:   Final SLP tree for instance 0x2def840:
+t.c:19:16: note:   node 0x2e80058 (max_nunits=4, refcnt=2) vector(4) double
+t.c:19:16: note:   op template: .r = _70;
+t.c:19:16: note:   stmt 0 .r = _70;
+t.c:19:16: note:   stmt 1 .g = _69;
+t.c:19:16: note:   stmt 2 .b = _68;
+t.c:19:16: note:   stmt 3 .o = r$o_33(D);
+t.c:19:16: note:   children 0x2e800d8
+t.c:19:16: note:   node (external) 0x2e800d8 (max_nunits=1, refcnt=1)
+t.c:19:16: note:   { _70, _69, _68, r$o_33(D) }

So it seems to succeed vectorizing with 4 entries but it does so for the single
return statement:

   [local count: 1063004409]:
  # i_37 = PHI 
  # r$r_40 = PHI <_5(5), r$r_25(D)(2)>
  # r$g_42 = PHI <_11(5), r$g_26(D)(2)>
  # r$b_44 = PHI <_16(5), r$b_27(D)(2)>
  # ivtmp_67 = PHI 
  _1 = rgbs[i_37].r;
  _2 = (int) _1;
  _3 = (double) _2;
  _4 = _3 * w_21(D);
  _5 = _4 + r$r_40;
  _7 = rgbs[i_37].g;
  _8 = (int) _7;
  _9 = (double) _8;
  _10 = _9 * w_21(D);
  _11 = _10 + r$g_42;
  _12 = rgbs[i_37].b;
  _13 = (int) _12;
  _14 = (double) _13;
  _15 = _14 * w_21(D);
  _16 = _15 + r$b_44;
  i_22 = i_37 + 1;
  ivtmp_66 = ivtmp_67 - 1;
  if (ivtmp_66 != 0)
goto ; [99.00%]
  else
goto ; [1.00%]

   [local count: 1052374367]:
  goto ; [100.00%]

   [local count: 10737416]:
  # _70 = PHI <_5(3)>
  # _69 = PHI <_11(3)>
  # _68 = PHI <_16(3)>
  _65 = 

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-31 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #12 from Jan Hubicka  ---
> /home/sdp/jun/btl0/install/bin/ld: /tmp/ccnX75zI.ltrans0.ltrans.o: in
> function `main':
> :(.text.startup+0x1): undefined reference to `GMCommand'

I wonder if your plugin is configured correctly.  Can you try to build
with -flto -fuse-linker-plugin.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-29 Thread zhangjungcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #11 from jun zhang  ---
Hello, Hubicka and Artem
I try to reproduce this issue in Raptor Lake,
I use -fopenmp -O3 -flto, meet the following error,
but if use -fopenmp -O3, no -flto, build ok.
Could you help me?

libtool: link: /home/sdp/jun/gcc0/install/bin/gcc -fopenmp -O3 -flto
-march=native -Wall -o utilities/gm utilities/gm.o
-L/home/sdp/jun/omp/Ofast/pts_g_gomp/install/.phoronix-test-suite/installed-tests/pts/graphics-magick-2.1.0/gm_/lib
magick/.libs/libGraphicsMagick.a -lfreetype -ljbig -ltiff -ljpeg
-lXext -lSM -lICE -lX11 -llzma -lbz2 -lz -lzstd -lm -lpthread -fopenmp
/home/sdp/jun/btl0/install/bin/ld: /tmp/ccnX75zI.ltrans0.ltrans.o: in
function `main':
:(.text.startup+0x1): undefined reference to `GMCommand'
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:6411: utilities/gm] Error 1
make[1]: Leaving directory


hubicka at gcc dot gnu.org  于2023年5月29日周一 02:50写道:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812
>
> --- Comment #10 from Jan Hubicka  ---
> This is benchmarkeable version of the simplified testcase:
>
> jan@localhost:/tmp> cat t.c
> #define N 1000
> struct rgb {unsigned char r,g,b;} rgbs[N];
> int *addr;
> struct drgb {double r,g,b;
> #ifdef OPACITY
>  double o;
> #endif
> };
>
> struct drgb sum(double w)
> {
> struct drgb r;
> for (int i = 0; i < N; i++)
> {
>   r.r += rgbs[i].r * w;
>   r.g += rgbs[i].g * w;
>   r.b += rgbs[i].b * w;
> }
> return r;
> }
> jan@localhost:/tmp> cat q.c
> struct drgb {double r,g,b;
> #ifdef OPACITY
>  double o;
> #endif
> };
> struct drgb sum(double w);
> int
> main()
> {
> for (int i = 0; i < 1000; i++)
> sum(i);
> }
>
>
> jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g ; objdump -d a.out | grep
> vfmadd231pd  ; perf stat ./a.out
>   40119d:   c4 e2 d9 b8 d1  vfmadd231pd %xmm1,%xmm4,%xmm2
>
>  Performance counter stats for './a.out':
>
>  12,148.04 msec task-clock:u #1.000 CPUs
> utilized
>  0  context-switches:u   #0.000 /sec
>  0  cpu-migrations:u #0.000 /sec
>736  page-faults:u#   60.586 /sec
> 50,018,421,148  cycles:u #4.117 GHz
>220,502  stalled-cycles-frontend:u#0.00% frontend
> cycles idle
> 39,950,154,369  stalled-cycles-backend:u #   79.87% backend
> cycles idle
>120,000,191,713  instructions:u   #2.40  insn per
> cycle
>   #0.33  stalled cycles 
> per
> insn
> 10,000,048,918  branches:u   #  823.182 M/sec
>  7,959  branch-misses:u  #0.00% of all
> branches
>
>   12.149466078 seconds time elapsed
>
>   12.149084000 seconds user
>0.0 seconds sys
>
>
> jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g -DOPACITY ; objdump -d
> a.out | grep vfmadd231pd  ; perf stat ./a.out
>
>  Performance counter stats for './a.out':
>
>  12,141.11 msec task-clock:u #1.000 CPUs
> utilized
>  0  context-switches:u   #0.000 /sec
>  0  cpu-migrations:u #0.000 /sec
>735  page-faults:u#   60.538 /sec
> 50,018,839,129  cycles:u #4.120 GHz
>185,034  stalled-cycles-frontend:u#0.00% frontend
> cycles idle
> 29,963,999,798  stalled-cycles-backend:u #   59.91% backend
> cycles idle
>120,000,191,729  instructions:u   #2.40  insn per
> cycle
>   #0.25  stalled cycles 
> per
> insn
> 10,000,048,913  branches:u   #  823.652 M/sec
>  7,311  branch-misses:u  #0.00% of all
> branches
>
>   12.142252354 seconds time elapsed
>
>   12.138237000 seconds user
>0.00400 seconds sys
>
>
> So on zen2 hardware I get same performance on both.  It may be interesting to
> test it on Raptor Lake.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #10 from Jan Hubicka  ---
This is benchmarkeable version of the simplified testcase:

jan@localhost:/tmp> cat t.c
#define N 1000
struct rgb {unsigned char r,g,b;} rgbs[N];
int *addr;
struct drgb {double r,g,b;
#ifdef OPACITY
 double o;
#endif
};

struct drgb sum(double w)
{
struct drgb r;
for (int i = 0; i < N; i++)
{
  r.r += rgbs[i].r * w;
  r.g += rgbs[i].g * w;
  r.b += rgbs[i].b * w;
}
return r;
}
jan@localhost:/tmp> cat q.c
struct drgb {double r,g,b;
#ifdef OPACITY
 double o;
#endif
};
struct drgb sum(double w);
int
main()
{
for (int i = 0; i < 1000; i++)
sum(i);
}


jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g ; objdump -d a.out | grep
vfmadd231pd  ; perf stat ./a.out
  40119d:   c4 e2 d9 b8 d1  vfmadd231pd %xmm1,%xmm4,%xmm2

 Performance counter stats for './a.out':

 12,148.04 msec task-clock:u #1.000 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   736  page-faults:u#   60.586 /sec
50,018,421,148  cycles:u #4.117 GHz 
   220,502  stalled-cycles-frontend:u#0.00% frontend
cycles idle  
39,950,154,369  stalled-cycles-backend:u #   79.87% backend
cycles idle   
   120,000,191,713  instructions:u   #2.40  insn per
cycle
  #0.33  stalled cycles per
insn   
10,000,048,918  branches:u   #  823.182 M/sec   
 7,959  branch-misses:u  #0.00% of all
branches   

  12.149466078 seconds time elapsed

  12.149084000 seconds user
   0.0 seconds sys


jan@localhost:/tmp> gcc t.c q.c -march=native -O3 -g -DOPACITY ; objdump -d
a.out | grep vfmadd231pd  ; perf stat ./a.out

 Performance counter stats for './a.out':

 12,141.11 msec task-clock:u #1.000 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   735  page-faults:u#   60.538 /sec
50,018,839,129  cycles:u #4.120 GHz 
   185,034  stalled-cycles-frontend:u#0.00% frontend
cycles idle  
29,963,999,798  stalled-cycles-backend:u #   59.91% backend
cycles idle   
   120,000,191,729  instructions:u   #2.40  insn per
cycle
  #0.25  stalled cycles per
insn   
10,000,048,913  branches:u   #  823.652 M/sec   
 7,311  branch-misses:u  #0.00% of all
branches   

  12.142252354 seconds time elapsed

  12.138237000 seconds user
   0.00400 seconds sys


So on zen2 hardware I get same performance on both.  It may be interesting to
test it on Raptor Lake.

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #9 from Jan Hubicka  ---
Oddly enough simplified version of the loop SLP vectorizes for me:
struct rgb {unsigned char r,g,b;} *rgbs;
int *addr;
double *weights;
struct drgb {double r,g,b;};

struct drgb sum()
{
struct drgb r;
for (int i = 0; i < 10; i++)
{
  int j = addr[i];
  double w = weights[i];
  r.r += rgbs[j].r * w;
  r.g += rgbs[j].g * w;
  r.b += rgbs[j].b * w;
}
return r;
}
I get:
L2:
movslq  (%r9,%rdx,4), %rax
vmovsd  (%r8,%rdx,8), %xmm1
incq%rdx
leaq(%rax,%rax,2), %rax
addq%rsi, %rax
movzbl  (%rax), %ecx
vmovddup%xmm1, %xmm4
vmovd   %ecx, %xmm0
movzbl  1(%rax), %ecx
movzbl  2(%rax), %eax
vpinsrd $1, %ecx, %xmm0, %xmm0
vcvtdq2pd   %xmm0, %xmm0
vfmadd231pd %xmm4, %xmm0, %xmm2
vcvtsi2sdl  %eax, %xmm5, %xmm0
vfmadd231sd %xmm1, %xmm0, %xmm3
cmpq$10, %rdx
jne .L2


I think the actual loop is:
  [local count: 44202554]:
  _106 = _262->pixel;
  _109 = *source_231(D).columns;

   [local count: 401841405]:
  # pixel$green_332 = PHI <_124(89), pixel$green_265(53)>
  # i_357 = PHI 
  # pixel$red_371 = PHI <_119(89), pixel$red_263(53)>
  # pixel$blue_377 = PHI <_129(89), pixel$blue_267(53)>
  i.51_102 = (long unsigned int) i_357;
  _103 = i.51_102 * 16;
  _104 = _262 + _103;
  _105 = _104->pixel;
  _107 = _105 - _106;
  _108 = (long unsigned int) _107;
  _110 = _108 * _109;
  _112 = _110 + _621;
  weight_297 = _104->weight;
  _113 = _112 * 4;
  _114 = _276 + _113;
  _115 = _114->red;
  _116 = (int) _115;
  _117 = (double) _116;
  _118 = _117 * weight_297;
  _119 = _118 + pixel$red_371;
  _120 = _114->green;
 _121 = (int) _120;
  _122 = (double) _121;
  _123 = _122 * weight_297;
  _124 = _123 + pixel$green_332;
  _125 = _114->blue;
  _126 = (int) _125;
  _127 = (double) _126;
  _128 = _127 * weight_297;
  _129 = _128 + pixel$blue_377;
  i_298 = i_357 + 1;
  if (n_195 > i_298)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 44202554]:
  # _607 = PHI <_124(54)>
  # _606 = PHI <_119(54)>
  # _605 = PHI <_129(54)>
  goto ; [100.00%]

   [local count: 357638851]:
  goto ; [100.00%]


and SLP vectorizer seems to claim:
../magick/resize.c:1284:52: note:   _125 = _114->blue;
../magick/resize.c:1284:52: note:   _120 = _114->green;
../magick/resize.c:1284:52: note:   _115 = _114->red;
../magick/resize.c:1284:52: missed:   not consecutive access weight_297 =
_104->weight;
../magick/resize.c:1284:52: missed:   not consecutive access _105 =
_104->pixel;
../magick/resize.c:1284:52: missed:   not consecutive access _134->red =
iftmp.57_207;
../magick/resize.c:1284:52: missed:   not consecutive access _134->green =
iftmp.60_208;
../magick/resize.c:1284:52: missed:   not consecutive access _134->blue =
iftmp.63_209;
../magick/resize.c:1284:52: missed:   not consecutive access _134->opacity = 0;
../magick/resize.c:1284:52: missed:   not consecutive access _63 =
*source_231(D).columns;
../magick/resize.c:1284:52: missed:   not consecutive access _60 = _262->pixel;

Not sure if that is related to the real testcase:


struct rgb {unsigned char r,g,b;} *rgbs;
int *addr;
double *weights;
struct drgb {double r,g,b,o;};

struct drgb sum()
{
struct drgb r;
for (int i = 0; i < 10; i++)
{
  int j = addr[i];
  double w = weights[i];
  r.r += rgbs[j].r * w;
  r.g += rgbs[j].g * w;
  r.b += rgbs[j].b * w;
}
return r;
}

make us to miss the vectorization even though there is nothing using drgb->o:

sum:
.LFB0:
.cfi_startproc
movq%rdi, %r8
movqweights(%rip), %rsi
movqaddr(%rip), %rdi
vxorps  %xmm2, %xmm2, %xmm2
movqrgbs(%rip), %rcx
xorl%edx, %edx
.p2align 4
.p2align 3
.L2:
movslq  (%rdi,%rdx,4), %rax
vmovsd  (%rsi,%rdx,8), %xmm0
incq%rdx
leaq(%rax,%rax,2), %rax
addq%rcx, %rax
movzbl  (%rax), %r9d
vcvtsi2sdl  %r9d, %xmm2, %xmm1
movzbl  1(%rax), %r9d
movzbl  2(%rax), %eax
vfmadd231sd %xmm0, %xmm1, %xmm3
vcvtsi2sdl  %r9d, %xmm2, %xmm1
vfmadd231sd %xmm0, %xmm1, %xmm5
vcvtsi2sdl  %eax, %xmm2, %xmm1
vfmadd231sd %xmm0, %xmm1, %xmm4
cmpq$10, %rdx
jne .L2
vmovq   %xmm4, %xmm4
vunpcklpd   %xmm5, %xmm3, %xmm0
movq%r8, %rax
vinsertf128 $0x1, %xmm4, %ymm0, %ymm0
vmovupd %ymm0, (%r8)
vzeroupper
ret

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

--- Comment #8 from Jan Hubicka  ---
Created attachment 55178
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55178=edit
Preprocessed source of VerticalFiller and HorisontalFiller

[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake

2023-05-28 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812

Jan Hubicka  changed:

   What|Removed |Added

Summary|GraphicsMagick resize is a  |GraphicsMagick resize is a
   |lot slower in GCC 13.1 vs   |lot slower in GCC 13.1 vs
   |Clang 16|Clang 16 on Intel Raptor
   ||Lake

--- Comment #7 from Jan Hubicka  ---
On zen3 hardware I get GCC:

GraphicsMagick 1.3.38:
pts/graphics-magick-2.1.0 [Operation: Resizing]
Test 1 of 1
Estimated Trial Run Count:3 
Estimated Time To Completion: 4 Minutes [17:00 UTC] 
Started Run 1 @ 16:57:17
Started Run 2 @ 16:58:22
Started Run 3 @ 16:59:26

Operation: Resizing:
1390
1386
1383

Average: 1386 Iterations Per Minute
Deviation: 0.25%

clang16:

GraphicsMagick 1.3.38:
pts/graphics-magick-2.1.0 [Operation: Resizing]
Test 1 of 1
Estimated Trial Run Count:3
Estimated Time To Completion: 4 Minutes [16:54 UTC]
Started Run 1 @ 16:51:48
Started Run 2 @ 16:52:52
Started Run 3 @ 16:53:56

Operation: Resizing:
180
180
180

Average: 180 Iterations Per Minute
Deviation: 0.00%


GCC profile:
  52.07%  VerticalFilter._omp_fn.0  
  24.59%  HorizontalFilter._omp_fn.0
  11.78%  ReadCachePixels.isra.0

Clang does not seem to have openmp in it, so to get comparable runs I added 
OMP_THREAD_LIMIT=1

With this I get:
GraphicsMagick 1.3.38:
pts/graphics-magick-2.1.0 [Operation: Resizing]
Test 1 of 1
Estimated Trial Run Count:3
Estimated Time To Completion: 4 Minutes [17:17 UTC]
Started Run 1 @ 17:14:14
Started Run 2 @ 17:15:18
Started Run 3 @ 17:16:22

Operation: Resizing:
184
186
186

Average: 185 Iterations Per Minute
Deviation: 0.62%

so GCC build is still bit faster. Internal loop of VerticalFillter is:
  0.00 │4a0:┌─→mov  0x8(%rdx),%rax  ▒
  1.33 ││  vmovsd   (%rdx),%xmm1▒
  1.58 ││  add  $0x10,%rdx  ▒
  0.00 ││  sub  %r13,%rax   ▒
  4.77 ││  imul %r11,%rax   ▒
  1.01 ││  add  %rcx,%rax   ▒
  0.04 ││  movzbl   0x2(%r15,%rax,4),%r10d  ▒
  8.38 ││  vcvtsi2sd%r10d,%xmm2,%xmm0   ▒
  2.44 ││  movzbl   0x1(%r15,%rax,4),%r10d  ◆
  1.55 ││  movzbl   (%r15,%rax,4),%eax  ▒
  0.00 ││  vfmadd231sd  %xmm0,%xmm1,%xmm4   ▒
 13.91 ││  vcvtsi2sd%r10d,%xmm2,%xmm0   ▒
  1.86 ││  vfmadd231sd  %xmm0,%xmm1,%xmm5   ▒
 13.00 ││  vcvtsi2sd%eax,%xmm2,%xmm0▒
  2.02 ││  vfmadd231sd  %xmm0,%xmm1,%xmm3   ▒
 12.54 │├──cmp  %rdx,%rdi   ▒
  0.00 │└──jne  4a0 ▒

HorisontalFiller:
  0.01 │520:┌─→mov  0x8(%r8),%rdx ▒
  0.96 ││  vmovsd   (%r8),%xmm1   ▒
  1.93 ││  add  $0x10,%r8 ▒
  0.50 ││  sub  %r15,%rdx ▒
  4.02 ││  add  %r11,%rdx ▒
  2.26 ││  movzbl   0x2(%r14,%rdx,4),%ebx ▒
  0.09 ││  vcvtsi2sd%ebx,%xmm2,%xmm0  ▒
 10.10 ││  movzbl   0x1(%r14,%rdx,4),%ebx ◆
  0.92 ││  movzbl   (%r14,%rdx,4),%edx▒
  1.84 ││  vfmadd231sd  %xmm0,%xmm1,%xmm4 ▒
  6.82 ││  vcvtsi2sd%ebx,%xmm2,%xmm0  ▒
 11.15 ││  vfmadd231sd  %xmm0,%xmm1,%xmm3 ▒
 13.81 ││  vcvtsi2sd%edx,%xmm2,%xmm0  ▒
  6.16 ││  vfmadd231sd  %xmm0,%xmm1,%xmm5 ▒
  8.61 │├──cmp  %rsi,%r8  ▒
  1.56 │└──jne  520   ▒

ReadCachePixels:
   │2e0:┌─→mov(%rbx,%rax,4),%edx  ▒
 83.03 ││  mov%edx,(%r12,%rax,4)  ▒
 12.34 ││  inc%rax▒
  0.02 │├──cmp%rsi,%rax   ▒

With Clang I get: