[Bug tree-optimization/107715] TSVC s161 for double runs at zen4 30 times slower when vectorization is enabled

hubicka at ucw dot cz via Gcc-bugs Wed, 16 Nov 2022 07:28:37 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715


--- Comment #2 from Jan Hubicka <hubicka at ucw dot cz> ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715
> 
> --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
> Because store data races are allowed with -Ofast masked stores are not used so
> we instead get
> 
>   vect__ifc__80.24_114 = VEC_COND_EXPR <mask__58.15_104, vect__45.20_109,
> vect__ifc__78.23_113>;
>   _ifc__80 = _58 ? _45 : _ifc__78;
>   MEM <vector(8) double> [(double *)vectp_c.25_116] = vect__ifc__80.24_114;
> 
> which somehow is later turned into masked stores?  In fact we expand from
> 
>   vect__43.18_107 = MEM <vector(8) double> [(double *)&a + ivtmp.75_134 * 1];
>   vect__ifc__78.23_113 = MEM <vector(8) double> [(double *)&c + 8B +
> ivtmp.75_134 * 1];
>   _97 = .COND_FMA (mask__58.15_104, vect_pretmp_36.14_102,
> vect_pretmp_36.14_102, vect__43.18_107, vect__ifc__78.23_113);
>   MEM <vector(8) double> [(double *)&c + 8B + ivtmp.75_134 * 1] = _97;
>   vect__38.29_121 = MEM <vector(8) double> [(double *)&c + ivtmp.75_134 * 1];
>   vect__39.32_124 = MEM <vector(8) double> [(double *)&e + ivtmp.75_134 * 1];
>   _98 = vect__35.11_99 >= { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
>   _100 = .COND_FMA (_98, vect_pretmp_36.14_102, vect__39.32_124,
> vect__38.29_121, vect__43.18_107);
>   MEM <vector(8) double> [(double *)&a + ivtmp.75_134 * 1] = _100;
> 
> the vectorizer has optimize_mask_stores () which is supposed to replace
> .MASK_STORE with
> 
>  if (mask != { 0, 0, 0 ... })
>    <code depending on the mask store>
> 
> and thus optimize the mask == 0 case.  But that only triggers for .MASK_STORE.
> 
> You can see this when you force .MASK_STORE via -O3 -ffast-math (without
> -fallow-store-data-races) you get this effect:

Yep, -fno-allow-store-data-races fixes the problem

jh@alberti:~/tsvc/bin> /home/jh/trunk-install/bin/gcc test.c -Ofast 
-march=native  -lm  
jh@alberti:~/tsvc/bin> perf stat ./a.out

 Performance counter stats for './a.out':

         37,289.50 msec task-clock:u              #    1.000 CPUs utilized      
                 0      context-switches:u        #    0.000 /sec               
                 0      cpu-migrations:u          #    0.000 /sec               
               431      page-faults:u             #   11.558 /sec               
   137,411,365,539      cycles:u                  #    3.685 GHz               
      (83.33%)
       991,673,172      stalled-cycles-frontend:u #    0.72% frontend cycles
idle     (83.34%)
           506,793      stalled-cycles-backend:u  #    0.00% backend cycles
idle      (83.34%)
     3,400,375,204      instructions:u            #    0.02  insn per cycle     
                                                  #    0.29  stalled cycles per
insn  (83.34%)
       200,235,802      branches:u                #    5.370 M/sec             
      (83.34%)
            73,962      branch-misses:u           #    0.04% of all branches   
      (83.33%)

      37.305121352 seconds time elapsed

      37.285467000 seconds user
       0.000000000 seconds sys


jh@alberti:~/tsvc/bin> /home/jh/trunk-install/bin/gcc test.c -Ofast 
-march=native  -lm  -fno-allow-store-data-races
jh@alberti:~/tsvc/bin> perf stat ./a.out

 Performance counter stats for './a.out':

            667.95 msec task-clock:u              #    0.999 CPUs utilized      
                 0      context-switches:u        #    0.000 /sec               
                 0      cpu-migrations:u          #    0.000 /sec               
               367      page-faults:u             #  549.439 /sec               
     2,434,906,671      cycles:u                  #    3.645 GHz               
      (83.24%)
            19,681      stalled-cycles-frontend:u #    0.00% frontend cycles
idle     (83.24%)
            12,495      stalled-cycles-backend:u  #    0.00% backend cycles
idle      (83.24%)
     2,793,482,139      instructions:u            #    1.15  insn per cycle     
                                                  #    0.00  stalled cycles per
insn  (83.24%)
       598,879,536      branches:u                #  896.588 M/sec             
      (83.78%)
            50,649      branch-misses:u           #    0.01% of all branches   
      (83.26%)

       0.668807640 seconds time elapsed

       0.668660000 seconds user
       0.000000000 seconds sys

So I suppose it is L1 trashing. l1-dcache-loads goes up from
2,000,413,936 to 11,044,576,207

I suppose it would be too fancy for vectorizer to work out the overall
memory consumption here :) It sort of should have all the info...

Honza

[Bug tree-optimization/107715] TSVC s161 for double runs at zen4 30 times slower when vectorization is enabled

Reply via email to