https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715
--- Comment #2 from Jan Hubicka <hubicka at ucw dot cz> --- > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107715 > > --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- > Because store data races are allowed with -Ofast masked stores are not used so > we instead get > > vect__ifc__80.24_114 = VEC_COND_EXPR <mask__58.15_104, vect__45.20_109, > vect__ifc__78.23_113>; > _ifc__80 = _58 ? _45 : _ifc__78; > MEM <vector(8) double> [(double *)vectp_c.25_116] = vect__ifc__80.24_114; > > which somehow is later turned into masked stores? In fact we expand from > > vect__43.18_107 = MEM <vector(8) double> [(double *)&a + ivtmp.75_134 * 1]; > vect__ifc__78.23_113 = MEM <vector(8) double> [(double *)&c + 8B + > ivtmp.75_134 * 1]; > _97 = .COND_FMA (mask__58.15_104, vect_pretmp_36.14_102, > vect_pretmp_36.14_102, vect__43.18_107, vect__ifc__78.23_113); > MEM <vector(8) double> [(double *)&c + 8B + ivtmp.75_134 * 1] = _97; > vect__38.29_121 = MEM <vector(8) double> [(double *)&c + ivtmp.75_134 * 1]; > vect__39.32_124 = MEM <vector(8) double> [(double *)&e + ivtmp.75_134 * 1]; > _98 = vect__35.11_99 >= { 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 }; > _100 = .COND_FMA (_98, vect_pretmp_36.14_102, vect__39.32_124, > vect__38.29_121, vect__43.18_107); > MEM <vector(8) double> [(double *)&a + ivtmp.75_134 * 1] = _100; > > the vectorizer has optimize_mask_stores () which is supposed to replace > .MASK_STORE with > > if (mask != { 0, 0, 0 ... }) > <code depending on the mask store> > > and thus optimize the mask == 0 case. But that only triggers for .MASK_STORE. > > You can see this when you force .MASK_STORE via -O3 -ffast-math (without > -fallow-store-data-races) you get this effect: Yep, -fno-allow-store-data-races fixes the problem jh@alberti:~/tsvc/bin> /home/jh/trunk-install/bin/gcc test.c -Ofast -march=native -lm jh@alberti:~/tsvc/bin> perf stat ./a.out Performance counter stats for './a.out': 37,289.50 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 431 page-faults:u # 11.558 /sec 137,411,365,539 cycles:u # 3.685 GHz (83.33%) 991,673,172 stalled-cycles-frontend:u # 0.72% frontend cycles idle (83.34%) 506,793 stalled-cycles-backend:u # 0.00% backend cycles idle (83.34%) 3,400,375,204 instructions:u # 0.02 insn per cycle # 0.29 stalled cycles per insn (83.34%) 200,235,802 branches:u # 5.370 M/sec (83.34%) 73,962 branch-misses:u # 0.04% of all branches (83.33%) 37.305121352 seconds time elapsed 37.285467000 seconds user 0.000000000 seconds sys jh@alberti:~/tsvc/bin> /home/jh/trunk-install/bin/gcc test.c -Ofast -march=native -lm -fno-allow-store-data-races jh@alberti:~/tsvc/bin> perf stat ./a.out Performance counter stats for './a.out': 667.95 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 367 page-faults:u # 549.439 /sec 2,434,906,671 cycles:u # 3.645 GHz (83.24%) 19,681 stalled-cycles-frontend:u # 0.00% frontend cycles idle (83.24%) 12,495 stalled-cycles-backend:u # 0.00% backend cycles idle (83.24%) 2,793,482,139 instructions:u # 1.15 insn per cycle # 0.00 stalled cycles per insn (83.24%) 598,879,536 branches:u # 896.588 M/sec (83.78%) 50,649 branch-misses:u # 0.01% of all branches (83.26%) 0.668807640 seconds time elapsed 0.668660000 seconds user 0.000000000 seconds sys So I suppose it is L1 trashing. l1-dcache-loads goes up from 2,000,413,936 to 11,044,576,207 I suppose it would be too fancy for vectorizer to work out the overall memory consumption here :) It sort of should have all the info... Honza