[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Keywords||missed-optimization Priority|P3 |P2
[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442 --- Comment #8 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-30 13:56:46 UTC --- __builtin_assume_aligned is now supported on the trunk. Leaving this open even there so that the default cost model is adjusted.
[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442 --- Comment #5 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-21 12:43:04 UTC --- Created attachment 24571 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=24571 gcc47-builtins.patch Prerequisite patch for __builtin_assume_aligned patch. These are just random things I've noticed when working on that patch.
[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442 --- Comment #6 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-21 12:50:11 UTC --- Created attachment 24572 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=24572 gcc47-assume-aligned.patch Current version of the __builtin_assume_aligned support. On top of the previous patch. Still unfinished areas: 1) vectorizer/data-refs unfortunately ignores the computed alignment, should use get_pointer_alignment/get_object_alignment. 2) it would be nice if the builtin was special cased in the C/C++ FEs and acted like template typename T T __builtin_assume_aligned (T, size_t, ...); (for pointer types only) instead of void *__builtin_assume_aligned (const void *, size_t, ...); 3) testcases need to be added.
[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442 --- Comment #7 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-21 12:50:47 UTC --- Of course something like #c4 is highly desirable independently on this.
[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442 Ira Rosen irar at il dot ibm.com changed: What|Removed |Added CC||irar at il dot ibm.com --- Comment #4 from Ira Rosen irar at il dot ibm.com 2011-06-19 08:25:05 UTC --- We can try to fix this with the cost model and additional heuristic in vect_enhance_data_refs_alignment. Currently we decide not to do versioning for alignment, because all the accesses are supported anyway. Maybe something like the following condition for versioning could help (when all the alignment values are unknown): if (number_of_loads * cost_of_misaligned_load + number_of_stores * cost_of_misaligned_store + approx_vector_iteration_cost_without_drs approx_scalar_iteration_cost * vectorization_factor) do_versioning = true; Ira
[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442 --- Comment #3 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-17 08:15:38 UTC --- ICC apparently has __assume_aligned (ptr, align) for this, and also #pragma vector {aligned,unaligned,always,nontemporal}. For alignment, I'd say having both hard alignment and likely alignment hints somewhere in the code would be better than ptr_align attribute on the arguments, so something like __builtin_assume_aligned (ptr, align[, misalign]) and __builtin_likely_aligned (ptr, align[, misalign]); would be helpful. The question is if they shouldn't return the pointer again, and let the user write it in the form ptr = __builtin_assume_aligned (ptr, 16); which would be optimized away when we compute the alignment.
[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2011.06.16 15:23:05 Ever Confirmed|0 |1 --- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2011-06-16 15:23:05 UTC --- Does -mtune=barcelona improve it? What Intel CPUs? I suppose the vectorizer cost model could be adjusted for -mtune=generic? I suppose the old rev. is equivalent to -fno-tree-vectorize? On AMD K8 I get 38.26user 0.12system 0:38.42elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k with vectorization and 31.09user 0.08system 0:31.21elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k without. With -mtune=barcelona I get 37.08user 0.20system 0:37.39elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k and the following with native tuning 32.93user 0.25system 0:33.20elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k (movlpd instead of movsd and incl instead of add difference to generic only). So, confirmed on AMD K8 as well.
[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added CC||hjl at gcc dot gnu.org, ||hubicka at gcc dot gnu.org Target Milestone|--- |4.5.4
[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442 --- Comment #2 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-16 16:33:16 UTC --- I was testing on SandyBridge, but it was reported to us for Core2. The loop used to be vectorized in 4.4 and is now too, in both cases it does a huge hard to decipher test with many conditions and either uses non-vectorized loop or vectorized loop. In the r148210 condition was also: vect_p.44_29 = (vector double *) out1_6(D); addr2int0.45_28 = (long int) vect_p.44_29; vect_p.48_37 = (vector double *) out2_21(D); addr2int1.49_38 = (long int) vect_p.48_37; orptrs1.50_40 = addr2int0.45_28 | addr2int1.49_38; vect_p.53_41 = (vector double *) out3_34(D); addr2int2.54_42 = (long int) vect_p.53_41; orptrs2.55_51 = orptrs1.50_40 | addr2int2.54_42; andmask.56_52 = orptrs2.55_51 15; ... D.2833_72 = andmask.56_52 == 0; but in the new condition is not, and previously it used movaps stores in the loop: movapd%xmm0, (%rdi,%r10) ... movapd%xmm0, (%rsi,%r10) ... movapd%xmm0, (%rdx,%r10) while newly it uses: movlpd%xmm0, (%rdi,%rbx) movhpd%xmm0, 8(%rdi,%rbx) ... movlpd%xmm0, (%rsi,%rbx) movhpd%xmm0, 8(%rsi,%rbx) ... movlpd%xmm0, (%rdx,%rbx) movhpd%xmm0, 8(%rdx,%rbx) Surprisingly, the new code is slower even when the pointers aren't aligned: r128110: Strip out best and worst realtime result minimum: 8.849950347 sec real / 0.85810 sec CPU maximum: 9.278652529 sec real / 0.000153471 sec CPU average: 9.055898562 sec real / 0.000138755 sec CPU stdev : 0.073603342 sec real / 0.16469 sec CPU r128111: Strip out best and worst realtime result minimum: 12.089365836 sec real / 0.81233 sec CPU maximum: 12.378188295 sec real / 0.000158253 sec CPU average: 12.234883839 sec real / 0.000136920 sec CPU stdev : 0.073461527 sec real / 0.17463 sec CPU (same baz routine, and double a[6] __attribute__((aligned (32))); int main () { int i; for (i = 0; i 50; i++) baz (a + 1, a + 10001, a + 3, a + 4, a + 5, 1); return 0; } instead). Here, in r128110 generated code it uses the scalar loop, while in r128111 it uses the vectorized one with those movlpd+movhpd stores. So in this particular case for this particular CPU, it would be better if the cost model said that it should verify whether all store pointers are sufficiently aligned and only use the vectorized loop in that case. BTW, the vectorization condition is really long, is it a good idea to let it go through with just a single branch at the end? Wouldn't it be better to test several most likely to fail checks first, conditional branch, then some other tests, again conditional branch? I've talked with Richard on IRC about how users could promise the compiler that the pointers are sufficiently aligned and thus it can just assume it is aligned (if it would test for it) and use it in the loop, both for loads and stores. Possibilities include __attribute__((ptr_align (align [, misalign]))) on const pointer parameters and const pointer variables, or adding __builtin_unreachable () using assertions. But now that I think about it more, we already version the loop for vectorization in this case, wouldn't it be better to just add some extension which would allow the user to say something is likely? Such hint could be e.g. hint that some pointer is likely to be so and so aligned/misaligned, or e.g. that pointers don't overlap (yeah, I know, we have restrict, but e.g. on STL containers it is more fun to add those)? E.g. if this loop was hinted that all 5 pointers are 16 byte aligned and that neither in1[0..len-1] nor in2[0..len-1] overlap out{1,2,3}[0..len-1], the vectorizer could verify those conditions at runtime and use an correct alignment and __restrict assuming faster vectorized loop, while for the fallback case (vectorization not beneficial, or some overlaps somewhere, or misaligned pointers) would be a scalar loop not assuming anything of that. Or perhaps the hints could tell the vectorizer to emit 3 different versions instead of two, each with different assumptions or something similar.