[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization

2011-08-01 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

   Keywords||missed-optimization
   Priority|P3  |P2


[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization

2011-06-30 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442

--- Comment #8 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-30 
13:56:46 UTC ---
__builtin_assume_aligned is now supported on the trunk.  Leaving this open even
there so that the default cost model is adjusted.


[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization

2011-06-21 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442

--- Comment #5 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-21 
12:43:04 UTC ---
Created attachment 24571
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=24571
gcc47-builtins.patch

Prerequisite patch for __builtin_assume_aligned patch.  These are just random
things I've noticed when working on that patch.


[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization

2011-06-21 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442

--- Comment #6 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-21 
12:50:11 UTC ---
Created attachment 24572
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=24572
gcc47-assume-aligned.patch

Current version of the __builtin_assume_aligned support.  On top of the
previous 
patch.
Still unfinished areas:
1) vectorizer/data-refs unfortunately ignores the computed alignment,
   should use get_pointer_alignment/get_object_alignment.
2) it would be nice if the builtin was special cased in the C/C++ FEs and
   acted like
   template typename T T __builtin_assume_aligned (T, size_t, ...);
   (for pointer types only) instead of
   void *__builtin_assume_aligned (const void *, size_t, ...);
3) testcases need to be added.


[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization

2011-06-21 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442

--- Comment #7 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-21 
12:50:47 UTC ---
Of course something like #c4 is highly desirable independently on this.


[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization

2011-06-19 Thread irar at il dot ibm.com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442

Ira Rosen irar at il dot ibm.com changed:

   What|Removed |Added

 CC||irar at il dot ibm.com

--- Comment #4 from Ira Rosen irar at il dot ibm.com 2011-06-19 08:25:05 UTC 
---
We can try to fix this with the cost model and additional heuristic in
vect_enhance_data_refs_alignment. Currently we decide not to do versioning for
alignment, because all the accesses are supported anyway. Maybe something like
the following condition for versioning could help (when all the alignment
values are unknown):
if (number_of_loads * cost_of_misaligned_load 
+ number_of_stores * cost_of_misaligned_store
+ approx_vector_iteration_cost_without_drs 
approx_scalar_iteration_cost * vectorization_factor)
  do_versioning = true;

Ira


[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization

2011-06-17 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442

--- Comment #3 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-17 
08:15:38 UTC ---
ICC apparently has __assume_aligned (ptr, align) for this, and also
#pragma vector {aligned,unaligned,always,nontemporal}.
For alignment, I'd say having both hard alignment and likely alignment hints
somewhere in the code would be better than ptr_align attribute on the
arguments,
so something like
__builtin_assume_aligned (ptr, align[, misalign])
and
__builtin_likely_aligned (ptr, align[, misalign]);
would be helpful.  The question is if they shouldn't return the pointer again,
and let the user write it in the form
ptr = __builtin_assume_aligned (ptr, 16);
which would be optimized away when we compute the alignment.


[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization

2011-06-16 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2011.06.16 15:23:05
 Ever Confirmed|0   |1

--- Comment #1 from Richard Guenther rguenth at gcc dot gnu.org 2011-06-16 
15:23:05 UTC ---
Does -mtune=barcelona improve it?  What Intel CPUs?  I suppose the vectorizer
cost model could be adjusted for -mtune=generic?  I suppose the old rev.
is equivalent to -fno-tree-vectorize?

On AMD K8 I get

38.26user 0.12system 0:38.42elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

with vectorization and

31.09user 0.08system 0:31.21elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

without.  With -mtune=barcelona I get

37.08user 0.20system 0:37.39elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

and the following with native tuning

32.93user 0.25system 0:33.20elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

(movlpd instead of movsd and incl instead of add difference to generic only).
So, confirmed on AMD K8 as well.


[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization

2011-06-16 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 CC||hjl at gcc dot gnu.org,
   ||hubicka at gcc dot gnu.org
   Target Milestone|--- |4.5.4


[Bug tree-optimization/49442] [4.5/4.6/4.7 Regression] Misaligned store support pessimization

2011-06-16 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49442

--- Comment #2 from Jakub Jelinek jakub at gcc dot gnu.org 2011-06-16 
16:33:16 UTC ---
I was testing on SandyBridge, but it was reported to us for Core2.
The loop used to be vectorized in 4.4 and is now too, in both cases it does a
huge hard to decipher test with many conditions and either uses non-vectorized
loop or vectorized loop.  In the r148210 condition was also:
  vect_p.44_29 = (vector double *) out1_6(D);
  addr2int0.45_28 = (long int) vect_p.44_29;
  vect_p.48_37 = (vector double *) out2_21(D);
  addr2int1.49_38 = (long int) vect_p.48_37;
  orptrs1.50_40 = addr2int0.45_28 | addr2int1.49_38;
  vect_p.53_41 = (vector double *) out3_34(D);
  addr2int2.54_42 = (long int) vect_p.53_41;
  orptrs2.55_51 = orptrs1.50_40 | addr2int2.54_42;
  andmask.56_52 = orptrs2.55_51  15;
...
  D.2833_72 = andmask.56_52 == 0;
but in the new condition is not, and previously it used movaps stores in the
loop:
movapd%xmm0, (%rdi,%r10)
...
movapd%xmm0, (%rsi,%r10)
...
movapd%xmm0, (%rdx,%r10)
while newly it uses:
movlpd%xmm0, (%rdi,%rbx)
movhpd%xmm0, 8(%rdi,%rbx)
...
movlpd%xmm0, (%rsi,%rbx)
movhpd%xmm0, 8(%rsi,%rbx)
...
movlpd%xmm0, (%rdx,%rbx)
movhpd%xmm0, 8(%rdx,%rbx)

Surprisingly, the new code is slower even when the pointers aren't aligned:
r128110:
Strip out best and worst realtime result
minimum: 8.849950347 sec real / 0.85810 sec CPU
maximum: 9.278652529 sec real / 0.000153471 sec CPU
average: 9.055898562 sec real / 0.000138755 sec CPU
stdev  : 0.073603342 sec real / 0.16469 sec CPU
r128111:
Strip out best and worst realtime result
minimum: 12.089365836 sec real / 0.81233 sec CPU
maximum: 12.378188295 sec real / 0.000158253 sec CPU
average: 12.234883839 sec real / 0.000136920 sec CPU
stdev  : 0.073461527 sec real / 0.17463 sec CPU
(same baz routine, and
double a[6] __attribute__((aligned (32)));
int
main ()
{
  int i;
  for (i = 0; i  50; i++)
baz (a + 1, a + 10001, a + 3, a + 4, a + 5, 1);
  return 0;
}
instead).  Here, in r128110 generated code it uses the scalar loop, while in
r128111 it uses the vectorized one with those movlpd+movhpd stores.
So in this particular case for this particular CPU, it would be better if the
cost model said that it should verify whether all store pointers are
sufficiently aligned and only use the vectorized loop in that case.

BTW, the vectorization condition is really long, is it a good idea to let it go
through with just a single branch at the end?  Wouldn't it be better to test
several most likely to fail checks first, conditional branch, then some other
tests, again conditional branch?

I've talked with Richard on IRC about how users could promise the compiler
that the pointers are sufficiently aligned and thus it can just assume it is
aligned (if it would test for it) and use it in the loop, both for loads and
stores.  Possibilities include __attribute__((ptr_align (align [, misalign])))
on const pointer parameters and const pointer variables, or adding
__builtin_unreachable () using assertions.

But now that I think about it more, we already version the loop for
vectorization in this case, wouldn't it be better to just add some extension
which would allow the user to say something is likely?  Such hint could be
e.g. hint that some pointer is likely to be so and so aligned/misaligned,
or e.g. that pointers don't overlap (yeah, I know, we have restrict, but
e.g. on STL containers it is more fun to add those)?

E.g. if this loop was hinted that all 5 pointers are 16 byte aligned and
that neither in1[0..len-1] nor in2[0..len-1] overlap out{1,2,3}[0..len-1], the
vectorizer could verify those conditions at runtime and use an correct
alignment
and __restrict assuming faster vectorized loop, while for the fallback case
(vectorization not beneficial, or some overlaps somewhere, or misaligned
pointers) would be a scalar loop not assuming anything of that.
Or perhaps the hints could tell the vectorizer to emit 3 different versions
instead of two, each with different assumptions or something similar.