4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

whaley at cs dot utsa dot edu Mon, 07 Aug 2006 19:59:27 -0700


------- Comment #45 from whaley at cs dot utsa dot edu  2006-08-08 02:59 -------
Guys,


OK, with Dorit's -fdump-tree-vect-details, I made a little progress on
vectorization.  In order to get vectorization to work, I had to add the flag
'-funsafe-math-optimizations'.  I will try to create a tarfile with everything
tomorrow so you guys can see all the output, but is it normal to need to throw
this to get vectorization?  SSE is IEEE compliant (unless you turn it off), and
ATLAS needs to stay IEEE, so I can't turn on unsafe-math-opt in general . . .

With these flags, gcc can vectorize the kernel if I do no unrolling at all.  I
have not yet run the full search on with these flags, but I've done quite a few
hand-called cases, and the performance is lower than either the x87 (best) or
scalar SSE for double on both the P4E and Ath64X2.  For single precision, there
is a modest speedup over the x87 code on both systems, but the total is *way*
below my assembly SSE kernels.

I just quickly glanced at the code, and I see that it never uses "movapd" from
memory, which is a key to getting decent performance.  ATLAS ensures that the
input matrices (A & B) are 16-byte aligned.  Is there any pragma/flag/etc I can
set that says "pointer X points to data that is 16-byte aligned"?

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

[Bug target/27827] [4.0/4.1 Regression] gcc 4 produces worse x87 code on all platforms than gcc 3

Reply via email to