[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-15 Thread jamborm at gcc dot gnu dot org


--- Comment #18 from jamborm at gcc dot gnu dot org  2010-06-15 09:48 
---
Subject: Bug 44423

Author: jamborm
Date: Tue Jun 15 09:48:39 2010
New Revision: 160775

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=160775
Log:
2010-06-15  Martin Jambor  mjam...@suse.cz

PR tree-optimization/44423
* tree-sra.c (dump_access): Dump also grp_assignment_read.
(analyze_access_subtree): Pass negative allow_replacements to children
if the current type is scalar.

* testsuite/gcc.dg/tree-ssa/pr44423.c: New test.


Added:
branches/gcc-4_5-branch/gcc/testsuite/gcc.dg/tree-ssa/pr44423.c
Modified:
branches/gcc-4_5-branch/gcc/ChangeLog
branches/gcc-4_5-branch/gcc/testsuite/ChangeLog
branches/gcc-4_5-branch/gcc/tree-sra.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-15 Thread jamborm at gcc dot gnu dot org


--- Comment #19 from jamborm at gcc dot gnu dot org  2010-06-15 10:04 
---
This is now fixed on both the trunk and the 4.5 branch.


-- 

jamborm at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-14 Thread jamborm at gcc dot gnu dot org


--- Comment #15 from jamborm at gcc dot gnu dot org  2010-06-14 12:39 
---
(In reply to comment #14)
 SSE performance is fine again, thanks a lot!
 
 One more question, if that's OK...
 Depending on ARRSZ the testcase uses wildly varying amounts of CPU time; it's
 about half a second for ARRSZ=1024, but almost 10 seconds for ARRSZ=20 on my
 machine, which is extremely strange because the operation count is the same in
 both cases. I suspect that something weird is happening with respect to the
 cache and prefetching. Should I open another PR for this?
 

The generated assembly is not different for the two cases, except that
there are much smaller offsets, of course.  This means that the lpic
and pre1 arrays are much closer to each other which may be something
the processor does not like.  I find this surprising but unless you
can think of a specific missed optimization opportunity (I can't), I
don't think it is a PR material.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-14 Thread martin at mpa-garching dot mpg dot de


--- Comment #16 from martin at mpa-garching dot mpg dot de  2010-06-14 
12:46 ---
(In reply to comment #15)

I have found the problem in the meantime ... it's my mistake, sorry about the
noise :(

The problem is that I did not explicitly zero the arrays in main(), so they
apparently contained NaN or similar nastinesses for the small ARRSZ, and
usual numbers for large ARRSZ. Of course the processor chokes on the
unusual numbers and takes much longer to execute the code.
I'm not sure whether the zeroing should be added for the regression test case
... but since you check for compiler diagnostic and do not try to run the
resulting executable that's probably not necessary.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-14 Thread jamborm at gcc dot gnu dot org


--- Comment #17 from jamborm at gcc dot gnu dot org  2010-06-14 12:50 
---
OK, I did not put much effort into my thinking about it :-)
Yes, the testcase is fine as it is.
I'm not testing the patch on the 4.5 branch and will commit it today
if everything goes fine.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-09 Thread jamborm at gcc dot gnu dot org


--- Comment #11 from jamborm at gcc dot gnu dot org  2010-06-09 09:02 
---
(In reply to comment #10)
 (In reply to comment #9)
  (In reply to comment #8)
   I don't think you need flow-sensitivity.
   
   Basically when you have only aggregate uses (as in this case)
  
  Vectors are considered scalars in GCC.  That is why the solutions
  described above work.
  
   then you only want to scalarize if the destination of the use is
   scalarized as well (to be able to copyprop out the aggregate copy).
  
  Well, that is what I thought until someone filed PR 43846.
 
 Hm, yes.  But there you know that
 
   D.2464.m[0] = D.2473_20;
   D.2464.m[1] = D.2472_19;
   D.2464.m[2] = D.2471_18;
   *b_1(D) = D.2464;
 
 D.2464 will be dead after scalarization.

If D.2464 was larger than just m, that would not necessarily be the
case and we would still want to avoid the extra copies.

However, I it is true that it would make sense to take
grp_assignment_read into account only if the whole access subtree
would end up with grp_unscalarized_data set to zero but that would
require quite a rewrite of analyze_access_subtree and would not help
in this case because grp_unscalarized_data is zero, the union is
covered by scalar replacements.

The real issue is that

  In the particular case of the
 current bug the aggregate remains live because of the load from va.v
 which we cannot scalarize(*).

we determine this very late, in sra_modify_assign (right after the big
comment) and in the most general form this can be determined only when
we already have the whole access tree (so if we wanted to do this
during analysis, we would have to scan the function body twice).
Nevertheless, for scalar accesses that have scalar sub-accesses this
is always true and it can be easily detected and so we can simply
disallow them, like I wrote in comment #7.  And disallow them always,
since otherwise it would be easy to _add_ stuff to the function that
is causing trouble now so that any heuristics is confused and decides
to produce replacements.

I'll submit a patch in a while.

 
 (*) we can scalarize this particular case if you manage to build a
 proper constructor from the elements - but that's probably a bit
 involved.
 

Well, I don't think I want to implement that... but I am curious,
would that actually lead to better (or even different) code if I
placed something like that into the loop?  And I also thought that in
gimple, constructors only could have invariants in them.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-09 Thread jamborm at gcc dot gnu dot org


--- Comment #12 from jamborm at gcc dot gnu dot org  2010-06-09 09:05 
---
(In reply to comment #11)
D.2464.m[0] = D.2473_20;
D.2464.m[1] = D.2472_19;
D.2464.m[2] = D.2471_18;
*b_1(D) = D.2464;
  
  D.2464 will be dead after scalarization.
 
 If D.2464 was larger than just m, that would not necessarily be the
 case and we would still want to avoid the extra copies.
 

Of course this would be true only if the last assignment was 

*b_1(D) = D.2464.m;

but the argument is the same.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-09 Thread jamborm at gcc dot gnu dot org


--- Comment #13 from jamborm at gcc dot gnu dot org  2010-06-09 11:20 
---
Subject: Bug 44423

Author: jamborm
Date: Wed Jun  9 11:20:03 2010
New Revision: 160462

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=160462
Log:
2010-06-09  Martin Jambor  mjam...@suse.cz

PR tree-optimization/44423
* tree-sra.c (dump_access): Dump also grp_assignment_read.
(analyze_access_subtree): Pass negative allow_replacements to children
if the current type is scalar.

* testsuite/gcc.dg/tree-ssa/pr44423.c: New test.


Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/pr44423.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-sra.c


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-09 Thread martin at mpa-garching dot mpg dot de


--- Comment #14 from martin at mpa-garching dot mpg dot de  2010-06-09 
12:06 ---
SSE performance is fine again, thanks a lot!

One more question, if that's OK...
Depending on ARRSZ the testcase uses wildly varying amounts of CPU time; it's
about half a second for ARRSZ=1024, but almost 10 seconds for ARRSZ=20 on my
machine, which is extremely strange because the operation count is the same in
both cases. I suspect that something weird is happening with respect to the
cache and prefetching. Should I open another PR for this?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-08 Thread jamborm at gcc dot gnu dot org


--- Comment #4 from jamborm at gcc dot gnu dot org  2010-06-08 13:16 ---
Mine


-- 

jamborm at gcc dot gnu dot org changed:

   What|Removed |Added

 AssignedTo|unassigned at gcc dot gnu   |jamborm at gcc dot gnu dot
   |dot org |org
 Status|NEW |ASSIGNED
   Last reconfirmed|2010-06-05 10:47:29 |2010-06-08 13:16:32
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-08 Thread martin at mpa-garching dot mpg dot de


--- Comment #5 from martin at mpa-garching dot mpg dot de  2010-06-08 13:54 
---
(In reply to comment #2)
 We have (4.4):
 
 bb 2:
   va.f[0] = a-r;
   va.f[1] = a-g;
   va.f[2] = a-b;
   va.f[3] = 0.0;
   pretmp.40 = va.v;
   ivtmp.61 = 0;

[...]

Could you please tell me the compiler flag(s) needed to produce this kind of
information? That seems much more useful for debugging and chasing performance
bottlenecks than assembler dumps...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-08 Thread rguenth at gcc dot gnu dot org


--- Comment #6 from rguenth at gcc dot gnu dot org  2010-06-08 14:02 ---
(In reply to comment #5)
 (In reply to comment #2)
  We have (4.4):
  
  bb 2:
va.f[0] = a-r;
va.f[1] = a-g;
va.f[2] = a-b;
va.f[3] = 0.0;
pretmp.40 = va.v;
ivtmp.61 = 0;
 
 [...]
 
 Could you please tell me the compiler flag(s) needed to produce this kind of
 information? That seems much more useful for debugging and chasing performance
 bottlenecks than assembler dumps...

-fdump-tree-all will leave you with a hunded dump files, program state
after each individual tree optimization pass.  -da will add dumps after
each RTL pass.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-08 Thread jamborm at gcc dot gnu dot org


--- Comment #7 from jamborm at gcc dot gnu dot org  2010-06-08 14:29 ---
I don't think I can fix this bug in its most general form without
doing some flow-sensitive decisions (which can be difficult for
aggregates) and without causing PR 43846 again.  (Aggregate
copy-propagation and either of the two things described below should
do the trick, though).

As noted, this is caused by a fix to PR 43846 which on the other hand
is certainly not necessary for non-aggregate types when we do type
punning of register types through unions.  I've got a two line patch
testing that and it works (and bootstraps and tests) fine.

However, that is only a change in the new heuristics and if the array
elements are individually read somewhere else in the function too, a
different decision making condition will kick in and we will end up
with the replacements and extra statements in the loop again.

Therefore, I now tend to think that these accesses to SSE vectors are
a good reason to simply disallow scalarization of anything that has a
non-aggregate parent in the SRA access tree.  This would only affect
type punning through unions and weird typecasts (none of which could
be processed by previous SRA).

Actually, I had this disallowed when I was developing the new SRA most
of the time and then decided to allow it only very late.  I don't
remember why I did that.  I'm now testing a patch doing that, maybe
some testcase will remind me what the reason was.

I will ponder about this a bit more but probably will soon submit a
patch doing the latter.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-08 Thread rguenth at gcc dot gnu dot org


--- Comment #8 from rguenth at gcc dot gnu dot org  2010-06-08 14:50 ---
I don't think you need flow-sensitivity.

Basically when you have only aggregate uses (as in this case) then you only
want to scalarize if the destination of the use is scalarized as well
(to be able to copyprop out the aggregate copy).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-08 Thread jamborm at gcc dot gnu dot org


--- Comment #9 from jamborm at gcc dot gnu dot org  2010-06-08 15:00 ---
(In reply to comment #8)
 I don't think you need flow-sensitivity.
 
 Basically when you have only aggregate uses (as in this case)

Vectors are considered scalars in GCC.  That is why the solutions
described above work.

 then you only want to scalarize if the destination of the use is
 scalarized as well (to be able to copyprop out the aggregate copy).

Well, that is what I thought until someone filed PR 43846.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-08 Thread rguenth at gcc dot gnu dot org


--- Comment #10 from rguenth at gcc dot gnu dot org  2010-06-08 15:11 
---
(In reply to comment #9)
 (In reply to comment #8)
  I don't think you need flow-sensitivity.
  
  Basically when you have only aggregate uses (as in this case)
 
 Vectors are considered scalars in GCC.  That is why the solutions
 described above work.
 
  then you only want to scalarize if the destination of the use is
  scalarized as well (to be able to copyprop out the aggregate copy).
 
 Well, that is what I thought until someone filed PR 43846.

Hm, yes.  But there you know that

  D.2464.m[0] = D.2473_20;
  D.2464.m[1] = D.2472_19;
  D.2464.m[2] = D.2471_18;
  *b_1(D) = D.2464;

D.2464 will be dead after scalarization.  In the particular case of the
current bug the aggregate remains live because of the load from va.v
which we cannot scalarize(*).

(*) we can scalarize this particular case if you manage to build a
proper constructor from the elements - but that's probably a bit
involved.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423



[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA

2010-06-05 Thread rguenth at gcc dot gnu dot org


--- Comment #3 from rguenth at gcc dot gnu dot org  2010-06-05 10:56 ---
Ok.  Fact is that no pass can move invariant store/load pairs.  But that's
pre-existing - the main issue is that the new SRA implementation ends up
rematerializing the stores inside the loop!

Diff of pre-esra vs. esra:

 bb 2:
   D.4339_3 = a_2(D)-r;
-  va.f[0] = D.4339_3;
+  va$f$0_33 = D.4339_3;
   D.4340_4 = a_2(D)-g;
-  va.f[1] = D.4340_4;
+  va$f$1_32 = D.4340_4;
   D.4341_5 = a_2(D)-b;
-  va.f[2] = D.4341_5;
-  va.f[3] = 0.0;
+  va$f$2_31 = D.4341_5;
+  va$f$3_30 = 0.0;
   y_6 = 0;
   goto bb 4;

@@ -504,6 +203,10 @@
   tmpatt_37 = {D.4375_36, D.4375_36, D.4375_36, D.4375_36};
   tmpatt_40 = tmpatt_37;
   tmpatt_15 = tmpatt_40;
+  va.f[0] = va$f$0_33;
+  va.f[1] = va$f$1_32;
+  va.f[2] = va$f$2_31;
+  va.f[3] = va$f$3_30;
   D.4347_16 = va.v;
   tmpatt_38 = __builtin_ia32_mulps (tmpatt_15, D.4347_16);
   tmpatt_41 = tmpatt_38;

that's of course bad (and the scalarization in this particular case looks
useless, too - the only use is an aggregate one, covering all stores).


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||jamborm at gcc dot gnu dot
   ||org
  Component|regression  |tree-optimization
Summary|[4.5/4.6] Massive   |[4.5/4.6 Regression] Massive
   |performance regression in   |performance regression in
   |SSE code|SSE code due to SRA
   Target Milestone|--- |4.5.1


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423