[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #18 from jamborm at gcc dot gnu dot org 2010-06-15 09:48 --- Subject: Bug 44423 Author: jamborm Date: Tue Jun 15 09:48:39 2010 New Revision: 160775 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=160775 Log: 2010-06-15 Martin Jambor mjam...@suse.cz PR tree-optimization/44423 * tree-sra.c (dump_access): Dump also grp_assignment_read. (analyze_access_subtree): Pass negative allow_replacements to children if the current type is scalar. * testsuite/gcc.dg/tree-ssa/pr44423.c: New test. Added: branches/gcc-4_5-branch/gcc/testsuite/gcc.dg/tree-ssa/pr44423.c Modified: branches/gcc-4_5-branch/gcc/ChangeLog branches/gcc-4_5-branch/gcc/testsuite/ChangeLog branches/gcc-4_5-branch/gcc/tree-sra.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #19 from jamborm at gcc dot gnu dot org 2010-06-15 10:04 --- This is now fixed on both the trunk and the 4.5 branch. -- jamborm at gcc dot gnu dot org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #15 from jamborm at gcc dot gnu dot org 2010-06-14 12:39 --- (In reply to comment #14) SSE performance is fine again, thanks a lot! One more question, if that's OK... Depending on ARRSZ the testcase uses wildly varying amounts of CPU time; it's about half a second for ARRSZ=1024, but almost 10 seconds for ARRSZ=20 on my machine, which is extremely strange because the operation count is the same in both cases. I suspect that something weird is happening with respect to the cache and prefetching. Should I open another PR for this? The generated assembly is not different for the two cases, except that there are much smaller offsets, of course. This means that the lpic and pre1 arrays are much closer to each other which may be something the processor does not like. I find this surprising but unless you can think of a specific missed optimization opportunity (I can't), I don't think it is a PR material. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #16 from martin at mpa-garching dot mpg dot de 2010-06-14 12:46 --- (In reply to comment #15) I have found the problem in the meantime ... it's my mistake, sorry about the noise :( The problem is that I did not explicitly zero the arrays in main(), so they apparently contained NaN or similar nastinesses for the small ARRSZ, and usual numbers for large ARRSZ. Of course the processor chokes on the unusual numbers and takes much longer to execute the code. I'm not sure whether the zeroing should be added for the regression test case ... but since you check for compiler diagnostic and do not try to run the resulting executable that's probably not necessary. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #17 from jamborm at gcc dot gnu dot org 2010-06-14 12:50 --- OK, I did not put much effort into my thinking about it :-) Yes, the testcase is fine as it is. I'm not testing the patch on the 4.5 branch and will commit it today if everything goes fine. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #11 from jamborm at gcc dot gnu dot org 2010-06-09 09:02 --- (In reply to comment #10) (In reply to comment #9) (In reply to comment #8) I don't think you need flow-sensitivity. Basically when you have only aggregate uses (as in this case) Vectors are considered scalars in GCC. That is why the solutions described above work. then you only want to scalarize if the destination of the use is scalarized as well (to be able to copyprop out the aggregate copy). Well, that is what I thought until someone filed PR 43846. Hm, yes. But there you know that D.2464.m[0] = D.2473_20; D.2464.m[1] = D.2472_19; D.2464.m[2] = D.2471_18; *b_1(D) = D.2464; D.2464 will be dead after scalarization. If D.2464 was larger than just m, that would not necessarily be the case and we would still want to avoid the extra copies. However, I it is true that it would make sense to take grp_assignment_read into account only if the whole access subtree would end up with grp_unscalarized_data set to zero but that would require quite a rewrite of analyze_access_subtree and would not help in this case because grp_unscalarized_data is zero, the union is covered by scalar replacements. The real issue is that In the particular case of the current bug the aggregate remains live because of the load from va.v which we cannot scalarize(*). we determine this very late, in sra_modify_assign (right after the big comment) and in the most general form this can be determined only when we already have the whole access tree (so if we wanted to do this during analysis, we would have to scan the function body twice). Nevertheless, for scalar accesses that have scalar sub-accesses this is always true and it can be easily detected and so we can simply disallow them, like I wrote in comment #7. And disallow them always, since otherwise it would be easy to _add_ stuff to the function that is causing trouble now so that any heuristics is confused and decides to produce replacements. I'll submit a patch in a while. (*) we can scalarize this particular case if you manage to build a proper constructor from the elements - but that's probably a bit involved. Well, I don't think I want to implement that... but I am curious, would that actually lead to better (or even different) code if I placed something like that into the loop? And I also thought that in gimple, constructors only could have invariants in them. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #12 from jamborm at gcc dot gnu dot org 2010-06-09 09:05 --- (In reply to comment #11) D.2464.m[0] = D.2473_20; D.2464.m[1] = D.2472_19; D.2464.m[2] = D.2471_18; *b_1(D) = D.2464; D.2464 will be dead after scalarization. If D.2464 was larger than just m, that would not necessarily be the case and we would still want to avoid the extra copies. Of course this would be true only if the last assignment was *b_1(D) = D.2464.m; but the argument is the same. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #13 from jamborm at gcc dot gnu dot org 2010-06-09 11:20 --- Subject: Bug 44423 Author: jamborm Date: Wed Jun 9 11:20:03 2010 New Revision: 160462 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=160462 Log: 2010-06-09 Martin Jambor mjam...@suse.cz PR tree-optimization/44423 * tree-sra.c (dump_access): Dump also grp_assignment_read. (analyze_access_subtree): Pass negative allow_replacements to children if the current type is scalar. * testsuite/gcc.dg/tree-ssa/pr44423.c: New test. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr44423.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-sra.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #14 from martin at mpa-garching dot mpg dot de 2010-06-09 12:06 --- SSE performance is fine again, thanks a lot! One more question, if that's OK... Depending on ARRSZ the testcase uses wildly varying amounts of CPU time; it's about half a second for ARRSZ=1024, but almost 10 seconds for ARRSZ=20 on my machine, which is extremely strange because the operation count is the same in both cases. I suspect that something weird is happening with respect to the cache and prefetching. Should I open another PR for this? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #4 from jamborm at gcc dot gnu dot org 2010-06-08 13:16 --- Mine -- jamborm at gcc dot gnu dot org changed: What|Removed |Added AssignedTo|unassigned at gcc dot gnu |jamborm at gcc dot gnu dot |dot org |org Status|NEW |ASSIGNED Last reconfirmed|2010-06-05 10:47:29 |2010-06-08 13:16:32 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #5 from martin at mpa-garching dot mpg dot de 2010-06-08 13:54 --- (In reply to comment #2) We have (4.4): bb 2: va.f[0] = a-r; va.f[1] = a-g; va.f[2] = a-b; va.f[3] = 0.0; pretmp.40 = va.v; ivtmp.61 = 0; [...] Could you please tell me the compiler flag(s) needed to produce this kind of information? That seems much more useful for debugging and chasing performance bottlenecks than assembler dumps... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #6 from rguenth at gcc dot gnu dot org 2010-06-08 14:02 --- (In reply to comment #5) (In reply to comment #2) We have (4.4): bb 2: va.f[0] = a-r; va.f[1] = a-g; va.f[2] = a-b; va.f[3] = 0.0; pretmp.40 = va.v; ivtmp.61 = 0; [...] Could you please tell me the compiler flag(s) needed to produce this kind of information? That seems much more useful for debugging and chasing performance bottlenecks than assembler dumps... -fdump-tree-all will leave you with a hunded dump files, program state after each individual tree optimization pass. -da will add dumps after each RTL pass. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #7 from jamborm at gcc dot gnu dot org 2010-06-08 14:29 --- I don't think I can fix this bug in its most general form without doing some flow-sensitive decisions (which can be difficult for aggregates) and without causing PR 43846 again. (Aggregate copy-propagation and either of the two things described below should do the trick, though). As noted, this is caused by a fix to PR 43846 which on the other hand is certainly not necessary for non-aggregate types when we do type punning of register types through unions. I've got a two line patch testing that and it works (and bootstraps and tests) fine. However, that is only a change in the new heuristics and if the array elements are individually read somewhere else in the function too, a different decision making condition will kick in and we will end up with the replacements and extra statements in the loop again. Therefore, I now tend to think that these accesses to SSE vectors are a good reason to simply disallow scalarization of anything that has a non-aggregate parent in the SRA access tree. This would only affect type punning through unions and weird typecasts (none of which could be processed by previous SRA). Actually, I had this disallowed when I was developing the new SRA most of the time and then decided to allow it only very late. I don't remember why I did that. I'm now testing a patch doing that, maybe some testcase will remind me what the reason was. I will ponder about this a bit more but probably will soon submit a patch doing the latter. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #8 from rguenth at gcc dot gnu dot org 2010-06-08 14:50 --- I don't think you need flow-sensitivity. Basically when you have only aggregate uses (as in this case) then you only want to scalarize if the destination of the use is scalarized as well (to be able to copyprop out the aggregate copy). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #9 from jamborm at gcc dot gnu dot org 2010-06-08 15:00 --- (In reply to comment #8) I don't think you need flow-sensitivity. Basically when you have only aggregate uses (as in this case) Vectors are considered scalars in GCC. That is why the solutions described above work. then you only want to scalarize if the destination of the use is scalarized as well (to be able to copyprop out the aggregate copy). Well, that is what I thought until someone filed PR 43846. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #10 from rguenth at gcc dot gnu dot org 2010-06-08 15:11 --- (In reply to comment #9) (In reply to comment #8) I don't think you need flow-sensitivity. Basically when you have only aggregate uses (as in this case) Vectors are considered scalars in GCC. That is why the solutions described above work. then you only want to scalarize if the destination of the use is scalarized as well (to be able to copyprop out the aggregate copy). Well, that is what I thought until someone filed PR 43846. Hm, yes. But there you know that D.2464.m[0] = D.2473_20; D.2464.m[1] = D.2472_19; D.2464.m[2] = D.2471_18; *b_1(D) = D.2464; D.2464 will be dead after scalarization. In the particular case of the current bug the aggregate remains live because of the load from va.v which we cannot scalarize(*). (*) we can scalarize this particular case if you manage to build a proper constructor from the elements - but that's probably a bit involved. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423
[Bug tree-optimization/44423] [4.5/4.6 Regression] Massive performance regression in SSE code due to SRA
--- Comment #3 from rguenth at gcc dot gnu dot org 2010-06-05 10:56 --- Ok. Fact is that no pass can move invariant store/load pairs. But that's pre-existing - the main issue is that the new SRA implementation ends up rematerializing the stores inside the loop! Diff of pre-esra vs. esra: bb 2: D.4339_3 = a_2(D)-r; - va.f[0] = D.4339_3; + va$f$0_33 = D.4339_3; D.4340_4 = a_2(D)-g; - va.f[1] = D.4340_4; + va$f$1_32 = D.4340_4; D.4341_5 = a_2(D)-b; - va.f[2] = D.4341_5; - va.f[3] = 0.0; + va$f$2_31 = D.4341_5; + va$f$3_30 = 0.0; y_6 = 0; goto bb 4; @@ -504,6 +203,10 @@ tmpatt_37 = {D.4375_36, D.4375_36, D.4375_36, D.4375_36}; tmpatt_40 = tmpatt_37; tmpatt_15 = tmpatt_40; + va.f[0] = va$f$0_33; + va.f[1] = va$f$1_32; + va.f[2] = va$f$2_31; + va.f[3] = va$f$3_30; D.4347_16 = va.v; tmpatt_38 = __builtin_ia32_mulps (tmpatt_15, D.4347_16); tmpatt_41 = tmpatt_38; that's of course bad (and the scalarization in this particular case looks useless, too - the only use is an aggregate one, covering all stores). -- rguenth at gcc dot gnu dot org changed: What|Removed |Added CC||jamborm at gcc dot gnu dot ||org Component|regression |tree-optimization Summary|[4.5/4.6] Massive |[4.5/4.6 Regression] Massive |performance regression in |performance regression in |SSE code|SSE code due to SRA Target Milestone|--- |4.5.1 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44423