[Bug target/29256] [4.9/5/6 regression] loop performance regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256 --- Comment #62 from amker at gcc dot gnu.org --- (In reply to Bill Schmidt from comment #61) > (In reply to amker from comment #60) > > (In reply to Bill Schmidt from comment #59) > > > We don't have a lot of data yet, but we have seen several examples in SPEC > > > and other benchmarks where turning on -funroll-loops is helpful, but > > > should > > > be much more helpful -- in many cases performance improves with a much > > > higher unroll factor. However, the effectiveness of unrolling is very > > > much > > > tied up with these issues in IVOPTS, where we currently end up with too > > > many > > > separate base registers for IVs. As we increase the unroll factor, we > > By this, do you mean too many candidates are chosen? Or the issue just like > > this PR describes? Thanks. > > > > On the surface, it's the issue from this PR where we have lots of separate > induction variables with their own index registers each requiring an add > during each iteration. The presence of this issue masks whether we have too IMHO, this issue should be fixed by a gimple unroller before IVO, or in RTL unroller. It's not that practical to fix it in IVO. > many candidates, but in the sense that we often see register spill > associated with this kind of code, we do have too many. I.e., the register > pressure model may not be in tune with the kind of addressing mode that's > being selected, but that's just a theory. Or perhaps pressure is just being > generically under-predicted for POWER. IVO's reg-pressure model fails to preserve a small iv set sometime on aarch64 too. I have this issue on list. On the other hand, the loops I saw are generally very big, it's might be inappropriate that rtl unroller decides to unroll them at the first place. > > Up till now we haven't done a lot of detailed analysis. Hopefully we can > free somebody up to start looking at some of our unrolling issues soon.
[Bug target/29256] [4.9/5/6 regression] loop performance regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256 --- Comment #61 from Bill Schmidt --- (In reply to amker from comment #60) > (In reply to Bill Schmidt from comment #59) > > We don't have a lot of data yet, but we have seen several examples in SPEC > > and other benchmarks where turning on -funroll-loops is helpful, but should > > be much more helpful -- in many cases performance improves with a much > > higher unroll factor. However, the effectiveness of unrolling is very much > > tied up with these issues in IVOPTS, where we currently end up with too many > > separate base registers for IVs. As we increase the unroll factor, we > By this, do you mean too many candidates are chosen? Or the issue just like > this PR describes? Thanks. > On the surface, it's the issue from this PR where we have lots of separate induction variables with their own index registers each requiring an add during each iteration. The presence of this issue masks whether we have too many candidates, but in the sense that we often see register spill associated with this kind of code, we do have too many. I.e., the register pressure model may not be in tune with the kind of addressing mode that's being selected, but that's just a theory. Or perhaps pressure is just being generically under-predicted for POWER. Up till now we haven't done a lot of detailed analysis. Hopefully we can free somebody up to start looking at some of our unrolling issues soon.
[Bug target/29256] [4.9/5/6 regression] loop performance regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256 --- Comment #60 from amker at gcc dot gnu.org --- (In reply to Bill Schmidt from comment #59) > (In reply to rguent...@suse.de from comment #57) > > > > It's been a long time since I've done SPEC measuring with/without > > -funroll-loops (or/and -fpeel-loops). Note that these flags have > > secondary effects as well: > > > > toplev.c:flag_web = flag_unroll_loops || flag_peel_loops; > > toplev.c:flag_rename_registers = flag_unroll_loops || flag_peel_loops; > > We don't have a lot of data yet, but we have seen several examples in SPEC > and other benchmarks where turning on -funroll-loops is helpful, but should > be much more helpful -- in many cases performance improves with a much > higher unroll factor. However, the effectiveness of unrolling is very much > tied up with these issues in IVOPTS, where we currently end up with too many > separate base registers for IVs. As we increase the unroll factor, we By this, do you mean too many candidates are chosen? Or the issue just like this PR describes? Thanks. > eventually hit this as a limiting factor, so fixing this IVOPTS issue would > be very helpful for POWER. > > As a side note, with -fprofile-use a GIMPLE unroller could peel and unroll > hot loop traces in loops that would otherwise be too complex to unroll. > I.e., if there is a single hot trace through a loop, you can do tail > duplication on the trace to force it into superblock form, and then peel and > unroll that superblock while falling into the original loop if the trace is > left. Complete unrolling and unrolling by a factor are both possible. I > don't know of specific benchmarks that would be helped by this, though. > > (An RTL unroller could do this as well, but it seems much more natural and > implementable in GIMPLE.)
[Bug target/29256] [4.9/5/6 regression] loop performance regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256 --- Comment #59 from Bill Schmidt --- (In reply to rguent...@suse.de from comment #57) > > It's been a long time since I've done SPEC measuring with/without > -funroll-loops (or/and -fpeel-loops). Note that these flags have > secondary effects as well: > > toplev.c:flag_web = flag_unroll_loops || flag_peel_loops; > toplev.c:flag_rename_registers = flag_unroll_loops || flag_peel_loops; We don't have a lot of data yet, but we have seen several examples in SPEC and other benchmarks where turning on -funroll-loops is helpful, but should be much more helpful -- in many cases performance improves with a much higher unroll factor. However, the effectiveness of unrolling is very much tied up with these issues in IVOPTS, where we currently end up with too many separate base registers for IVs. As we increase the unroll factor, we eventually hit this as a limiting factor, so fixing this IVOPTS issue would be very helpful for POWER. As a side note, with -fprofile-use a GIMPLE unroller could peel and unroll hot loop traces in loops that would otherwise be too complex to unroll. I.e., if there is a single hot trace through a loop, you can do tail duplication on the trace to force it into superblock form, and then peel and unroll that superblock while falling into the original loop if the trace is left. Complete unrolling and unrolling by a factor are both possible. I don't know of specific benchmarks that would be helped by this, though. (An RTL unroller could do this as well, but it seems much more natural and implementable in GIMPLE.)
[Bug target/29256] [4.9/5/6 regression] loop performance regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256 --- Comment #58 from amker at gcc dot gnu.org --- (In reply to Bill Schmidt from comment #56) > (In reply to Bill Schmidt from comment #53) > > I'm not a fan of a tree-level unroller. It's impossible to make good > > decisions about unroll factors that early. But your second approach sounds > > quite promising to me. > > I would be willing to soften this statement. I think that an early unroller > might well be a profitable approach for most systems with large caches and > so forth, where if the unrolling heuristics are not completely accurate we > are still likely to make a reasonably good decision. However, I would > expect to see ports with limited caches/memory to want more accurate control > over unrolling decisions. So I could see allowing ports to select between a > GIMPLE unroller and an RTL unroller (I doubt anybody would want both). Thanks for the comments. As David suggested, we can try to implement a relatively conservative unroller and make sure it's a win in most unrolled cases, even with some opportunities missed. Then we can enable it at O3/Ofast level, that would be wanted I think since now we don't have a general unroller by default. > > In general it seems like PowerPC could benefit from more aggressive > unrolling much of the time, provided we can also solve the related IVOPTS > problems that cause too much register spill. > > I may have an interest in working on a GIMPLE unroller, depending on how > quickly I can complete or shed some other projects... (In reply to rguent...@suse.de from comment #57) > On Tue, 11 Aug 2015, wschmidt at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256 > > > > --- Comment #56 from Bill Schmidt --- > > (In reply to Bill Schmidt from comment #53) > > > I'm not a fan of a tree-level unroller. It's impossible to make good > > > decisions about unroll factors that early. But your second approach > > > sounds > > > quite promising to me. > > > > I would be willing to soften this statement. I think that an early unroller > > might well be a profitable approach for most systems with large caches and > > so > > forth, where if the unrolling heuristics are not completely accurate we are > > still likely to make a reasonably good decision. However, I would expect to > > see ports with limited caches/memory to want more accurate control over > > unrolling decisions. So I could see allowing ports to select between a > > GIMPLE > > unroller and an RTL unroller (I doubt anybody would want both). > > > > In general it seems like PowerPC could benefit from more aggressive > > unrolling > > much of the time, provided we can also solve the related IVOPTS problems > > that > > cause too much register spill. > > > > I may have an interest in working on a GIMPLE unroller, depending on how > > quickly I can complete or shed some other projects... > > I think that a separate unrolling on GIMPLE would be a hard sell > due to the lack of a good cost mode. _But_ doing unrolling as part > of another transform like we are doing now makes sense. So does > eventually moving parts of an RTL pass involving unrolling to > GIMPLE, like modulo scheduling or SMS (leaving the scheduling part > to RTL). (In reply to Bill Schmidt from comment #56) > (In reply to Bill Schmidt from comment #53) > > I'm not a fan of a tree-level unroller. It's impossible to make good > > decisions about unroll factors that early. But your second approach sounds > > quite promising to me. > > I would be willing to soften this statement. I think that an early unroller > might well be a profitable approach for most systems with large caches and > so forth, where if the unrolling heuristics are not completely accurate we > are still likely to make a reasonably good decision. However, I would > expect to see ports with limited caches/memory to want more accurate control > over unrolling decisions. So I could see allowing ports to select between a > GIMPLE unroller and an RTL unroller (I doubt anybody would want both). As David suggested, we can try to implement a relatively conservative unroller and make sure it's a win in most unrolled cases, even with some opportunities missed. Then we can enable it at O3/Ofast level, it would be nice since we don't have a general unroller by default. About cost-model. Is it possible to introduce cache information model in GCC? I don't see it's a difficult problem, and can be a start for possible cache sensitive optimizations in the future? Another general question is: what kind of cost do we need in a fine unroller, besides cache/branch ones? > > In general it seems like PowerPC could benefit from more aggressive > unrolling much of the time, provided we can also solve the related IVOPTS > problems that cause too much register spill. > > I may have an interest in working on a GIMPLE unroller, depending on how > quickly I can complete or shed some other pro
[Bug target/29256] [4.9/5/6 regression] loop performance regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256 --- Comment #57 from rguenther at suse dot de --- On Tue, 11 Aug 2015, wschmidt at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256 > > --- Comment #56 from Bill Schmidt --- > (In reply to Bill Schmidt from comment #53) > > I'm not a fan of a tree-level unroller. It's impossible to make good > > decisions about unroll factors that early. But your second approach sounds > > quite promising to me. > > I would be willing to soften this statement. I think that an early unroller > might well be a profitable approach for most systems with large caches and so > forth, where if the unrolling heuristics are not completely accurate we are > still likely to make a reasonably good decision. However, I would expect to > see ports with limited caches/memory to want more accurate control over > unrolling decisions. So I could see allowing ports to select between a GIMPLE > unroller and an RTL unroller (I doubt anybody would want both). > > In general it seems like PowerPC could benefit from more aggressive unrolling > much of the time, provided we can also solve the related IVOPTS problems that > cause too much register spill. > > I may have an interest in working on a GIMPLE unroller, depending on how > quickly I can complete or shed some other projects... I think that a separate unrolling on GIMPLE would be a hard sell due to the lack of a good cost mode. _But_ doing unrolling as part of another transform like we are doing now makes sense. So does eventually moving parts of an RTL pass involving unrolling to GIMPLE, like modulo scheduling or SMS (leaving the scheduling part to RTL). Note that the RTL unroller is not enabled by default by any optimization level and note that unfortunately the RTL unroller shares flags with the GIMPLE level complete peeling (where it mainly controls cost modeling). Oh, but it's enabled with -fprofile-use. It's been a long time since I've done SPEC measuring with/without -funroll-loops (or/and -fpeel-loops). Note that these flags have secondary effects as well: toplev.c:flag_web = flag_unroll_loops || flag_peel_loops; toplev.c:flag_rename_registers = flag_unroll_loops || flag_peel_loops;
[Bug target/29256] [4.9/5/6 regression] loop performance regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256 --- Comment #56 from Bill Schmidt --- (In reply to Bill Schmidt from comment #53) > I'm not a fan of a tree-level unroller. It's impossible to make good > decisions about unroll factors that early. But your second approach sounds > quite promising to me. I would be willing to soften this statement. I think that an early unroller might well be a profitable approach for most systems with large caches and so forth, where if the unrolling heuristics are not completely accurate we are still likely to make a reasonably good decision. However, I would expect to see ports with limited caches/memory to want more accurate control over unrolling decisions. So I could see allowing ports to select between a GIMPLE unroller and an RTL unroller (I doubt anybody would want both). In general it seems like PowerPC could benefit from more aggressive unrolling much of the time, provided we can also solve the related IVOPTS problems that cause too much register spill. I may have an interest in working on a GIMPLE unroller, depending on how quickly I can complete or shed some other projects...
[Bug target/29256] [4.9/5/6 regression] loop performance regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256 Jakub Jelinek changed: What|Removed |Added Target Milestone|4.9.3 |4.9.4
[Bug target/29256] [4.9/5/6 regression] loop performance regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256 --- Comment #55 from Jakub Jelinek --- GCC 4.9.3 has been released.