Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi Jeff, On Thu, Sep 17, 2020 at 05:12:17PM -0600, Jeff Law wrote: > On 9/3/20 4:37 PM, Segher Boessenkool wrote: > >> Apart from that, one P9 specific point is that the update form load isn't > >> preferred, the reason is that the instruction can not retire until both > >> parts complete, it can hold up subsequent instructions from retiring. > >> If the addi stalls (starvation), the instruction can not retire and can > >> cause things stuck. It seems also something we can model here? > > This is (almost) no problem on p9, since we no longer have issue groups. > > It can hold up older insns from retiring, sure, but they *will* have > > finished, and p9 can retire 64 insns per cycle. The "completion wall" > > is gone. The only problem is if things stick around so long that > > resources run out... but you're talking 100s of insns there. > > So the PA8xxx had the same issue with its dual output insns -- the big > difference is the PA8xxx systems were considered retirement bandwidth > limited (2 memory and 2 non-memory per cycle, with just a 56 entry > reorder buffer, split between memory and non-memory ops). Holding a > slot in the reorder buffer was relatively costly. > > > If you can retire 64 ops per cycle and you've probably got an enormous > reorder buffer, so I wouldn't worry much about holding up insns from > retiring on your target. Power9 doesn't have a reorder buffer or anything similar at all -- it uses history buffers, so committing insns (pretty much what you call retiring) is essentially for free (restoring old register values after flushes now costs more though). Segher
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
On 9/3/20 4:37 PM, Segher Boessenkool wrote: >> Apart from that, one P9 specific point is that the update form load isn't >> preferred, the reason is that the instruction can not retire until both >> parts complete, it can hold up subsequent instructions from retiring. >> If the addi stalls (starvation), the instruction can not retire and can >> cause things stuck. It seems also something we can model here? > This is (almost) no problem on p9, since we no longer have issue groups. > It can hold up older insns from retiring, sure, but they *will* have > finished, and p9 can retire 64 insns per cycle. The "completion wall" > is gone. The only problem is if things stick around so long that > resources run out... but you're talking 100s of insns there. So the PA8xxx had the same issue with its dual output insns -- the big difference is the PA8xxx systems were considered retirement bandwidth limited (2 memory and 2 non-memory per cycle, with just a 56 entry reorder buffer, split between memory and non-memory ops). Holding a slot in the reorder buffer was relatively costly. If you can retire 64 ops per cycle and you've probably got an enormous reorder buffer, so I wouldn't worry much about holding up insns from retiring on your target. jeff pEpkey.asc Description: application/pgp-keys
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi Hans, on 2020/9/6 上午10:47, Hans-Peter Nilsson wrote: > On Tue, 1 Sep 2020, Bin.Cheng via Gcc-patches wrote: >>> Great idea! With explicitly specified -funroll-loops, it's bootstrapped >>> but the regression testing did show one failure (the only one): >>> >>> PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1 >>> >>> It exposes two issues: >>> >>> 1) Currently address_cost hook on rs6000 always return zero, but at least >>> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we >>> have to take the address update into account (scalar normal operation). >>> Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer >>> ainc candidates. In this case, the cand/group cost is -4 (minus cost_step), >>> with scaling up, the off becomes much. With one simple hack on for pre_inc/ >>> pre_dec in rs6000 address_cost, the case passed. It should be handled in >>> one separated issue. >>> >>> 2) This case makes me think we should exclude ainc candidates in function >>> mark_reg_offset_candidates. The justification is that: ainc candidate >>> handles step update itself and when we calculate the cost for it against >>> its ainc_use, the cost_step has been reduced. When unrolling happens, >>> the ainc computations are replicated and it doesn't save step updates >>> like normal reg_offset_p candidates. >> Though auto-inc candidate embeds stepping operation into memory >> access, we might want to avoid it in case of unroll if there are many >> sequences of memory accesses, and if the unroll factor is big. The >> rationale is embedded stepping is a u-arch operation and does have its >> cost. > > Forgive me for barging in here (though the context is powerpc, > the dialogue and the patch seems to be generic ivopts), but > that's not a general remark I hope, about auto-inc (always) > having a cost? > > For some architectures, auto-inc *is* free, as free as > register-indirect, so the more auto-inc use, the better. All > this should be reflected by the address-cost, IMHO, and not > hardcoded into ivopts. > Yeah, now ivopts doesn't hardcode the cost for auto-inc (always), instead it allows targets to set its cost by themselves through address_cost hook. As the function get_address_cost_ainc, it checks auto-inc operations supported or not and set the cost as address_cost hook further. One example on Power is listed as below: Group 0: cand costcompl. inv.expr. inv.vars 1 4 1 NIL;1 3 0 0 NIL;NIL; 4 0 1 NIL;1 5 0 1 NIL;NIL; 130 1 NIL;NIL; 18-4 0 NIL;NIL; Cand 18 is one auto-inc candidate, whose group 0/cand cost is -4 (minus step_cost), the iv_cost of cand 18 is 5 (step_cost + non-original_iv cost), when it's selected, the step_cost parts counteract, the remaining cost (1) is for non-original iv, it shows it doesn't put any hardcoded cost to this ainc_cost candidate. I guess some misunderstanding was derived from some discussion above. Sorry if some of my previous comments misled you. BR, Kewen
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
On Tue, 1 Sep 2020, Bin.Cheng via Gcc-patches wrote: > > Great idea! With explicitly specified -funroll-loops, it's bootstrapped > > but the regression testing did show one failure (the only one): > > > > PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1 > > > > It exposes two issues: > > > > 1) Currently address_cost hook on rs6000 always return zero, but at least > > from Power7, pre_inc/pre_dec kind instructions are cracked, it means we > > have to take the address update into account (scalar normal operation). > > Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer > > ainc candidates. In this case, the cand/group cost is -4 (minus cost_step), > > with scaling up, the off becomes much. With one simple hack on for pre_inc/ > > pre_dec in rs6000 address_cost, the case passed. It should be handled in > > one separated issue. > > > > 2) This case makes me think we should exclude ainc candidates in function > > mark_reg_offset_candidates. The justification is that: ainc candidate > > handles step update itself and when we calculate the cost for it against > > its ainc_use, the cost_step has been reduced. When unrolling happens, > > the ainc computations are replicated and it doesn't save step updates > > like normal reg_offset_p candidates. > Though auto-inc candidate embeds stepping operation into memory > access, we might want to avoid it in case of unroll if there are many > sequences of memory accesses, and if the unroll factor is big. The > rationale is embedded stepping is a u-arch operation and does have its > cost. Forgive me for barging in here (though the context is powerpc, the dialogue and the patch seems to be generic ivopts), but that's not a general remark I hope, about auto-inc (always) having a cost? For some architectures, auto-inc *is* free, as free as register-indirect, so the more auto-inc use, the better. All this should be reflected by the address-cost, IMHO, and not hardcoded into ivopts. brgds, H-P
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi Segher, on 2020/9/4 下午10:16, Segher Boessenkool wrote: > Hi! > > On Fri, Sep 04, 2020 at 04:47:37PM +0800, Kewen.Lin wrote: Apart from that, one P9 specific point is that the update form load isn't preferred, the reason is that the instruction can not retire until both parts complete, it can hold up subsequent instructions from retiring. If the addi stalls (starvation), the instruction can not retire and can cause things stuck. It seems also something we can model here? >>> >>> This is (almost) no problem on p9, since we no longer have issue groups. >>> It can hold up older insns from retiring, sure, but they *will* have >>> finished, and p9 can retire 64 insns per cycle. The "completion wall" >>> is gone. The only problem is if things stick around so long that >>> resources run out... but you're talking 100s of insns there. >> >> Theoretically it's fine, but the addi starvation was observed in the FP/SIMD >> instructions intensive loop code, which did cause some worse performance. :( > > "addi starvation" has nothing to do with addi (it also happens for other > insns), and nothing with update form memory insns either. What happens > is simply that no shorter latency insns are issued by the core so long > as longer latency insns (like most float insns) are available. So in > really nice floating point loops we execute the few integer add insns > much too late, much later than they were in the machine code, which then > makes the memory insns late as well, etc. > Yeah, the starvation issue isn't addi specific, but in the FP/SIMD insns intensive loop, "addi/add" is the major/all portion of the shorter latency insns in most time. So I'd argue that it's related. :) Since they are mainly for IV updates, memory insns depend on it, the FP/SIMD insns depend on the memory insns, ..., it can easily cause the stall chain reaction, I guess that's why some people call it as "addi starvation". As the example Bin gave in another email, more auto-inc candidates would have more iv updates (cracked ADDIs), if one/several common index iv can be shared among the memory insns (fewer ADDIs), we can reduce the number of shorter latency insns. As I know, some compiler did implement not to perfer auto-inc candidates, it can mitigate starvation issue in those FP/SIMD intensive loops to some extent. BR, Kewen
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi! On Fri, Sep 04, 2020 at 04:47:37PM +0800, Kewen.Lin wrote: > >> Apart from that, one P9 specific point is that the update form load isn't > >> preferred, the reason is that the instruction can not retire until both > >> parts complete, it can hold up subsequent instructions from retiring. > >> If the addi stalls (starvation), the instruction can not retire and can > >> cause things stuck. It seems also something we can model here? > > > > This is (almost) no problem on p9, since we no longer have issue groups. > > It can hold up older insns from retiring, sure, but they *will* have > > finished, and p9 can retire 64 insns per cycle. The "completion wall" > > is gone. The only problem is if things stick around so long that > > resources run out... but you're talking 100s of insns there. > > Theoretically it's fine, but the addi starvation was observed in the FP/SIMD > instructions intensive loop code, which did cause some worse performance. :( "addi starvation" has nothing to do with addi (it also happens for other insns), and nothing with update form memory insns either. What happens is simply that no shorter latency insns are issued by the core so long as longer latency insns (like most float insns) are available. So in really nice floating point loops we execute the few integer add insns much too late, much later than they were in the machine code, which then makes the memory insns late as well, etc. Segher
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi Bin, On Fri, Sep 04, 2020 at 04:27:32PM +0800, Bin.Cheng wrote: > On Fri, Sep 4, 2020 at 6:37 AM Segher Boessenkool > wrote: > > It should have cost, certainly, but not address_cost I think. The total > > cost of an ldu should be a tiny bit less than that of ld + that of addi; > > the address_cost of ldu should be the same as that of ld. > Hi Segher, > In simple cases, yes, and it is also the (rough) idea of modeling > auto-inc addressing mode in ivopts, however, things are different if > loop gets complicated. The address_cost function is used for many other things, not just ivopts, so this shouldn't be done there. That is all :-) > Considering the case choosing 10 auto-inc > addressing_mode/candidate vs. [base_x + iv_index]. The latter only > needs one add instruction, while the former needs 10 embedded auto-inc > operations. Yeah. > Another issue is register pressure, choosing auto-inc candidates could > result in more IV, while choosing IV_index results in one IV (and more > Base pointers), however, spilling base pointer (which is loop > invariant) is usually cheaper than IV. > Another issue is auto-inc candidates probably lead to more bloated > setup code in the preheader BB, due to problems in expression > canonicalization, CSE, etc.. > > So it's not that easy to answer the question for complicated cases. > As for simple cases, the current model works fine with auto-inc > (somehow) preferred. Right, I wasn't saying that at all, sorry if I confused things. Thanks, Segher
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi Segher, >> Good question! I agree that they can execute in parallel, but it depends >> on how we interprete the addressing cost, if it's for required execution >> resource, I think it's off, since comparing with ld, the ldu has two iops >> and extra ALU requirement. > > OTOH, if you do not use an ldu you need to use a real addi insn, which > gives you all the same cost (plus it takes more code space and decode etc. > resources). Agreed. > >> I'm not sure its usage elsewhere, but in the >> context of IVOPTs on Power, for one normal candidate, its step cost is 4, >> the cost for group (1) is zero, total cost is 4 for this combination. >> for the scenario like: >> ldx rx, iv // (1) >> ... >> iv = iv + step // (2) >> >> While for ainc_use candidate (like ldu), its step cost is 4, but the cost >> for group (1) is (-4 // minus step cost), total cost is 0. It looks to >> say the step update is free. > > That seems wrong, but the address_cost is used in more places, that is > not where to fix this? Good point, I had this question in mind too, it's used somewhere, one of them even uses one magic number, I planned to check all its usages once started to investigate it. But as your comment below, this hook looks not appropriate. > >> We can also see (1) and (2) can also execute in parallel (same iteration). >> If we consider the next iteration, it will have the dependency, but it's >> the same for ldu. So basically they are similar, I think it's unfair to >> have this difference in the cost modeling. The cracked addi should have >> its cost here. Does it make sense? > > It should have cost, certainly, but not address_cost I think. The total > cost of an ldu should be a tiny bit less than that of ld + that of addi; > the address_cost of ldu should be the same as that of ld. OK, I'll check whether there is some other way suitable for this in the context of IVOPTs. Good to see that we agree on the current modeling is a bit off on Power. :) > >> Apart from that, one P9 specific point is that the update form load isn't >> preferred, the reason is that the instruction can not retire until both >> parts complete, it can hold up subsequent instructions from retiring. >> If the addi stalls (starvation), the instruction can not retire and can >> cause things stuck. It seems also something we can model here? > > This is (almost) no problem on p9, since we no longer have issue groups. > It can hold up older insns from retiring, sure, but they *will* have > finished, and p9 can retire 64 insns per cycle. The "completion wall" > is gone. The only problem is if things stick around so long that > resources run out... but you're talking 100s of insns there. > Theoretically it's fine, but the addi starvation was observed in the FP/SIMD instructions intensive loop code, which did cause some worse performance. :( BR, Kewen
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
On Fri, Sep 4, 2020 at 6:37 AM Segher Boessenkool wrote: > > On Thu, Sep 03, 2020 at 10:24:21AM +0800, Kewen.Lin wrote: > > on 2020/9/2 下午6:25, Segher Boessenkool wrote: > > > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote: > > >> on 2020/9/1 上午3:41, Segher Boessenkool wrote: > > >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote: > > 1) Currently address_cost hook on rs6000 always return zero, but at > > least > > from Power7, pre_inc/pre_dec kind instructions are cracked, it means we > > have to take the address update into account (scalar normal operation). > > >>> > > >>> From Power4 on already (not sure about Power6, but does anyone care?) > > >> > > >> Thanks for the information, it looks this issue exists for a long time. > > > > > > Well, *is* it an issue? The addressing doesn't get more expensive... > > > For example, an > > > ldu 3,16(4) > > > is cracked to an > > > ld 3,16(4) > > > and an > > > addi 4,4,16 > > > (the addi is not on the critical path of the load). So it seems to me > > > this shouldn't increase the addressing cost at all? (The instruction of > > > course is really two insns in one.) > > > > Good question! I agree that they can execute in parallel, but it depends > > on how we interprete the addressing cost, if it's for required execution > > resource, I think it's off, since comparing with ld, the ldu has two iops > > and extra ALU requirement. > > OTOH, if you do not use an ldu you need to use a real addi insn, which > gives you all the same cost (plus it takes more code space and decode etc. > resources). > > > I'm not sure its usage elsewhere, but in the > > context of IVOPTs on Power, for one normal candidate, its step cost is 4, > > the cost for group (1) is zero, total cost is 4 for this combination. > > for the scenario like: > > ldx rx, iv // (1) > > ... > > iv = iv + step // (2) > > > > While for ainc_use candidate (like ldu), its step cost is 4, but the cost > > for group (1) is (-4 // minus step cost), total cost is 0. It looks to > > say the step update is free. > > That seems wrong, but the address_cost is used in more places, that is > not where to fix this? > > > We can also see (1) and (2) can also execute in parallel (same iteration). > > If we consider the next iteration, it will have the dependency, but it's > > the same for ldu. So basically they are similar, I think it's unfair to > > have this difference in the cost modeling. The cracked addi should have > > its cost here. Does it make sense? > > It should have cost, certainly, but not address_cost I think. The total > cost of an ldu should be a tiny bit less than that of ld + that of addi; > the address_cost of ldu should be the same as that of ld. Hi Segher, In simple cases, yes, and it is also the (rough) idea of modeling auto-inc addressing mode in ivopts, however, things are different if loop gets complicated. Considering the case choosing 10 auto-inc addressing_mode/candidate vs. [base_x + iv_index]. The latter only needs one add instruction, while the former needs 10 embedded auto-inc operations. Another issue is register pressure, choosing auto-inc candidates could result in more IV, while choosing IV_index results in one IV (and more Base pointers), however, spilling base pointer (which is loop invariant) is usually cheaper than IV. Another issue is auto-inc candidates probably lead to more bloated setup code in the preheader BB, due to problems in expression canonicalization, CSE, etc.. So it's not that easy to answer the question for complicated cases. As for simple cases, the current model works fine with auto-inc (somehow) preferred. Thanks, bin > > > Apart from that, one P9 specific point is that the update form load isn't > > preferred, the reason is that the instruction can not retire until both > > parts complete, it can hold up subsequent instructions from retiring. > > If the addi stalls (starvation), the instruction can not retire and can > > cause things stuck. It seems also something we can model here? > > This is (almost) no problem on p9, since we no longer have issue groups. > It can hold up older insns from retiring, sure, but they *will* have > finished, and p9 can retire 64 insns per cycle. The "completion wall" > is gone. The only problem is if things stick around so long that > resources run out... but you're talking 100s of insns there. > > > Segher
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
On Thu, Sep 03, 2020 at 10:24:21AM +0800, Kewen.Lin wrote: > on 2020/9/2 下午6:25, Segher Boessenkool wrote: > > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote: > >> on 2020/9/1 上午3:41, Segher Boessenkool wrote: > >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote: > 1) Currently address_cost hook on rs6000 always return zero, but at least > from Power7, pre_inc/pre_dec kind instructions are cracked, it means we > have to take the address update into account (scalar normal operation). > >>> > >>> From Power4 on already (not sure about Power6, but does anyone care?) > >> > >> Thanks for the information, it looks this issue exists for a long time. > > > > Well, *is* it an issue? The addressing doesn't get more expensive... > > For example, an > > ldu 3,16(4) > > is cracked to an > > ld 3,16(4) > > and an > > addi 4,4,16 > > (the addi is not on the critical path of the load). So it seems to me > > this shouldn't increase the addressing cost at all? (The instruction of > > course is really two insns in one.) > > Good question! I agree that they can execute in parallel, but it depends > on how we interprete the addressing cost, if it's for required execution > resource, I think it's off, since comparing with ld, the ldu has two iops > and extra ALU requirement. OTOH, if you do not use an ldu you need to use a real addi insn, which gives you all the same cost (plus it takes more code space and decode etc. resources). > I'm not sure its usage elsewhere, but in the > context of IVOPTs on Power, for one normal candidate, its step cost is 4, > the cost for group (1) is zero, total cost is 4 for this combination. > for the scenario like: > ldx rx, iv // (1) > ... > iv = iv + step // (2) > > While for ainc_use candidate (like ldu), its step cost is 4, but the cost > for group (1) is (-4 // minus step cost), total cost is 0. It looks to > say the step update is free. That seems wrong, but the address_cost is used in more places, that is not where to fix this? > We can also see (1) and (2) can also execute in parallel (same iteration). > If we consider the next iteration, it will have the dependency, but it's > the same for ldu. So basically they are similar, I think it's unfair to > have this difference in the cost modeling. The cracked addi should have > its cost here. Does it make sense? It should have cost, certainly, but not address_cost I think. The total cost of an ldu should be a tiny bit less than that of ld + that of addi; the address_cost of ldu should be the same as that of ld. > Apart from that, one P9 specific point is that the update form load isn't > preferred, the reason is that the instruction can not retire until both > parts complete, it can hold up subsequent instructions from retiring. > If the addi stalls (starvation), the instruction can not retire and can > cause things stuck. It seems also something we can model here? This is (almost) no problem on p9, since we no longer have issue groups. It can hold up older insns from retiring, sure, but they *will* have finished, and p9 can retire 64 insns per cycle. The "completion wall" is gone. The only problem is if things stick around so long that resources run out... but you're talking 100s of insns there. Segher
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi Segher, on 2020/9/2 下午6:25, Segher Boessenkool wrote: > Hi! > > On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote: >> on 2020/9/1 上午3:41, Segher Boessenkool wrote: >>> On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote: 1) Currently address_cost hook on rs6000 always return zero, but at least from Power7, pre_inc/pre_dec kind instructions are cracked, it means we have to take the address update into account (scalar normal operation). >>> >>> From Power4 on already (not sure about Power6, but does anyone care?) >> >> Thanks for the information, it looks this issue exists for a long time. > > Well, *is* it an issue? The addressing doesn't get more expensive... > For example, an > ldu 3,16(4) > is cracked to an > ld 3,16(4) > and an > addi 4,4,16 > (the addi is not on the critical path of the load). So it seems to me > this shouldn't increase the addressing cost at all? (The instruction of > course is really two insns in one.) > Good question! I agree that they can execute in parallel, but it depends on how we interprete the addressing cost, if it's for required execution resource, I think it's off, since comparing with ld, the ldu has two iops and extra ALU requirement. I'm not sure its usage elsewhere, but in the context of IVOPTs on Power, for one normal candidate, its step cost is 4, the cost for group (1) is zero, total cost is 4 for this combination. for the scenario like: ldx rx, iv // (1) ... iv = iv + step // (2) While for ainc_use candidate (like ldu), its step cost is 4, but the cost for group (1) is (-4 // minus step cost), total cost is 0. It looks to say the step update is free. We can also see (1) and (2) can also execute in parallel (same iteration). If we consider the next iteration, it will have the dependency, but it's the same for ldu. So basically they are similar, I think it's unfair to have this difference in the cost modeling. The cracked addi should have its cost here. Does it make sense? Apart from that, one P9 specific point is that the update form load isn't preferred, the reason is that the instruction can not retire until both parts complete, it can hold up subsequent instructions from retiring. If the addi stalls (starvation), the instruction can not retire and can cause things stuck. It seems also something we can model here? BR, Kewen
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi! On Wed, Sep 02, 2020 at 11:16:00AM +0800, Kewen.Lin wrote: > on 2020/9/1 上午3:41, Segher Boessenkool wrote: > > On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote: > >> 1) Currently address_cost hook on rs6000 always return zero, but at least > >> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we > >> have to take the address update into account (scalar normal operation). > > > > From Power4 on already (not sure about Power6, but does anyone care?) > > Thanks for the information, it looks this issue exists for a long time. Well, *is* it an issue? The addressing doesn't get more expensive... For example, an ldu 3,16(4) is cracked to an ld 3,16(4) and an addi 4,4,16 (the addi is not on the critical path of the load). So it seems to me this shouldn't increase the addressing cost at all? (The instruction of course is really two insns in one.) Segher
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi Bin, I've updated the patch to punt ainc_use candidates as below: > + /* Skip AINC candidate since it contains address update itself, > +the replicated AINC computations when unrolling still have > +updates, unlike reg_offset_p candidates can save step updates > +when unrolling. */ > + if (cand->ainc_use) > + continue; > + For this new attached patch, it's bootstrapped and regress-tested without explicit unrolling, while the only one failure has been identified as rs6000 specific address_cost issue in bootstrapping and regression testing with explicit unrolling. By the way, with above simple hack of address_cost, I also did one bootstrapping and regression testing with explicit unrolling, the above sms-4.c did pass as I expected but had two failures instead: PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors) PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2 By further investigation, the 2nd one is expected due to the adddress_cost hook hack, while the 1st one one exposed -fcompare-debug issue in sms. The RTL sequence starts to off from sms, just some NOTE_INSN_DELETED positions change. I believe it's just exposed by this combination unluckily/luckily ;-) I will send a patch separately for it once I got time to look into it, but it should be unrelated to this patch series for sure. >>> This is the kind of situation I intended to avoid before. IMHO, this >>> isn't a neat change (it can't be given we are predicting the future >>> transformation in compilation pipeline), accumulation of such changes >>> could make IVOPTs break in one way or another. So as long as you make >>> sure it doesn't have functional impact in case of no-rtl_unroll, I am >>> fine. >> >> Yeah, I admit it's not neat, but the proposals in the previous discussions >> without predicting unroll factor can not work well for all scenarios with >> different unroll factors, they could over-blame some kind of candidates. >> For the case of no-rtl_unroll, unroll factor estimation should set >> loop->estimated_unroll to zero, all these changes won't take effect. The Oops, one correction, should set it to *one* rather than zero. My memory... :( >> estimation function follows the same logics as that of RTL unroll factor >> calculation, I did test with explicit unrolling disablement before, it >> worked expectedly. > Thanks for working on this, also sorry for being nitpicking. Oh, you don't! Your constructive advices/comments make me consider/test more scenarios/things that I didn't realize before, it's really helpful to get this patch better, really appreciate that!!! BR, Kewen
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
On Wed, Sep 2, 2020 at 11:50 AM Kewen.Lin wrote: > > Hi Bin, > > >> 2) This case makes me think we should exclude ainc candidates in function > >> mark_reg_offset_candidates. The justification is that: ainc candidate > >> handles step update itself and when we calculate the cost for it against > >> its ainc_use, the cost_step has been reduced. When unrolling happens, > >> the ainc computations are replicated and it doesn't save step updates > >> like normal reg_offset_p candidates. > > Though auto-inc candidate embeds stepping operation into memory > > access, we might want to avoid it in case of unroll if there are many > > sequences of memory accesses, and if the unroll factor is big. The > > rationale is embedded stepping is a u-arch operation and does have its > > cost. > > > > Thanks for the comments! Agree! Excluding them from reg_offset_p > candidates here is consistent with this expectation, it makes us > consider the unroll factor effect when checking the corresponding > step cost and the embedded stepping cost (in group/candidate cost, > minus step cost and use the cost from the address_cost hook). > > >> > >> I've updated the patch to punt ainc_use candidates as below: > >> > >>> + /* Skip AINC candidate since it contains address update itself, > >>> +the replicated AINC computations when unrolling still have > >>> +updates, unlike reg_offset_p candidates can save step updates > >>> +when unrolling. */ > >>> + if (cand->ainc_use) > >>> + continue; > >>> + > >> > >> For this new attached patch, it's bootstrapped and regress-tested without > >> explicit unrolling, while the only one failure has been identified as > >> rs6000 specific address_cost issue in bootstrapping and regression testing > >> with explicit unrolling. > >> > >> By the way, with above simple hack of address_cost, I also did one > >> bootstrapping and regression testing with explicit unrolling, the above > >> sms-4.c did pass as I expected but had two failures instead: > >> > >> PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors) > >> PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts > >> "PHI" 2 > >> > >> By further investigation, the 2nd one is expected due to the adddress_cost > >> hook > >> hack, while the 1st one one exposed -fcompare-debug issue in sms. The RTL > >> sequence starts to off from sms, just some NOTE_INSN_DELETED positions > >> change. > >> I believe it's just exposed by this combination unluckily/luckily ;-) I > >> will > >> send a patch separately for it once I got time to look into it, but it > >> should > >> be unrelated to this patch series for sure. > > This is the kind of situation I intended to avoid before. IMHO, this > > isn't a neat change (it can't be given we are predicting the future > > transformation in compilation pipeline), accumulation of such changes > > could make IVOPTs break in one way or another. So as long as you make > > sure it doesn't have functional impact in case of no-rtl_unroll, I am > > fine. > > Yeah, I admit it's not neat, but the proposals in the previous discussions > without predicting unroll factor can not work well for all scenarios with > different unroll factors, they could over-blame some kind of candidates. > For the case of no-rtl_unroll, unroll factor estimation should set > loop->estimated_unroll to zero, all these changes won't take effect. The > estimation function follows the same logics as that of RTL unroll factor > calculation, I did test with explicit unrolling disablement before, it > worked expectedly. Thanks for working on this, also sorry for being nitpicking. Thanks, bin
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi Bin, >> 2) This case makes me think we should exclude ainc candidates in function >> mark_reg_offset_candidates. The justification is that: ainc candidate >> handles step update itself and when we calculate the cost for it against >> its ainc_use, the cost_step has been reduced. When unrolling happens, >> the ainc computations are replicated and it doesn't save step updates >> like normal reg_offset_p candidates. > Though auto-inc candidate embeds stepping operation into memory > access, we might want to avoid it in case of unroll if there are many > sequences of memory accesses, and if the unroll factor is big. The > rationale is embedded stepping is a u-arch operation and does have its > cost. > Thanks for the comments! Agree! Excluding them from reg_offset_p candidates here is consistent with this expectation, it makes us consider the unroll factor effect when checking the corresponding step cost and the embedded stepping cost (in group/candidate cost, minus step cost and use the cost from the address_cost hook). >> >> I've updated the patch to punt ainc_use candidates as below: >> >>> + /* Skip AINC candidate since it contains address update itself, >>> +the replicated AINC computations when unrolling still have >>> +updates, unlike reg_offset_p candidates can save step updates >>> +when unrolling. */ >>> + if (cand->ainc_use) >>> + continue; >>> + >> >> For this new attached patch, it's bootstrapped and regress-tested without >> explicit unrolling, while the only one failure has been identified as >> rs6000 specific address_cost issue in bootstrapping and regression testing >> with explicit unrolling. >> >> By the way, with above simple hack of address_cost, I also did one >> bootstrapping and regression testing with explicit unrolling, the above >> sms-4.c did pass as I expected but had two failures instead: >> >> PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors) >> PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2 >> >> By further investigation, the 2nd one is expected due to the adddress_cost >> hook >> hack, while the 1st one one exposed -fcompare-debug issue in sms. The RTL >> sequence starts to off from sms, just some NOTE_INSN_DELETED positions >> change. >> I believe it's just exposed by this combination unluckily/luckily ;-) I will >> send a patch separately for it once I got time to look into it, but it should >> be unrelated to this patch series for sure. > This is the kind of situation I intended to avoid before. IMHO, this > isn't a neat change (it can't be given we are predicting the future > transformation in compilation pipeline), accumulation of such changes > could make IVOPTs break in one way or another. So as long as you make > sure it doesn't have functional impact in case of no-rtl_unroll, I am > fine. Yeah, I admit it's not neat, but the proposals in the previous discussions without predicting unroll factor can not work well for all scenarios with different unroll factors, they could over-blame some kind of candidates. For the case of no-rtl_unroll, unroll factor estimation should set loop->estimated_unroll to zero, all these changes won't take effect. The estimation function follows the same logics as that of RTL unroll factor calculation, I did test with explicit unrolling disablement before, it worked expectedly. BR, Kewen
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi Segher, on 2020/9/1 上午3:41, Segher Boessenkool wrote: > Hi! > > Just a note: > > On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote: >> 1) Currently address_cost hook on rs6000 always return zero, but at least >> from Power7, pre_inc/pre_dec kind instructions are cracked, it means we >> have to take the address update into account (scalar normal operation). > > From Power4 on already (not sure about Power6, but does anyone care?) > Thanks for the information, it looks this issue exists for a long time. BR, Kewen
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
On Tue, Aug 25, 2020 at 8:47 PM Kewen.Lin wrote: > > Hi Bin, > > >> > >> For one particular case like: > >> > >> for (i = 0; i < SIZE; i++) > >> y[i] = a * x[i] + z[i]; > >> > >> we will mark reg_offset_p for IV candidates on x as below: > >>- (unsigned long) (x_18(D) + 8)// only mark this before. > >>- x_18(D) + 8 > >>- (unsigned long) (x_18(D) + 24) > >>- (unsigned long) ((vector(2) double *) (x_18(D) + 8) + > >> 18446744073709551600) > >>... > >> > >> Do you mind to have a review again? Thanks in advance! > > I trust you with the change. > > Thanks again! Sorry for the late since it took some time to investigate > the exposed issues. > > >> > >> Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9. > >> > >> SPEC2017 P9 performance run has no remarkable degradations/improvements. > > Is this run with unroll-loops? > > Yes, the options what I used were: > >-Ofast -mcpu=power9 -fpeel-loops -mrecip -funroll-loops > > I also re-tested the newly attached patch, nothing changes for SPEC2017 data. > > > Could you exercise the code with unroll-loops enabled when > > bootstrap/regtest please? It doesn't matter if cases fail with > > unroll-loops, just making sure there is no fallout. Otherwise it's > > fine with me. > > Great idea! With explicitly specified -funroll-loops, it's bootstrapped > but the regression testing did show one failure (the only one): > > PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1 > > It exposes two issues: > > 1) Currently address_cost hook on rs6000 always return zero, but at least > from Power7, pre_inc/pre_dec kind instructions are cracked, it means we > have to take the address update into account (scalar normal operation). > Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer > ainc candidates. In this case, the cand/group cost is -4 (minus cost_step), > with scaling up, the off becomes much. With one simple hack on for pre_inc/ > pre_dec in rs6000 address_cost, the case passed. It should be handled in > one separated issue. > > 2) This case makes me think we should exclude ainc candidates in function > mark_reg_offset_candidates. The justification is that: ainc candidate > handles step update itself and when we calculate the cost for it against > its ainc_use, the cost_step has been reduced. When unrolling happens, > the ainc computations are replicated and it doesn't save step updates > like normal reg_offset_p candidates. Though auto-inc candidate embeds stepping operation into memory access, we might want to avoid it in case of unroll if there are many sequences of memory accesses, and if the unroll factor is big. The rationale is embedded stepping is a u-arch operation and does have its cost. > > I've updated the patch to punt ainc_use candidates as below: > > > + /* Skip AINC candidate since it contains address update itself, > > +the replicated AINC computations when unrolling still have > > +updates, unlike reg_offset_p candidates can save step updates > > +when unrolling. */ > > + if (cand->ainc_use) > > + continue; > > + > > For this new attached patch, it's bootstrapped and regress-tested without > explicit unrolling, while the only one failure has been identified as > rs6000 specific address_cost issue in bootstrapping and regression testing > with explicit unrolling. > > By the way, with above simple hack of address_cost, I also did one > bootstrapping and regression testing with explicit unrolling, the above > sms-4.c did pass as I expected but had two failures instead: > > PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors) > PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2 > > By further investigation, the 2nd one is expected due to the adddress_cost > hook > hack, while the 1st one one exposed -fcompare-debug issue in sms. The RTL > sequence starts to off from sms, just some NOTE_INSN_DELETED positions change. > I believe it's just exposed by this combination unluckily/luckily ;-) I will > send a patch separately for it once I got time to look into it, but it should > be unrelated to this patch series for sure. This is the kind of situation I intended to avoid before. IMHO, this isn't a neat change (it can't be given we are predicting the future transformation in compilation pipeline), accumulation of such changes could make IVOPTs break in one way or another. So as long as you make sure it doesn't have functional impact in case of no-rtl_unroll, I am fine. > > In a word, bootstrapping/regress-testing with unroll-loops enabled shows this > patch looks fine. > > BR, > Kewen > - > gcc/ChangeLog: > > * tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p. > (struct ivopts_data): New field consider_reg_offset_for_unroll_p. > (mark_reg_offset_candidates): New function. > (add_candidate_1): Set reg
Re: [PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi! Just a note: On Tue, Aug 25, 2020 at 08:46:55PM +0800, Kewen.Lin wrote: > 1) Currently address_cost hook on rs6000 always return zero, but at least > from Power7, pre_inc/pre_dec kind instructions are cracked, it means we > have to take the address update into account (scalar normal operation). >From Power4 on already (not sure about Power6, but does anyone care?) Segher
[PATCH 3/4 v3] ivopts: Consider cost_step on different forms during unrolling
Hi Bin, >> >> For one particular case like: >> >> for (i = 0; i < SIZE; i++) >> y[i] = a * x[i] + z[i]; >> >> we will mark reg_offset_p for IV candidates on x as below: >>- (unsigned long) (x_18(D) + 8)// only mark this before. >>- x_18(D) + 8 >>- (unsigned long) (x_18(D) + 24) >>- (unsigned long) ((vector(2) double *) (x_18(D) + 8) + >> 18446744073709551600) >>... >> >> Do you mind to have a review again? Thanks in advance! > I trust you with the change. Thanks again! Sorry for the late since it took some time to investigate the exposed issues. >> >> Bootstrapped/regtested on powerpc64le-linux-gnu P8 and P9. >> >> SPEC2017 P9 performance run has no remarkable degradations/improvements. > Is this run with unroll-loops? Yes, the options what I used were: -Ofast -mcpu=power9 -fpeel-loops -mrecip -funroll-loops I also re-tested the newly attached patch, nothing changes for SPEC2017 data. > Could you exercise the code with unroll-loops enabled when > bootstrap/regtest please? It doesn't matter if cases fail with > unroll-loops, just making sure there is no fallout. Otherwise it's > fine with me. Great idea! With explicitly specified -funroll-loops, it's bootstrapped but the regression testing did show one failure (the only one): PASS->FAIL: gcc.dg/sms-4.c scan-rtl-dump-times sms "SMS succeeded" 1 It exposes two issues: 1) Currently address_cost hook on rs6000 always return zero, but at least from Power7, pre_inc/pre_dec kind instructions are cracked, it means we have to take the address update into account (scalar normal operation). Since IVOPTs reduces the cost_step for ainc candidates, it makes us prefer ainc candidates. In this case, the cand/group cost is -4 (minus cost_step), with scaling up, the off becomes much. With one simple hack on for pre_inc/ pre_dec in rs6000 address_cost, the case passed. It should be handled in one separated issue. 2) This case makes me think we should exclude ainc candidates in function mark_reg_offset_candidates. The justification is that: ainc candidate handles step update itself and when we calculate the cost for it against its ainc_use, the cost_step has been reduced. When unrolling happens, the ainc computations are replicated and it doesn't save step updates like normal reg_offset_p candidates. I've updated the patch to punt ainc_use candidates as below: > + /* Skip AINC candidate since it contains address update itself, > +the replicated AINC computations when unrolling still have > +updates, unlike reg_offset_p candidates can save step updates > +when unrolling. */ > + if (cand->ainc_use) > + continue; > + For this new attached patch, it's bootstrapped and regress-tested without explicit unrolling, while the only one failure has been identified as rs6000 specific address_cost issue in bootstrapping and regression testing with explicit unrolling. By the way, with above simple hack of address_cost, I also did one bootstrapping and regression testing with explicit unrolling, the above sms-4.c did pass as I expected but had two failures instead: PASS->FAIL: gcc.dg/sms-compare-debug-1.c (test for excess errors) PASS->FAIL: gcc.dg/tree-ssa/ivopts-lt.c scan-tree-dump-times ivopts "PHI" 2 By further investigation, the 2nd one is expected due to the adddress_cost hook hack, while the 1st one one exposed -fcompare-debug issue in sms. The RTL sequence starts to off from sms, just some NOTE_INSN_DELETED positions change. I believe it's just exposed by this combination unluckily/luckily ;-) I will send a patch separately for it once I got time to look into it, but it should be unrelated to this patch series for sure. In a word, bootstrapping/regress-testing with unroll-loops enabled shows this patch looks fine. BR, Kewen - gcc/ChangeLog: * tree-ssa-loop-ivopts.c (struct iv_cand): New field reg_offset_p. (struct ivopts_data): New field consider_reg_offset_for_unroll_p. (mark_reg_offset_candidates): New function. (add_candidate_1): Set reg_offset_p to false for new candidate. (set_group_iv_cost): Scale up group cost with estimate_unroll_factor if consider_reg_offset_for_unroll_p. (determine_iv_cost): Increase step cost with estimate_unroll_factor if consider_reg_offset_for_unroll_p. (tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update consider_reg_offset_for_unroll_p. diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c index 1d2697ae1ba..4b58b620ddd 100644 --- a/gcc/tree-ssa-loop-ivopts.c +++ b/gcc/tree-ssa-loop-ivopts.c @@ -473,6 +473,9 @@ struct iv_cand struct iv *orig_iv; /* The original iv if this cand is added from biv with smaller type. */ bool doloop_p; /* Whether this is a doloop candidate. */ + bool reg_offset_p; /* Whether this is available for an address type group