Re: [PATCH] AArch64: Improve address rematerialization costs
Wilco Dijkstra writes: > Hi Richard, > >> But even if the costs are too high, the patch seems to be overcompensating. >> It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR. > > An LDR is not a replacement for ADRP+LDR, you need a store in addition the > original ADRP+LDR. Basically a simple spill would be comparing these 2 > sequences: > > ADRP x0, ... > LDR x0, [x0, ...] > STR x0, [SP, ...] > ... > LDR x0, [SP, ...] > > > ADRP x0, ... > LDR x0, [x0, ...] > ... > ADRP x0, ... > LDR x0, [x0, ...] > > Obviously it's far cheaper to do the latter than the former. Sure. Like I say, I'm not disagreeing with the intent of reducing spilling and promoting rematerialisation. I agree we should do that. I'm just disagreeing with the approach of using rtx_costs. The rtx_cost hook isn't being asked the question: is spilling this better value than rematerialising it? It's being asked for the cost of an operation, on the understanding that that cost will be compared with the cost of other operations. An ADRP+LDR operation then ought to be at least as costly as an LDR, because in a two-way comparison, it is. […] >> Maybe it would help to turn the question around for a minute. Can we >> describe the cases in which it's *better* for the RA to spill a constant >> address to the stack and reload it, rather than rematerialise on demand? > > Rematerialization is almost always better than spilling and reloading from the > stack. If the constant requires multiple instructions and there are more than > 2 > references it would be better for codesize to spill, but for performance it is > better to rematerialize unless there are many references. > > You also want to prefer rematerialization over spilling a different liferange > when > other aspects are comparable. Yeah, that's what I thought the answer would be. So the question is: why is the RA choosing to spill and reload rather than rematerialise these values? Does it not know how to rematerialise them, and so we rely on earlier passes not reusing the constants? Or does the RA know how but decides it isn't worthwhile, because of the way that the RA uses the target costs? If the latter, I would be much happier with a new hook that allows the target to force the RA to rematerialise a given value, if that's the heuristic we want to use when optimising for speed. Thanks, Richard
Re: [PATCH] AArch64: Improve address rematerialization costs
Hi Richard, > But even if the costs are too high, the patch seems to be overcompensating. > It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR. An LDR is not a replacement for ADRP+LDR, you need a store in addition the original ADRP+LDR. Basically a simple spill would be comparing these 2 sequences: ADRP x0, ... LDR x0, [x0, ...] STR x0, [SP, ...] ... LDR x0, [SP, ...] ADRP x0, ... LDR x0, [x0, ...] ... ADRP x0, ... LDR x0, [x0, ...] Obviously it's far cheaper to do the latter than the former. > Giving X zero cost means that a sequence like: > > (set (reg x0) X) > (set (reg x1) X) > > should stay as-is rather than be changed to: > > (set (reg x0) X) > (set (reg x1) (reg x0)) > > I don't think we want that for multi-instruction constants when > optimising for size. I don't believe this is a real problem. The cost queries for address constants come from register allocation, I don't see them affect other optimizations. > Yeah, I wasn't suggesting that we increase the spill costs. I'm saying I'm saying that because we've set the spill costs low on purpose to work around register allocation bugs. There have been some fixes since, so increasing the spill costs may now be feasible (but not trivial). > that we should look at whether the target-independent RA heuristics need > to change, whether new target hooks are needed, etc. We shouldn't go > into this with the assumption that the target-independent code is > invariant and that any fix must be in existing aarch64 hooks (rtx costs > or spill costs). But what bug do you think exists in target independent code? It behaves correctly once we supply more accurate costs. If there was no rematerialization irrespectively of the cost settings then you could claim there was a bug. > Maybe it would help to turn the question around for a minute. Can we > describe the cases in which it's *better* for the RA to spill a constant > address to the stack and reload it, rather than rematerialise on demand? Rematerialization is almost always better than spilling and reloading from the stack. If the constant requires multiple instructions and there are more than 2 references it would be better for codesize to spill, but for performance it is better to rematerialize unless there are many references. You also want to prefer rematerialization over spilling a different liferange when other aspects are comparable. Cheers, Wilco
Re: [PATCH] AArch64: Improve address rematerialization costs
Hi, >> It's also said that chosen alternatives might be the reason that >> rematerialization >> is not choosen and alternatives are chosen based on reload heuristics, not >> based >> on actual costs. > > Thanks for the pointer. Yeah, it'd be interesting to know if this > is the same issue, although I fear the only way of knowing for sure > is to fix it first and see whether both targets benefit. ;-) I don't believe this is the same issue - there are lots of register allocation problems indeed, many are caused by the complex design. All the alternatives and register classes create a huge crossproduct, making it almost impossible to make good allocation decisions even if they were accurately costed. I've found that the correct way to deal with this is to reduce all this choice as much as possible. That means splitting instructions into simpler ones with fewer alternatives and register classes. You also need to block it from treating all register classes as equivalent - on AArch64 we had to force floating point values to be allocated to floating point registers (which is obviously how any register allocator should work by default), but maybe x86 doesn't do that yet. Cheers, Wilco
Re: [PATCH] AArch64: Improve address rematerialization costs
Richard Biener writes: > On Wed, May 11, 2022 at 2:23 PM Richard Sandiford via Gcc-patches > wrote: >> >> Wilco Dijkstra writes: >> > Hi Richard, >> > >> >> Yeah, I'm not disagreeing with any of that. It's just a question of >> >> whether the problem should be fixed by artificially lowering the general >> >> rtx costs with one particular user (RA spill costs) in mind, or whether >> >> it should be fixed by making the RA spill code take the factors above >> >> into account. >> > >> > The RA spill code already works fine on immediates but not on address >> > constants. And the reason is that the current rtx costs for addresses are >> > set artificially high without justification (I checked the patch that >> > increased >> > the costs but there was nothing explaining why it was beneficial). >> >> But even if the costs are too high, the patch seems to be overcompensating. >> It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR. >> >> Giving X zero cost means that a sequence like: >> >> (set (reg x0) X) >> (set (reg x1) X) >> >> should stay as-is rather than be changed to: >> >> (set (reg x0) X) >> (set (reg x1) (reg x0)) >> >> I don't think we want that for multi-instruction constants when >> optimising for size. >> >> > It's certainly possible to experiment with increasing the spill costs, but >> > that >> > won't improve the issue with address constants unless they are at least >> > doubled. >> > And it has the effect of halving all rtx costs in the register allocator >> > which is >> > likely to cause regressions. So we'd need to adjust many rtx costs to keep >> > the >> > allocator working, plus fix any further regressions this causes. >> >> Yeah, I wasn't suggesting that we increase the spill costs. I'm saying >> that we should look at whether the target-independent RA heuristics need >> to change, whether new target hooks are needed, etc. We shouldn't go >> into this with the assumption that the target-independent code is >> invariant and that any fix must be in existing aarch64 hooks (rtx costs >> or spill costs). >> >> Maybe it would help to turn the question around for a minute. Can we >> describe the cases in which it's *better* for the RA to spill a constant >> address to the stack and reload it, rather than rematerialise on demand? > > From the discussion in PR102178 it seems that LRA cannot rematerialize > all "constants" (though here it is constant pool loads). Some constants > might also not be 'constant'. See the PR for more fun "spilling" behavior > on x86_64. > > It's also said that chosen alternatives might be the reason that > rematerialization > is not choosen and alternatives are chosen based on reload heuristics, not > based > on actual costs. Thanks for the pointer. Yeah, it'd be interesting to know if this is the same issue, although I fear the only way of knowing for sure is to fix it first and see whether both targets benefit. ;-) Richard
Re: [PATCH] AArch64: Improve address rematerialization costs
On Wed, May 11, 2022 at 2:23 PM Richard Sandiford via Gcc-patches wrote: > > Wilco Dijkstra writes: > > Hi Richard, > > > >> Yeah, I'm not disagreeing with any of that. It's just a question of > >> whether the problem should be fixed by artificially lowering the general > >> rtx costs with one particular user (RA spill costs) in mind, or whether > >> it should be fixed by making the RA spill code take the factors above > >> into account. > > > > The RA spill code already works fine on immediates but not on address > > constants. And the reason is that the current rtx costs for addresses are > > set artificially high without justification (I checked the patch that > > increased > > the costs but there was nothing explaining why it was beneficial). > > But even if the costs are too high, the patch seems to be overcompensating. > It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR. > > Giving X zero cost means that a sequence like: > > (set (reg x0) X) > (set (reg x1) X) > > should stay as-is rather than be changed to: > > (set (reg x0) X) > (set (reg x1) (reg x0)) > > I don't think we want that for multi-instruction constants when > optimising for size. > > > It's certainly possible to experiment with increasing the spill costs, but > > that > > won't improve the issue with address constants unless they are at least > > doubled. > > And it has the effect of halving all rtx costs in the register allocator > > which is > > likely to cause regressions. So we'd need to adjust many rtx costs to keep > > the > > allocator working, plus fix any further regressions this causes. > > Yeah, I wasn't suggesting that we increase the spill costs. I'm saying > that we should look at whether the target-independent RA heuristics need > to change, whether new target hooks are needed, etc. We shouldn't go > into this with the assumption that the target-independent code is > invariant and that any fix must be in existing aarch64 hooks (rtx costs > or spill costs). > > Maybe it would help to turn the question around for a minute. Can we > describe the cases in which it's *better* for the RA to spill a constant > address to the stack and reload it, rather than rematerialise on demand? >From the discussion in PR102178 it seems that LRA cannot rematerialize all "constants" (though here it is constant pool loads). Some constants might also not be 'constant'. See the PR for more fun "spilling" behavior on x86_64. It's also said that chosen alternatives might be the reason that rematerialization is not choosen and alternatives are chosen based on reload heuristics, not based on actual costs. Richard. > Thanks, > Richard
Re: [PATCH] AArch64: Improve address rematerialization costs
Wilco Dijkstra writes: > Hi Richard, > >> Yeah, I'm not disagreeing with any of that. It's just a question of >> whether the problem should be fixed by artificially lowering the general >> rtx costs with one particular user (RA spill costs) in mind, or whether >> it should be fixed by making the RA spill code take the factors above >> into account. > > The RA spill code already works fine on immediates but not on address > constants. And the reason is that the current rtx costs for addresses are > set artificially high without justification (I checked the patch that > increased > the costs but there was nothing explaining why it was beneficial). But even if the costs are too high, the patch seems to be overcompensating. It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR. Giving X zero cost means that a sequence like: (set (reg x0) X) (set (reg x1) X) should stay as-is rather than be changed to: (set (reg x0) X) (set (reg x1) (reg x0)) I don't think we want that for multi-instruction constants when optimising for size. > It's certainly possible to experiment with increasing the spill costs, but > that > won't improve the issue with address constants unless they are at least > doubled. > And it has the effect of halving all rtx costs in the register allocator > which is > likely to cause regressions. So we'd need to adjust many rtx costs to keep the > allocator working, plus fix any further regressions this causes. Yeah, I wasn't suggesting that we increase the spill costs. I'm saying that we should look at whether the target-independent RA heuristics need to change, whether new target hooks are needed, etc. We shouldn't go into this with the assumption that the target-independent code is invariant and that any fix must be in existing aarch64 hooks (rtx costs or spill costs). Maybe it would help to turn the question around for a minute. Can we describe the cases in which it's *better* for the RA to spill a constant address to the stack and reload it, rather than rematerialise on demand? Thanks, Richard
Re: [PATCH] AArch64: Improve address rematerialization costs
Hi Richard, > Yeah, I'm not disagreeing with any of that. It's just a question of > whether the problem should be fixed by artificially lowering the general > rtx costs with one particular user (RA spill costs) in mind, or whether > it should be fixed by making the RA spill code take the factors above > into account. The RA spill code already works fine on immediates but not on address constants. And the reason is that the current rtx costs for addresses are set artificially high without justification (I checked the patch that increased the costs but there was nothing explaining why it was beneficial). It's certainly possible to experiment with increasing the spill costs, but that won't improve the issue with address constants unless they are at least doubled. And it has the effect of halving all rtx costs in the register allocator which is likely to cause regressions. So we'd need to adjust many rtx costs to keep the allocator working, plus fix any further regressions this causes. Cheers, Wilco
Re: [PATCH] AArch64: Improve address rematerialization costs
Wilco Dijkstra writes: > Hi Richard, > >>> There isn't really a better way of doing this within the existing costing >>> code. >> >> Yeah, I was wondering whether we could change something there. >> ADRP+LDR is logically more expensive than a single LDR, especially >> when optimising for size, so I think it's reasonable for the rtx_costs >> to say so. But that doesn't/shouldn't mean that spilling is better >> (for either size or speed). >> >> So it feels like there's something missing in the way the costs are >> being applied. > > Calculating accurate spill costs is hard. Spill optimization is done later, so > until then you can't know the actual cost of a spill decision already made. > Spills are also more expensive than you think due to store latency, more > dirty cachelines etc. There is little benefit in lifting an ADRP to the start > of > a function but keep ADD/LDR close to references. Basically ADRP/MOV are > very cheap, so it's a waste to allocate these to long-lived registers. > > Given that there are significant codesize and performance improvements, > it is clear that doing more rematerialization is better even in cases where it > takes 2 instructions to recompute the address. Binaries show a significant > reduction in stack-based loads and stores. Yeah, I'm not disagreeing with any of that. It's just a question of whether the problem should be fixed by artificially lowering the general rtx costs with one particular user (RA spill costs) in mind, or whether it should be fixed by making the RA spill code take the factors above into account. Thanks, Richard
Re: [PATCH] AArch64: Improve address rematerialization costs
Hi Richard, >> There isn't really a better way of doing this within the existing costing >> code. > > Yeah, I was wondering whether we could change something there. > ADRP+LDR is logically more expensive than a single LDR, especially > when optimising for size, so I think it's reasonable for the rtx_costs > to say so. But that doesn't/shouldn't mean that spilling is better > (for either size or speed). > > So it feels like there's something missing in the way the costs are > being applied. Calculating accurate spill costs is hard. Spill optimization is done later, so until then you can't know the actual cost of a spill decision already made. Spills are also more expensive than you think due to store latency, more dirty cachelines etc. There is little benefit in lifting an ADRP to the start of a function but keep ADD/LDR close to references. Basically ADRP/MOV are very cheap, so it's a waste to allocate these to long-lived registers. Given that there are significant codesize and performance improvements, it is clear that doing more rematerialization is better even in cases where it takes 2 instructions to recompute the address. Binaries show a significant reduction in stack-based loads and stores. Cheers, Wilco
Re: [PATCH] AArch64: Improve address rematerialization costs
Wilco Dijkstra via Gcc-patches writes: > Hi Richard, > >> I'm not questioning the results, but I think we need to look in more >> detail why rematerialisation requires such low costs. The point of >> comparison should be against a spill and reload, so any constant >> that is as cheap as a load should be rematerialised. If that isn't >> happening then it sounds like changes are needed elsewhere. > > The simple answer is that rematerializable expressions must have a lower cost > than the spill cost (potentially of something else), otherwise it will never > happen. > The previous costs were set way too high (eg. 12 for ADRP+LDR vs 4 for a > reload). > This patch basically ensures that is indeed the case. In principle a zero cost > works fine for anything that can be rematerialized. However it may use more > instructions than a spill (of something else), so a small non-zero cost avoids > bloating codesize. > > There isn't really a better way of doing this within the existing costing > code. Yeah, I was wondering whether we could change something there. ADRP+LDR is logically more expensive than a single LDR, especially when optimising for size, so I think it's reasonable for the rtx_costs to say so. But that doesn't/shouldn't mean that spilling is better (for either size or speed). So it feels like there's something missing in the way the costs are being applied. Thanks, Richard > We could try doubling or quadrupling the spill costs but that would create a > lot of fallout since it affects everything.
Re: [PATCH] AArch64: Improve address rematerialization costs
Hi Richard, > I'm not questioning the results, but I think we need to look in more > detail why rematerialisation requires such low costs. The point of > comparison should be against a spill and reload, so any constant > that is as cheap as a load should be rematerialised. If that isn't > happening then it sounds like changes are needed elsewhere. The simple answer is that rematerializable expressions must have a lower cost than the spill cost (potentially of something else), otherwise it will never happen. The previous costs were set way too high (eg. 12 for ADRP+LDR vs 4 for a reload). This patch basically ensures that is indeed the case. In principle a zero cost works fine for anything that can be rematerialized. However it may use more instructions than a spill (of something else), so a small non-zero cost avoids bloating codesize. There isn't really a better way of doing this within the existing costing code. We could try doubling or quadrupling the spill costs but that would create a lot of fallout since it affects everything. Cheers, Wilco
Re: [PATCH] AArch64: Improve address rematerialization costs
Wilco Dijkstra writes: > Improve rematerialization costs of addresses. The current costs are set too > high > which results in extra register pressure and spilling. Using lower costs > means > addresses will be rematerialized more often rather than being spilled or > causing > spills. This results in significant codesize reductions and performance > gains. > SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO. Codesize is > 0.12% > smaller. I'm not questioning the results, but I think we need to look in more detail why rematerialisation requires such low costs. The point of comparison should be against a spill and reload, so any constant that is as cheap as a load should be rematerialised. If that isn't happening then it sounds like changes are needed elsewhere. Thanks, Richard > Passes bootstrap and regress. OK for commit? > > ChangeLog: > 2021-06-01 Wilco Dijkstra > > * config/aarch64/aarch64.cc (aarch64_rtx_costs): Use better > rematerialization > costs for HIGH, LO_SUM and SYMREF. > > --- > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index > 43d87d1b9c4ef1a85094e51f81745f98f1ef27fb..7341849121ffd6b3b0b77c9730e74e751742e852 > 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -14529,45 +14529,28 @@ cost_plus: > return false; /* All arguments need to be in registers. */ > } > > +/* The following costs are used for rematerialization of addresses. > + Set a low cost for all global accesses - this ensures they are > + preferred for rematerialization, blocks them from being spilled > + and reduces register pressure. The result is significant codesize > + reductions and performance gains. */ > + > case SYMBOL_REF: > + *cost = 0; > > - if (aarch64_cmodel == AARCH64_CMODEL_LARGE > - || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC) > - { > - /* LDR. */ > - if (speed) > - *cost += extra_cost->ldst.load; > - } > - else if (aarch64_cmodel == AARCH64_CMODEL_SMALL > - || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC) > - { > - /* ADRP, followed by ADD. */ > - *cost += COSTS_N_INSNS (1); > - if (speed) > - *cost += 2 * extra_cost->alu.arith; > - } > - else if (aarch64_cmodel == AARCH64_CMODEL_TINY > - || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC) > - { > - /* ADR. */ > - if (speed) > - *cost += extra_cost->alu.arith; > - } > + /* Use a separate remateralization cost for GOT accesses. */ > + if (aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC > + && aarch64_classify_symbol (x, 0) == SYMBOL_SMALL_GOT_4G) > + *cost = COSTS_N_INSNS (1) / 2; > > - if (flag_pic) > - { > - /* One extra load instruction, after accessing the GOT. */ > - *cost += COSTS_N_INSNS (1); > - if (speed) > - *cost += extra_cost->ldst.load; > - } >return true; > > case HIGH: > + *cost = 0; > + return true; > + > case LO_SUM: > - /* ADRP/ADD (immediate). */ > - if (speed) > - *cost += extra_cost->alu.arith; > + *cost = COSTS_N_INSNS (3) / 4; >return true; > > case ZERO_EXTRACT:
Re: [PATCH] AArch64: Improve address rematerialization costs
Hi Richard, > Can you fold in the rtx costs part of the original GOT relaxation patch? Sure, see below for the updated version. > I don't think there's enough information here for me to be able to review > the patch though. I'll need to find testcases, look in detail at what > the rtl passes are doing, and try to work out whether (and why) this is > a good way of fixing things. Well today GCC does everything with costs rather than backend callbacks. I'd be interested in hearing about alternatives that have the same effect without a callback that allows a backend to decide between spilling and rematerialization. Cheers, Wilco v2: fold in GOT remat cost Improve rematerialization costs of addresses. The current costs are set too high which results in extra register pressure and spilling. Using lower costs means addresses will be rematerialized more often rather than being spilled or causing spills. This results in significant codesize reductions and performance gains. SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO. Codesize is 0.12% smaller. Passes bootstrap and regress. OK for commit? ChangeLog: 2021-06-01 Wilco Dijkstra * config/aarch64/aarch64.c (aarch64_rtx_costs): Use better rematerialization costs for HIGH, LO_SUM and SYMREF. --- diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 39de231d8ac6d10362cdd2b48eb9bd9de60c6703..a7f99ece55383168fb0f77e5c11c501d0bb2f013 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -13610,45 +13610,28 @@ cost_plus: return false; /* All arguments need to be in registers. */ } +/* The following costs are used for rematerialization of addresses. + Set a low cost for all global accesses - this ensures they are + preferred for rematerialization, blocks them from being spilled + and reduces register pressure. The result is significant codesize + reductions and performance gains. */ + case SYMBOL_REF: - if (aarch64_cmodel == AARCH64_CMODEL_LARGE - || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC) - { - /* LDR. */ - if (speed) - *cost += extra_cost->ldst.load; - } - else if (aarch64_cmodel == AARCH64_CMODEL_SMALL - || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC) - { - /* ADRP, followed by ADD. */ - *cost += COSTS_N_INSNS (1); - if (speed) - *cost += 2 * extra_cost->alu.arith; - } - else if (aarch64_cmodel == AARCH64_CMODEL_TINY - || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC) - { - /* ADR. */ - if (speed) - *cost += extra_cost->alu.arith; - } + /* Use a separate remateralization cost for GOT accesses. */ + if (aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC + && aarch64_classify_symbol (x, 0) == SYMBOL_SMALL_GOT_4G) + *cost = COSTS_N_INSNS (1) / 2; - if (flag_pic) - { - /* One extra load instruction, after accessing the GOT. */ - *cost += COSTS_N_INSNS (1); - if (speed) - *cost += extra_cost->ldst.load; - } + *cost = 0; return true; case HIGH: + *cost = 0; + return true; + case LO_SUM: - /* ADRP/ADD (immediate). */ - if (speed) - *cost += extra_cost->alu.arith; + *cost = COSTS_N_INSNS (3) / 4; return true; case ZERO_EXTRACT:
Re: [PATCH] AArch64: Improve address rematerialization costs
Wilco Dijkstra writes: > ping Can you fold in the rtx costs part of the original GOT relaxation patch? I don't think there's enough information here for me to be able to review the patch though. I'll need to find testcases, look in detail at what the rtl passes are doing, and try to work out whether (and why) this is a good way of fixing things. I don't mind doing that, but I don't think I'll have time before stage 3. Thanks, Richard > > > From: Wilco Dijkstra > Sent: 02 June 2021 11:21 > To: GCC Patches > Cc: Kyrylo Tkachov ; Richard Sandiford > > Subject: [PATCH] AArch64: Improve address rematerialization costs > > Hi, > > Given the large improvements from better register allocation of GOT accesses, > I decided to generalize it to get large gains for normal addressing too: > > Improve rematerialization costs of addresses. The current costs are set too > high > which results in extra register pressure and spilling. Using lower costs > means > addresses will be rematerialized more often rather than being spilled or > causing > spills. This results in significant codesize reductions and performance > gains. > SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO. Codesize is > 0.12% > smaller. > > Passes bootstrap and regress. OK for commit? > > ChangeLog: > 2021-06-01 Wilco Dijkstra > > * config/aarch64/aarch64.c (aarch64_rtx_costs): Use better > rematerialization > costs for HIGH, LO_SUM and SYMBOL_REF. > > --- > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index > 641c83b479e76cbcc75b299eb7ae5f634d9db7cd..08245827daa3f8199b29031e754244c078f0f500 > 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -13444,45 +13444,22 @@ cost_plus: >return false; /* All arguments need to be in registers. */ > } > > -case SYMBOL_REF: > +/* The following costs are used for rematerialization of addresses. > + Set a low cost for all global accesses - this ensures they are > + preferred for rematerialization, blocks them from being spilled > + and reduces register pressure. The result is significant codesize > + reductions and performance gains. */ > > - if (aarch64_cmodel == AARCH64_CMODEL_LARGE > - || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC) > - { > - /* LDR. */ > - if (speed) > - *cost += extra_cost->ldst.load; > - } > - else if (aarch64_cmodel == AARCH64_CMODEL_SMALL > - || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC) > - { > - /* ADRP, followed by ADD. */ > - *cost += COSTS_N_INSNS (1); > - if (speed) > - *cost += 2 * extra_cost->alu.arith; > - } > - else if (aarch64_cmodel == AARCH64_CMODEL_TINY > - || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC) > - { > - /* ADR. */ > - if (speed) > - *cost += extra_cost->alu.arith; > - } > - > - if (flag_pic) > - { > - /* One extra load instruction, after accessing the GOT. */ > - *cost += COSTS_N_INSNS (1); > - if (speed) > - *cost += extra_cost->ldst.load; > - } > +case SYMBOL_REF: > + *cost = 0; >return true; > > case HIGH: > + *cost = 0; > + return true; > + > case LO_SUM: > - /* ADRP/ADD (immediate). */ > - if (speed) > - *cost += extra_cost->alu.arith; > + *cost = COSTS_N_INSNS (3) / 4; >return true; > > case ZERO_EXTRACT:
Re: [PATCH] AArch64: Improve address rematerialization costs
ping From: Wilco Dijkstra Sent: 02 June 2021 11:21 To: GCC Patches Cc: Kyrylo Tkachov ; Richard Sandiford Subject: [PATCH] AArch64: Improve address rematerialization costs Hi, Given the large improvements from better register allocation of GOT accesses, I decided to generalize it to get large gains for normal addressing too: Improve rematerialization costs of addresses. The current costs are set too high which results in extra register pressure and spilling. Using lower costs means addresses will be rematerialized more often rather than being spilled or causing spills. This results in significant codesize reductions and performance gains. SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO. Codesize is 0.12% smaller. Passes bootstrap and regress. OK for commit? ChangeLog: 2021-06-01 Wilco Dijkstra * config/aarch64/aarch64.c (aarch64_rtx_costs): Use better rematerialization costs for HIGH, LO_SUM and SYMBOL_REF. --- diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 641c83b479e76cbcc75b299eb7ae5f634d9db7cd..08245827daa3f8199b29031e754244c078f0f500 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -13444,45 +13444,22 @@ cost_plus: return false; /* All arguments need to be in registers. */ } - case SYMBOL_REF: + /* The following costs are used for rematerialization of addresses. + Set a low cost for all global accesses - this ensures they are + preferred for rematerialization, blocks them from being spilled + and reduces register pressure. The result is significant codesize + reductions and performance gains. */ - if (aarch64_cmodel == AARCH64_CMODEL_LARGE - || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC) - { - /* LDR. */ - if (speed) - *cost += extra_cost->ldst.load; - } - else if (aarch64_cmodel == AARCH64_CMODEL_SMALL - || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC) - { - /* ADRP, followed by ADD. */ - *cost += COSTS_N_INSNS (1); - if (speed) - *cost += 2 * extra_cost->alu.arith; - } - else if (aarch64_cmodel == AARCH64_CMODEL_TINY - || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC) - { - /* ADR. */ - if (speed) - *cost += extra_cost->alu.arith; - } - - if (flag_pic) - { - /* One extra load instruction, after accessing the GOT. */ - *cost += COSTS_N_INSNS (1); - if (speed) - *cost += extra_cost->ldst.load; - } + case SYMBOL_REF: + *cost = 0; return true; case HIGH: + *cost = 0; + return true; + case LO_SUM: - /* ADRP/ADD (immediate). */ - if (speed) - *cost += extra_cost->alu.arith; + *cost = COSTS_N_INSNS (3) / 4; return true; case ZERO_EXTRACT:
Re: [PATCH] AArch64: Improve address rematerialization costs
ping From: Wilco Dijkstra Sent: 02 June 2021 11:21 To: GCC Patches Cc: Kyrylo Tkachov ; Richard Sandiford Subject: [PATCH] AArch64: Improve address rematerialization costs Hi, Given the large improvements from better register allocation of GOT accesses, I decided to generalize it to get large gains for normal addressing too: Improve rematerialization costs of addresses. The current costs are set too high which results in extra register pressure and spilling. Using lower costs means addresses will be rematerialized more often rather than being spilled or causing spills. This results in significant codesize reductions and performance gains. SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO. Codesize is 0.12% smaller. Passes bootstrap and regress. OK for commit? ChangeLog: 2021-06-01 Wilco Dijkstra * config/aarch64/aarch64.c (aarch64_rtx_costs): Use better rematerialization costs for HIGH, LO_SUM and SYMBOL_REF. --- diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 641c83b479e76cbcc75b299eb7ae5f634d9db7cd..08245827daa3f8199b29031e754244c078f0f500 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -13444,45 +13444,22 @@ cost_plus: return false; /* All arguments need to be in registers. */ } - case SYMBOL_REF: + /* The following costs are used for rematerialization of addresses. + Set a low cost for all global accesses - this ensures they are + preferred for rematerialization, blocks them from being spilled + and reduces register pressure. The result is significant codesize + reductions and performance gains. */ - if (aarch64_cmodel == AARCH64_CMODEL_LARGE - || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC) - { - /* LDR. */ - if (speed) - *cost += extra_cost->ldst.load; - } - else if (aarch64_cmodel == AARCH64_CMODEL_SMALL - || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC) - { - /* ADRP, followed by ADD. */ - *cost += COSTS_N_INSNS (1); - if (speed) - *cost += 2 * extra_cost->alu.arith; - } - else if (aarch64_cmodel == AARCH64_CMODEL_TINY - || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC) - { - /* ADR. */ - if (speed) - *cost += extra_cost->alu.arith; - } - - if (flag_pic) - { - /* One extra load instruction, after accessing the GOT. */ - *cost += COSTS_N_INSNS (1); - if (speed) - *cost += extra_cost->ldst.load; - } + case SYMBOL_REF: + *cost = 0; return true; case HIGH: + *cost = 0; + return true; + case LO_SUM: - /* ADRP/ADD (immediate). */ - if (speed) - *cost += extra_cost->alu.arith; + *cost = COSTS_N_INSNS (3) / 4; return true; case ZERO_EXTRACT:
Re: [PATCH] AArch64: Improve address rematerialization costs
Hi Richard, > No. It's never correct to completely wipe out the existing cost - you > don't know the context where this is being used. > > The most you can do is not add any additional cost. Remember that aarch64_rtx_costs starts like this: /* By default, assume that everything has equivalent cost to the cheapest instruction. Any additional costs are applied as a delta above this default. */ *cost = COSTS_N_INSNS (1); This is literally the last statement executed before the big switch... Given the cost is always initialized, there is no existing cost besides this default value, and thus changing it to something else is not an issue. We could of course do something like: *cost -= COSTS_N_INSNS (1); But that is less clear and problematic if the default value ever changes. Cheers, Wilco
Re: [PATCH] AArch64: Improve address rematerialization costs
On 02/06/2021 11:21, Wilco Dijkstra via Gcc-patches wrote: Hi, Given the large improvements from better register allocation of GOT accesses, I decided to generalize it to get large gains for normal addressing too: Improve rematerialization costs of addresses. The current costs are set too high which results in extra register pressure and spilling. Using lower costs means addresses will be rematerialized more often rather than being spilled or causing spills. This results in significant codesize reductions and performance gains. SPECINT2017 improves by 0.27% with LTO and 0.16% without LTO. Codesize is 0.12% smaller. Passes bootstrap and regress. OK for commit? ChangeLog: 2021-06-01 Wilco Dijkstra * config/aarch64/aarch64.c (aarch64_rtx_costs): Use better rematerialization costs for HIGH, LO_SUM and SYMBOL_REF. --- diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 641c83b479e76cbcc75b299eb7ae5f634d9db7cd..08245827daa3f8199b29031e754244c078f0f500 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -13444,45 +13444,22 @@ cost_plus: return false; /* All arguments need to be in registers. */ } -case SYMBOL_REF: +/* The following costs are used for rematerialization of addresses. + Set a low cost for all global accesses - this ensures they are + preferred for rematerialization, blocks them from being spilled + and reduces register pressure. The result is significant codesize + reductions and performance gains. */ - if (aarch64_cmodel == AARCH64_CMODEL_LARGE - || aarch64_cmodel == AARCH64_CMODEL_SMALL_SPIC) - { - /* LDR. */ - if (speed) - *cost += extra_cost->ldst.load; - } - else if (aarch64_cmodel == AARCH64_CMODEL_SMALL - || aarch64_cmodel == AARCH64_CMODEL_SMALL_PIC) - { - /* ADRP, followed by ADD. */ - *cost += COSTS_N_INSNS (1); - if (speed) - *cost += 2 * extra_cost->alu.arith; - } - else if (aarch64_cmodel == AARCH64_CMODEL_TINY - || aarch64_cmodel == AARCH64_CMODEL_TINY_PIC) - { - /* ADR. */ - if (speed) - *cost += extra_cost->alu.arith; - } - - if (flag_pic) - { - /* One extra load instruction, after accessing the GOT. */ - *cost += COSTS_N_INSNS (1); - if (speed) - *cost += extra_cost->ldst.load; - } +case SYMBOL_REF: + *cost = 0; return true; No. It's never correct to completely wipe out the existing cost - you don't know the context where this is being used. The most you can do is not add any additional cost. Similarly for all the other cases. case HIGH: + *cost = 0; + return true; + case LO_SUM: - /* ADRP/ADD (immediate). */ - if (speed) - *cost += extra_cost->alu.arith; + *cost = COSTS_N_INSNS (3) / 4; return true; case ZERO_EXTRACT: