RE: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
Many thanks. Here’s the version that I've committed with a ??? comment as requested (even a no-op else clause to make the logic easier to understand). 2021-08-24 Roger Sayle Richard Biener gcc/ChangeLog * config/i386/i386-features.c (compute_convert_gain): Provide more accurate values for CONST_INT, when optimizing for size. * config/i386/i386.c (COSTS_N_BYTES): Move definition from here... * config/i386/i386.h (COSTS_N_BYTES): to here. Cheers, Roger -- -Original Message- From: Richard Biener Sent: 23 August 2021 14:47 To: Roger Sayle Cc: GCC Patches Subject: Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass. On Fri, Aug 20, 2021 at 9:55 PM Roger Sayle wrote: > Hi Richard, > > Benchmarking this patch using CSiBE on x86_64-pc-linux-gnu with -Os -m32 > saves 2432 bytes. > Of the 893 tests, 34 have size differences, 30 are improvements, 4 are > regressions (of a few bytes). > > > Also I'm missing a 'else' - in the default case there's no cost/benefit of > > using SSE vs. GPR regs? > > For SSE it would be a constant pool load. > > The code size regression I primarily wanted to tackle was the zero > vs. non-zero case when dealing with immediate operands, which was the > piece affected by my and Jakub's xor improvements. > > Alas my first attempt to specify a non-zero gain in the default > (doesn't fit in SImode) case, increased the code size slightly. The > use of the constant pool complicates things, as the number of times > the same value is used becomes an issue. If the constant being loaded > is unique, then clearly the increase in constant pool size should > (ideally) be taken into account. But if the same constant is used > multiple times in a chain (or is already in the constant pool), the > observed cost is much cheaper. Empirically, a value of zero isn't a > poor choice, so the decision on whether to use vector instructions is shifted > to the gains from operations being performed, rather than the loading of > integer constants. No doubt, like rtx_costs, these are free parameters that > future generations will continue to tweak and refine. > > Given that this patch reduces code size with -Os, both with and without -m32, > ok for mainline? OK if you add a comment for the missing 'else'. Thanks, Richard. > Thanks in advance, > Roger > -- diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index d9c6652..5a99ea7 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -610,12 +610,40 @@ general_scalar_chain::compute_convert_gain () case CONST_INT: if (REG_P (dst)) - /* DImode can be immediate for TARGET_64BIT and SImode always. */ - igain += m * COSTS_N_INSNS (1); + { + if (optimize_insn_for_size_p ()) + { + /* xor (2 bytes) vs. xorps (3 bytes). */ + if (src == const0_rtx) + igain -= COSTS_N_BYTES (1); + /* movdi_internal vs. movv2di_internal. */ + /* => mov (5 bytes) vs. movaps (7 bytes). */ + else if (x86_64_immediate_operand (src, SImode)) + igain -= COSTS_N_BYTES (2); + else + /* ??? Larger immediate constants are placed in the +constant pool, where the size benefit/impact of +STV conversion is affected by whether and how +often each constant pool entry is shared/reused. +The value below is empirically derived from the +CSiBE benchmark (and the optimal value may drift +over time). */ + igain += COSTS_N_BYTES (0); + } + else + { + /* DImode can be immediate for TARGET_64BIT + and SImode always. */ + igain += m * COSTS_N_INSNS (1); + igain -= vector_const_cost (src); + } + } else if (MEM_P (dst)) - igain += (m * ix86_cost->int_store[2] - - ix86_cost->sse_store[sse_cost_idx]); - igain -= vector_const_cost (src); + { + igain += (m * ix86_cost->int_store[2] + - ix86_cost->sse_store[sse_cost_idx]); + igain -= vector_const_cost (src); + } break; default: diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 4d4ab6a..5abf2a6 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -19982,8 +19982,6 @@ ix86_division_cost (const struct processor_costs *cost, return cost->divid
Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
On Fri, Aug 20, 2021 at 9:55 PM Roger Sayle wrote: > > > Hi Richard, > > Benchmarking this patch using CSiBE on x86_64-pc-linux-gnu with -Os -m32 > saves 2432 bytes. > Of the 893 tests, 34 have size differences, 30 are improvements, 4 are > regressions (of a few bytes). > > > Also I'm missing a 'else' - in the default case there's no cost/benefit of > > using SSE vs. GPR regs? > > For SSE it would be a constant pool load. > > The code size regression I primarily wanted to tackle was the zero vs. > non-zero case when > dealing with immediate operands, which was the piece affected by my and > Jakub's xor > improvements. > > Alas my first attempt to specify a non-zero gain in the default (doesn't fit > in SImode) case, > increased the code size slightly. The use of the constant pool complicates > things, as the number > of times the same value is used becomes an issue. If the constant being > loaded is unique, then > clearly the increase in constant pool size should (ideally) be taken into > account. But if the same > constant is used multiple times in a chain (or is already in the constant > pool), the observed cost > is much cheaper. Empirically, a value of zero isn't a poor choice, so the > decision on whether to > use vector instructions is shifted to the gains from operations being > performed, rather than the > loading of integer constants. No doubt, like rtx_costs, these are free > parameters that future > generations will continue to tweak and refine. > > Given that this patch reduces code size with -Os, both with and without -m32, > ok for mainline? OK if you add a comment for the missing 'else'. Thanks, Richard. > Thanks in advance, > Roger > -- > > -Original Message- > From: Richard Biener > Sent: 20 August 2021 08:29 > To: Roger Sayle > Cc: GCC Patches > Subject: Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass. > > On Thu, Aug 19, 2021 at 6:01 PM Roger Sayle > wrote: > > > > > > Doh! ENOPATCH. > > > > -Original Message- > > From: Roger Sayle > > Sent: 19 August 2021 16:59 > > To: 'GCC Patches' > > Subject: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass. > > > > > > Back in June I briefly mentioned in one of my gcc-patches posts that a > > change that should have always reduced code size, would mysteriously > > occasionally result in slightly larger code (according to CSiBE): > > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573233.html > > > > Investigating further, the cause turns out to be that x86_64's > > scalar-to-vector (stv) pass is relying on poor estimates of the size > > costs/benefits. This patch tweaks the backend's compute_convert_gain > > method to provide slightly more accurate values when compiling with -Os. > > Compilation without -Os is (should be) unaffected. And for > > completeness, I'll mention that the stv pass is a net win for code > > size so it's much better to improve its heuristics than simply gate > > the pass on !optimize_for_size. > > > > The net effect of this change is to save 1399 bytes on the CSiBE code > > size benchmark when compiling with -Os. > > > > This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap" > > and "make -k check" with no new failures. > > > > Ok for mainline? > > + /* xor (2 bytes) vs. xorps (3 bytes). */ > + if (src == const0_rtx) > + igain -= COSTS_N_BYTES (1); > + /* movdi_internal vs. movv2di_internal. */ > + /* => mov (5 bytes) vs. movaps (7 bytes). */ > + else if (x86_64_immediate_operand (src, SImode)) > + igain -= COSTS_N_BYTES (2); > > doesn't it need two GPR xor for 32bit DImode and two mov? Thus the non-SSE > cost should be times 'm'? For const0_rtx we may eventually re-use the zero > reg for the high part so that is eventually correct. > > Also I'm missing a 'else' - in the default case there's no cost/benefit of > using SSE vs. GPR regs? For SSE it would be a constant pool load. > > I also wonder, since I now see COSTS_N_BYTES for the first time (heh), > whether with -Os we'd need to replace all COSTS_N_INSNS (1) scaling with > COSTS_N_BYTES scaling? OTOH costs_add_n_insns uses COSTS_N_INSNS for the > size part as well. > > That said, it looks like we're eventually mixing apples and oranges now or > even previously? > > Thanks, > Richard. > > > > > > > 2021-08-19 Roger Sayle > > > > gcc/ChangeLog > > * config/i386/i386-features.c (compute_convert_gain): Provide > > more accurate values for CONST_INT, when optimizing for size. > > * config/i386/i386.c (COSTS_N_BYTES): Move definition from here... > > * config/i386/i386.h (COSTS_N_BYTES): to here. > > > > Roger > > -- > > >
RE: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
Hi Richard, Benchmarking this patch using CSiBE on x86_64-pc-linux-gnu with -Os -m32 saves 2432 bytes. Of the 893 tests, 34 have size differences, 30 are improvements, 4 are regressions (of a few bytes). > Also I'm missing a 'else' - in the default case there's no cost/benefit of > using SSE vs. GPR regs? > For SSE it would be a constant pool load. The code size regression I primarily wanted to tackle was the zero vs. non-zero case when dealing with immediate operands, which was the piece affected by my and Jakub's xor improvements. Alas my first attempt to specify a non-zero gain in the default (doesn't fit in SImode) case, increased the code size slightly. The use of the constant pool complicates things, as the number of times the same value is used becomes an issue. If the constant being loaded is unique, then clearly the increase in constant pool size should (ideally) be taken into account. But if the same constant is used multiple times in a chain (or is already in the constant pool), the observed cost is much cheaper. Empirically, a value of zero isn't a poor choice, so the decision on whether to use vector instructions is shifted to the gains from operations being performed, rather than the loading of integer constants. No doubt, like rtx_costs, these are free parameters that future generations will continue to tweak and refine. Given that this patch reduces code size with -Os, both with and without -m32, ok for mainline? Thanks in advance, Roger -- -Original Message- From: Richard Biener Sent: 20 August 2021 08:29 To: Roger Sayle Cc: GCC Patches Subject: Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass. On Thu, Aug 19, 2021 at 6:01 PM Roger Sayle wrote: > > > Doh! ENOPATCH. > > -Original Message- > From: Roger Sayle > Sent: 19 August 2021 16:59 > To: 'GCC Patches' > Subject: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass. > > > Back in June I briefly mentioned in one of my gcc-patches posts that a > change that should have always reduced code size, would mysteriously > occasionally result in slightly larger code (according to CSiBE): > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573233.html > > Investigating further, the cause turns out to be that x86_64's > scalar-to-vector (stv) pass is relying on poor estimates of the size > costs/benefits. This patch tweaks the backend's compute_convert_gain > method to provide slightly more accurate values when compiling with -Os. > Compilation without -Os is (should be) unaffected. And for > completeness, I'll mention that the stv pass is a net win for code > size so it's much better to improve its heuristics than simply gate > the pass on !optimize_for_size. > > The net effect of this change is to save 1399 bytes on the CSiBE code > size benchmark when compiling with -Os. > > This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap" > and "make -k check" with no new failures. > > Ok for mainline? + /* xor (2 bytes) vs. xorps (3 bytes). */ + if (src == const0_rtx) + igain -= COSTS_N_BYTES (1); + /* movdi_internal vs. movv2di_internal. */ + /* => mov (5 bytes) vs. movaps (7 bytes). */ + else if (x86_64_immediate_operand (src, SImode)) + igain -= COSTS_N_BYTES (2); doesn't it need two GPR xor for 32bit DImode and two mov? Thus the non-SSE cost should be times 'm'? For const0_rtx we may eventually re-use the zero reg for the high part so that is eventually correct. Also I'm missing a 'else' - in the default case there's no cost/benefit of using SSE vs. GPR regs? For SSE it would be a constant pool load. I also wonder, since I now see COSTS_N_BYTES for the first time (heh), whether with -Os we'd need to replace all COSTS_N_INSNS (1) scaling with COSTS_N_BYTES scaling? OTOH costs_add_n_insns uses COSTS_N_INSNS for the size part as well. That said, it looks like we're eventually mixing apples and oranges now or even previously? Thanks, Richard. > > > 2021-08-19 Roger Sayle > > gcc/ChangeLog > * config/i386/i386-features.c (compute_convert_gain): Provide > more accurate values for CONST_INT, when optimizing for size. > * config/i386/i386.c (COSTS_N_BYTES): Move definition from here... > * config/i386/i386.h (COSTS_N_BYTES): to here. > > Roger > -- >
RE: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
Hi Richard, Fortunately, it's (moderately) safe to mix COSTS_N_INSNS and COSTS_N_BYTES in the i386 backend. The average length of an x86_64 instruction in typical code is between 2 and 3 bytes, so the definition of N*4 for COSTS_N_INSNS(N) and N*2 for COSTS_N_BYTES(N) allows these be mixed, with no less approximation error than the more common problem, that rtx_costs usually encodes cycle counts (but converted to units of COSTS_N_INSNS). The thing that I like about STV's use of a "gain calculator" is that it allows much more accurate fine tuning. Many passes, particularly combine, base their decisions of obtaining total cost estimates (with large approximate values) and then comparing those two totals. As any physicist will confirm, it's much better to parameterize on the observed delta between those two large approximate values. But you're also right, that I need to run CSiBE with -m32 and get back to you with the results. Roger -- -Original Message- From: Richard Biener Sent: 20 August 2021 08:29 To: Roger Sayle Cc: GCC Patches Subject: Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass. On Thu, Aug 19, 2021 at 6:01 PM Roger Sayle wrote: > > > Doh! ENOPATCH. > > -Original Message- > From: Roger Sayle > Sent: 19 August 2021 16:59 > To: 'GCC Patches' > Subject: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass. > > > Back in June I briefly mentioned in one of my gcc-patches posts that a > change that should have always reduced code size, would mysteriously > occasionally result in slightly larger code (according to CSiBE): > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573233.html > > Investigating further, the cause turns out to be that x86_64's > scalar-to-vector (stv) pass is relying on poor estimates of the size > costs/benefits. This patch tweaks the backend's compute_convert_gain > method to provide slightly more accurate values when compiling with -Os. > Compilation without -Os is (should be) unaffected. And for > completeness, I'll mention that the stv pass is a net win for code > size so it's much better to improve its heuristics than simply gate > the pass on !optimize_for_size. > > The net effect of this change is to save 1399 bytes on the CSiBE code > size benchmark when compiling with -Os. > > This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap" > and "make -k check" with no new failures. > > Ok for mainline? + /* xor (2 bytes) vs. xorps (3 bytes). */ + if (src == const0_rtx) + igain -= COSTS_N_BYTES (1); + /* movdi_internal vs. movv2di_internal. */ + /* => mov (5 bytes) vs. movaps (7 bytes). */ + else if (x86_64_immediate_operand (src, SImode)) + igain -= COSTS_N_BYTES (2); doesn't it need two GPR xor for 32bit DImode and two mov? Thus the non-SSE cost should be times 'm'? For const0_rtx we may eventually re-use the zero reg for the high part so that is eventually correct. Also I'm missing a 'else' - in the default case there's no cost/benefit of using SSE vs. GPR regs? For SSE it would be a constant pool load. I also wonder, since I now see COSTS_N_BYTES for the first time (heh), whether with -Os we'd need to replace all COSTS_N_INSNS (1) scaling with COSTS_N_BYTES scaling? OTOH costs_add_n_insns uses COSTS_N_INSNS for the size part as well. That said, it looks like we're eventually mixing apples and oranges now or even previously? Thanks, Richard. > > > 2021-08-19 Roger Sayle > > gcc/ChangeLog > * config/i386/i386-features.c (compute_convert_gain): Provide > more accurate values for CONST_INT, when optimizing for size. > * config/i386/i386.c (COSTS_N_BYTES): Move definition from here... > * config/i386/i386.h (COSTS_N_BYTES): to here. > > Roger > -- >
Re: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
On Thu, Aug 19, 2021 at 6:01 PM Roger Sayle wrote: > > > Doh! ENOPATCH. > > -Original Message- > From: Roger Sayle > Sent: 19 August 2021 16:59 > To: 'GCC Patches' > Subject: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass. > > > Back in June I briefly mentioned in one of my gcc-patches posts that a > change that should have always reduced code size, would mysteriously > occasionally result in slightly larger code (according to CSiBE): > https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573233.html > > Investigating further, the cause turns out to be that x86_64's > scalar-to-vector (stv) pass is relying on poor estimates of the size > costs/benefits. This patch tweaks the backend's compute_convert_gain method > to provide slightly more accurate values when compiling with -Os. > Compilation without -Os is (should be) unaffected. And for completeness, > I'll mention that the stv pass is a net win for code size so it's much > better to improve its heuristics than simply gate the pass on > !optimize_for_size. > > The net effect of this change is to save 1399 bytes on the CSiBE code size > benchmark when compiling with -Os. > > This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap" > and "make -k check" with no new failures. > > Ok for mainline? + /* xor (2 bytes) vs. xorps (3 bytes). */ + if (src == const0_rtx) + igain -= COSTS_N_BYTES (1); + /* movdi_internal vs. movv2di_internal. */ + /* => mov (5 bytes) vs. movaps (7 bytes). */ + else if (x86_64_immediate_operand (src, SImode)) + igain -= COSTS_N_BYTES (2); doesn't it need two GPR xor for 32bit DImode and two mov? Thus the non-SSE cost should be times 'm'? For const0_rtx we may eventually re-use the zero reg for the high part so that is eventually correct. Also I'm missing a 'else' - in the default case there's no cost/benefit of using SSE vs. GPR regs? For SSE it would be a constant pool load. I also wonder, since I now see COSTS_N_BYTES for the first time (heh), whether with -Os we'd need to replace all COSTS_N_INSNS (1) scaling with COSTS_N_BYTES scaling? OTOH costs_add_n_insns uses COSTS_N_INSNS for the size part as well. That said, it looks like we're eventually mixing apples and oranges now or even previously? Thanks, Richard. > > > 2021-08-19 Roger Sayle > > gcc/ChangeLog > * config/i386/i386-features.c (compute_convert_gain): Provide > more accurate values for CONST_INT, when optimizing for size. > * config/i386/i386.c (COSTS_N_BYTES): Move definition from here... > * config/i386/i386.h (COSTS_N_BYTES): to here. > > Roger > -- >
RE: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass.
Doh! ENOPATCH. -Original Message- From: Roger Sayle Sent: 19 August 2021 16:59 To: 'GCC Patches' Subject: [x86_64 PATCH] Tweak -Os costs for scalar-to-vector pass. Back in June I briefly mentioned in one of my gcc-patches posts that a change that should have always reduced code size, would mysteriously occasionally result in slightly larger code (according to CSiBE): https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573233.html Investigating further, the cause turns out to be that x86_64's scalar-to-vector (stv) pass is relying on poor estimates of the size costs/benefits. This patch tweaks the backend's compute_convert_gain method to provide slightly more accurate values when compiling with -Os. Compilation without -Os is (should be) unaffected. And for completeness, I'll mention that the stv pass is a net win for code size so it's much better to improve its heuristics than simply gate the pass on !optimize_for_size. The net effect of this change is to save 1399 bytes on the CSiBE code size benchmark when compiling with -Os. This patch has been tested on x86_64-pc-linux-gnu with "make bootstrap" and "make -k check" with no new failures. Ok for mainline? 2021-08-19 Roger Sayle gcc/ChangeLog * config/i386/i386-features.c (compute_convert_gain): Provide more accurate values for CONST_INT, when optimizing for size. * config/i386/i386.c (COSTS_N_BYTES): Move definition from here... * config/i386/i386.h (COSTS_N_BYTES): to here. Roger -- diff --git a/gcc/config/i386/i386-features.c b/gcc/config/i386/i386-features.c index d9c6652..cdae3dc 100644 --- a/gcc/config/i386/i386-features.c +++ b/gcc/config/i386/i386-features.c @@ -610,12 +610,31 @@ general_scalar_chain::compute_convert_gain () case CONST_INT: if (REG_P (dst)) - /* DImode can be immediate for TARGET_64BIT and SImode always. */ - igain += m * COSTS_N_INSNS (1); + { + if (optimize_insn_for_size_p ()) + { + /* xor (2 bytes) vs. xorps (3 bytes). */ + if (src == const0_rtx) + igain -= COSTS_N_BYTES (1); + /* movdi_internal vs. movv2di_internal. */ + /* => mov (5 bytes) vs. movaps (7 bytes). */ + else if (x86_64_immediate_operand (src, SImode)) + igain -= COSTS_N_BYTES (2); + } + else + { + /* DImode can be immediate for TARGET_64BIT + and SImode always. */ + igain += m * COSTS_N_INSNS (1); + igain -= vector_const_cost (src); + } + } else if (MEM_P (dst)) - igain += (m * ix86_cost->int_store[2] - - ix86_cost->sse_store[sse_cost_idx]); - igain -= vector_const_cost (src); + { + igain += (m * ix86_cost->int_store[2] + - ix86_cost->sse_store[sse_cost_idx]); + igain -= vector_const_cost (src); + } break; default: diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 4d4ab6a..5abf2a6 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -19982,8 +19982,6 @@ ix86_division_cost (const struct processor_costs *cost, return cost->divide[MODE_INDEX (mode)]; } -#define COSTS_N_BYTES(N) ((N) * 2) - /* Return cost of shift in MODE. If CONSTANT_OP1 is true, the op1 value is known and set in OP1_VAL. AND_IN_OP1 specify in op1 is result of and and SHIFT_AND_TRUNCATE diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 21fe51b..edbfcaf 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -88,6 +88,11 @@ struct stringop_algs } size [MAX_STRINGOP_ALGS]; }; +/* Analog of COSTS_N_INSNS when optimizing for size. */ +#ifndef COSTS_N_BYTES +#define COSTS_N_BYTES(N) ((N) * 2) +#endif + /* Define the specific costs for a given cpu. NB: hard_register is used by TARGET_REGISTER_MOVE_COST and TARGET_MEMORY_MOVE_COST to compute hard register move costs by register allocator. Relative costs of