[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #19 from Wilco --- Author: wilco Date: Wed Jun 28 14:13:02 2017 New Revision: 249740 URL: https://gcc.gnu.org/viewcvs?rev=249740=gcc=rev Log: Improve Cortex-A53 shift bypass The aarch_forward_to_shift_is_not_shifted_reg bypass always returns true on AArch64 shifted instructions. This causes the bypass to activate in too many cases, resulting in slower execution on Cortex-A53 like reported in PR79665. This patch uses the arm_no_early_alu_shift_dep condition instead which improves the example in PR79665 by ~7%. Given it is no longer used, remove aarch_forward_to_shift_is_not_shifted_reg. Also remove an unnecessary REG_P check. gcc/ PR target/79665 * config/arm/aarch-common.c (arm_no_early_alu_shift_dep): Remove redundant if. (aarch_forward_to_shift_is_not_shifted_reg): Remove. * config/arm/aarch-common-protos.h (aarch_forward_to_shift_is_not_shifted_re): Remove. * config/arm/cortex-a53.md: Use arm_no_early_alu_shift_dep in bypass. Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/aarch-common-protos.h trunk/gcc/config/arm/aarch-common.c trunk/gcc/config/arm/cortex-a53.md
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #18 from tnfchris at gcc dot gnu.org --- Author: tnfchris Date: Mon May 8 09:45:46 2017 New Revision: 247734 URL: https://gcc.gnu.org/viewcvs?rev=247734=gcc=rev Log: 2017-05-08 Tamar ChristinaPR middle-end/79665 * expr.c (expand_expr_real_2): Move TRUNC_MOD_EXPR, FLOOR_MOD_EXPR, CEIL_MOD_EXPR, ROUND_MOD_EXPR cases. Modified: branches/gcc-7-branch/gcc/ChangeLog branches/gcc-7-branch/gcc/expr.c
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #17 from wilco at gcc dot gnu.org --- (In reply to wilco from comment #16) > (In reply to wilco from comment #14) > > (In reply to PeteVine from comment #13) > > > Still, the 5% regression must have happened very recently. The fast gcc > > > was > > > built on 20170220 and the slow one yesterday, using the original patch. > > > Once > > > again, switching away from Cortex-A53 codegen restores the expected > > > performance. > > > > The issue is due to inefficient code generated for unsigned modulo: > > > > umull x0, w0, w4 > > umull x1, w1, w4 > > lsr x0, x0, 32 > > lsr x1, x1, 32 > > lsr w0, w0, 6 > > lsr w1, w1, 6 > > > > It seems the Cortex-A53 scheduler isn't modelling this correctly. When I > > manually remove the redundant shifts I get a 15% speedup. I'll have a look. > > See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html The redundant LSRs and SDIV are removed on latest trunk. Although my patch above hasn't gone in, I get a 15% speedup on Cortex-A53 with -mcpu=cortex-a53 and 8% with -mcpu=cortex-a72.
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #16 from wilco at gcc dot gnu.org --- (In reply to wilco from comment #14) > (In reply to PeteVine from comment #13) > > Still, the 5% regression must have happened very recently. The fast gcc was > > built on 20170220 and the slow one yesterday, using the original patch. Once > > again, switching away from Cortex-A53 codegen restores the expected > > performance. > > The issue is due to inefficient code generated for unsigned modulo: > > umull x0, w0, w4 > umull x1, w1, w4 > lsr x0, x0, 32 > lsr x1, x1, 32 > lsr w0, w0, 6 > lsr w1, w1, 6 > > It seems the Cortex-A53 scheduler isn't modelling this correctly. When I > manually remove the redundant shifts I get a 15% speedup. I'll have a look. See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #15 from tnfchris at gcc dot gnu.org --- Author: tnfchris Date: Thu Apr 27 09:58:27 2017 New Revision: 247307 URL: https://gcc.gnu.org/viewcvs?rev=247307=gcc=rev Log: 2017-04-26 Tamar ChristinaPR middle-end/79665 * expr.c (expand_expr_real_2): Move TRUNC_MOD_EXPR, FLOOR_MOD_EXPR, CEIL_MOD_EXPR, ROUND_MOD_EXPR cases. Modified: trunk/gcc/ChangeLog trunk/gcc/expr.c
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 wilco at gcc dot gnu.org changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #14 from wilco at gcc dot gnu.org --- (In reply to PeteVine from comment #13) > Still, the 5% regression must have happened very recently. The fast gcc was > built on 20170220 and the slow one yesterday, using the original patch. Once > again, switching away from Cortex-A53 codegen restores the expected > performance. The issue is due to inefficient code generated for unsigned modulo: umull x0, w0, w4 umull x1, w1, w4 lsr x0, x0, 32 lsr x1, x1, 32 lsr w0, w0, 6 lsr w1, w1, 6 It seems the Cortex-A53 scheduler isn't modelling this correctly. When I manually remove the redundant shifts I get a 15% speedup. I'll have a look.
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #13 from PeteVine --- Still, the 5% regression must have happened very recently. The fast gcc was built on 20170220 and the slow one yesterday, using the original patch. Once again, switching away from Cortex-A53 codegen restores the expected performance.
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #12 from ktkachov at gcc dot gnu.org --- Huh, never mind. That sdiv was there even before this changes, it is unrelated to this. I don't have see how there could be a slowdown from this change
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #11 from ktkachov at gcc dot gnu.org --- Looks like the sdiv comes from the % 300 expansion (an sdiv followed by a multiply-subtract). Need to figure why MOD expressions are affected by this change
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #10 from Jakub Jelinek --- You can put a breakpoint at the new expr.c code and see what both the unsigned and signed sequences are and see what seq_cost gcc computed. If the costs don't match the hw, then it can of course choose a worse sequence.
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #9 from ktkachov at gcc dot gnu.org --- Hmm, for -mcpu=cortex-a53 one of the divisions ends up being unexpanded and generates an sdiv with this patch, whereas before it used to expand
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 Jakub Jelinek changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #8 from Jakub Jelinek --- Fixed. If this makes -mcpu=cortex-a53 slower, then it doesn't have right costs function.
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #7 from Jakub Jelinek --- Author: jakub Date: Thu Feb 23 07:49:06 2017 New Revision: 245676 URL: https://gcc.gnu.org/viewcvs?rev=245676=gcc=rev Log: PR middle-end/79665 * internal-fn.c (get_range_pos_neg): Moved to ... * tree.c (get_range_pos_neg): ... here. No longer static. * tree.h (get_range_pos_neg): New prototype. * expr.c (expand_expr_real_2) : If both arguments are known to be in between 0 and signed maximum inclusive, try to expand both unsigned and signed divmod and use the cheaper one from those. Modified: trunk/gcc/ChangeLog trunk/gcc/expr.c trunk/gcc/internal-fn.c trunk/gcc/tree.c trunk/gcc/tree.h
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #6 from PeteVine --- But that's related to -mcpu=cortex-a53 again, so never mind I guess.
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 PeteVine changed: What|Removed |Added CC||tulipawn at gmail dot com --- Comment #5 from PeteVine --- Psst! GCC 7 was already 1.75x faster than Clang 3.8 on my aarch64 machine when I benchmarked this code 3 weeks ago, but with this patch, it seems to take a 5% hit.
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 Jakub Jelinek changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2017-02-22 Assignee|unassigned at gcc dot gnu.org |jakub at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #4 from Jakub Jelinek --- Created attachment 40811 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40811=edit gcc7-pr79665.patch Untested fix.
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 --- Comment #3 from Josh Stone --- I'm using just -O3, and then I compared effects with and without -fwrapv to figure out what's going on. Clang is only faster without -fwrapv. With -march=native on my Sandy Bridge, a few instructions are placed in a different order, but it's otherwise identical.
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #2 from Jakub Jelinek --- What we can do IMHO is in expand_divmod, if we have division or modulo by constant and the first operand is known to have 0 MSB (from get_range_info or get_nonzero_bits), then we can expand it as both signed or unsigned division/modulo, regardless what unsignedp is. Then we could compare costs of both expansions and decide what is faster.
[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665 Andrew Pinski changed: What|Removed |Added Component|c |middle-end --- Comment #1 from Andrew Pinski --- What options are you using? Did you try -march=native ?