[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-06-28 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #19 from Wilco  ---
Author: wilco
Date: Wed Jun 28 14:13:02 2017
New Revision: 249740

URL: https://gcc.gnu.org/viewcvs?rev=249740=gcc=rev
Log:
Improve Cortex-A53 shift bypass

The aarch_forward_to_shift_is_not_shifted_reg bypass always returns true
on AArch64 shifted instructions.  This causes the bypass to activate in
too many cases, resulting in slower execution on Cortex-A53 like reported
in PR79665.

This patch uses the arm_no_early_alu_shift_dep condition instead which
improves the example in PR79665 by ~7%.  Given it is no longer used,
remove aarch_forward_to_shift_is_not_shifted_reg.  Also remove an
unnecessary REG_P check.

gcc/
PR target/79665
* config/arm/aarch-common.c (arm_no_early_alu_shift_dep):
Remove redundant if.
(aarch_forward_to_shift_is_not_shifted_reg): Remove.
* config/arm/aarch-common-protos.h
(aarch_forward_to_shift_is_not_shifted_re): Remove.
* config/arm/cortex-a53.md: Use arm_no_early_alu_shift_dep in bypass.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/arm/aarch-common-protos.h
trunk/gcc/config/arm/aarch-common.c
trunk/gcc/config/arm/cortex-a53.md

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-05-08 Thread tnfchris at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #18 from tnfchris at gcc dot gnu.org ---
Author: tnfchris
Date: Mon May  8 09:45:46 2017
New Revision: 247734

URL: https://gcc.gnu.org/viewcvs?rev=247734=gcc=rev
Log:
2017-05-08  Tamar Christina  

PR middle-end/79665
* expr.c (expand_expr_real_2): Move TRUNC_MOD_EXPR, FLOOR_MOD_EXPR,
CEIL_MOD_EXPR, ROUND_MOD_EXPR cases.


Modified:
branches/gcc-7-branch/gcc/ChangeLog
branches/gcc-7-branch/gcc/expr.c

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-05-05 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #17 from wilco at gcc dot gnu.org ---
(In reply to wilco from comment #16)
> (In reply to wilco from comment #14)
> > (In reply to PeteVine from comment #13)
> > > Still, the 5% regression must have happened very recently. The fast gcc 
> > > was
> > > built on 20170220 and the slow one yesterday, using the original patch. 
> > > Once
> > > again, switching away from Cortex-A53 codegen restores the expected
> > > performance.
> > 
> > The issue is due to inefficient code generated for unsigned modulo:
> > 
> > umull   x0, w0, w4
> > umull   x1, w1, w4
> > lsr x0, x0, 32
> > lsr x1, x1, 32
> > lsr w0, w0, 6
> > lsr w1, w1, 6
> > 
> > It seems the Cortex-A53 scheduler isn't modelling this correctly. When I
> > manually remove the redundant shifts I get a 15% speedup. I'll have a look.
> 
> See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html

The redundant LSRs and SDIV are removed on latest trunk. Although my patch
above hasn't gone in, I get a 15% speedup on Cortex-A53 with -mcpu=cortex-a53
and 8% with -mcpu=cortex-a72.

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-04-27 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #16 from wilco at gcc dot gnu.org ---
(In reply to wilco from comment #14)
> (In reply to PeteVine from comment #13)
> > Still, the 5% regression must have happened very recently. The fast gcc was
> > built on 20170220 and the slow one yesterday, using the original patch. Once
> > again, switching away from Cortex-A53 codegen restores the expected
> > performance.
> 
> The issue is due to inefficient code generated for unsigned modulo:
> 
> umull   x0, w0, w4
> umull   x1, w1, w4
> lsr x0, x0, 32
> lsr x1, x1, 32
> lsr w0, w0, 6
> lsr w1, w1, 6
> 
> It seems the Cortex-A53 scheduler isn't modelling this correctly. When I
> manually remove the redundant shifts I get a 15% speedup. I'll have a look.

See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-04-27 Thread tnfchris at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #15 from tnfchris at gcc dot gnu.org ---
Author: tnfchris
Date: Thu Apr 27 09:58:27 2017
New Revision: 247307

URL: https://gcc.gnu.org/viewcvs?rev=247307=gcc=rev
Log:
2017-04-26  Tamar Christina  

PR middle-end/79665
* expr.c (expand_expr_real_2): Move TRUNC_MOD_EXPR, FLOOR_MOD_EXPR,
CEIL_MOD_EXPR, ROUND_MOD_EXPR cases.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/expr.c

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-04-18 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

wilco at gcc dot gnu.org changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #14 from wilco at gcc dot gnu.org ---
(In reply to PeteVine from comment #13)
> Still, the 5% regression must have happened very recently. The fast gcc was
> built on 20170220 and the slow one yesterday, using the original patch. Once
> again, switching away from Cortex-A53 codegen restores the expected
> performance.

The issue is due to inefficient code generated for unsigned modulo:

umull   x0, w0, w4
umull   x1, w1, w4
lsr x0, x0, 32
lsr x1, x1, 32
lsr w0, w0, 6
lsr w1, w1, 6

It seems the Cortex-A53 scheduler isn't modelling this correctly. When I
manually remove the redundant shifts I get a 15% speedup. I'll have a look.

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-23 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #13 from PeteVine  ---
Still, the 5% regression must have happened very recently. The fast gcc was
built on 20170220 and the slow one yesterday, using the original patch. Once
again, switching away from Cortex-A53 codegen restores the expected
performance.

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-23 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #12 from ktkachov at gcc dot gnu.org ---
Huh, never mind. That sdiv was there even before this changes, it is unrelated
to this. I don't have see how there could be a slowdown from this change

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-23 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #11 from ktkachov at gcc dot gnu.org ---
Looks like the sdiv comes from the % 300 expansion (an sdiv followed by a
multiply-subtract). Need to figure why MOD expressions are affected by this
change

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-23 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #10 from Jakub Jelinek  ---
You can put a breakpoint at the new expr.c code and see what both the unsigned
and signed sequences are and see what seq_cost gcc computed.  If the costs
don't match the hw, then it can of course choose a worse sequence.

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-23 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #9 from ktkachov at gcc dot gnu.org ---
Hmm, for -mcpu=cortex-a53 one of the divisions ends up being unexpanded and
generates an sdiv with this patch, whereas before it used to expand

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-22 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

Jakub Jelinek  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Jakub Jelinek  ---
Fixed.  If this makes -mcpu=cortex-a53 slower, then it doesn't have right costs
function.

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-22 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #7 from Jakub Jelinek  ---
Author: jakub
Date: Thu Feb 23 07:49:06 2017
New Revision: 245676

URL: https://gcc.gnu.org/viewcvs?rev=245676=gcc=rev
Log:
PR middle-end/79665
* internal-fn.c (get_range_pos_neg): Moved to ...
* tree.c (get_range_pos_neg): ... here.  No longer static.
* tree.h (get_range_pos_neg): New prototype.
* expr.c (expand_expr_real_2) : If both arguments
are known to be in between 0 and signed maximum inclusive, try to
expand both unsigned and signed divmod and use the cheaper one from
those.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/expr.c
trunk/gcc/internal-fn.c
trunk/gcc/tree.c
trunk/gcc/tree.h

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-22 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #6 from PeteVine  ---
But that's related to -mcpu=cortex-a53 again, so never mind I guess.

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-22 Thread tulipawn at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

PeteVine  changed:

   What|Removed |Added

 CC||tulipawn at gmail dot com

--- Comment #5 from PeteVine  ---
Psst! GCC 7 was already 1.75x faster than Clang 3.8 on my aarch64 machine when
I benchmarked this code 3 weeks ago, but with this patch, it seems to take a 5%
hit.

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-22 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

Jakub Jelinek  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2017-02-22
   Assignee|unassigned at gcc dot gnu.org  |jakub at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #4 from Jakub Jelinek  ---
Created attachment 40811
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40811=edit
gcc7-pr79665.patch

Untested fix.

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-21 Thread jistone at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

--- Comment #3 from Josh Stone  ---
I'm using just -O3, and then I compared effects with and without -fwrapv to
figure out what's going on.  Clang is only faster without -fwrapv.

With -march=native on my Sandy Bridge, a few instructions are placed in a
different order, but it's otherwise identical.

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-21 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek  ---
What we can do IMHO is in expand_divmod, if we have division or modulo by
constant and the first operand is known to have 0 MSB (from get_range_info or
get_nonzero_bits), then we can expand it as both signed or unsigned
division/modulo, regardless what unsignedp is.  Then we could compare costs of
both expansions and decide what is faster.

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

2017-02-21 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

Andrew Pinski  changed:

   What|Removed |Added

  Component|c   |middle-end

--- Comment #1 from Andrew Pinski  ---
What options are you using?

Did you try -march=native ?