from:"wilco at gcc dot gnu.org"

[Bug middle-end/50481] builtin to reverse the bit order

2019-06-20 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50481

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #5 from Wilco  ---
Yes it would be good to have a generic builtin, this issue keeps coming up:
https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01187.html

[Bug target/84521] aarch64: Frame-pointer corruption with __builtin_setjmp/__builtin_longjmp and -fomit-frame-pointer

2019-06-19 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84521

--- Comment #29 from Wilco  ---
Author: wilco
Date: Wed Jun 19 12:52:43 2019
New Revision: 272473

URL: https://gcc.gnu.org/viewcvs?rev=272473=gcc=rev
Log:
Simplify setjmp and non-local goto implementation (PR84521)

This fixes and simplifies the setjmp and non-local goto implementation.
Currently the virtual frame pointer is saved when using __builtin_setjmp or
a non-local goto.  Depending on whether a frame pointer is used, this may
either save SP or FP with an immediate offset.  However the goto or longjmp
always updates the hard frame pointer.

A receiver veneer in the original function then assigns the hard frame pointer
to the virtual frame pointer, which should, if it works correctly, again assign
SP or FP.  However the special elimination code in eliminate_regs_in_insn
doesn't do this correctly unless the frame pointer is used, and even if it
worked by writing SP, the frame pointer would still be corrupted.

A much simpler implementation is to always save and restore the hard frame
pointer.  This avoids 2 redundant instructions which add/subtract the virtual
frame offset.  A large amount of code can be removed as a result, including all
implementations of TARGET_BUILTIN_SETJMP_FRAME_VALUE (all of which already use
the hard frame pointer).  The expansion of nonlocal_goto on PA can be simplied
to just restore the hard frame pointer. 

This fixes the most obvious issues, however there are still issues on targets
which define HARD_FRAME_POINTER_IS_FRAME_POINTER (arm, mips).
Each function could have a different hard frame pointer, so a non-local goto
may restore the wrong frame pointer (TARGET_BUILTIN_SETJMP_FRAME_VALUE could
be useful for this).

The i386 TARGET_BUILTIN_SETJMP_FRAME_VALUE was incorrect: if stack_realign_fp
is true, it would save the hard frame pointer value but restore the virtual
frame pointer which according to ix86_initial_elimination_offset can have a
non-zero offset from the hard frame pointer.

The ia64 implementation of nonlocal_goto seems incorrect since the helper
function moves the the frame pointer value into the static chain register
(so this patch does nothing to make it better or worse).

AArch64 + x86-64 bootstrap OK, new test passes on AArch64, x86-64 and Arm.

gcc/
PR middle-end/84521
* builtins.c (expand_builtin_setjmp_setup): Save
hard_frame_pointer_rtx.
(expand_builtin_setjmp_receiver): Do not emit sfp = fp move since we
restore fp.
* function.c (expand_function_start): Save hard_frame_pointer_rtx for
non-local goto.
* lra-eliminations.c (eliminate_regs_in_insn): Remove sfp = fp
elimination code.
(remove_reg_equal_offset_note): Remove unused function.
* reload1.c (eliminate_regs_in_insn): Remove sfp = hfp elimination
code.
* config/arc/arc.c (TARGET_BUILTIN_SETJMP_FRAME_VALUE): Remove.
(arc_builtin_setjmp_frame_value): Remove function.
* config/avr/avr.c  (TARGET_BUILTIN_SETJMP_FRAME_VALUE): Remove.
(avr_builtin_setjmp_frame_value): Remove function.
* config/i386/i386.c (TARGET_BUILTIN_SETJMP_FRAME_VALUE): Remove.
(ix86_builtin_setjmp_frame_value): Remove function.
* config/pa/pa.md (nonlocal_goto): Remove FP adjustment.
* config/sparc/sparc.c (TARGET_BUILTIN_SETJMP_FRAME_VALUE): Remove.
(sparc_builtin_setjmp_frame_value): Remove function.
* config/vax/vax.c (TARGET_BUILTIN_SETJMP_FRAME_VALUE): Remove.
(vax_builtin_setjmp_frame_value): Remove function.
* config/xtensa/xtensa.c (xtensa_frame_pointer_required): Force frame
pointer if has_nonlocal_label.

testsuite/
PR middle-end/84521
* gcc.c-torture/execute/pr84521.c: New test.

Added:
trunk/gcc/testsuite/gcc.c-torture/execute/pr84521.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/builtins.c
trunk/gcc/config/arc/arc.c
trunk/gcc/config/avr/avr.c
trunk/gcc/config/i386/i386.c
trunk/gcc/config/pa/pa.md
trunk/gcc/config/sparc/sparc.c
trunk/gcc/config/vax/vax.c
trunk/gcc/config/xtensa/xtensa.c
trunk/gcc/function.c
trunk/gcc/lra-eliminations.c
trunk/gcc/reload1.c
trunk/gcc/testsuite/ChangeLog

[Bug middle-end/64242] Longjmp expansion incorrect

2019-06-17 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242

--- Comment #33 from Wilco  ---
Author: wilco
Date: Mon Jun 17 11:25:12 2019
New Revision: 272382

URL: https://gcc.gnu.org/viewcvs?rev=272382=gcc=rev
Log:
Improve PR64242 testcase

Clear the input array to avoid the testcase accidentally
passing with an incorrect frame pointer.

Committed as obvious.

testsuite/
PR middle-end/64242
* gcc.c-torture/execute/pr64242.c: Improve test.

Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.c-torture/execute/pr64242.c

[Bug middle-end/64242] Longjmp expansion incorrect

2019-06-14 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242

--- Comment #27 from Wilco  ---
(In reply to dave.anglin from comment #26)
> On 2019-06-10 9:51 a.m., wilco at gcc dot gnu.org wrote:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242
> >
> > --- Comment #25 from Wilco  ---
> > I believe this is now fixed for generic code - however targets which 
> > implement
> > the nonlocal_goto expander (eg. pa, sparc, vax) need similar fixes in their
> > backends.
> >
> The test no longer fails on pa.  We need a test that demonstrates the
> problem in the current
> pa expander.

It allocates the buf variables at exactly the same offset in both frames, so
the code below accidentally reads the right values from the wrong address:

broken_longjmp:
.PROC
.CALLINFO FRAME=192,CALLS,SAVE_RP,SAVE_SP,ENTRY_GR=3
.ENTRY
copy %r3,%r1
stw %r2,-20(%r30)
copy %r30,%r3
stwm %r1,192(%r30)
copy %r26,%r25
ldi 20,%r24
bl memcpy,%r2
ldo 8(%r3),%r26
ldw 8(%r3),%r28  -> load frame pointer of parent function
ldo -8(%r28),%r3 -> set hard_frame_pointer
ldw 16(%r3),%r30 -> use wrong frame pointer to read stack pointer
ldw 12(%r3),%r1  -> use wrong frame pointer to read return address
be,n 0(%sr4,%r1)

Given there are many possible stack layouts, the easiest option would be to
clear the input buffer so it will jump to a null pointer. Eg.

__attribute ((noinline)) void
broken_longjmp (void *p)
{
  void *buf[32];
  __builtin_memcpy (buf, p, 5 * sizeof (void*));
  __builtin_memset (p, 0, 5 * sizeof (void*));
  /* Corrupts stack pointer...  */
  __builtin_longjmp (buf, 1);
}

[Bug tree-optimization/90836] Missing popcount pattern matching

2019-06-12 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90836

--- Comment #2 from Wilco  ---
(In reply to Richard Biener from comment #1)
> While it could be matched in match.pd variations would be quite a lot.
> I don't see where else it fits (apart from forwprop which already does
> fancy pattern matching...).
> 
> You probably want to match generic INTEGER_CSTs and delay the actual
> verification into a with {}.
> 
> I suppose popcount on integers other than long long work with the same
> pattern?

There are lots of possible variations and given proving that a particular
sequence is equivalent to popcount may be too difficult, it's about supporting
the cases that occur in code that people care about.

[Bug middle-end/64242] Longjmp expansion incorrect

2019-06-10 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242

--- Comment #25 from Wilco  ---
I believe this is now fixed for generic code - however targets which implement
the nonlocal_goto expander (eg. pa, sparc, vax) need similar fixes in their
backends.

[Bug middle-end/90693] Missing popcount simplifications

2019-06-07 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90693

--- Comment #2 from Wilco  ---
(In reply to Dávid Bolvanský from comment #1)
> >> __builtin_popcount (x) == 1 into x == (x & -x)
> 
> 
> This will not work for x = 0.
> 
> Should work:
> x && x == (x & -x)
> x && (x & x-1) == 0

Good point, though that's not needed for (x & (x-1)) != 0 given you can only
have 2 or more bits set if x was non-zero to start with. It's worth finding a
branchless sequence, otherwise it may not be faster.

Maybe (x << clz (x)) == INT_MIN or (x-1)

[Bug middle-end/64242] Longjmp expansion incorrect

2019-06-03 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64242

--- Comment #24 from Wilco  ---
Author: wilco
Date: Mon Jun  3 13:55:15 2019
New Revision: 271870

URL: https://gcc.gnu.org/viewcvs?rev=271870=gcc=rev
Log:
Fix PR64242 - Longjmp expansion incorrect

Improve the fix for PR64242.  Various optimizations can change a memory
reference into a frame access.  Given there are multiple virtual frame pointers
which may be replaced by multiple hard frame pointers, there are no checks for
writes to the various frame pointers.  So updates to a frame pointer tends to
generate incorrect code.  Improve the previous fix to also add clobbers of
several frame pointers and add a scheduling barrier.  This should work in most
cases until GCC supports a generic "don't optimize across this instruction"
feature.

Bootstrap OK. Testcase passes on AArch64 and x86-64.  Inspected x86, Arm,
Thumb-1 and Thumb-2 assembler which looks correct. 

gcc/
PR middle-end/64242
* builtins.c (expand_builtin_longjmp): Add frame clobbers and schedule
block.
(expand_builtin_nonlocal_goto): Likewise.

testsuite/
PR middle-end/64242
* gcc.c-torture/execute/pr64242.c: Update test.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/builtins.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.c-torture/execute/pr64242.c

[Bug driver/90684] New alignment options incorrectly report error

2019-06-03 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90684

--- Comment #3 from Wilco  ---
Author: wilco
Date: Mon Jun  3 11:27:50 2019
New Revision: 271864

URL: https://gcc.gnu.org/viewcvs?rev=271864=gcc=rev
Log:
Fix alignment option parser (PR90684)

Fix the alignment option parser to always allow up to 4 alignments.
Now -falign-functions=16:8:8:8 no longer reports an error.

gcc/
PR driver/90684
* opts.c (parse_and_check_align_values): Allow 4 alignment values.
Mgcc/ChangeLog
Mgcc/opts.c

Modified:
trunk/gcc/ChangeLog
trunk/gcc/opts.c

[Bug middle-end/82853] Optimize x % 3 == 0 without modulo

2019-06-03 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853

--- Comment #36 from Wilco  ---
(In reply to Orr Shalom Dvory from comment #35)
> Hi, thanks for your respond. can someone mark this bug as need to be
> improved?
> Does anyone agree/disagree with my new proposed method?

It's best to create a new bug if there are still easy cases that could be
optimized. Also it seems the constants it uses are quite complex - it may be
feasible to simplify them. Eg. long f(long x) { return x % 35 == 0; }

[Bug middle-end/90693] New: Missing popcount simplifications

2019-05-31 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90693

Bug ID: 90693
   Summary: Missing popcount simplifications
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

While GCC optimizes __builtin_popcount (x) != 0 into x != 0, we can also
optimize __builtin_popcount (x) == 1 into x == (x & -x), and __builtin_popcount
(x) > 1 into (x & (x-1)) != 0.

[Bug driver/90684] New alignment options incorrectly report error

2019-05-30 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90684

Wilco  changed:

   What|Removed |Added

  Component|middle-end  |driver
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

--- Comment #1 from Wilco  ---
Proposed patch: https://gcc.gnu.org/ml/gcc-patches/2019-05/msg02030.html

[Bug middle-end/90684] New: New alignment options incorrectly report error

2019-05-30 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90684

Bug ID: 90684
   Summary: New alignment options incorrectly report error
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

GCC9 always reports an error when using -falign-functions=16:8:8

cc1: error: invalid number of arguments for ‘-falign-functions’ option:
‘16:8:8’

This is not working as documented in
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options. The
issue is that the parser expects an undocumented macro SUBALIGN_LOG to be
defined eventhough that is not necessary for the option parsing.

[Bug middle-end/12849] testing divisibility by constant

2019-05-30 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=12849

Wilco  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||wilco at gcc dot gnu.org
 Resolution|--- |FIXED

--- Comment #6 from Wilco  ---
Fixed in GCC9.

[Bug target/90317] [7/8/9/10] ICE for arm sha1h and wrong optimisations on sha1h/c/m/p

2019-05-29 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90317

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-05-29
 CC||wilco at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Wilco  ---
Confirmed

[Bug tree-optimization/90594] New: [9/10 regression] Spurious popcount emitted

2019-05-23 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90594

Bug ID: 90594
   Summary: [9/10 regression] Spurious popcount emitted
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

The following testcase emits a popcount which computes the final pointer value.
This is redundant given the loop already computes the pointer value. The
popcount causes the code to be significantly larger and slower than previous
GCC versions.

int *bad_popcount (unsigned x, int *p)
{
  for (; x != 0; )
{
  int tmp = __builtin_ctz (x);
  x = x & (x - 1);
  *p++ = tmp;
}
  return p;
}

GCC8:
cbz w0, .L2
.L3:
rbitw2, w0
clz w2, w2
str w2, [x1], 4
sub w2, w0, #1
andsw0, w0, w2
bne .L3
.L2:
mov x0, x1
ret

GCC9:
cbz w0, .L12
mov x4, x1
mov w2, w0
.L11:
rbitw3, w2
clz w3, w3
str w3, [x4], 4
sub w3, w2, #1
andsw2, w2, w3
bne .L11
fmovs0, w0
cnt v0.8b, v0.8b
addvb0, v0.8b
umovw0, v0.b[0]
sub w0, w0, #1
add x0, x0, 1
add x0, x1, x0, lsl 2
ret
.L12:
mov x0, x1
ret

[Bug target/41999] Bug in generation of interrupt function code for ARM processor

2019-05-17 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=41999

Wilco  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||wilco at gcc dot gnu.org
 Resolution|--- |WORKSFORME

--- Comment #4 from Wilco  ---
Works since at least GCC4.5.4.

[Bug other/16996] [meta-bug] code size improvements

2019-05-17 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=16996
Bug 16996 depends on bug 38570, which changed state.

Bug 38570 Summary: [arm] -mthumb generates sub-optimal prolog/epilog
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |WORKSFORME

[Bug target/38570] [arm] -mthumb generates sub-optimal prolog/epilog

2019-05-17 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570

Wilco  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |WORKSFORME

--- Comment #13 from Wilco  ---
Close

[Bug target/38570] [arm] -mthumb generates sub-optimal prolog/epilog

2019-05-17 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #12 from Wilco  ---
This has been fixed since at least GCC5.4.1: https://www.godbolt.org/z/BlaE1o

[Bug target/9831] [ARM] Peephole for multiple load/store could be more effective.

2019-05-17 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=9831

Wilco  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||wilco at gcc dot gnu.org
 Resolution|--- |WONTFIX

--- Comment #8 from Wilco  ---
There doesn't appear to be anything that can be improved here. Literal pool
loads can't be easily peepholed into LDM, and there aren't many opportunities
anyway.

[Bug target/42017] gcc compiling C for ARM has stopped using r14 in leaf functions?

2019-05-17 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42017

Wilco  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||wilco at gcc dot gnu.org
 Resolution|--- |WORKSFORME

--- Comment #6 from Wilco  ---
This has been fixed since at least GCC5.4: https://www.godbolt.org/z/6IAGfh

[Bug middle-end/90263] Calls to mempcpy should use memcpy

2019-05-07 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263

--- Comment #20 from Wilco  ---
(In reply to Martin Liška from comment #19)
> Created attachment 46265 [details]
> Patch candidate v2
> 
> Update patch that should be fine. Tests on x86_64 work except:
> FAIL: gcc.c-torture/execute/builtins/mempcpy.c execution,  -O1 
> 
> Where the test expects that mempcpy without LHS should be transformed into
> memcpy.

Thanks, this version works as expected on AArch64. In principle we could keep
mempcpy for tailcalls with -Os, but I'm not sure it's worth the trouble.

[Bug middle-end/90263] Calls to mempcpy should use memcpy

2019-04-29 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263

--- Comment #18 from Wilco  ---
(In reply to Martin Liška from comment #14)
> Created attachment 46262 [details]
> Patch candidate
> 
> Patch candidate that handles:
> 
> $ cat ~/Programming/testcases/mempcpy.c
> int *mempcopy2 (int *p, int *q, long n)
> {
>   return __builtin_mempcpy (p, q, n);
> }
> 
> $ ./xgcc -B. -O2 -S ~/Programming/testcases/mempcpy.c -o/dev/stdout
> ...
> mempcopy2:
> .LFB0:
>   .cfi_startproc
>   pushq   %rbx
>   .cfi_def_cfa_offset 16
>   .cfi_offset 3, -16
>   movq%rdx, %rbx
>   callmemcpy
>   addq%rbx, %rax
>   popq%rbx
>   .cfi_def_cfa_offset 8
>   ret
> ...
> 
> There's a single failing test: gcc.dg/20050503-1.c

On AArch64 it calls mempcpy but memcpy on x86, which is the wrong way around...

-  if (retmode == RETURN_END && target != const0_rtx)
+  if (!TARGET_HAS_FAST_MEMPCPY_ROUTINE && retmode == RETURN_END && target !=
const0_rtx)

When I uncomment this, it uses memcpy. Also:

+TARGET_HAS_FAST_MEMPCPY_ROUTINE
+&& retmode == RETURN_END_MINUS_ONE,
+_move_done);

Should that be RETURN_END? And is_move_done isn't used, is that right?

[Bug middle-end/90263] Calls to mempcpy should use memcpy

2019-04-29 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263

--- Comment #17 from Wilco  ---
(In reply to Wilco from comment #16)
> (In reply to Martin Sebor from comment #15)
> > I just noticed I have been misreading mempcpy as memccpy and so making no
> > sense.  Sorry about that!  Please ignore my comments.
> 
> I see, yes we have too many and the names are too similar. Now all we need
> is someone to propose strlpcpy or stplcpy...

https://github.com/fishilico/shared/blob/master/linux/nolibc/uname.c

Aargh...

[Bug middle-end/90263] Calls to mempcpy should use memcpy

2019-04-29 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263

--- Comment #16 from Wilco  ---
(In reply to Martin Sebor from comment #15)
> I just noticed I have been misreading mempcpy as memccpy and so making no
> sense.  Sorry about that!  Please ignore my comments.

I see, yes we have too many and the names are too similar. Now all we need is
someone to propose strlpcpy or stplcpy...

[Bug rtl-optimization/90249] [9/10 Regression] Code size regression on thumb2 due to sub-optimal register allocation starting with r265398

2019-04-29 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90249

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #5 from Wilco  ---
(In reply to Segher Boessenkool from comment #4)
> That is code *size*.  Code size is expected to grow a tiny bit, because of
> *better* register allocation.
> 
> But we could not do make_more_copies at -Os, if that helps?  (The hard
> register
> changes themselves are required for correctness).

Better register allocation implies lower average codesize due to fewer spills,
fewer callee-saves, fewer moves etc.

I still don't understand what specific problem make_more_copies is trying to
solve. Is it trying to do life-range splitting of argument registers?

[Bug middle-end/90263] Calls to mempcpy should use memcpy

2019-04-29 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263

--- Comment #12 from Wilco  ---
(In reply to Martin Sebor from comment #11)
> My concern is that transforming memccpy to memcpy would leave little
> incentive for libraries like glibc to provide a more optimal implementation.
> Would implementing the function simply as memcpy and having the latter
> return the result of the former be a viable option?  Basically, rename
> memcpy to meccpy, parameterizing it on the termination character in the
> process, and change memcpy to call memccpy and return the first pointer.

I have no idea what you mean - there is no way you can implement memcpy using
memccpy. It never makes sense to slow down a performance critical function in
order to speed up an infrequently used one.

[Bug middle-end/90263] Calls to mempcpy should use memcpy

2019-04-29 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263

--- Comment #9 from Wilco  ---
(In reply to Martin Sebor from comment #7)
> Rather than unconditionally transforming mempcpy to memcpy I would prefer to
> see libc implementations of memccpy optimized.  WG14 N2349 discusses a
> rationale for this:   
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2349.htm

I don't think one precludes the other. memccpy is an odd interface, basically
just a shorthand for memchr+memcpy. GLIBC already has an efficient generic
implementation doing this - and like mempcpy it's unlikely many targets will
add an optimized assembler version. So inlining in GCC would be best.

Having so many non-standard string functions just means they are not optimized
at all on most targets and never get significant use. I've spent a lot of time
optimizing the generic string functions in GLIBC so they are at least not
ridiculously slow (a loop doing 1 byte per iteration is unacceptable today).

[Bug middle-end/90263] Calls to mempcpy should use memcpy

2019-04-29 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263

--- Comment #6 from Wilco  ---
(In reply to Martin Liška from comment #5)
> The discussion looks familiar to me. Isn't that PR70140, where I was
> suggesting something like:
> 
> https://marc.info/?l=gcc-patches=150166433909242=2
> 
> with a new target macro: TARGET_HAS_FAST_MEMPCPY_ROUTINE ?

Yes something like that should be fine. Unfortunately that patch doesn't apply
anymore, and when I got it to compile it didn't do anything.

[Bug middle-end/90263] Calls to mempcpy should use memcpy

2019-04-26 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263

--- Comment #4 from Wilco  ---
(In reply to Jakub Jelinek from comment #3)
> Because then you penalize properly maintained targets which do have
> efficient mempcpy.  And even if some targets don't have efficient mempcpy
> right now, that doesn't mean they can't have it in the future.  On the
> caller side, when not expanded inline, calling mempcpy instead of memcpy is
> smaller, often requires fewer registers to be live across the call etc.

Few targets have an optimized mempcpy, particularly when you look at other
libraries besides GLIBC. We could easily have a target hook to emit mempcpy,
but I believe the default should be to do what's fastest for most targets.

[Bug middle-end/90263] Calls to mempcpy should use memcpy

2019-04-26 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263

--- Comment #2 from Wilco  ---
(In reply to Jakub Jelinek from comment #1)
> As stated several times in the past, I strongly disagree.

Why? GCC already does this for bzero and bcopy.

[Bug middle-end/90263] New: Calls to mempcpy should use memcpy

2019-04-26 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90263

Bug ID: 90263
   Summary: Calls to mempcpy should use memcpy
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

While GCC now inlines fixed-size mempcpy like memcpy, GCC still emits calls to
mempcpy rather than converting to memcpy. Since most libraries, including
GLIBC, do not have optimized mempcpy for most targets, calling mempcpy is less
efficient than calling memcpy and emitting an extra addition to compute the
result.

int *mempcopy1 (int *p, int *q)
{
  return  __builtin_mempcpy (p, q, 16);
}

int *mempcopy2 (int *p, int *q, long n)
{
  return __builtin_mempcpy (p, q, n);
}

mempcopy1:
ldp x2, x3, [x1]
stp x2, x3, [x0], 16
ret

mempcopy2:
b   mempcpy

[Bug middle-end/90262] New: Inline small constant memmoves

2019-04-26 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90262

Bug ID: 90262
   Summary: Inline small constant memmoves
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

GCC does not inline fixed-size memmoves. However memmove can be as easily
inlined as memcpy. The existing memcpy infrastructure could be reused/expanded
for this - all loads would be emitted first, followed by the stores. Large
memmoves could be inlined on targets with vector registers. Targets without
vector registers could emit an overlap check and use inlined memcpy for the no
overlap case.

void copy(int *p, int *q)
{
  __builtin_memcpy(p, q, 24);
}

void move(int *p, int *q)
{
  __builtin_memmove(p, q, 24);
}

copy:
ldp x2, x3, [x1]
stp x2, x3, [x0]
ldr x1, [x1, 16]
str x1, [x0, 16]
ret

move:
mov x2, 24
b   memmove

The memmove could be expanded using the same number of registers:

ldp x2, x3, [x1]
ldr x1, [x1, 16]
stp x2, x3, [x0]
str x1, [x0, 16]

[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm

2019-04-18 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871

--- Comment #52 from Wilco  ---
(In reply to Segher Boessenkool from comment #48)
> With just Peter's and Jakub's patch, it *improves* code size by 0.090%.
> That does not fix this PR though :-/

But it does fix most of the codesize regression. The shrinkwrapping testcase
seems a preexisting problem that was exposed by the combine changes, so it
doesn't need to hold up the release. The regalloc change might fix
addr-modes-float.c too.

[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm

2019-04-18 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871

--- Comment #47 from Wilco  ---
(In reply to Segher Boessenkool from comment #46)
> With all three patches together (Peter's, mine, Jakub's), I get a code size
> increase of only 0.047%, much more acceptable.  Now looking what that diff
> really *is* :-)

I think with Jakub's change you don't need to disable the movsi_compare0
pattern in combine. If regalloc works as expected, it will get split into a
compare so shrinkwrap can handle it.

[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm

2019-04-18 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871

--- Comment #38 from Wilco  ---
(In reply to Segher Boessenkool from comment #37)
> Yes, it is a balancing act.  Which option works better?

Well the question really is what is bad about movsi_compare0 that could be
easily fixed?

The move is for free so there is no need for the "r,0" variant in principle, so
if that helps reducing constraints on register allocation then we could remove
or reorder that alternative.

[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398

2019-04-17 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763

--- Comment #54 from Wilco  ---
(In reply to Jeffrey A. Law from comment #53)
> Realistically the register allocation issues are not going to get addressed
> this cycle nor are improvements to the overall handling of RMW insns in
> combine.  So we're going to be stuck with bandaids.
> 
> I've got an updated backend pattern that should address the remainder of the
> insv_1 and insv_2 regressions and Steve has a backend pattern to address the
> other regression in this BZ.

I'd prefer not to add quick hacks that aren't beneficial in the long term.
Adding a very general pattern to handle any bitfield insert of any constant
would be much more useful. There is no issue with xfailing these tests -
neither insv pattern was used frequently so the regression for these is not
significant compared to the register allocation and move issues.

[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398

2019-04-14 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763

--- Comment #52 from Wilco  ---
(In reply to Jeffrey A. Law from comment #49)
> I think the insv_1 (and it's closely related insv_2) regressions can be
> fixed by a single ior/and pattern in the backend or by hacking up combine a
> bit.  I'm still playing with the latter, but may have to put it on the back
> burner because of the pain of note management :(  Hoping to draw a
> conclusion on that by the end of this  week.  If I can't get a clean combine
> solution, then my recommendation would be to build the suitable backend
> pattern.   It just has to match stuff like
> 
> (set (reg1) (ior (and (reg2) ...)) with a matching constraint on reg1 and
> reg2 to ensure it's a RMW operand.

I don't think the current insv patterns are very useful, a more general
approach would be to support bitfield insert of any immediate which is not
currently supported. This can then be expanded into a bic/orr, bic/add, mov/bfi
or movk depending on the mask/immediate.

Note the register allocation issue as discussed in PR87871 which causes the
codesize regressions after combine inserts extra moves is still the worst part
of this issue.

[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm

2019-04-12 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871

--- Comment #21 from Wilco  ---
(In reply to Vladimir Makarov from comment #20)
> (In reply to Wilco from comment #19)
> > (In reply to Peter Bergner from comment #18)
> > > (In reply to Segher Boessenkool from comment #15)
> > > >   Popping a5(r116,l0)  -- assign reg 3
> > > >   Popping a3(r112,l0)  -- assign reg 4
> > > >   Popping a2(r114,l0)  -- assign reg 3
> > > >   Popping a0(r111,l0)  -- assign reg 0
> > > >   Popping a4(r117,l0)  -- assign reg 0
> > > >   Popping a1(r113,l0)  -- assign reg 2
> > > > Assigning 4 to a5r116
> > > > Disposition:
> > > > 0:r111 l0 03:r112 l0 41:r113 l0 22:r114 l0  
> > > >3
> > > > 5:r116 l0 44:r117 l0 0
> > > > 
> > > > 
> > > > r116 does not conflict with *any* other pseudo.  It is alive in the 
> > > > first
> > > > two insns of the function, which are
> > > 
> > > So we initially assign r3 to r116 presumably because it has the same cost 
> > > as
> > > the other gprs and it occurs first in REG_ALLOC_ORDER.  Then
> > > improve_allocation() decides that r4 is a better hard reg and switches the
> > > assignment to that.  I'm not sure why it wouldn't choose r0 there instead.
> > 
> > I would expect that r116 has a strong preference for r0 given the r116 = mov
> > r0 and thus allocating r116 to r0 should have the lowest cost by a large
> > margin.
> 
> p116 conflicts with hr0.  Therefore it can not get hr0.  p112 is connected
> with p116.  p112 got hr4 and p116 got 3.  Assigning 4 to 116 is profitable. 
> Therefore assignment of p116 is changed to 4.
> 
> The question is why p116 conflicts with hr0.  Before RA we have

That's a bug since register copies should not create a conflict. It's one of
the most basic optimization of register allocator.

And there is also the question why we do move r0 into a virtual register but
not assign the virtual register to an argument register.

[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm

2019-04-12 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871

--- Comment #19 from Wilco  ---
(In reply to Peter Bergner from comment #18)
> (In reply to Segher Boessenkool from comment #15)
> >   Popping a5(r116,l0)  -- assign reg 3
> >   Popping a3(r112,l0)  -- assign reg 4
> >   Popping a2(r114,l0)  -- assign reg 3
> >   Popping a0(r111,l0)  -- assign reg 0
> >   Popping a4(r117,l0)  -- assign reg 0
> >   Popping a1(r113,l0)  -- assign reg 2
> > Assigning 4 to a5r116
> > Disposition:
> > 0:r111 l0 03:r112 l0 41:r113 l0 22:r114 l0 3
> > 5:r116 l0 44:r117 l0 0
> > 
> > 
> > r116 does not conflict with *any* other pseudo.  It is alive in the first
> > two insns of the function, which are
> 
> So we initially assign r3 to r116 presumably because it has the same cost as
> the other gprs and it occurs first in REG_ALLOC_ORDER.  Then
> improve_allocation() decides that r4 is a better hard reg and switches the
> assignment to that.  I'm not sure why it wouldn't choose r0 there instead.

I would expect that r116 has a strong preference for r0 given the r116 = mov r0
and thus allocating r116 to r0 should have the lowest cost by a large margin.

[Bug target/81800] [8/9 regression] on aarch64 ilp32 lrint should not be inlined as two instructions

2019-04-11 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81800

--- Comment #16 from Wilco  ---
(In reply to Jakub Jelinek from comment #15)
> (In reply to Wilco from comment #14)
> > (In reply to Jakub Jelinek from comment #13)
> > > Patches should be pinged after a week if they aren't reviewed, 
> > > furthermore,
> > > it is better to CC explicitly relevant maintainers.
> > 
> > I've got about 10 patches waiting, I'll ping after stage 1 opens.
> 
> This PR is marked as 8/9 Regression though, therefore it should be resolved
> if possible already for GCC 9 and maybe even for GCC 8.4 too.

It's a minor issue. I can't see how it could be more important than serious
code generation bugs (eg. setjmp/longjmp) that affect many GCC versions and
targets.

[Bug target/81800] [8/9 regression] on aarch64 ilp32 lrint should not be inlined as two instructions

2019-04-11 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81800

--- Comment #14 from Wilco  ---
(In reply to Jakub Jelinek from comment #13)
> Patches should be pinged after a week if they aren't reviewed, furthermore,
> it is better to CC explicitly relevant maintainers.

I've got about 10 patches waiting, I'll ping after stage 1 opens.

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-04-09 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

--- Comment #16 from Wilco  ---
(In reply to kugan from comment #15)
> (In reply to Wilco from comment #11)
> > There is also something odd with the way the loop iterates, this doesn't
> > look right:
> > 
> > whilelo p0.s, x3, x4
> > incwx3
> > ptest   p1, p0.b
> > bne .L3
> 
> I am not sure I understand this. I tried with qemu using an execution
> testcase and It seems to work.
> 
> whilelo   p0.s, x4, x5
>   incwx4
>   ptest   p1, p0.b
>   bne .L3
> In my case I have the above (register allocation difference only) incw is
> correct considering two vector word registers? Am I missing something here?

I'm talking about the completely redundant ptest, where does that come from?

[Bug target/88834] [SVE] Poor addressing mode choices for LD2 and ST2

2019-03-28 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88834

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #11 from Wilco  ---
There is also something odd with the way the loop iterates, this doesn't look
right:

whilelo p0.s, x3, x4
incwx3
ptest   p1, p0.b
bne .L3

[Bug target/89607] Missing optimization for store of multiple registers on aarch64

2019-03-19 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 CC||wilco at gcc dot gnu.org
 Resolution|--- |FIXED
   Target Milestone|--- |9.0

--- Comment #9 from Wilco  ---
Fixed in GCC9 already, so closing.

[Bug ada/89493] [9 Regression] Stack smashing on armv7hl

2019-03-19 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89493

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #2 from Wilco  ---
So is this an Ada unwind bug? What kind of unwinder does Ada use?

[Bug target/89752] [8/9 Regression] ICE in emit_move_insn, at expr.c:3723

2019-03-18 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89752

--- Comment #10 from Wilco  ---
It seems that rewriting "+rm" into "=rm" and "0" is not equivalent. Eg.

__asm__  ("" : [a0] "=m" (A0) : "0" (A0));

gives a million warnings "matching constraint does not allow a register", so
"0" appears to imply a register operand. Reload seems to deal with "=rm" and
"rm" or "=rm" and "0m", but a single "0" prefers a register and that's why it
tries to insert an incorrect BLK move.

[Bug target/89752] [8/9 Regression] ICE in emit_move_insn, at expr.c:3723

2019-03-18 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89752

--- Comment #4 from Wilco  ---
Small example which generates the same ICE on every GCC version:

typedef struct { int x, y, z; } X;

void f(void)
{
  X A0, A1;
  __asm__ ("" : [a0] "+rm" (A0),[a1] "+rm" (A1));
}

So it's completely invalid inline asm...

[Bug target/89752] [8/9 Regression] ICE in emit_move_insn, at expr.c:3723

2019-03-18 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89752

--- Comment #3 from Wilco  ---
Full instruction:

(insn 531 530 532 19 (parallel [
(set (mem/c:BLK (reg:DI 3842) [29 A0+0 S2 A64])
(asm_operands:BLK ("") ("=rm") 0 [
(mem/c:BLK (reg:DI 3846) [29 A0+0 S2 A64])
(mem/c:BLK (reg:DI 3848) [29 A1+0 S2 A128])
]
 [
(asm_input:BLK ("0")
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418)
(asm_input:BLK ("1")
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418)
]
 []
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418))
(set (mem/c:BLK (reg:DI 3844) [29 A1+0 S2 A128])
(asm_operands:BLK ("") ("=rm") 1 [
(mem/c:BLK (reg:DI 3846) [29 A0+0 S2 A64])
(mem/c:BLK (reg:DI 3848) [29 A1+0 S2 A128])
]
 [
(asm_input:BLK ("0")
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418)
(asm_input:BLK ("1")
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418)
]
 []
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h:1418))
])
"external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/products/GeneralBlockPanelKernel.h":1418:511
-1
 (nil))


I guess this comes from the source like this with some of the arguments being
immediates due to template substitution:

__asm__ ("" : [a0] "+rm" (A0),[a1] "+rm" (A1));

It's not obvious what this is supposed to achieve, let alone whether it is
valid...

[Bug target/89752] [8/9 Regression] ICE in emit_move_insn, at expr.c:3723

2019-03-18 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89752

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-03-18
 CC||wilco at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #2 from Wilco  ---
Confirmed. It ICEs in Eigen::internal::gebp_kernel, 2, 4,
false, false>::operator()

It seems to choke on this asm during reload:

  531: {[r3842:DI]=asm_operands;[r3844:DI]=asm_operands;}

and somehow emit a move between these operands:

(reg:BLK 3849) (mem/c:BLK (reg:DI 3846) [29 A0+0 S2 A64])

[Bug target/89222] [7/8 regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os

2019-03-05 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222

Wilco  changed:

   What|Removed |Added

   Target Milestone|--- |8.5
Summary|[7/8/9 regression] ARM  |[7/8 regression] ARM
   |thumb-2 misoptimisation of  |thumb-2 misoptimisation of
   |func ptr call with -O2 or   |func ptr call with -O2 or
   |-Os |-Os

[Bug target/89222] [7/8/9 regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os

2019-03-05 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222

--- Comment #9 from Wilco  ---
Author: wilco
Date: Tue Mar  5 15:04:01 2019
New Revision: 269390

URL: https://gcc.gnu.org/viewcvs?rev=269390=gcc=rev
Log:
[ARM] Fix PR89222

The GCC optimizer can generate symbols with non-zero offset from simple
if-statements. Bit zero is used for the Arm/Thumb state bit, so relocations
with offsets fail if it changes bit zero and the relocation forces bit zero
to true.  The fix is to disable offsets on function pointer symbols.  

gcc/
PR target/89222
* config/arm/arm.md (movsi): Use targetm.cannot_force_const_mem
to decide when to split off a non-zero offset from a symbol.
* config/arm/arm.c (arm_cannot_force_const_mem): Disallow offsets
in function symbols.

testsuite/
PR target/89222
* gcc.target/arm/pr89222.c: Add new test.

Added:
trunk/gcc/testsuite/gcc.target/arm/pr89222.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/arm/arm.c
trunk/gcc/config/arm/arm.md
trunk/gcc/testsuite/ChangeLog

[Bug tree-optimization/89437] [9 regression] incorrect result for sinl (atanl (x))

2019-03-04 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89437

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Wilco  ---
Fixed

[Bug tree-optimization/89437] [9 regression] incorrect result for sinl (atanl (x))

2019-03-04 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89437

--- Comment #1 from Wilco  ---
Author: wilco
Date: Mon Mar  4 12:36:04 2019
New Revision: 269364

URL: https://gcc.gnu.org/viewcvs?rev=269364=gcc=rev
Log:
Fix PR89437

Fix PR89437. Fix the sinatan-1.c testcase to not run without
a C99 target system.  Use nextafterl for long double initialization.

Fix an issue with sinl (atanl (sqrtl (LDBL_MAX)) returning 0.0
instead of 1.0 by using x < sqrtl (LDBL_MAX) in match.pd.

gcc/
PR tree-optimization/89437
* match.pd: Use lt in sin(atan(x)) and cos(atan(x)) simplifications.

testsuite/
PR tree-optimization/89437
* gcc.dg/sinatan-1.c: Fix testcase.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/match.pd
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/sinatan-1.c

[Bug tree-optimization/86829] Missing sin(atan(x)) and cos(atan(x)) optimizations

2019-02-21 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86829

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #8 from Wilco  ---
This has caused a regression, see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89437 and
https://gcc.gnu.org/ml/gcc-patches/2019-02/msg01739.html

[Bug tree-optimization/89437] New: [9 regression] incorrect result for sinl (atanl (x))

2019-02-21 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89437

Bug ID: 89437
   Summary: [9 regression] incorrect result for sinl (atanl (x))
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

A recently added optimization uses an inline expansion for sinl (atanl (x)). As
it involves computing sqrtl (x * x + 1) which can overflow for large x, there
is a check to ensure x is within the right range. The check emitted is x <=
sqrtl (LDBL_MAX). It's not clear whether that's a bug in mpfr or
build_sinatan_real, however it is safe to change the test to x < sqrtl
(LDBL_MAX).

In principle the comparison should use a much smaller constant given that x * x
+ 1 equals x * x for x > 2^113 for any long double format, so we know sinl
(atanl (x)) must be 1.0 for all larger values.

Testcase is gcc.dg/sinatan-1.c after using nextafterl instead of incorrect
nextafter for the long double. I'll post a patch to fix this.

[Bug libfortran/78314] [aarch64] ieee_support_halting does not report unsupported fpu traps correctly

2019-02-20 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78314

--- Comment #25 from Wilco  ---
(In reply to Steve Ellcey from comment #24)
> See email strings at:
> 
> https://gcc.gnu.org/ml/fortran/2019-01/msg00276.html
> https://gcc.gnu.org/ml/fortran/2019-02/msg00057.html
> 
> For more discussion.

Sure, it looks like we simply need to reinstate Szabolc's patch with a #ifdef
arm/aarch64 around it. I need to find out whether a feclearexcept
(FE_ALL_EXCEPT) is necessary on Arm targets which can trap (very few exist
IIRC).

[Bug libfortran/78314] [aarch64] ieee_support_halting does not report unsupported fpu traps correctly

2019-02-19 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78314

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #23 from Wilco  ---
(In reply to Steve Ellcey from comment #22)
> It looks like the recent checkins are causing gfortran.dg/ieee/ieee_6.f90
> to fail on aarch64.  Reading through the comments it looks like this isn't a
> new problem but the failure showing up in the GCC testsuite run is new.

Well it's a regression since it used to work in GCC7/8/9. Note it's incorrect
to allow folding of these calls, for example support_fpu_flag can return true
or false depending on the flag (FE_DENORMAL if not supported on most targets),
and yet the constant folding always returns true for all flags.

[Bug middle-end/89037] checking ice emitting 128-bit bit-field initializer

2019-02-19 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89037

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org
Version|9.0 |6.5.0
   Target Milestone|--- |9.0
  Known to fail||6.0, 7.0, 8.0

--- Comment #5 from Wilco  ---
Note this fails on GCC6, GCC7 and GCC8 too.

[Bug target/85711] [8 regression] ICE in aarch64_classify_address, at config/aarch64/aarch64.c:5678

2019-02-15 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85711

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-02-15
 CC||wilco at gcc dot gnu.org
Summary|ICE in  |[8 regression] ICE in
   |aarch64_classify_address,   |aarch64_classify_address,
   |at  |at
   |config/aarch64/aarch64.c:56 |config/aarch64/aarch64.c:56
   |78  |78
 Ever confirmed|0   |1
  Known to fail||8.0

--- Comment #4 from Wilco  ---
Fixed in GCC9, still failing in GCC8.

[Bug target/89190] [8/9 regression][ARM] armv8-m.base invalid ldm ICE

2019-02-13 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89190

--- Comment #2 from Wilco  ---
Author: wilco
Date: Wed Feb 13 16:22:25 2019
New Revision: 268848

URL: https://gcc.gnu.org/viewcvs?rev=268848=gcc=rev
Log:
[ARM] Fix Thumb-1 ldm (PR89190)

This patch fixes an ICE in the Thumb-1 LDM peepholer.  Thumb-1 LDMs
always update the base register except if the base is loaded.
The current implementation rejects LDMs where the base is not dead,
however this doesn't exclude the case where the base is loaded as
well as dead.  Fix this by explicitly checking whether the base is
loaded.  Also enable LDMs which load the first register.

gcc/
PR target/89190
* config/arm/arm.c (ldm_stm_operation_p) Set
addr_reg_in_reglist correctly for first register.
(load_multiple_sequence): Remove dead base check.
(gen_ldm_seq): Correctly set write_back for Thumb-1.

testsuite/
PR target/89190
* gcc.target/arm/pr89190.c: New test.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/arm/arm.c
trunk/gcc/testsuite/ChangeLog

[Bug target/89222] [7/8/9 regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os

2019-02-13 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222

--- Comment #8 from Wilco  ---
(In reply to Wilco from comment #7)
> Patch: https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00780.html

Updated patch: https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00947.html

[Bug tree-optimization/86637] [9 Regression] ICE: tree check: expected block, have in inlining_chain_to_json, at optinfo-emit-json.cc:293

2019-02-11 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86637

--- Comment #14 from Wilco  ---
Author: wilco
Date: Mon Feb 11 18:14:37 2019
New Revision: 268777

URL: https://gcc.gnu.org/viewcvs?rev=268777=gcc=rev
Log:
[COMMITTED] Fix pthread errors in pr86637-2.c

Fix test errors on targets which do not support pthreads.

Committed as obvious.

testsuite/
PR tree-optimization/86637
* gcc.c-torture/compile/pr86637-2.c: Test pthread and graphite target.

Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.c-torture/compile/pr86637-2.c

[Bug target/89222] [7/8/9 regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os

2019-02-11 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222

--- Comment #7 from Wilco  ---
Patch: https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00780.html

[Bug target/89222] [7.x regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os

2019-02-08 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2019-02-08
   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org
 Ever confirmed|0   |1

[Bug target/89222] [7.x regression] ARM thumb-2 misoptimisation of func ptr call with -O2 or -Os

2019-02-08 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89222

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #5 from Wilco  ---
(In reply to Andrew Pinski from comment #4)
> I think the bug is in the assembler or the linker:
> .L22:
>   .word   myhandler2-1
> 
> 
> 
> Basically what is happening is:
> (__handler != ((__sighandler_t) 2)) && (__handler != ((__sighandler_t)
> SIG_DFL))
> 
> is converted to:
> 
> (((size_t)__handler)-1) <= 1
> 
> And then GCC emits myhandler2-1 in the constant pool which is correct but
> the assembler/linker decides to put 0x23a7 in that location (See the
> .L22 above) and then GCC adds +1 to it to try to make it myhandler2 (again).
> 
> This is why using SIG_DFL of 5 works, it is just by accident because GCC
> decides not to do the transformation or put myhandler2-1 in the constant
> pool.
> 
> Again I think this is an assembler/linker issue of putting the wrong value
> for
> 
> .L22:
>   .word   myhandler2-1

The +1 is added by the assembler since it is a Thumb function. Then the
compiler adds another +1, making it +2. Basically one shouldn't do arithmetic
with function symbols since bit 0 encodes the Arm/Thumb state.

[Bug rtl-optimization/87871] [9 Regression] testcases fail after r265398 on arm

2019-02-06 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87871

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #8 from Wilco  ---
(In reply to Segher Boessenkool from comment #5)
> The first one just needs an xfail.  I don't know if it should be *-*-* there
> or only arm*-*-* should be added.
> 
> The other two need some debugging by someone who knows the target and/or
> these tests.

The previous code for Arm was:

cbz r0, .L5
push{r4, lr}
mov r4, r0
bl  foo
movwr2, #:lower16:.LANCHOR0
movtr2, #:upper16:.LANCHOR0
add r4, r4, r0
str r4, [r2]
pop {r4, pc}
.L5:
movsr0, #1
bx  lr

Now it fails to shrinkwrap:

push{r4, lr}
mov r4, r0
cmp r4, #0
moveq   r0, #1
beq .L3
bl  foo
ldr r2, .L7
add r3, r4, r0
str r3, [r2]
.L3:
pop {r4, lr}
bx  lr

It seems shrinkwrapping is more random, sometimes it's done as expected,
sometimes it is not. It was more consistent on older GCC's.

[Bug rtl-optimization/89195] [7/8/9 regression] Corrupted stack offset after combine

2019-02-05 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195

--- Comment #11 from Wilco  ---
(In reply to Segher Boessenkool from comment #9)
> That patch is pre-approved if it regchecks fine (on more than just x86). 
> Thanks!

check-gcc is clean on aarch64_be-none-elf

[Bug rtl-optimization/89195] [7/8/9 regression] Corrupted stack offset after combine

2019-02-05 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195

--- Comment #10 from Wilco  ---
(In reply to Jakub Jelinek from comment #8)
> Created attachment 45606 [details]
> gcc9-pr89195.patch
> 
> Now in patch form (untested so far).

That works fine indeed. It avoids accessing the object out of bounds but still
allows the optimization for eg. i:16 using a 16-bit access within the struct.

[Bug rtl-optimization/89195] [7/8/9 regression] Corrupted stack offset after combine

2019-02-05 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195

--- Comment #4 from Wilco  ---
(In reply to Segher Boessenkool from comment #3)
> (In reply to Wilco from comment #1)
> > len is unsigned HOST_WIDE_INT, so bits_to_bytes_round_down does an unsigned
> > division...
> 
> That shouldn't make a difference though, both dividend and divisor should be
> non-negative.  Are they?  Well I guess not...  So pos points outside of the
> register here?!

pos is a frame offset so always negative. And yes this code is then creating an
access outside the original object (starting at -1 or +1 depending on
endianness).

> Was it correct before that?  At least it was symmetric so it *seemed* 
> correct..

It was broken in the same way. If I cast just len to HOST_WIDE_INT it works
fine.

[Bug rtl-optimization/89195] [7/8/9 regression] Corrupted stack offset after combine

2019-02-04 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195

--- Comment #1 from Wilco  ---
make_extraction does:

  if (MEM_P (inner))
{
  poly_int64 offset;

  /* POS counts from lsb, but make OFFSET count in memory order.  */
  if (BYTES_BIG_ENDIAN)
offset = bits_to_bytes_round_down (GET_MODE_PRECISION (is_mode)
   - len - pos);
  else
offset = pos / BITS_PER_UNIT;
  new_rtx = adjust_address_nv (inner, tmode, offset);

len is unsigned HOST_WIDE_INT, so bits_to_bytes_round_down does an unsigned
division...

[Bug rtl-optimization/89195] [7/8/9 regression] Corrupted stack offset after combine

2019-02-04 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195

Wilco  changed:

   What|Removed |Added

 Target||aarch64
   Target Milestone|--- |9.0
  Known to fail||7.0, 8.0, 9.0

[Bug rtl-optimization/89195] New: [7/8/9 regression] Corrupted stack offset after combine

2019-02-04 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89195

Bug ID: 89195
   Summary: [7/8/9 regression] Corrupted stack offset after
combine
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

The following testcase generates incorrect stack offsets on AArch64 since GCC7
when compiled with -O1 -mbig-endian:

struct S {
   unsigned i : 24;
};

volatile unsigned char x;

int f (struct S d) 
{
  return d.i & x;
}

This produces:

sub sp, sp, #16
str x0, [sp, 8]
adrpx0, x
ldrbw0, [x0, #:lo12:x]
and w0, w0, 255
mov x1, 2305843009213693952
add x1, sp, x1
ldr w1, [x1, 7]
and w0, w1, w0
add sp, sp, 16
ret

The combine output is as follows:

Trying 10, 11 -> 12:
   10: r100:SI=[sfp:DI-0x8]
   11: r101:SI#0=zero_extract(r100:SI#0,0x18,0x8)
  REG_DEAD r100:SI
   12: r98:SI=r101:SI:SI
  REG_DEAD r101:SI
  REG_DEAD r92:SI
Failed to match this instruction:
(set (reg:SI 98)
(and:SI (mem/c:SI (plus:DI (reg/f:DI 64 sfp)
(const_int 2305843009213693943 [0x1ff7])) [1
dD.3407+2305843009213693951 S4 A8])
(reg:SI 92 [ x.0_3+-3 ])))
Successfully matched this instruction:
(set (reg:SI 101)
(mem/c:SI (plus:DI (reg/f:DI 64 sfp)
(const_int 2305843009213693943 [0x1ff7])) [1
dD.3407+2305843009213693951 S4 A8]))
Successfully matched this instruction:
(set (reg:SI 98)
(and:SI (reg:SI 101)
(reg:SI 92 [ x.0_3+-3 ])))
allowing combination of insns 10, 11 and 12
original costs 16 + 4 + 4 = 24
replacement costs 16 + 4 = 20
deferring deletion of insn with uid = 10.
modifying insn i211: r101:SI=[sfp:DI+0x1ff7]
deferring rescan insn with uid = 11.
modifying insn i312: r98:SI=r101:SI:SI
  REG_DEAD r92:SI
  REG_DEAD r101:SI
deferring rescan insn with uid = 12.


So it appears to want to change the offset -8 to -7 to optimize the zero-extend
away (this is an out-of-bound access, but maybe OK for locals?). It must be
converting a bit offset to a byte offset using an unsigned shift, losing the
top 3 bits, which results in a wildly out of range offset...

[Bug target/89190] [8/9 regression][ARM] armv8-m.base invalid ldm ICE

2019-02-04 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89190

Wilco  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2019-02-04
 Ever confirmed|0   |1

--- Comment #1 from Wilco  ---
Fix: https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00201.html

[Bug target/89190] New: [8/9 regression][ARM] armv8-m.base invalid ldm ICE

2019-02-04 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89190

Bug ID: 89190
   Summary: [8/9 regression][ARM] armv8-m.base invalid ldm ICE
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: wilco at gcc dot gnu.org
  Target Milestone: ---

The following testcases ICEs with -march=armv8-m.base on arm.none.eabi:

long long a;
int b, c;
int d(int e, int f) { return e << f; }
void g() {
  long long h;
  char i = d(b >= 7, 2);
  c = i == 0 ?: 1 / i;
  h = c && a ?: c + a;
  b = h;
}

error: unrecognizable insn:
   10 | }
  | ^
(insn 133 114 105 3 (parallel [
(set (reg:SI 2 r2)
(plus:SI (reg:SI 2 r2)
(const_int 8 [0x8])))
(set (reg:SI 0 r0)
(mem/c:SI (reg/f:SI 2 r2 [127]) [2 a+0 S4 A64]))
(set (reg:SI 2 r2)
(mem/c:SI (plus:SI (reg/f:SI 2 r2 [127])
(const_int 4 [0x4])) [2 a+4 S4 A32]))
]) -1
 (nil))

The reason is that the Thumb-1 LDM code set writeback if the base is dead.
However it does not ensure the base register is not also loaded. The fix is to
disallow writeback if the base is loaded.

[Bug ipa/89104] ICE: Segmentation fault (in tree_int_cst_elt_check)

2019-01-31 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89104

--- Comment #5 from Wilco  ---
(In reply to Jakub Jelinek from comment #4)
> I really don't like these aarch64 warnings, declare simd is an optimization
> (admittedly with ABI consequences) and warning about this by default is
> weird,
> + it is going to be a pain, any time any declare simd testcase is added
> there is potential "regression" on aarch64.
> Plus it really looks like a bug in this case, there is no mixed type at all,
> the int * argument is uniform, so should be passed as any other pointer, and
> all the others are int and so should use the same vector int type.

I agree backend specific warnings are not ideal but it's unclear whether a
better solution exists beyond just not emitting these warnings at all and
letting the user figure it out.

However the key question is why do testcases which do not specifically test for
a specific warning fail if an extra warning is emitted? That's completely
harmless in most cases.

[Bug ipa/89104] ICE: Segmentation fault (in tree_int_cst_elt_check)

2019-01-30 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89104

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #3 from Wilco  ---
(In reply to Jakub Jelinek from comment #2)

gcc.dg/gomp/pr89104.c fails with a warning:

gcc/gcc/testsuite/gcc.dg/gomp/pr89104.c:8:1: warning: GCC does not currently
support mixed size types for 'simd' functions

Should we just ignore warnings with -w?

[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions

2019-01-29 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101

Wilco  changed:

   What|Removed |Added

 Status|WAITING |NEW
  Known to work||9.0
Version|unknown |8.2.0
   Target Milestone|--- |9.0
  Known to fail||8.2.0

--- Comment #3 from Wilco  ---
(In reply to Gael Guennebaud from comment #2)
> Indeed, it fails to remove the dup only if the coefficient is used multiple
> times as in the following reduced exemple: (https://godbolt.org/z/hmSaE0)
> 
> 
> #include 
> 
> void foo(const float* a, const float * b, float * c, int n) {
> float32x4_t c0, c1, c2, c3;
> c0 = vld1q_f32(c+0*4);
> c1 = vld1q_f32(c+1*4);
> for(int k=0; k {
> float32x4_t a0 = vld1q_f32(a+0*4+k*4);
> float32x4_t b0 = vld1q_f32(b+k*4);
> c0 = vfmaq_laneq_f32(c0, a0, b0, 0);
> c1 = vfmaq_laneq_f32(c1, a0, b0, 0);
> }
> vst1q_f32(c+0*4, c0);
> vst1q_f32(c+1*4, c1);
> }
> 
> 
> I tested with gcc 7 and 8.

Confirmed for GCC8, fixed on trunk. I tried the above example with up to 4 uses
and it always generates the expected code on trunk. So this is fixed for GCC9,
however it seems unlikely the fix (multi-use support in Combine) could be
backported.

[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions

2019-01-29 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #1 from Wilco  ---
(In reply to Gael Guennebaud from comment #0)
> vfmaq_laneq_f32 is currently implemented as:
> 
> __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
> vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b,
>float32x4_t __c, const int __lane)
> {
>   return __builtin_aarch64_fmav4sf (__b,
>   __aarch64_vdupq_laneq_f32 (__c, __lane),
>   __a);
> }
> 
> thus leading to unoptimized code as:
> 
> ldr   q1, [x2, 16]
>   dup v28.4s, v1.s[0]
>   dup v27.4s, v1.s[1]
>   dup v26.4s, v1.s[2]
>   dup v1.4s, v1.s[3]
>   fmlav22.4s, v25.4s, v28.4s
>   fmlav3.4s, v25.4s, v27.4s
>   fmlav6.4s, v25.4s, v26.4s
>   fmlav17.4s, v25.4s, v1.4s
> 
> instead of:
> 
> ldr   q1, [x2, 16]
>   fmlav22.4s, v25.4s, v1.s[0]
>   fmlav3.4s, v25.4s, v1.s[1]
>   fmlav6.4s, v25.4s, v1.s[2]
>   fmlav17.4s, v25.4s, v1.s[3]
> 
> I guess several other *lane* intrinsics exhibit the same shortcoming.

Which compiler version did you use? I tried this on GCC6, 7, 8, and 9 with -O2:

#include "arm_neon.h"
float32x4_t f(float32x4_t a, float32x4_t b, float32x4_t c)
{
  a = vfmaq_laneq_f32 (a, b, c, 0);
  a = vfmaq_laneq_f32 (a, b, c, 1);
  return a;
}

fmlav0.4s, v1.4s, v2.4s[0]
fmlav0.4s, v1.4s, v2.4s[1]
ret

In all cases the optimizer is able to merge the dups as expected.

If it still fails for you, could you provide a compilable example like above
that shows the issue?

> For the record, I managed to partly workaround this issue by writing my own
> version as:
> 
>  if(LaneID==0)  asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
> else if(LaneID==1)  asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
> else if(LaneID==2)  asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
> else if(LaneID==3)  asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
> 
> but that's of course not ideal. This change yields a 32% speed up in Eigen's
> matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633

I'd strongly advise against using inline assembler since most people make
mistakes writing it, and GCC won't be able to optimize code using inline
assembler.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-25 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #23 from Wilco  ---
(In reply to ktkachov from comment #22)

> helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%)
> improvement on parest, and about 5.3% on a more aggressive CPU.
> I tried unrolling 8x in a similar manner and that was not faster than 4x on
> either target.

The 4x unrolled version has 19 instructions (and microops) vs 7*4 for the
non-unrolled version, a significant reduction (without LDP it would be 21 vs
28). There is potential to use 2 more LDPs and use load+writeback which would
make it 15 vs 28 instructions, so close to 2x reduction in instructions.

[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398

2019-01-25 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763

--- Comment #32 from Wilco  ---
Author: wilco
Date: Fri Jan 25 13:29:06 2019
New Revision: 268265

URL: https://gcc.gnu.org/viewcvs?rev=268265=gcc=rev
Log:
[PATCH][AArch64] Fix generation of tst (PR87763)

The TST instruction no longer matches in all cases due to changes in
Combine.  The fix is simple, we now need to allow a subreg as well when
selecting the cc_mode.  This fixes the tst_5.c and tst_6.c failures.

AArch64 regress & bootstrap OK.

PR rtl-optimization/87763
* config/aarch64/aarch64.c (aarch64_select_cc_mode):
Allow SUBREG when matching CC_NZmode compare.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/aarch64/aarch64.c

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #21 from Wilco  ---
(In reply to rguent...@suse.de from comment #20)
> On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > 
> > --- Comment #19 from Wilco  ---
> > (In reply to rguent...@suse.de from comment #18)
> > 
> > > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > > > suggestion)
> > > 
> > > If that helps, sure (I'd have guessed uarchs are going to split
> > > load-multiple into separate loads, but eventually it avoids
> > > load-port contention?)
> > 
> > Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a
> > 128-bit LDP in a single cycle (see Optimization Guide).
> > 
> > > > 2) Unrolling and breaking accumulator dependencies.
> > > 
> > > IIRC RTL unrolling can do this (as side-effect, not as main
> > > cost motivation) guarded with an extra switch.
> > > 
> > > > I think more general unrolling and the peeling associated with it can be
> > > > discussed independently of 1) and 2) once we collect more data on more
> > > > microarchitectures.
> > > 
> > > I think both of these can be "implemented" on the RTL unroller
> > > side.
> > 
> > You still need dependence analysis, alias info, ivopt to run again. The 
> > goal is
> > to remove the increment of the index, use efficient addressing modes 
> > (base+imm)
> > and allow scheduling to move instructions between iterations. I don't 
> > believe
> > the RTL unroller supports any of this today.
> 
> There's no way to encode load-multiple on GIMPLE that wouldn't be
> awkward to other GIMPLE optimizers.

I don't think anyone want LDP/STP directly in GIMPLE - that doesn't seem
useful. We don't even form LDP until quite late in RTL. The key to forming
LDP/STP is using base+imm addressing modes and having correct alias info (so
loads/stores from different iterations can be interleaved and then combined
into LDP/STP). The main thing a backend would need to do is tune address costs
to take future LDP formation into account (and yes, the existing cost models
need to be improved anyway).

> Yes, the RTL unroller supports scheduling (sched runs after unrolling)
> and scheduling can do dependence analysis.  Yes, the RTL unroller
> does _not_ do dependence analysis at the moment, so if you want to
> know beforehand whether you can interleave iterations you have to
> actually perform dependence analysis.  Of course you can do that
> on RTL.  And of course you can do IVOPTs on RTL (yes, we don't do that
> at the moment).

Sure we *could* duplicate all high-level loop optimizations to work on RTL.
However is that worth the effort given we have them already at tree level?

> Note I'm not opposed to have IVOPTs on GIMPLE itself perform
> unrolling (I know Bin was against this given IVOPTs is already
> so comples) and a do accumulator breakup.  But I don't see how
> to do the load-multiple thing (yes, you could represent it
> as a vector load plus N element extracts on GIMPLE and it
> would be easy to teach SLP vectorization to perform this
> transform on its own if it is really profitable - which I
> doubt you can reasonably argue before RA, let alone on GIMPLE).

Let's forget about load-multiple in GIMPLE. Kyrill's example shows that
unrolling at the high level means the existing loop optimizations and analysis
work as expected and we end up with good addressing modes, LDPs and
interleaving of different iterations. With the existing RTL unroller this just
isn't happening.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #19 from Wilco  ---
(In reply to rguent...@suse.de from comment #18)

> > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > suggestion)
> 
> If that helps, sure (I'd have guessed uarchs are going to split
> load-multiple into separate loads, but eventually it avoids
> load-port contention?)

Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a
128-bit LDP in a single cycle (see Optimization Guide).

> > 2) Unrolling and breaking accumulator dependencies.
> 
> IIRC RTL unrolling can do this (as side-effect, not as main
> cost motivation) guarded with an extra switch.
> 
> > I think more general unrolling and the peeling associated with it can be
> > discussed independently of 1) and 2) once we collect more data on more
> > microarchitectures.
> 
> I think both of these can be "implemented" on the RTL unroller
> side.

You still need dependence analysis, alias info, ivopt to run again. The goal is
to remove the increment of the index, use efficient addressing modes (base+imm)
and allow scheduling to move instructions between iterations. I don't believe
the RTL unroller supports any of this today.

[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398

2019-01-22 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763

--- Comment #23 from Wilco  ---
Author: wilco
Date: Tue Jan 22 17:49:46 2019
New Revision: 268159

URL: https://gcc.gnu.org/viewcvs?rev=268159=gcc=rev
Log:
Fix vect-nop-move.c test

Fix a failing test - changes in Combine mean the test now fails
eventhough the generated code is the same.  Given there are several
AArch64-specific tests for vec-select, remove the scanning of Combine
output.  Committed as trivial fix.

testsuite/
PR rtl-optimization/87763
* gcc.dg/vect/vect-nop-move.c: Fix testcase on AArch64.

Modified:
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/vect/vect-nop-move.c

[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398

2019-01-22 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763

--- Comment #22 from Wilco  ---
(In reply to Steve Ellcey from comment #21)
> If I look at this specific example:
> 
> int f2 (int x, int y)
> {
>   return (x & ~0x0ff000) | ((y & 0x0ff) << 12);
> }
> 

> Is this because of x0 (a hard register) at the destination in insn 15?

Yes, the hard reg change affects the rtl shape in these tests so it fails to
match in combine. I have a simple fix for the tst_5 and tst_6 failures.

You can check this by ensuring there are no hard registers in the test:

int f2a (int x, int y)
{
   x++;
   y++;
   return (x & ~0x0ff000) | ((y & 0x0ff) << 12);
}

This also fails to match before r265398.

[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398

2019-01-21 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763

--- Comment #19 from Wilco  ---
(In reply to Segher Boessenkool from comment #18)
> https://gcc.gnu.org/ml/gcc/2019-01/msg00112.html

Thanks, I hadn't noticed that yet... I need to look at it in more detail, but
are you saying that combine no longer generates zero_extracts today and all
patterns that rely on it must be changed to a different canonical form?

I suspect the tst_5/6 cases are similar if the canonical form of rtl has
changed. Note that doing (x & 63) != 0 just works fine, it's only 255 and 65535
which appear to be treated differently...

[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398

2019-01-21 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763

--- Comment #17 from Wilco  ---
(In reply to Vladimir Makarov from comment #14)
>   I've checked cvtf_1.c generated code and I don't see additional fmov
> anymore.  I guess it was fixed by an ira-costs.c change (a special
> consideration of moves containing hard regs).  I think this PR is resolved
> (although the change resulted in a new PR 88560 but it is a different story).

Some failures disappeared, however various failures still exist. It appears the
ones reported above are not all directly related to the move change, but
another change around the same time. For example all of these are due to
combine no longer creating bfi or tst instructions in trivial cases:

FAIL: gcc.target/aarch64/combine_bfi_1.c scan-assembler-times \\tbfi\\t 5
FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 0, 8
FAIL: gcc.target/aarch64/insv_1.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 16, 5
FAIL: gcc.target/aarch64/insv_1.c scan-assembler movk\tx[0-9]+, 0x1d6b, lsl 32
FAIL: gcc.target/aarch64/insv_2.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 43, 5
FAIL: gcc.target/aarch64/insv_2.c scan-assembler bfi\tx[0-9]+, x[0-9]+, 56, 8
FAIL: gcc.target/aarch64/insv_2.c scan-assembler movk\tx[0-9]+, 0x1d6b, lsl 16
FAIL: gcc.target/aarch64/lsl_asr_sbfiz.c scan-assembler sbfiz\tw
FAIL: gcc.target/aarch64/tst_5.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*255
FAIL: gcc.target/aarch64/tst_5.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*65535
FAIL: gcc.target/aarch64/tst_6.c scan-assembler tst\t(x|w)[0-9]+,[ \t]*65535

>   If we want to improve the generated code size, it would be better to find
> a small testcase from SPEC2006 showing what is wrong.  I understand it is a
> hard work but otherwise we could only speculate what is going wrongly.  I
> don't think that reverting the combiner change would be a solution.

Yes it's hard to get good reproducible examples, especially from large
functions. It's easier to first sort out the known test failures.

[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398

2019-01-21 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763

--- Comment #13 from Wilco  ---
(In reply to Segher Boessenkool from comment #12)
> Before the change combine forwarded all argument (etc.) hard registers
> wherever
> it could, doing part of RA's job (and doing a lousy job of it).  If after the
> change it no longer two ranges, than a) that is a good decision, or b) there
> is
> some bug in RA.

I think it's important not to conflate avoiding increasing live ranges of hard
registers and inserting redundant extra moves which cannot be optimized. The
former unconditionally looks like a good idea, however I can't see any valid
reasoning for the latter. Basically we now always get 2 moves for every
physical register, and this seems to thwart register preferencing.

> 0.05% code size is nothing, most other biggish changes have a much bigger
> effect.

Compiler optimization is all about making many small changes which add up to a
large improvement. This is actually quite significant given practically all
code is affected negatively.

[Bug rtl-optimization/87763] [9 Regression] aarch64 target testcases fail after r265398

2019-01-21 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87763

--- Comment #11 from Wilco  ---
A SPEC2006 run shows the codesize cost of make_more_copies is 0.05%.
Practically all tests are worse, the largest increases are perlbench at 0.20%,
gromacs 0.12%, calculix 0.12%, soplex 0.08%, xalancbmk 0.07%, wrf 0.06%. In
total ~26KBytes of extra moves...

Overall this doesn't look like a good change. So the question is, what is the
purpose of adding extra live ranges when the register allocator doesn't appear
to be able to merge them?

[Bug middle-end/88560] [9 Regression] armv8_2-fp16-move-1.c and related regressions after r266385

2019-01-18 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88560

--- Comment #6 from Wilco  ---
(In reply to Vladimir Makarov from comment #5)
> We have too many tests checking expected generated code.  We should more
> focus on overall effect of the change.  SPEC would be a good criterium
> although it is hard to check SPEC for each patch.
> 
> I've checked the generated code of arm8_2-fp16-move-1.c and found that in
> most cases GCC generates better code with the patch:
> 
> @@ -80,7 +80,6 @@ test_load_store_1:
> @ frame_needed = 0, uses_anonymous_args = 0
> @ link register save eliminated.
> lsl r1, r1, #1
> -   vmov.f16s0, r3  @ __fp16
> ldrhr3, [r2, r1]@ __fp16
> strhr3, [r0, r1]@ __fp16
> bx  lr

When I tested it, this test added that vmov, not removed it - see comment #2.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-17 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #16 from Wilco  ---
(In reply to rguent...@suse.de from comment #15)

> which is what I refered to for branch prediction.  Your & prompts me
> to a way to do sth similar as duffs device, turning the loop into a nest.
> 
>  head:
>if (n == 0) exit;
><1 iteration>
>if (n & 1)
>  n -= 1, goto head;
><1 iteration>
>if (n & 2)
>  n -= 2, goto head;
><2 iterations>
>if (n & 4)
>  n -= 4, goto head;
><4 iterations>
>n -= 8, goto head;
> 
> the inner loop tests should become well-predicted quickly.
> 
> But as always - more branches make it more likely that one is
> mispredicted.  For a single non-unrolled loop you usually only
> get the actual exit mispredicted.

Yes the overlapping the branches for the tail loop and the main loop will
result in more mispredictions. And there are still 4 branches for an 8x
unrolled loop, blocking optimizations and scheduling. So Duff's device is
always inefficient - the above loop is much faster like this:

if (n & 1)
 do 1 iteration
if (n & 2)
 do 2 iterations
if (n >= 4)
 do
  4 iterations
 while ((n -= 4) > 0)

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-17 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #14 from Wilco  ---
(In reply to rguent...@suse.de from comment #13)

> Usually the peeling is done to improve branch prediction on the
> prologue/epilogue.

Modern branch predictors do much better on a loop than with this kind of code:

andsx12, x11, 7
beq .L70
cmp x12, 1
beq .L55
cmp x12, 2
beq .L57
cmp x12, 3
beq .L59
cmp x12, 4
beq .L61
cmp x12, 5
beq .L63
cmp x12, 6
bne .L72

That's way too many branches close together so most predictors will hit the
maximum branches per fetch block limit and not predict them.

If you wanted to peel it would have to be like:

if (n & 4)
  do 4 iterations
if (n & 2)
  do 2 iterations
if (n & 1)
  do 1 iteration

However that's still way too much code explosion for little or no gain. The
only case where this makes sense is in a handwritten memcpy.

[Bug tree-optimization/88739] [7/8/9 Regression] Big-endian union bug

2019-01-09 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739

--- Comment #37 from Wilco  ---
(In reply to rsand...@gcc.gnu.org from comment #35)
> Yeah, the expr.c patch makes the original testcase work, but we still fail
> for:

That's the folding in ccp1 after inlining, which will require a similar fix.
There are likely more places that need to be fixed to handle the 'short' bit
types.

[Bug tree-optimization/88739] [7/8/9 Regression] Big-endian union bug

2019-01-09 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739

--- Comment #34 from Wilco  ---
With just the expr.c patch the gcc regression tests all pass on big-endian
AArch64. Interestingly this includes the new torture test, ie. it does not
trigger the union bug.

[Bug tree-optimization/88739] [7/8/9 Regression] Big-endian union bug

2019-01-09 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739

--- Comment #33 from Wilco  ---
(In reply to Richard Biener from comment #32)

> > 
> > Index: gcc/expr.c
> > ===
> > --- gcc/expr.c  (revision 267553)
> > +++ gcc/expr.c  (working copy)
> > @@ -10562,6 +10562,15 @@ expand_expr_real_1 (tree exp, rtx target
> >infinitely recurse.  */
> > gcc_assert (tem != exp);
> >  
> > +   /* When extracting from non-mode bitsize entities adjust the
> > +  bit position for BYTES_BIG_ENDIAN.  */
> > +   if (INTEGRAL_TYPE_P (TREE_TYPE (tem))
> > +   && (TYPE_PRECISION (TREE_TYPE (tem))
> > +   < GET_MODE_BITSIZE (as_a  (TYPE_MODE 
> > (TREE_TYPE (tem)
> > +   && BYTES_BIG_ENDIAN)
> > + bitpos += (GET_MODE_BITSIZE (as_a  (TYPE_MODE 
> > (TREE_TYPE (tem
> > +- TYPE_PRECISION (TREE_TYPE (tem)));
> > +
> > /* If TEM's type is a union of variable size, pass TARGET to the 
> > inner
> >computation, since it will need a temporary and TARGET is known
> >to have to do.  This occurs in unchecked conversion in Ada.  */
> 
> Btw, this needs to be amended for WORDS_BIG_ENDIAN of course.  I guess
> we might even run into the case that such BIT_FIELD_REF references
> a non-contiguous set of bits... (that's also true for BITS_BIG_ENDIAN !=
> BYTES_BIG_ENDIAN I guess).

Was that meant to be instead or in addition to the tree-ssa-sccvn.c patch? With
both I get:

lsr w20, w1, 2
...
and w1, w20, 65535

With only the expr.c patch it starts to look as expected:

lsr w20, w1, 2
...
lsr w1, w20, 14

And with the latter case the new torture test now passes on big-endian!

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #7 from Wilco  ---
(In reply to rguent...@suse.de from comment #6)
> On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > 
> > --- Comment #5 from Wilco  ---
> > (In reply to Wilco from comment #4)
> > > (In reply to ktkachov from comment #2)
> > > > Created attachment 45386 [details]
> > > > aarch64-llvm output with -Ofast -mcpu=cortex-a57
> > > > 
> > > > I'm attaching the full LLVM aarch64 output.
> > > > 
> > > > The output you quoted is with -funroll-loops. If that's not given, GCC
> > > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> > > > testing).
> > > > 
> > > > Is there anything we can do to make the default unrolling a bit more
> > > > aggressive?
> > > 
> > > I don't think the RTL unroller works at all. It doesn't have the right
> > > settings, and doesn't understand how to unroll, so we always get 
> > > inefficient
> > > and bloated code.
> > > 
> > > To do unrolling correctly it has to be integrated at tree level - for
> > > example when vectorization isn't possible/beneficial, unrolling might 
> > > still
> > > be a good idea.
> > 
> > To add some numbers to the conversation, the gain LLVM gets from default
> > unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017.
> > 
> > This clearly shows there is huge potential from unrolling, *if* we can teach
> > GCC to unroll properly like LLVM. That means early unrolling, using good
> > default settings and using a trailing loop rather than inefficient peeling.
> 
> I don't see why this cannot be done on RTL where we have vastly more
> information of whether there are execution resources that can be
> used by unrolling.  Note we also want unrolling to interleave
> instructions to not rely on pre-reload scheduling which in turn means
> having a good eye on register pressure (again sth not very well handled
> on GIMPLE)

The main issue is that other loop optimizations are done on tree, so things
like addressing modes, loop invariants, CSEs are run on the non-unrolled
version. Then when we unroll in RTL we end up with very non-optimal code.
Typical unrolled loop starts like this:

add x13, x2, 1
add x14, x2, 2
add x11, x2, 3
add x10, x2, 4
ldr w30, [x4, x13, lsl 2]
add x9, x2, 5
add x5, x2, 6
add x12, x2, 7
ldr d23, [x3, x2, lsl 3]
... rest of unrolled loop

So basically it decides to create a new induction variable for every unrolled
copy in the loop. This often leads to spills just because it creates way too
many redundant addressing instructions. It also blocks scheduling between
iterations since the alias optimization doesn't appear to understand simple
constant differences between indices.

So unrolling should definitely be done at a high level just like vectorization.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #5 from Wilco  ---
(In reply to Wilco from comment #4)
> (In reply to ktkachov from comment #2)
> > Created attachment 45386 [details]
> > aarch64-llvm output with -Ofast -mcpu=cortex-a57
> > 
> > I'm attaching the full LLVM aarch64 output.
> > 
> > The output you quoted is with -funroll-loops. If that's not given, GCC
> > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> > testing).
> > 
> > Is there anything we can do to make the default unrolling a bit more
> > aggressive?
> 
> I don't think the RTL unroller works at all. It doesn't have the right
> settings, and doesn't understand how to unroll, so we always get inefficient
> and bloated code.
> 
> To do unrolling correctly it has to be integrated at tree level - for
> example when vectorization isn't possible/beneficial, unrolling might still
> be a good idea.

To add some numbers to the conversation, the gain LLVM gets from default
unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017.

This clearly shows there is huge potential from unrolling, *if* we can teach
GCC to unroll properly like LLVM. That means early unrolling, using good
default settings and using a trailing loop rather than inefficient peeling.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #4 from Wilco  ---
(In reply to ktkachov from comment #2)
> Created attachment 45386 [details]
> aarch64-llvm output with -Ofast -mcpu=cortex-a57
> 
> I'm attaching the full LLVM aarch64 output.
> 
> The output you quoted is with -funroll-loops. If that's not given, GCC
> doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> testing).
> 
> Is there anything we can do to make the default unrolling a bit more
> aggressive?

I don't think the RTL unroller works at all. It doesn't have the right
settings, and doesn't understand how to unroll, so we always get inefficient
and bloated code.

To do unrolling correctly it has to be integrated at tree level - for example
when vectorization isn't possible/beneficial, unrolling might still be a good
idea.

[Bug tree-optimization/88739] [7/8/9 Regression] Big-endian union bug

2019-01-09 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88739

--- Comment #29 from Wilco  ---
(In reply to Richard Biener from comment #26)

> Did anybody test the patch?  Testing on x86_64 will be quite pointless...

Well that generates _18 = BIT_FIELD_REF <_2, 16, 14>; and becomes:

ubfxx1, x20, 2, 16

This extracts bits 2-17 of the 30-bit value instead of bits 14-29. The issue is
that we're using a bitfield reference on a value that is claimed not to be a
bitfield in comment 6. So I can't see how using BIT_FIELD_REF could ever work
correctly.

< 1 2 3 4 5 6 7 8 >

301 - 400 of 718 matches

Mail list logo