[Bug libstdc++/80579] std::vector::reserve should not require T to be moveable.

2017-04-30 Thread ville.voutilainen at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80579

--- Comment #4 from Ville Voutilainen  ---
(In reply to Marc Glisse from comment #2)
> If I remember correctly (very doubtful), a paper was presented about option
> 2, and option 1 was rejected (although we could still provide it as an
> extension if someone provides a patch and it is clean enough that
> maintainers don't feel this will complicate maintenance).

The only semi-clean way to do it requires if constexpr and therefore C++17.
While we have on some occasions made effort to support types with deleted move
operations, the committee is strongly recommending that we shouldn't.

As far as Carlo's actual use case goes, std::atomics don't go into vectors
by design, because vector may shuffle things around and such shuffling is
not an atomic operation.

[Bug libstdc++/80579] std::vector::reserve should not require T to be moveable.

2017-04-30 Thread carlo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80579

Carlo Wood  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #3 from Carlo Wood  ---
Yeah, there seems to have been confusion (by me) about how to invoke clang via
the irc bot that I used... I was under the impression that since reserve()
existed before move constructors existed it just had to work when I did nothing
special, and I didn't, I just used a std::atomic_bool as member in my class
(which in turn caused the move constructor to be deleted; I didn't delete it
manually).

I guess that this is not a bug then.
Sorry,
Carlo Wood

[Bug libstdc++/80579] std::vector::reserve should not require T to be moveable.

2017-04-30 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80579

--- Comment #2 from Marc Glisse  ---
I think there were discussions a couple years (?) ago in the committee about
extending std::vector for types that do not satisfy its current requirements. I
remember roughly 3 options:
- relax restrictions on the current functions. For instance, if a type is
non-movable, reserve is an operation that only works on empty vectors (and
emplace_back only works if size

[Bug libstdc++/80579] std::vector::reserve should not require T to be moveable.

2017-04-30 Thread ville.voutilainen at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80579

Ville Voutilainen  changed:

   What|Removed |Added

 CC||ville.voutilainen at gmail dot 
com
  Component|c++ |libstdc++

--- Comment #1 from Ville Voutilainen  ---
Well, clang 5.0 on wandbox rejects the code:
https://wandbox.org/permlink/W6hDjCOfqNRecpnU

vector::reserve requires MoveInsertable(*), B isn't MoveInsertable.
This is not an implementation bug, libstdc++ conforms to what the
standard specifies.

(*) ..because reserve might reallocate, so if it does, the elements
need to be moved to a new buffer. An implementation is not required
to copy the elements if moving them isn't valid, quite the opposite;
an implementation is allowed to assume that it can move. We could
be really nice and do that as a response to the violation of the
MoveInsertable precondition.

[Bug testsuite/80580] New: GIMPLEFE ICE on invalid code (fuzz testing)

2017-04-30 Thread miyuki at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80580

Bug ID: 80580
   Summary: GIMPLEFE ICE on invalid code (fuzz testing)
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: testsuite
  Assignee: unassigned at gcc dot gnu.org
  Reporter: miyuki at gcc dot gnu.org
  Target Milestone: ---

Created attachment 41290
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41290=edit
test cases

I did some fuzz testing of the GIMPLE front end and found several ICEs.

I realize, that GIMPLE FE is intended for internal use in the GCC test suite,
so the requirements for its robustness are not as high as for the user-facing
front ends. Nevertheless, I think correct handling of erroneous input would be
useful for reducing GIMPLE code of real-world bug reports in the future
(because tools like C-Reduce tend to generate lots of erroneous intermediate
inputs).

I used a technique created by John Regehr, which is described in his blog
http://blog.regehr.org/archives/1284 to generate these test cases. Kudos to
John.

There are 46 test cases in the archive and they all produce ICEs with distinct
backtraces. Although they seem to be different bugs, I decided not to file 46
bug reports :).

Example: 

$ cat test001.c
__GIMPLE a() {
  if
  goto

$ cc1 -fgimple -w test001.c
test001.c: In function 'a':
test001.c:3:3: error: expected '(' before 'goto'
   goto
   ^~~~
test001.c:3:3: internal compiler error: Segmentation fault
0xbde80f crash_signal
/home/miyuki/gcc/src/gcc/toplev.c:337
0x62d467 tree_check
/home/miyuki/gcc/src/gcc/tree.h:3076
0x62d467 i_label_binding
/home/miyuki/gcc/src/gcc/c/c-decl.c:289
0x62d467 lookup_label(tree_node*)
/home/miyuki/gcc/src/gcc/c/c-decl.c:3567
0x62d624 lookup_label_for_goto(unsigned int, tree_node*)
/home/miyuki/gcc/src/gcc/c/c-decl.c:3615
0x6b9adf c_parser_gimple_if_stmt
/home/miyuki/gcc/src/gcc/c/gimple-parser.c:1318
0x6b9adf c_parser_gimple_compound_statement
/home/miyuki/gcc/src/gcc/c/gimple-parser.c:172
0x6b9adf c_parser_parse_gimple_body(c_parser*)
/home/miyuki/gcc/src/gcc/c/gimple-parser.c:92
0x6a2b2b c_parser_declaration_or_fndef
/home/miyuki/gcc/src/gcc/c/c-parser.c:2104
0x6aa913 c_parser_external_declaration
/home/miyuki/gcc/src/gcc/c/c-parser.c:1469
0x6ab1d1 c_parser_translation_unit
/home/miyuki/gcc/src/gcc/c/c-parser.c:1349
0x6ab1d1 c_parse_file()
/home/miyuki/gcc/src/gcc/c/c-parser.c:18181
0x708582 c_common_parse_file()
/home/miyuki/gcc/src/gcc/c-family/c-opts.c:1107
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.

P.S. I have a patch series to fix some of these bugs. I am planning to rebase,
retest and post these patches soon.

[Bug c++/80579] New: std::vector::reserve should not require T to be moveable.

2017-04-30 Thread carlo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80579

Bug ID: 80579
   Summary: std::vector::reserve should not require T to be
moveable.
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: carlo at gcc dot gnu.org
  Target Milestone: ---

g++ 7.0.1 fails on

#include 
struct B { B(B&&) = delete; };
std::vector v;
int main() { v.reserve(8); }

error: use of deleted function 'B::B(B&&)'

while clang 5.0.0 compiles it.

[Bug testsuite/65941] FAIL: g++.dg/other/pr59492.C: no such instruction: rdrand

2017-04-30 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65941

--- Comment #2 from Tom de Vries  ---
Created attachment 41289
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41289=edit
tentative patch

[Bug sanitizer/80578] New: -fsanitize=undefined report yields memory leak

2017-04-30 Thread gcc at gms dot tf
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80578

Bug ID: 80578
   Summary: -fsanitize=undefined report yields memory leak
   Product: gcc
   Version: 6.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: sanitizer
  Assignee: unassigned at gcc dot gnu.org
  Reporter: gcc at gms dot tf
CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org,
jakub at gcc dot gnu.org, kcc at gcc dot gnu.org
  Target Milestone: ---

When compiling a program with both -fsanitize=address and -fsanitize=undefined
where undefined sanitzer complains about a UB issue yields a memory leak which
is detected by LeakSanitizer.

Example:

$ cat main.cc 
#include 

struct A { virtual ~A()=default; int a; };
struct B { virtual ~B()=default; int a; };

int main(int argc, char **argv)
{
  A *a = new A;
  a->a = argc;
  std::cout << a->a << '\n';
  B *b = reinterpret_cast(a);
  delete b;
  return 0;
}
$ /a.out 
1
main.cc:12:10: runtime error: member call on address 0x6020eff0 which does
not point to an object of type 'B'
0x6020eff0: note: object is of type 'A'
 01 00 80 5f  f8 17 40 00 00 00 00 00  01 00 00 00 be be be be  00 00 00 00 00
00 00 00  00 00 00 00
  ^~~
  vptr for 'A'

=
==10149==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 4 byte(s) in 2 object(s) allocated from:
#0 0x7f806bb20210 in realloc (/lib64/libasan.so.3+0xc7210)
#1 0x7f806b763033  (/lib64/libstdc++.so.6+0x92033)

SUMMARY: AddressSanitizer: 4 byte(s) leaked in 2 allocation(s).

$ echo $?
1


Expected behaviour: Just the runtime error message and no reported memory
leaks.


GDB says that this is in:

(gdb) l *0x92033
0x92033 is in d_growable_string_callback_adapter (cp-demangle.c:3863).


When compiling without undefined sanitizer the leak is gone:


$ g++ -fsanitize=address -g main.cc 
$ ./a.out 
1
$ echo $?
0


Also, as expected, when just compiling with undefined sanitizer:

$ g++  -fsanitize=undefined -g main.cc
$ ./a.out 
1
main.cc:12:10: runtime error: member call on address 0x01250c20 which does
not point to an object of type 'B'
0x01250c20: note: object is of type 'A'
 00 00 00 00  70 10 40 00 00 00 00 00  01 00 00 00 00 00 00 00  00 00 00 00 00
00 00 00  11 04 00 00
  ^~~
  vptr for 'A'
$ echo $?
0

[Bug c++/80577] New: Avoid using adj in member function pointers

2017-04-30 Thread drepper.fsp+rhbz at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80577

Bug ID: 80577
   Summary: Avoid using adj in member function pointers
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: drepper.fsp+rhbz at gmail dot com
  Target Milestone: ---

Consider the following code:

struct foo final {
  int a = 0, b = 0;
  int get1() const { return a; }
  int get2() const { return a + b; }
};

foo f;
int (foo::*mfp)() const = ::get1;

int get()
{
  return (f.*mfp)();
}

When compiled get() looks on x86-64 like this:

movqmfp+8(%rip), %rax
leaqf(%rax), %rdi
jmp *mfp(%rip)

The compiler knows the type 'foo'.  It can determine that there is no multiple
inheritence.  This means that the adj field in the member function pointer will
always be zero.  Hence the generated code should be

movl$f, %esi
jmp *mfp(%rip)

[Bug tree-optimization/80576] New: dead strcpy and strncpy followed by memset not eliminated

2017-04-30 Thread msebor at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80576

Bug ID: 80576
   Summary: dead strcpy and strncpy followed by memset not
eliminated
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: msebor at gcc dot gnu.org
  Target Milestone: ---

Similar to bug 80487, due to the subsequent memset that overwrites the contents
of the destination arrays, the strcpy and strncpy calls in the following two
functions constitute dead stores and could be eliminated.  The output shows
that GCC does not yet take advantage of this optimization opportunity.  The
first case might be related to (or the same as) bug 79716.

$ cat y.c && gcc  -O2 -S -Wall -Wextra -Wpedantic
-fdump-tree-optimized=/dev/stdout y.c
void sink (void*);

void f (const char *s)
{
  char a[256];

  __builtin_strcpy (a, s);   // dead store
  __builtin_memset (a, 0, sizeof a); 

  sink (a);
}

void g (const char *s)
{
  char a[256];

  __builtin_strncpy (a, s, sizeof a);   // dead store
  __builtin_memset (a, 0, sizeof a);   

  sink (a);
}


;; Function f (f, funcdef_no=0, decl_uid=1795, cgraph_uid=0, symbol_order=0)

f (const char * s)
{
  char a[256];

   [100.00%]:
  __builtin_strcpy (, s_2(D));
  __builtin_memset (, 0, 256);
  sink ();
  a ={v} {CLOBBER};
  return;

}



;; Function g (g, funcdef_no=1, decl_uid=1799, cgraph_uid=1, symbol_order=1)

g (const char * s)
{
  char a[256];

   [100.00%]:
  __builtin_strncpy (, s_2(D), 256);
  __builtin_memset (, 0, 256);
  sink ();
  a ={v} {CLOBBER};
  return;

}

[Bug tree-optimization/79224] [7/8 Regression] Large C-Ray slowdown

2017-04-30 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79224

--- Comment #14 from Jan Hubicka  ---
Author: hubicka
Date: Sun Apr 30 15:02:11 2017
New Revision: 247417

URL: https://gcc.gnu.org/viewcvs?rev=247417=gcc=rev
Log:

PR ipa/79224
* ipa-inline-analysis.c (dump_predicate): Add optional parameter NL.
(account_size_time): Use two predicates - exec_pred and
nonconst_pred_ptr.
(evaluate_conditions_for_known_args): Compute both clause and
nonspec_clause.
(evaluate_properties_for_edge): Evaulate both clause and
nonspec_clause.
(inline_summary_t::duplicate): Update.
(estimate_function_body_sizes): Caluculate exec and nonconst predicates
separately.
(compute_inline_parameters): Likewise.
(estimate_edge_size_and_time): Update caluclation of time.
(estimate_node_size_and_time): Compute both time and nonspecialized
time.
(estimate_ipcp_clone_size_and_time): Update.
(inline_merge_summary): Update.
(do_estimate_edge_time): Update.
(do_estimate_edge_size): Update.
(do_estimate_edge_hints): Update.
(inline_read_section, inline_write_summary): Stream both new
predicates.
* ipa-inline.c (compute_uninlined_call_time): Take uninlined_call_time
as argument.
(compute_inlined_call_time): Cleanup.
(big_speedup_p): Update.
(edge_badness): Update.
* ipa-inline.h (INLINE_TIME_SCALE): Remove.
(size_time_entry): Replace predicate by exec_predicate and
nonconst_predicate.
(edge_growth_cache_entry): Cache both time nad nonspecialized time.
(estimate_edge_time): Return also nonspec_time.
(reset_edge_growth_cache): Update.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/ipa-inline-analysis.c
trunk/gcc/ipa-inline.c
trunk/gcc/ipa-inline.h

[Bug c++/80575] New: unnecessary virtual function table support in member function call

2017-04-30 Thread drepper.fsp+rhbz at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80575

Bug ID: 80575
   Summary: unnecessary virtual function table support in member
function call
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: drepper.fsp+rhbz at gmail dot com
  Target Milestone: ---

Created attachment 41288
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41288=edit
example for ineffectiveness of final with member function pointers

The support for virtual function requires more complex code at the call site
through a member function pointer.  gcc has some support to elide the handling
of virtual functions.  Take the following code:

struct foo {
  int a = 0, b =0;
  int get1() const { return a; }
  int get2() const { return a + b; }
};

foo f;
int (foo::*mfp)() const = ::get1;

int get()
{
  return (f->*mfp)();
}

The generated code (on x86-64) for get is:

movqmfp+8(%rip), %rax
leaqf(%rax), %rdi
jmp *mfp(%rip)

Perfectly fine.  If now the variable 'f' is changed to a pointer the code looks
like this:

movqmfp(%rip), %rax
movqmfp+8(%rip), %rdi
addqf(%rip), %rdi
testb   $1, %al
je  .L4
movq(%rdi), %rdx
movq-1(%rdx,%rax), %rax
.L4:
jmp *%rax

This is due to the fact that other classes derived can be derived from 'foo'
and those could have virtual functions.

To prevent this it should be possible to mark 'foo' as final.  If you do this
nothing changes, though.

--- u.cc-old2017-04-30 16:30:50.704469153 +0200
+++ u.cc2017-04-30 16:24:56.619672469 +0200
@@ -1,4 +1,4 @@
-struct foo {
+struct foo final {
   int a = 0, b =0;
   int get1() const { return a; }
   int get2() const { return a + b; }

[Bug tree-optimization/80574] GCC fail to optimize nested ternary

2017-04-30 Thread gjl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80574

Georg-Johann Lay  changed:

   What|Removed |Added

 CC||gjl at gcc dot gnu.org

--- Comment #2 from Georg-Johann Lay  ---
GCC performs poor on code expanded from macros that (recursively) duplicate
macro arguments.  Some time ago I digged into it, and the reason was that it
failed to recognize MIN_EXPR / MAX_EXPR because some optimizations factor out
common subextressions.  Yet another problem is that the expressions inside the
conditions get promoted to int resp. unsigned, whereas the target values remain
types smaller than int.  (Is was actually more complex code that implemented
saturation by nested MIN / MAX expressions.

As a work around, you can try to use inline functions so that GCC will
recognize MAX_EXPR and MIN_EXPR as expected.  The drawback is that you need a
series or macros for each input type like: int8_t, uint8_t, int16_t, ...
(unsigned and signed should be enough thou).

Sample work around for unsigned char:

#define MAX_1(VAR, ...) \
  (VAR)

#define MAX_2(VAR, ...) \
  (((VAR)>MAX_1(__VA_ARGS__))?(VAR):MAX_1(__VA_ARGS__))

__attribute__((__always_inline__))
static inline unsigned char max2 (unsigned char a, unsigned char b)
{
return MAX_2 (a, b);
}

#undef  MAX_2
#define MAX_2(a, b)  max2 (a, b)

#define MAX_3(VAR, ...) \
  (MAX_2 ((VAR), MAX_2(__VA_ARGS__)))

#define MAX_4(VAR, ...) \
  (MAX_2 ((VAR), MAX_3(__VA_ARGS__)))

#define MAX_5(VAR, ...) \
  (MAX_2 ((VAR), MAX_4(__VA_ARGS__)))

#define MAX_6(VAR, ...) \
  (MAX_2 ((VAR), MAX_5(__VA_ARGS__)))


The .original dump as generated with -fdump-tree-original reads now:


;; Function max2 (null)
{
  return MAX_EXPR ;
}

;; Function f1_unsigned (null)
{
  return max2 (a1, max2 (a2, max2 (a3, max2 (a1, max2 (a2, a3);
}

;; Function f2_unsigned (null)
{
  return max2 (a1, max2 (a2, a3));
}

and after inline expansion everything is nice with MAX_EXPR whereas your
original code leads to:


; Function f1_unsigned (null)
{
  return (unsigned char) MAX_EXPR  a3 ?
(int) a2 : (int) a3, (int) a1>, (int) a3>, (int) a2>, (int) a1>;
}

;; Function f2_unsigned (null)
{
  return (unsigned char) MAX_EXPR  a3 ? (int) a2 : (int) a3, (int) a1>;
}


Not all of the expressions are recognized as MAX_EXPR.

[Bug tree-optimization/80574] GCC fail to optimize nested ternary

2017-04-30 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80574

--- Comment #1 from Marc Glisse  ---
With -fdump-tree-original, the signed case looks perfect:

  return MAX_EXPR , a1>, a3>,
a2>, a1>;

(which reassoc eventually simplifies)
while in the unsigned case, we fail to recognize the innermost max:

  return (unsigned char) MAX_EXPR  a3 ?
(int) a2 : (int) a3, (int) a1>, (int) a3>, (int) a2>, (int) a1>;

and we also fail during gimple, probably because of the conversions.

[Bug tree-optimization/80574] New: GCC fail to optimize nested ternary

2017-04-30 Thread SztfG at yandex dot ru
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80574

Bug ID: 80574
   Summary: GCC fail to optimize nested ternary
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: SztfG at yandex dot ru
  Target Milestone: ---

Created attachment 41287
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41287=edit
nested ternary sample

GCC fail to optimize ternary chain (nested ternary) for find maximum value with
unsigned types. Check attachment

[Bug other/80573] New: ICE: internal compiler error: in assign_temp, at function.c:961

2017-04-30 Thread gjl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80573

Bug ID: 80573
   Summary: ICE: internal compiler error: in assign_temp, at
function.c:961
   Product: gcc
   Version: 6.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: gjl at gcc dot gnu.org
  Target Milestone: ---

This PR looks similar to PR71210, however I am just using v6.3 from 2016-12-22
whereas PR71210 has been fixed for v6.2 on 2016-05-20.

== C-Program ==

extern char cards[];

void fun (void)
{
__asm volatile ("" : "+r" (cards));
}


$ avr-gcc ice-inv.c -S  -v

Target: avr
Configured with: ../../gcc.gnu.org/gcc-6-branch/configure --target=avr
--prefix=/local/gnu/install/gcc-6-avr-mingw32 --host=i386-mingw32
--build=x86_64-linux-gnu --enable-languages=c,c++ --disable-nls
--disable-shared --enable-lto --with-dwarf2 --with-gnu-ld --with-gnu-as
Thread model: single
gcc version 6.3.1 20161222 [gcc-6-branch revision 243886] (GCC) 
GNU C11 (GCC) version 6.3.1 20161222 [gcc-6-branch revision 243886] (avr)
compiled by GNU C version 3.4.5 (mingw-vista special r2), GMP version
4.3.2, MPFR version 2.4.2, MPC version 0.8.1, isl version none

ice-inv.c: In function 'fun':
ice-inv.c:5:5: internal compiler error: in assign_temp, at function.c:961
 __asm volatile ("" : "+r" (cards));
 ^

ice-inv.c:5:5: internal compiler error: Segmentation fault

[Bug target/69460] ARM Cortex M0 produces suboptimal code vs Cortex M3

2017-04-30 Thread strntydog at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69460

strntydog at gmail dot com changed:

   What|Removed |Added

Version|5.2.0   |6.3.1

--- Comment #4 from strntydog at gmail dot com ---
Ok, so i just tested to see if this problem with Cortex M0/M0+ code generation
persists in GCC 6.3.1, which is the latest GCC Binary distributed by the Arm
Embedded folks.  And it does.

To put the Optimisation failure into perspective, this is the difference
between the 6 tests in the test case:

Test 1 - Code Size is 40% Bigger for M0, and the Function is 114% bigger.
Test 2 - Code Size is 20% bigger for M0, and the Function is 44% bigger.
Test 3 - Code Size is same between M0 and M3, but the Function is 43% bigger.
Test 4 - Code Size is 40% Bigger for M0, and the Function is 86% bigger.
Test 5 - Code Size is same between M0 and M3, but the Function is 14% bigger.
Test 6 - Code Size is 38% Bigger for M0, and the Function is 100% bigger.

These are HUGE.  

This means that on average these function will run about 22% slower than they
should and consume 67% more FLASH space than they should. But worst case from
my tests could be over twice as large as they need to be and need 40% more
instructions to achieve the same thing.

This problem is easily shown to occur when accessing memory location at known
addresses, something which microcontroller programs do all the time. This
problem effects every single M0 Application written which is compiled with GCC,
wasting Flash and running slower.

Note: Code Size refers to the number of instructions in the function, and the
function size is the code size plus its Literal data.  Code size is a measure
of performance on the M0, because more instructions means more cycles to
execute. And Function size is a measure of flash wastage.

[Bug bootstrap/69790] LTO compiling GCC does not work (lib/bfd-plugin path has unclear location)

2017-04-30 Thread trippels at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69790

Markus Trippelsdorf  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |INVALID

--- Comment #5 from Markus Trippelsdorf  ---
Ah, you're doing it wrong. 
Please use --with-build-config=bootstrap-lto instead.
Also --disable-werror would get you past warnings.

Lets close as invalid.

[Bug bootstrap/69790] LTO compiling GCC does not work (lib/bfd-plugin path has unclear location)

2017-04-30 Thread dilyan.palauzov at aegee dot org
rror 1
make[4]: Leaving directory '/git/gcc/x86_64-pc-linux-gnu/libgomp'
make[3]: *** [Makefile:493: all] Error 2
make[3]: Leaving directory '/git/gcc/x86_64-pc-linux-gnu/libgomp'


and gcc 6.3.1 20170430, when compiled with 6.3.1 20170421, with the FLAGS above
and ./configure --enable-threads=posix --enable-nls --enable-interpreter
--with-system-zlib --enable-libgcj-multifile --enable-languages=all
--enable-targets=all --with-system-unwind --without-x
--with-linker-hash-style=gnu --disable-multilib --enable-shared

fails at stage1 with

make[3]: Leaving directory '/git/gcc/host-x86_64-pc-linux-gnu/libdecnumber'
make[3]: Entering directory '/git/gcc/host-x86_64-pc-linux-gnu/gcc'
make[3]: Leaving directory '/git/gcc/host-x86_64-pc-linux-gnu/gcc'
Checking multilib configuration for libgcc...
make[3]: Entering directory '/git/gcc/x86_64-pc-linux-gnu/libgcc'
# If this is the top-level multilib, build all the other
# multilibs.
# Early copyback; see "all" above for the rationale.  The
# early copy is necessary so that the gcc -B options find
# the right startup files when linking shared libgcc.
/bin/sh ../.././libgcc/../mkinstalldirs ../../host-x86_64-pc-linux-gnu/gcc
parts="crtbegin.o crtbeginS.o crtbeginT.o crtend.o crtendS.o crtprec32.o
crtprec64.o crtprec80.o crtfastmath.o";   
\
for file in $parts; do  \
  rm -f ../../host-x86_64-pc-linux-gnu/gcc/$file;   \
  /usr/local/bin/install -c -m 644 $file ../../host-x86_64-pc-linux-gnu/gcc/;  
\
  case $file in \
*.a)\
  /usr/local/x86_64-pc-linux-gnu/bin/ranlib
../../host-x86_64-pc-linux-gnu/gcc/$file ;; \
  esac; \
done
# @multilib_flags@ is still needed because this may use
# /git/gcc/host-x86_64-pc-linux-gnu/gcc/xgcc
-B/git/gcc/host-x86_64-pc-linux-gnu/gcc/ -B/usr/local/x86_64-pc-linux-gnu/bin/
-B/usr/local/x86_64-pc-linux-gnu/lib/ -isystem
/usr/local/x86_64-pc-linux-gnu/include -isystem
/usr/local/x86_64-pc-linux-gnu/sys-includeand -O2  -g -O2 -Wall -Wextra
-pipe -O3 -fno-fat-lto-objects -flto -DIN_GCC-W -Wall -Wno-narrowing
-Wwrite-strings -Wcast-qual -Wno-format -Wstrict-prototypes
-Wmissing-prototypes -Wold-style-definition  -isystem ./include   -fpic
-mlong-double-80 -DUSE_ELF_SYMVER -g -DIN_LIBGCC2 -fbuilding-libgcc
-fno-stack-protector  directly.
# @multilib_dir@ is not really necessary, but sometimes it has
# more uses than just a directory name.
/bin/sh ../.././libgcc/../mkinstalldirs .
/git/gcc/host-x86_64-pc-linux-gnu/gcc/xgcc
-B/git/gcc/host-x86_64-pc-linux-gnu/gcc/ -B/usr/local/x86_64-pc-linux-gnu/bin/
-B/usr/local/x86_64-pc-linux-gnu/lib/ -isystem
/usr/local/x86_64-pc-linux-gnu/include -isystem
/usr/local/x86_64-pc-linux-gnu/sys-include-O2  -g -O2 -Wall -Wextra -pipe
-O3 -fno-fat-lto-objects -flto -DIN_GCC-W -Wall -Wno-narrowing
-Wwrite-strings -Wcast-qual -Wno-format -Wstrict-prototypes
-Wmissing-prototypes -Wold-style-definition  -isystem ./include   -fpic
-mlong-double-80 -DUSE_ELF_SYMVER -g -DIN_LIBGCC2 -fbuilding-libgcc
-fno-stack-protector  -shared -nodefaultlibs -Wl,--soname=libgcc_s.so.1
-Wl,--version-script=libgcc.map -o ./libgcc_s.so.1.tmp -g -O2 -Wall -Wextra
-pipe -O3 -fno-fat-lto-objects -flto -B./ _muldi3_s.o _negdi2_s.o _lshrdi3_s.o
_ashldi3_s.o _ashrdi3_s.o _cmpdi2_s.o _ucmpdi2_s.o _clear_cache_s.o
_trampoline_s.o __main_s.o _absvsi2_s.o _absvdi2_s.o _addvsi3_s.o _addvdi3_s.o
_subvsi3_s.o _subvdi3_s.o _mulvsi3_s.o _mulvdi3_s.o _negvsi2_s.o _negvdi2_s.o
_ctors_s.o _ffssi2_s.o _ffsdi2_s.o _clz_s.o _clzsi2_s.o _clzdi2_s.o _ctzsi2_s.o
_ctzdi2_s.o _popcount_tab_s.o _popcountsi2_s.o _popcountdi2_s.o _paritysi2_s.o
_paritydi2_s.o _powisf2_s.o _powidf2_s.o _powixf2_s.o _mulsc3_s.o _muldc3_s.o
_mulxc3_s.o _divsc3_s.o _divdc3_s.o _divxc3_s.o _bswapsi2_s.o _bswapdi2_s.o
_clrsbsi2_s.o _clrsbdi2_s.o _fixunssfsi_s.o _fixunsdfsi_s.o _fixunsxfsi_s.o
_fixsfdi_s.o _fixdfdi_s.o _fixxfdi_s.o _fixunssfdi_s.o _fixunsdfdi_s.o
_fixunsxfdi_s.o _floatdisf_s.o _floatdidf_s.o _floatdixf_s.o _floatundisf_s.o
_floatundidf_s.o _floatundixf_s.o _divdi3_s.o _moddi3_s.o _udivdi3_s.o
_umoddi3_s.o _udiv_w_sdiv_s.o _udivmoddi4_s.o cpuinfo_s.o sfp-exceptions_s.o
addtf3_s.o divtf3_s.o multf3_s.o negtf2_s.o subtf3_s.o unordtf2_s.o fixtfsi_s.o
fixunstfsi_s.o floatsitf_s.o floatunsitf_s.o fixtfdi_s.o fixunstfdi_s.o
floatditf_s.o floatunditf_s.o fixtfti_s.o fixunstfti_s.o floattitf_s.o
floatuntitf_s.o extendsftf2_s.o extenddftf2_s.o extendxftf2_s.o trunctfsf2_s.o
trunctfdf2_s.o trunctfxf2_s.o getf2_s.o letf2_s.o eqtf2_s.o _divtc3_s.o
_multc3_s.o _powitf2_s.o enable-execute-stack_s.o unwind-dw2_s.o
unwind-dw2-fde-dip_s.o unwind-sjlj_s.o unwind-c_s.o emutls_s.o libgcc.a -lc &&
rm -f ./libgcc_s.so && if [ -f ./libgcc_s.so.1 ]; then mv -f ./libgcc_s.so.1
./libgcc_s.so.1.backup; else true; fi &&am

[Bug lto/77954] LTO_STREAMER_DEBUG ICE with OpenMP SIMD clones

2017-04-30 Thread tschwinge at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77954

Thomas Schwinge  changed:

   What|Removed |Added

 Status|WAITING |NEW

--- Comment #2 from Thomas Schwinge  ---
(In reply to Martin Liška from comment #1)
> Can I reproduce it on a normal GCC (no target compiler) on my
> x86_64-linux-gnu target?
> 
> After setting the macro, following works for me:
> ./gcc/xgcc -B gcc 
> /home/marxin/Programming/gcc/libgomp/testsuite/libgomp.fortran/declare-simd-4.f90
>  -flto -c -mavx -fno-use-linker-plugin -fno-inline

Drop "-c", and "-fno-use-linker-plugin", and add "-fopenmp" (and
"-Bx86_64-pc-linux-gnu/libgomp/{,.libs/}", or similar):

$ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
-Bbuild-gcc/x86_64-pc-linux-gnu/libgomp/{,.libs/}
source-gcc/libgomp/testsuite/libgomp.fortran/declare-simd-4.f90 -flto -fopenmp 
lto1: internal compiler error: in lto_orig_address_remove, at
lto-streamer.c:369
0x901fbd lto_orig_address_remove(tree_node*)
[...]/source-gcc/gcc/lto-streamer.c:369
0x902f77 lto_read_tree
[...]/source-gcc/gcc/lto-streamer-in.c:1363
0x902f77 lto_input_tree_1(lto_input_block*, data_in*, LTO_tags, unsigned
int)
[...]/source-gcc/gcc/lto-streamer-in.c:1475
0x903192 lto_input_scc(lto_input_block*, data_in*, unsigned int*, unsigned
int*)
[...]/source-gcc/gcc/lto-streamer-in.c:1387
0x5c02a6 lto_read_decls
[...]/source-gcc/gcc/lto/lto.c:1694
0x5c2f5d lto_file_finalize
[...]/source-gcc/gcc/lto/lto.c:2038
0x5c2f5d lto_create_files_from_ids
[...]/source-gcc/gcc/lto/lto.c:2048
0x5c2f5d lto_file_read
[...]/source-gcc/gcc/lto/lto.c:2089
0x5c2f5d read_cgraph_and_symbols
[...]/source-gcc/gcc/lto/lto.c:2801
0x5c2f5d lto_main()
[...]/source-gcc/gcc/lto/lto.c:3306
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.
lto-wrapper: fatal error: build-gcc/gcc/xgcc returned 1 exit status
compilation terminated.
/usr/bin/ld: lto-wrapper failed
collect2: error: ld returned 1 exit status

[Bug c++/80572] New: crash reporting warning from precompiled header

2017-04-30 Thread th at zoon dot cc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80572

Bug ID: 80572
   Summary: crash reporting warning from precompiled header
   Product: gcc
   Version: 6.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: th at zoon dot cc
  Target Milestone: ---

Created attachment 41286
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41286=edit
Test case

[build]
g++ -c -o pch.h.gch pch.h
g++ -c -o foo.o -include pch.h foo.cc

[pch.h]
#include 

[foo.cc]
bool minor();


When pch.h is #included directly, it stops crashing and prints a warning about
minor being #defined somewhere. So I'm guessing it crashes when it tries to
emit the same warning from the pch.

Version:
Current Arch: g++ (GCC) 6.3.1 20170306

Error message: 
foo.cc:2:1: internal compiler error: Segmentation fault
 bool minor();
 ^~~~
Please submit a full bug report,
with preprocessed source if appropriate.
See  for instructions.

[Bug target/80571] New: AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops

2017-04-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571

Bug ID: 80571
   Summary: AVX allows multiple vcvtsi2ss/sd (integer ->
float/double) to reuse a single dep-breaking vxorps,
even hoisting it out of loops
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

See also more discussion on a clang bug about what's optimal for scalar int->fp
conversion (https://bugs.llvm.org/show_bug.cgi?id=22024#c11).

The important point is that using vcvtsi2sd/ss with a src vector reg different
from the destination reg allows reusing the same "safe" source to avoid false
dependencies, using of one vxorps for multiple conversions.  (Even hoisted out
of loops).

// see https://godbolt.org/g/c9k4SH for gcc8, clang4, icc17, and MSVC CL
void cvti64f64_loop(double *dp, long long *ip) {
for (int i=0; i<1000 ; i++) {
dp[i] = ip[i];
// int64 can't vectorize without AVX512
}
}


compiles with gcc6.3.1 and gcc8 20170429 -O3 -march=haswell:

cvti64f64_loop:
xorl%eax, %eax
.L6:
vxorpd  %xmm0, %xmm0, %xmm0
vcvtsi2sdq  (%rsi,%rax), %xmm0, %xmm0
vmovsd  %xmm0, (%rdi,%rax)
addq$8, %rax
cmpq$8000, %rax
jne .L6
ret

But it could compile to

xorl%eax, %eax
vxorpd  %xmm1, %xmm1, %xmm1 Hoisted out of the loop
.L6:
vcvtsi2sdq  (%rsi,%rax), %xmm1, %xmm0
vmovsd  %xmm0, (%rdi,%rax)
addq$8, %rax
cmpq$8000, %rax
jne .L6
ret



This trick requires AVX, of course.

If the loop needs a vector constant, you can use that instead of a
specially-zeroed register.  (The upper elements don't have to be 0.0 for ...ss
instructions, and the x86-64 SysV ABI allows passing scalar float/double args
with garbage (not zeros) in the high bytes.  This already happens in some code,
so library implementers already have to avoid assuming that high elements are
always clear, even with hand-written asm).

clang already hoists vxorps out of loops it, but only by looking for "dead"
registers, not by reusing a reg holding a constant.  (e.g. enabling those
multiplies in the godbolt link to use up all the xmm regs gets clang to put a
vxorps in the loop instead of merging into a reg holding one of them.)


Using a constant has the downside that loading the constant might cache-miss,
delaying a bunch of loop-setup work from happening (and loop-setup is where
int->float is probably most common).  So maybe it would be best to vxorps-zero
a register and use it for int->FP conversion if several other instructions
depend on that before any depend on the constant.  The zeroed register can
later have a constant loaded into it or whatever, and use that as a safe source
register inside the loop.

Or if there's a constant that's about to be used with the result of the int->fp
conversion, it's maybe not bad to load that constant and then use that xmm reg
as the merge-target for a scalar conversion.  If OOO execution can do the load
early (and it doesn't miss in cache), then there's no extra latency.  Or if it
does miss, then there's only an extra cycle or two of latency for that
dependency chain vs. spending an instruction on vxorpd-zeroing a target to
convert into, separate from loading the constant.





This trick works for non-loops, of course:

void cvti32f32(float *A, float *B, int x, int y) {
*B = y;
*A = x;
}

currently compiles to (-O3 -march=haswell)

cvti32f32:
vxorps  %xmm0, %xmm0, %xmm0
vcvtsi2ss   %ecx, %xmm0, %xmm0
vmovss  %xmm0, (%rsi)
vxorps  %xmm0, %xmm0, %xmm0
vcvtsi2ss   %edx, %xmm0, %xmm0
vmovss  %xmm0, (%rdi)

But could compile (with no loss of safety against false deps) to

cvti32f32:
vxorps  %xmm1, %xmm1, %xmm1 # reused for both
vcvtsi2ss  %ecx, %xmm1, %xmm0
vmovss  %xmm0, (%rsi)
vcvtsi2ss  %edx, %xmm1, %xmm0   # this convert can go into xmm1 if
we want to have both in regs at once
vmovss  %xmm0, (%rdi)
ret


Or, with the same amount of safety and fewer fused-domain uops on Intel
SnB-family CPUs, and not requiring AVX (but of course works fine with AVX):

cvti32f32:
movd  %ecx, %xmm0 # breaks deps on xmm0
cvtdq2ps  %xmm0, %xmm0# 1 uop for port1
movss %xmm0, (%rsi)
movd  %edx, %xmm0
cvtdq2ps  %xmm0, %xmm0
movss %xmm0, (%rdi)
ret

But this is only good for 32-bit integer -> float, not double.  (Because
cvtdq2pd also takes a port5 uop, unlike conversion to ps).  The code-size is
also slightly larger, taking 2 

[Bug target/80570] New: auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract

2017-04-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570

Bug ID: 80570
   Summary: auto-vectorizing int->double conversion should use
half-width memory operands to avoid shuffles, instead
of load+extract
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

When auto-vectorizing int->double conversion, gcc loads a full-width vector
into a register and then unpacks the upper half to feed (v)cvtdq2pd.  e.g. with
AVX, we get a 256b load and then vextracti128.

It's even worse with an unaligned src pointer with
-mavx256-split-unaligned-load, where it does vinsertf128 -> vextractf128,
without ever doing anything with the full 256b vector!

On Intel SnB-family CPUs, this will bottleneck the loop on port5 throughput,
because VCVTDQ2PD reg -> reg needs a port5 uop as well as a port1 uop.  (And
vextracti128 can only run on the shuffle unit on port5).

VCVTDQ2PD with a memory source operand doesn't need the shuffle port at all on
Intel Haswell and later, just the FP-add unit and a load, so it's a much better
choice.  (Throughput of one per clock on Sandybridge and Haswell, 2 per clock
on Skylake).  It's still 2 fused-domain uops, though, so I guess it can't
micro-fuse the load according to Agner Fog's testing.  (Or 3 on SnB).

I'm pretty sure using twice as many half-width memory operands is not worse on
other AVX CPUs either (AMD BD-family or Zen, or KNL), vs. max-width loads and
extracting the high half.


void cvti32f64_loop(double *dp, int *ip) {
// ICC avoids the mistake when it doesn't emit a prologue to align the pointers
#ifdef __GNUC__
dp = __builtin_assume_aligned(dp, 64);
ip = __builtin_assume_aligned(ip, 64);
#endif
for (int i=0; i<1 ; i++) {
double tmp = ip[i];
dp[i] = tmp;
}
}

https://godbolt.org/g/329C3P
gcc.godbolt.org's "gcc7" snapshot: g++ (GCC-Explorer-Build) 8.0.0 20170429
(experimental)

gcc -O3 -march=sandybridge
cvti32f64_loop:
xorl%eax, %eax
.L2:
vmovdqa (%rsi,%rax), %ymm0
vcvtdq2pd   %xmm0, %ymm1
vextractf128$0x1, %ymm0, %xmm0
vmovapd %ymm1, (%rdi,%rax,2)
vcvtdq2pd   %xmm0, %ymm0
vmovapd %ymm0, 32(%rdi,%rax,2)
addq$32, %rax
cmpq$4, %rax
jne .L2
vzeroupper
ret

gcc does the same thing for -march=haswell, but uses vextracti128.  This is
obviously really silly.

For comparison, clang 4.0 -O3 -march=sandybridge -fno-unroll-loops emits:
xorl%eax, %eax
.LBB0_1:
vcvtdq2pd   (%rsi,%rax,4), %ymm0
vmovaps %ymm0, (%rdi,%rax,8)
addq$4, %rax
cmpq$1, %rax# imm = 0x2710
jne .LBB0_1
vzeroupper
retq

This should come close to one 256b store per clock (on Haswell), even with
unrolling disabled.



With -march=nehalem, gcc gets away with it for this simple not-unrolled loop
(without hurting throughput I think), but only because this strategy
effectively unrolls the loop (doing two stores per add + cmp/jne), and Nehalem
can run shuffles on two execution ports (so the pshufd can run on port1, while
the cvtdq2pd can run on ports 1+5).  So it's 10 fused-domain uops per 2 stores
instead of 5 per 1 store.  Depending on how the loop buffer handles
non-multiple-of-4 uop counts, this might be a wash.  (Of course, with any other
work in the loop, or with unrolling, the memory-operand strategy is much
better).

CVTDQ2PD's memory operand is only 64 bits, so even the non-AVX version doesn't
fault if misaligned.

--

It's even more horrible without aligned pointers, when the sandybridge version
(which splits unaligned 256b loads/stores) uses vinsertf128 to emulate a 256b
load, and then does vextractf128 right away:

 inner_loop:   # gcc8 -march=sandybridge without __builtin_assume_aligned
vmovdqu (%r8,%rax), %xmm0
vinsertf128 $0x1, 16(%r8,%rax), %ymm0, %ymm0
vcvtdq2pd   %xmm0, %ymm1
vextractf128$0x1, %ymm0, %xmm0
vmovapd %ymm1, (%rcx,%rax,2)
vcvtdq2pd   %xmm0, %ymm0
vmovapd %ymm0, 32(%rcx,%rax,2)

This is obviously really really bad, and should probably be checked for and
avoided in case there are things other than int->double autovec that could lead
to doing this.

---

With -march=skylake-avx512, gcc does the AVX512 version of the same thing: zmm
load and then extra the upper 256b

[Bug bootstrap/80565] [8 Regression] ICE at -O2 and -O3 in 32-bit mode (not 64-bit) on x86_64-linux-gnu (in edge_badness, at ipa-inline.c:1028)

2017-04-30 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80565

Martin Liška  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
  Known to work||7.0.1
   Keywords||ice-on-valid-code
   Last reconfirmed||2017-04-30
 CC||hubicka at ucw dot cz,
   ||marxin at gcc dot gnu.org
 Ever confirmed|0   |1
Summary|ICE at -O2 and -O3 in   |[8 Regression] ICE at -O2
   |32-bit mode (not 64-bit) on |and -O3 in 32-bit mode (not
   |x86_64-linux-gnu (in|64-bit) on x86_64-linux-gnu
   |edge_badness, at|(in edge_badness, at
   |ipa-inline.c:1028)  |ipa-inline.c:1028)
   Target Milestone|--- |8.0
  Known to fail||8.0

--- Comment #1 from Martin Liška  ---
Confirmed, started with r247380.