[Bug testsuite/80759] gcc.target/x86_64/abi/ms-sysv FAILs

2017-05-20 Thread daniel.santos at pobox dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80759

--- Comment #9 from Daniel Santos  ---
Thank you again for the assistance.

(In reply to r...@cebitec.uni-bielefeld.de from comment #8)
> Daniel,
>
> > Would you be so kind as to test this on Solaris for me please?  I don't have
> > access to a Solaris machine and I've never set it up before, so I wouldn't 
> > even
> > know where to start to try to build an OpensSolaris VM.
>
> sure, though there's no need at all (except for the .struct part) to do
> the testing on Solaris.  I believe there are ready-made Solaris/x86
> VirtualBox images, though.

I've found a few, so going to try them out when I get some time.  Oracle even
has something on their downloads.  I haven't used Solaris since the early
aughts.

> For the multilib problem, you can easily
> configure gcc for i686-pc-linux-gnu with --enable-targets=all on a
> Linux/x86_64 box (with a few necessary 32-bit development packages
> added), so the default multilib is non-x86_64, while the x86_64 multilib
> is only used with -m64.

Hmm, I seem to be having problems getting this to work.  Would I configure with
--target=i686-pc-linux-gnu --enable-targets=all --enable-multilib?

> However, I still don't understand why you are jumping through all these
> hoops in ms-sysv.exp doing the compilations etc. manually rather than
> just relying on dg-runtest or similar.  This would avoid all this
> multilib trouble nicely, and massivly reduce ms-sysv.exp.

Well quite frankly because dg-runtest, et. al. don't offer support for tests
that use code generators.  The generated headers using the default options are
between 4.4 and 6 MiB in size and there are more things that need to be tested
(-fsplit-stack to name one) that isn't tested now.  I would also like to add a
feature where defining an environment variable generates more comprehensive
tests that I wouldn't want to run for every test (as it could take hours with
--enable-checking=all,rtl).

The most behaviorally similar test currently in the tree is
gcc/testsuite/gcc.dg/compat/struct-layout-1.exp, which builds a generator
(using remote_exec), runs the generator (remote_exec again) to generate sources
for all tests and then builds and runs each test using (using compat-execute). 
Calls to remote_exec are not automatically parallelized.  I don't fully
understand how the gcc/testsuite/lib/compat.exp library works, but I'm guessing
that calls to compat-execute are parallelized by dejagnu.

The scheme that struct-layout-1 uses builds the generator and creates sources
for all of the tests in job directory (i.e.,
gcc/testsuite/gcc{,1,2,3,4,5,6,etc.}/gcc.dg-struct-layout-1).  They take up
1.21 MiB per job, so -j48 results in 58 MiB of space usage.  My generator and
generated sources are larger, and currently take about 11.65 MiB per job, so
-j48 would eat 559 MiB of disk space, even though there are only 6 tests at the
moment.  This could be mitigated if there was a way to build and run the
generator only once and have the output go to a directory shared across jobs,
but I'm not yet aware of any such existing mechanism.

This doesn't mean that my approach is the only solution.  In fact, I built this
with Mike Stump's counsel and later discovered that when I ran multiple jobs,
each test was run once per job, so -j8 would run all of the tests 8 times,
rather than split them apart!  That's when I added the parallelization scheme.

So if you have some better ideas on how to accomplish this then please do
present them.  Or maybe I'm misunderstanding something about the way
dg-runtest, gcc_target_compile, etc. work in relation to parallelism?  My
understanding is that if I use them in succession for a single test run (i.e.,
build the generator, run the generator, build & run the test) that they could
end up being run on different jobs and then fail.

> One or two nits about PR management, btw.: it is good practice to take
> the PR if you're working on it.  And just add the URL to the patch
> submissing into the URL field.
>
> Thanks.
>
>   Rainer

I very much appreciate hints and guidance about proper PR management, coding
standards, etiquette, procedures, norms, etc.! I'm still pretty new to this
project but I find it really enjoyable.  However, I don't seem to have the
privileges to change those fields.  Do I need to seek advanced privileges from
somebody?

[Bug c++/80746] [concepts] ICE evaluating constraints for concepts with dependent template parameters

2017-05-20 Thread hstong at ca dot ibm.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80746

Hubert Tong  changed:

   What|Removed |Added

 CC||hstong at ca dot ibm.com

--- Comment #3 from Hubert Tong  ---
The description matches the case in Bug 79759.
The traceback for both start from the same line when I try a recent build.

[Bug target/80846] New: auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

Bug ID: 80846
   Summary: auto-vectorized AVX2 horizontal sum should narrow to
128b right away, to be more efficient for Ryzen and
Intel
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

gcc's tune=generic strategy for horizontal sums at the end of an AVX2
auto-vectorized  reduction is sub-optimal even for Intel CPUs, and horrible for
CPUs that split 256b ops into 2 uops (i.e. AMD including Ryzen).

The first step should always be vextracti/f128 to reduce down to xmm vectors. 
It has low latency and good throughput on all CPUs.  Keep narrowing in half
until you're down to one element.  See also
http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026

The one exception is a horizontal sum of 8-bit elements without overflow, where
you should use VPSADBW ymm against a zeroed to do a horizontal add without
overflow, and then extract hsum the resulting four 64-bit values.  (For signed
8-bit, you can range-shift to unsigned and then correct the scalar hsum
result.)



gcc tends to keep working with ymm vectors until the last step, even using
VPERM2I128 (which is horrible on AMD CPUs, e.g. Ryzen: 8 uops with 3c
throughput/latency vs. VEXTRACTI128 being 1 uop with 1c latency and 0.33c
throughput).

// https://godbolt.org/g/PwX6yT
int sumint(const int arr[]) {
  arr = __builtin_assume_aligned(arr, 64);
  int sum=0;
  for (int i=0 ; i<1024 ; i++)
sum+=arr[i];
  return sum;
}

Compiled with gcc8.0.0 20170520  -mavx2 -funroll-loops -O3 -std=gnu11, we get

vpxor   %xmm7, %xmm7, %xmm7
leaq4096(%rdi), %rax
.L24:
vpaddd  (%rdi), %ymm7, %ymm0
addq$256, %rdi# doing this later would let more
instructions use a disp8 instead of disp32
vpaddd  -224(%rdi), %ymm0, %ymm1
vpaddd  -192(%rdi), %ymm1, %ymm2
vpaddd  -160(%rdi), %ymm2, %ymm3
vpaddd  -128(%rdi), %ymm3, %ymm4
vpaddd  -96(%rdi), %ymm4, %ymm5
vpaddd  -64(%rdi), %ymm5, %ymm6
vpaddd  -32(%rdi), %ymm6, %ymm7  # unrolling without multiple
accumulators loses a lot of the benefit.
cmpq%rdi, %rax
jne .L24

# our single accumulator is currently in ymm7
vpxor   %xmm8, %xmm8, %xmm8 # Ryzen uops: 1  latency: x
vperm2i128  $33, %ymm8, %ymm7, %ymm9# 8   3
vpaddd  %ymm7, %ymm9, %ymm10# 2   1
vperm2i128  $33, %ymm8, %ymm10, %ymm11  # 8   3
vpalignr$8, %ymm10, %ymm11, %ymm12  # 2   1
vpaddd  %ymm12, %ymm10, %ymm13  # 2   1
vperm2i128  $33, %ymm8, %ymm13, %ymm14  # 8   3
vpalignr$4, %ymm13, %ymm14, %ymm15  # 2   1
vpaddd  %ymm15, %ymm13, %ymm0   # 2   1
vmovd   %xmm0, %eax # 1   3

vzeroupper
ret

Using x/ymm8-15 as src1 needs a 3-byte VEX prefix instead of 2-byte, so the
epilogue should reuse xmm0-6 to save code-size.  They're dead, and no x86 CPUs
have write-after-write dependencies.

More importantly, the shuffle strategy is just bad.  There should be only one
shuffle between each VPADDD.  I'd suggest

 vextracti128 $1, %ymm7, %xmm0
 vpaddd  %xmm7,%xmm0,%xmm0
 # Then a 128b hsum, which can use the same strategy as if we'd started with
128b
 vpunpckhqdq %xmm0,%xmm0,%xmm1  # Avoids an immediate, but without AVX use
PSHUFD to copy+shuffle
 vpaddd  %xmm1,%xmm0,%xmm0
 vpshuflw$0x4e,%xmm0,%xmm1  # or PSHUFD, or MOVSHDUP
 vpaddd  %xmm1,%xmm0,%xmm0
 vmovd   %xmm0,%eax

This is faster on Haswell by a significant margin, from avoiding the
lane-crossing VPERM2I128 shuffles.  It's also smaller.

All of these instructions are 1 uop / 1c latency on Ryzen (except movd), so
this is 7 uops / 9c latency.  GCC's current code is 36 uops, 17c on Ryzen. 
Things are similar on Bulldozer-family, vector ops have at least 2c latency.

An interesting alternative is possible for the last narrowing step with BMI2:

 vmovq  %xmm0, %rax
 rorx   $32, %rax, %rdx
 add%edx, %eax

RORX can run on ports 0 and 6 on Intel CPUs that support it, and it's fast on
Ryzen (1c latency 0.25c throughput).  If there are further vector instructions,
this reduces pressure on the vector ALU ports.  The only CPU where this is
really good is Excavator (bdver4?), assuming vector shuffle and VPADDD are
still 2c latency each, while RORX and ADD are 1c.  (Agner Fog's spreadsheet
doesn't have an Excavator tab).

I think it's a code-size win, s

[Bug target/52991] attribute packed broken on mingw32?

2017-05-20 Thread sherpya at netfarm dot it
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52991

--- Comment #22 from Gianluigi Tiesi  ---
(In reply to Ladislav Láska from comment #21)
> Hi!
> 
> I'm still seeing this problem on recent release 6.3.1, and it seems to be
> enabled by default on at least some builds (msys2 for example). 
> 
> Can I help somehow to get this into trunk sooner?
> 
> Thanks!

everyone uses -mno-ms-bitfields nowadays

[Bug target/80845] nvptx backend generates cvt.u32.u32

2017-05-20 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80845

--- Comment #2 from Tom de Vries  ---
(In reply to Tom de Vries from comment #0)
> //(insn 14 13 15 2
> // (set (reg:QI 23)
> //  (subreg/s/u:QI (reg:SI 44) 0)) 2 {*movqi_insn}
> // (nil))

For this insn, we enter nvpx_output_mov_insn with dst_inner == QI and src_inner
== SI. There are two clauses (disregarding the CONSTANT_P (src) one) that emit
mov.u32 insn, but neither of them triggers:
...
  if (src_inner == dst_inner)
return "%.\tmov%t0\t%0, %1;";

  ...

  if (GET_MODE_SIZE (dst_inner) == GET_MODE_SIZE (src_inner))
return "%.\tmov.b%T0\t%0, %1;";
...

So, we end up at the default, cvt case:
...
  return "%.\tcvt%t0%t1\t%0, %1;";
...

[Bug target/80845] nvptx backend generates cvt.u32.u32

2017-05-20 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80845

--- Comment #1 from Tom de Vries  ---
Created attachment 41392
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41392&action=edit
Patch adding assert on cvt.t.t

Found PR using this patch.

[Bug target/80845] nvptx backend generates cvt.u32.u32

2017-05-20 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80845

Tom de Vries  changed:

   What|Removed |Added

 Target||nvptx
   Severity|normal  |minor

[Bug target/80845] New: nvptx backend generates cvt.u32.u32

2017-05-20 Thread vries at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80845

Bug ID: 80845
   Summary: nvptx backend generates cvt.u32.u32
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vries at gcc dot gnu.org
  Target Milestone: ---

When compiling f.i. this testcase:

$ gcc gcc/testsuite/gcc.target/nvptx/abi-complex-arg.c -S
...

we generate these ptx insns:
...
cvt.u32.u32 %r23, %r44;
cvt.u32.u32 %r24, %r45;
cvt.u32.u32 %r23, %r22;
cvt.u32.u32 %r25, %r24;
...

The first looks like this in more detail:
...
//(insn 14 13 15 2
// (set (reg:QI 23)
//  (subreg/s/u:QI (reg:SI 44) 0)) 2 {*movqi_insn}
// (nil))
cvt.u32.u32 %r23, %r44; // 14   *movqi_insn/1   
...

While ptxas seems to accept cvt.u32.u32, we should probably emit mov.u32
instead.

[Bug target/80844] New: OpenMP SIMD doesn't know how to efficiently zero a vector (its stores zeros and reloads)

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844

Bug ID: 80844
   Summary: OpenMP SIMD doesn't know how to efficiently zero a
vector (its stores zeros and reloads)
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

float sumfloat_omp(const float arr[]) {
  float sum=0;
   #pragma omp simd reduction(+:sum) aligned(arr : 64)
  for (int i=0 ; i<1024 ; i++)
sum = sum + arr[i];
  return sum;
}
// https://godbolt.org/g/6KnMXM

x86-64 gcc7.1 and gcc8-snapshot-20170520 -mavx2 -ffast-math -funroll-loops
-fopenmp -O3 emit:

# omitted integer code to align the stack by 32
vpxor   %xmm0, %xmm0, %xmm0  # tmp119
vmovaps %xmm0, -48(%rbp) # tmp119, MEM[(void *)&D.2373]
vmovaps %xmm0, -32(%rbp) # tmp119, MEM[(void *)&D.2373]
vmovaps -48(%rbp), %ymm8 # MEM[(float *)&D.2373], vect__23.20
# then the loop

The store-forwarding stall part of this is very similar to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

With gcc4/5/6, we get four integer stores like movq $0, -48(%rbp) before the
vector load.  Either way, this is ridiculous because vpxor already zeroed the
whole ymm vector, and causes a store-forwarding stall.

It's also silly because -ffast-math allows 0.0+x to optimize to x.  It could
start off by simply loading the first vector instead of adding it to 0, in this
special case where the loop count is a compile-time constant.

---

It's not just AVX 256b vectors, although it's far worse there.

With just SSE2:

pxor%xmm0, %xmm0
movaps  %xmm0, -24(%rsp)# dead store
pxor%xmm0, %xmm0


Or from older gcc versions, the same int store->vector reload
store-forwarding-stall inducing code.

-


Even though -funroll-loops unrolls, it doesn't use multiple accumulators to run
separate dep chains in parallel.  So it still bottlenecks on the 4 cycle
latency of VADDPS, instead of the 0.5c throughput (Example numbers for Intel
Skylake, and yes the results are the same with -march=skylake).  clang uses 4
vector accumulators, so it runs 4x faster when data is hot in L2.  (up to 8x is
possible with data hot in L1).  Intel pre-skylake has VADDPS latency=3c
throughput=1c, so there's still a factor of three to be had.  But Haswell has
FMA lat=5c, tput=0.5, so you need 10 accumulators to saturate the 2 load&FMA
per clock max throughput.

Is there already an open bug for this?  I know it's totally separate from this
issue.

[Bug c++/80841] Fails to match template specialization with polymorphic non-type template argument

2017-05-20 Thread daniel.kruegler at googlemail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80841

Daniel Krügler  changed:

   What|Removed |Added

 CC||daniel.kruegler@googlemail.
   ||com

--- Comment #1 from Daniel Krügler  ---
(In reply to Jason Bell from comment #0)

An interesting case. Reduced example:

//#
template 
struct A {};

template 
struct B {};

template 
struct B> 
{
  using result = T;
};

static double input;

int main() {
  using result1 = typename B>::result; // OK
  using result2 = typename B>::result; // OK
  using result3 = typename B>::result; // Error
}
//#

results in:

prog.cc: In function 'int main()':
prog.cc:18:59: error: 'result' in 'struct B >' does
not name a type
   using result3 = typename B>::result;
   ^~
Clang behaves exactly the same and from the compiler behaviour it looks like a
non-deducible situation, but I fail to match this with current non-deduced
context rules at the moment.

[Bug c++/80839] Memory leak in _Rb_tree

2017-05-20 Thread mail at kitsu dot me
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80839

mail at kitsu dot me changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #3 from mail at kitsu dot me ---
thanks

[Bug c++/80839] Memory leak in _Rb_tree

2017-05-20 Thread eugeni.stepanov at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80839

--- Comment #2 from Evgeniy Stepanov  ---
Hmm, where is the button to mark this resolved/wontfix?

[Bug c++/80839] Memory leak in _Rb_tree

2017-05-20 Thread eugeni.stepanov at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80839

Evgeniy Stepanov  changed:

   What|Removed |Added

 CC||eugeni.stepanov at gmail dot 
com

--- Comment #1 from Evgeniy Stepanov  ---
MemorySanitizer requires that all code in the program is instrumented, with the
exception of libc, but not libstdc++/libc++.

https://clang.llvm.org/docs/MemorySanitizer.html#handling-external-code

[Bug bootstrap/80843] New: [8 Regression] bootstrap fails in stage1 on powerpc-linux-gnu

2017-05-20 Thread doko at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80843

Bug ID: 80843
   Summary: [8 Regression] bootstrap fails in stage1 on
powerpc-linux-gnu
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: doko at gcc dot gnu.org
  Target Milestone: ---

seen with trunk 20170520, trunk 20170512 succeeded to build.

original build log at
https://buildd.debian.org/status/fetch.php?pkg=gcc-snapshot&arch=powerpc&ver=20170520-1&stamp=1495309132&raw=0

/«PKGBUILDDIR»/build/./gcc/xgcc -B/«PKGBUILDDIR»/build/./gcc/
-B/usr/lib/gcc-snapshot/powerpc-linux-gnu/bin/
-B/usr/lib/gcc-snapshot/powerpc-linux-gnu/lib/ -isystem
/usr/lib/gcc-snapshot/powerpc-linux-gnu/include -isystem
/usr/lib/gcc-snapshot/powerpc-linux-gnu/sys-include-g -O2 -O2  -g -O2
-DIN_GCC-W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wno-format
-Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition  -isystem
./include   -fPIC -mlong-double-128 -mno-minimal-toc -g -DIN_LIBGCC2
-fbuilding-libgcc -fno-stack-protector   -fPIC -mlong-double-128
-mno-minimal-toc -I. -I. -I../.././gcc -I../../../src/libgcc
-I../../../src/libgcc/. -I../../../src/libgcc/../gcc
-I../../../src/libgcc/../include -I../../../src/libgcc/../libdecnumber/dpd
-I../../../src/libgcc/../libdecnumber -DHAVE_CC_TLS  -Wno-type-limits -mvsx
-mfloat128 -mno-float128-hardware -I../../../src/libgcc/soft-fp
-I../../../src/libgcc/config/rs6000  -o _mulkc3.o -MT _mulkc3.o -MD -MP -MF
_mulkc3.dep  -c ../../../src/libgcc/config/rs6000/_mulkc3.c -fvisibility=hidden
-DHIDE_EXPORTS
*** Error in `/«PKGBUILDDIR»/build/./gcc/cc1': free(): invalid next size
(fast): 0x124eb500 ***
=== Backtrace: =
/lib/powerpc-linux-gnu/libc.so.6(+0x81adc)[0xf971adc]
/lib/powerpc-linux-gnu/libc.so.6(+0x8b254)[0xf97b254]
/lib/powerpc-linux-gnu/libc.so.6(+0x8bd74)[0xf97bd74]
/«PKGBUILDDIR»/build/./gcc/cc1(_Z12sbitmap_freeP17simple_bitmap_def+0x20)[0x104d2d2c]
/«PKGBUILDDIR»/build/./gcc/cc1[0x10c40ea4]
/«PKGBUILDDIR»/build/./gcc/cc1(_Z28try_shrink_wrapping_separateP15basic_block_def+0x3d8)[0x10c43f2c]
/«PKGBUILDDIR»/build/./gcc/cc1(_Z34thread_prologue_and_epilogue_insnsv+0xc0)[0x107adbc8]
/«PKGBUILDDIR»/build/./gcc/cc1[0x107aee34]
/«PKGBUILDDIR»/build/./gcc/cc1[0x107aef18]
/«PKGBUILDDIR»/build/./gcc/cc1(_Z16execute_one_passP8opt_pass+0x318)[0x10b0be00]
/«PKGBUILDDIR»/build/./gcc/cc1[0x10b0c1f8]
/«PKGBUILDDIR»/build/./gcc/cc1[0x10b0c238]
/«PKGBUILDDIR»/build/./gcc/cc1[0x10b0c238]
/«PKGBUILDDIR»/build/./gcc/cc1(_Z17execute_pass_listP8functionP8opt_pass+0x50)[0x10b0c2c0]
/«PKGBUILDDIR»/build/./gcc/cc1(_ZN11cgraph_node6expandEv+0x224)[0x105b1360]
/«PKGBUILDDIR»/build/./gcc/cc1[0x105b1ae4]
/«PKGBUILDDIR»/build/./gcc/cc1(_ZN12symbol_table7compileEv+0x2f4)[0x105b297c]
/«PKGBUILDDIR»/build/./gcc/cc1(_ZN12symbol_table25finalize_compilation_unitEv+0x150)[0x105b2c34]
/«PKGBUILDDIR»/build/./gcc/cc1[0x10c9206c]
/«PKGBUILDDIR»/build/./gcc/cc1[0x10c957e8]
/«PKGBUILDDIR»/build/./gcc/cc1(_ZN6toplev4mainEiPPc+0x1bc)[0x10c95c04]
/«PKGBUILDDIR»/build/./gcc/cc1(main+0x48)[0x118983c4]
/lib/powerpc-linux-gnu/libc.so.6(+0x21274)[0xf911274]
/lib/powerpc-linux-gnu/libc.so.6(__libc_start_main+0xdc)[0xf91147c]
=== Memory map: 
0010-0012 r-xp  00:00 0  [vdso]
0f8f-0fa8 r-xp  00:1c 21642306  
/lib/powerpc-linux-gnu/libc-2.24.so
0fa8-0fa9 r--p 0018 00:1c 21642306  
/lib/powerpc-linux-gnu/libc-2.24.so
0fa9-0faa rw-p 0019 00:1c 21642306  
/lib/powerpc-linux-gnu/libc-2.24.so
0fab-0fb8 r-xp  00:1c 21642302  
/lib/powerpc-linux-gnu/libm-2.24.so
0fb8-0fb9 r--p 000c 00:1c 21642302  
/lib/powerpc-linux-gnu/libm-2.24.so
0fb9-0fba rw-p 000d 00:1c 21642302  
/lib/powerpc-linux-gnu/libm-2.24.so
0fbb-0fbd r-xp  00:1c 21642231  
/lib/powerpc-linux-gnu/libz.so.1.2.8
0fbd-0fbe r--p 0001 00:1c 21642231  
/lib/powerpc-linux-gnu/libz.so.1.2.8
0fbe-0fbf rw-p 0002 00:1c 21642231  
/lib/powerpc-linux-gnu/libz.so.1.2.8
0fc0-0fc1 r-xp  00:1c 21642303  
/lib/powerpc-linux-gnu/libdl-2.24.so
0fc1-0fc2 r--p  00:1c 21642303  
/lib/powerpc-linux-gnu/libdl-2.24.so
0fc2-0fc3 rw-p 0001 00:1c 21642303  
/lib/powerpc-linux-gnu/libdl-2.24.so
0fc4-0fcd r-xp  00:1c 21621381  
/usr/lib/powerpc-linux-gnu/libgmp.so.10.3.2
0fcd-0fce r--p 0008 00:1c 21621381  
/usr/

[Bug tree-optimization/80842] New: gcc ICE at -O3 on x86_64-linux-gnu in "set_lattice_value"

2017-05-20 Thread helloqirun at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80842

Bug ID: 80842
   Summary: gcc ICE at -O3 on x86_64-linux-gnu in
"set_lattice_value"
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: helloqirun at gmail dot com
  Target Milestone: ---

The following code causes an ICE when compiled with the current gcc trunk at
-O3 on x86_64-linux-gnu in both 32- and 64-bit modes. 



$ gcc-trunk -v
Using built-in specs.
COLLECT_GCC=gcc-trunk
COLLECT_LTO_WRAPPER=/home/absozero/trunk/root-gcc/libexec/gcc/x86_64-pc-linux-gnu/8.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc/configure --prefix=/home/absozero/trunk/root-gcc
--enable-languages=c,c++ --disable-werror --enable-multilib
Thread model: posix
gcc version 8.0.0 20170520 (experimental) [trunk revision 248308] (GCC)

$ gcc-trunk -O3 abc.c
abc.c: In function ‘fn3’:
abc.c:10:6: internal compiler error: in set_lattice_value, at
tree-ssa-ccp.c:505
 void fn3() {
  ^~~
0xc9efc8 set_lattice_value
../../gcc/gcc/tree-ssa-ccp.c:505
0xca3f2c visit_assignment
../../gcc/gcc/tree-ssa-ccp.c:2322
0xca40da ccp_visit_stmt
../../gcc/gcc/tree-ssa-ccp.c:2396
0xd2f518 simulate_stmt
../../gcc/gcc/tree-ssa-propagate.c:241
0xd30fb2 process_ssa_edge_worklist
../../gcc/gcc/tree-ssa-propagate.c:341
0xd30fb2 ssa_propagate(ssa_prop_result (*)(gimple*, edge_def**, tree_node**),
ssa_prop_result (*)(gphi*))
../../gcc/gcc/tree-ssa-propagate.c:813
0xc9de10 do_ssa_ccp
../../gcc/gcc/tree-ssa-ccp.c:2436
0xc9de10 execute
../../gcc/gcc/tree-ssa-ccp.c:2480
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.




$ cat abc.c
unsigned a;
short b;
char c, d, e;
void fn1();
void fn2() {
  a++;
  for (; a;)
fn1(0, 0);
}
void fn3() {
  fn2();
l1:;
  unsigned char f;
  short g;
  unsigned char *h = &f;
  g += &h ? e ? g = 1 : 0 : 0;
  d = g;
  c *f;
  if (d & (b %= *h) < f * d / (d -= 0))
goto l1;
}

[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #5 from Peter Cordes  ---
(In reply to Thiago Macieira from comment #3)
> (In reply to Jakub Jelinek from comment #1)
> > Foir double-word compare and exchange you need double-word alignment, so I
> > think the current alignment is correct.
> 
> The instruction manual says that CMPXCHG16B requires 128-bit alignment, but
> doesn't say the same for CMPXCHG8B. It says that the AC(0) alignment check
> fault could happen if it's not aligned, but doesn't say what the required
> alignment is.

The more important point is that simple loads and stores are not atomic on
cache-line splits, so requiring natural alignment for atomic objects would
avoid that.  LOCKed read-modify-write ops are also *much* slower on cache-line
splits.


#AC isn't really relevant, but I'd assume it requires 8B alignment since it's
really a single 8B atomic RMW.

#AC faults only happen if the kernel sets the AC bit in EFLAGS, which will
cause *any* unaligned access to fault.  Code all over the place assumes that
unaligned accesses are safe.  e.g. glibc memcpy commonly uses unaligned loads
for small non-power-of-2 sizes or unaligned inputs.  So you can't really enable
the AC flag with normal code.

I assume this is why Intel was lazy about documenting the exact details of #AC
behaviour for this instruction, or figured it was obvious.

[Bug libstdc++/80835] Reading a member of an atomic can load just that member, not the whole struct

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835

Peter Cordes  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=70490

--- Comment #1 from Peter Cordes  ---
related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490: lack of a way for
CPUs to advertise when 16B SSE loads/stores are atomic, and that cmpxchg16b
segfaults on mmap(PROT_READ), which is one of the reasons for not inlining
cmpxchg16b (https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html)

I tried to add a CC to Torvald Riegel, since a read-mostly pointer +
ABA-counter maybe be a relevant use-case for 16-byte atomic objects, where the
reads only want the current pointer value.

Using locking for 16B objects would break this use-case.

[Bug target/70490] __atomic_load_n(const __int128 *, ...) generates CMPXCHG16B with no warning

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #5 from Peter Cordes  ---
It would be great if there was a CPUID feature-bit for (aligned?) 16B
loads/stores being atomic.  IDK if that's likely to ever happen, but it might
be something we could ask CPU vendors for.  I suspect that a lot of current
CPUs do in practice have this feature, especially single-socket system.  For
example, Intel Haswell's internal data paths between different layers of cache
are all at least 32B wide.

I mention single-socket because narrower transfers in a coherency protocol can
cause tearing even on CPUs with a 16B data path to/from L1d.  e.g. experimental
verification of non-atomicity on AMD Opteron 2435 (K10) with threads running on
separate sockets, connected with HyperTransport:
http://stackoverflow.com/questions/7646018/sse-instructions-which-cpus-can-do-atomic-16b-memory-operations/7647825#7647825

Still, CPUID could report something that was detected at power-on, if there are
still cases where multi-socket coherency traffic only supports 8B atomicity.

---

There's also a difference between atomicity guarantees for stuff like MMIO
observed by non-CPU system devices, vs. only WB regions of normal DRAM observed
by other CPUs.

As I understand it, the current atomicity guarantees for aligned accesses up to
64b apply even for uncached accesses, except for the P6 unaligned guaranteed
which specifically only applies to *cached* accesses.  Relevant quotes
extracted from the Intel's manual with commentary:
http://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic

[Bug testsuite/64221] contrib/compare_tests confused by c-c++-common/ubsan/shift-5.c

2017-05-20 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64221

Marc Glisse  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Marc Glisse  ---
I am not seeing it these days, so tentatively marking it as fixed...

[Bug fortran/80610] Compiler crashes ungraciously when large static array is initialized with anything other than zero

2017-05-20 Thread tkoenig at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80610

Thomas Koenig  changed:

   What|Removed |Added

   Keywords||memory-hog
 Status|RESOLVED|NEW
 Resolution|INVALID |---
   Severity|normal  |enhancement

--- Comment #15 from Thomas Koenig  ---
So, confirming as an enhancement.

[Bug fortran/80610] Compiler crashes ungraciously when large static array is initialized with anything other than zero

2017-05-20 Thread jvdelisle at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80610

--- Comment #14 from Jerry DeLisle  ---
(In reply to Steve Kargl from comment #13)
> On Sat, May 20, 2017 at 04:59:10AM +, jvdelisle at gcc dot gnu.org wrote:
> > 
> > Yes that will take some frontend magic and we have so few people to support
> > gfortran (for free remember) that we may not be able to get to it.
> > 
> > I don't think the report is invalid at all.
> > 
> 
> I have thought about this type of issue.
> 
> 2005-08-21  Steven G. Kargl  
> 
>   * array.c: Bump GFC_MAX_AC_EXPAND from 100 to 65535.
> 
> Good luck.

Exactly my point. I remember when we had this discussion before. We know its
possible to do this all differently, but we don't have the resources to do it.
Other bugs are much higher priority.

[Bug bootstrap/80838] PGO/LTO bootstrapped compiler 5% slower than pure PGO bootstrapped one

2017-05-20 Thread hubicka at ucw dot cz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80838

--- Comment #1 from Jan Hubicka  ---
If it is easy to do, can you attach profiles please?
i will try to reproduce this...

Honza

[Bug c++/80841] New: Fails to match template specialization with polymorphic non-type template argument

2017-05-20 Thread cipherjason at hotmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80841

Bug ID: 80841
   Summary: Fails to match template specialization with
polymorphic non-type template argument
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: cipherjason at hotmail dot com
  Target Milestone: ---

This code fails to match the specialization of MaybeDestruct:

#include 

template 
struct Just;

template  class Fjust, class Maybe>
struct MaybeDestruct;

template  class Fjust, T X>
struct MaybeDestruct> {
using result = typename Fjust::result;
};

template 
struct Number {
using result = Just;
};

static constexpr double input = 2.0;
int main() {
using result = typename MaybeDestruct>::result;
}



If the use of template parameter T is stripped out then it seems to work:

#include 

template 
struct Just;

template  class Fjust, class Maybe>
struct MaybeDestruct;

template  class Fjust, const double& X>
struct MaybeDestruct> {
using result = typename Fjust::result;
};

template 
struct Number {
using result = Just;
};

static constexpr double input = 2.0;
int main() {
using result = typename MaybeDestruct>::result;
}

[Bug c++/80840] New: ICE in convert_nontype_argument reference to double

2017-05-20 Thread cipherjason at hotmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80840

Bug ID: 80840
   Summary: ICE in convert_nontype_argument reference to double
   Product: gcc
   Version: 7.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: cipherjason at hotmail dot com
  Target Milestone: ---

The following code causes an internal compiler error:

#include 

template 
struct Just;

template 
struct Number {
static constexpr double value = X;
using result = Just;
};

int main() {}



prog.cc:9:45: internal compiler error: in convert_nontype_argument, at
cp/pt.c:6828
 using result = Just;
 ^
0x5e59bd convert_nontype_argument
../../gcc-7.1.0/gcc/cp/pt.c:6827
0x5e59bd convert_template_argument
../../gcc-7.1.0/gcc/cp/pt.c:7668
0x5e669c coerce_template_parms
../../gcc-7.1.0/gcc/cp/pt.c:8128
0x5e8a09 lookup_template_class_1
../../gcc-7.1.0/gcc/cp/pt.c:8664
0x5e8a09 lookup_template_class(tree_node*, tree_node*, tree_node*, tree_node*,
int, int)
../../gcc-7.1.0/gcc/cp/pt.c:9009
0x6827dd finish_template_type(tree_node*, tree_node*, int)
../../gcc-7.1.0/gcc/cp/semantics.c:3151
0x631ff4 cp_parser_template_id
../../gcc-7.1.0/gcc/cp/parser.c:15495
0x63214f cp_parser_class_name
../../gcc-7.1.0/gcc/cp/parser.c:21953
0x63c737 cp_parser_qualifying_entity
../../gcc-7.1.0/gcc/cp/parser.c:6286
0x63c737 cp_parser_nested_name_specifier_opt
../../gcc-7.1.0/gcc/cp/parser.c:5972
0x63f452 cp_parser_simple_type_specifier
../../gcc-7.1.0/gcc/cp/parser.c:16826
0x62828d cp_parser_type_specifier
../../gcc-7.1.0/gcc/cp/parser.c:16499
0x63e6c2 cp_parser_type_specifier_seq
../../gcc-7.1.0/gcc/cp/parser.c:20781
0x6355f1 cp_parser_type_id_1
../../gcc-7.1.0/gcc/cp/parser.c:20627
0x63bb28 cp_parser_type_id
../../gcc-7.1.0/gcc/cp/parser.c:20697
0x63bb28 cp_parser_alias_declaration
../../gcc-7.1.0/gcc/cp/parser.c:18593
0x625d3c cp_parser_member_declaration
../../gcc-7.1.0/gcc/cp/parser.c:23041
0x62696a cp_parser_member_specification_opt
../../gcc-7.1.0/gcc/cp/parser.c:22945
0x62696a cp_parser_class_specifier_1
../../gcc-7.1.0/gcc/cp/parser.c:22098
0x6285c1 cp_parser_class_specifier
../../gcc-7.1.0/gcc/cp/parser.c:22350
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug fortran/80766] [7/8 Regression] [OOP] ICE with type-bound procedure returning an array

2017-05-20 Thread janus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80766

--- Comment #8 from janus at gcc dot gnu.org ---
(In reply to janus from comment #5)
> This rather simple patch fixes the ICE on trunk:
> 
> Index: gcc/fortran/resolve.c
> ===
> --- gcc/fortran/resolve.c (revision 247818)
> +++ gcc/fortran/resolve.c (working copy)
> @@ -13833,6 +13833,9 @@ resolve_fl_derived (gfc_symbol *sym)
> gcc_assert (vtab);
> vptr->ts.u.derived = vtab->ts.u.derived;
>   }
> +
> +  if (!resolve_fl_derived0 (vptr->ts.u.derived))
> + return false;
>  }
>  
>if (!resolve_fl_derived0 (sym))

This shows one ICE in the testsuite:

FAIL: gfortran.dg/typebound_proc_32.f90   -O  (internal compiler error)

[Bug fortran/80766] [7/8 Regression] [OOP] ICE with type-bound procedure returning an array

2017-05-20 Thread janus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80766

--- Comment #7 from janus at gcc dot gnu.org ---
(In reply to janus from comment #6)
> I just verified that reverting the class.c parts of r241450 fixes the ICE as
> well. It does not show any failures on the select_type_* test cases either.
> I'm about to start a full regtest ...

That unfortunately shows one runtime failure (at all optimization levels):

FAIL: gfortran.dg/submodule_6.f08   -O0  execution test

[Bug fortran/80766] [7/8 Regression] [OOP] ICE with type-bound procedure returning an array

2017-05-20 Thread janus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80766

janus at gcc dot gnu.org changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |janus at gcc dot gnu.org

[Bug fortran/80766] [7/8 Regression] [OOP] ICE with type-bound procedure returning an array

2017-05-20 Thread janus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80766

--- Comment #6 from janus at gcc dot gnu.org ---
(In reply to janus from comment #4)
> I actually think that r241450 might be more relevant here than r241403 (in
> particular since it messes with gfc_find_derived_vtab).

I just verified that reverting the class.c parts of r241450 fixes the ICE as
well. It does not show any failures on the select_type_* test cases either. I'm
about to start a full regtest ...

[Bug fortran/80766] [7/8 Regression] [OOP] ICE with type-bound procedure returning an array

2017-05-20 Thread janus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80766

--- Comment #5 from janus at gcc dot gnu.org ---
I have investigated a bit on the origin of the problem, and it seems that it is
related to the vtype symbols not being resolved properly (and the TBP component
having the type BT_UNKNOWN although it's a REAL function).

This rather simple patch fixes the ICE on trunk:

Index: gcc/fortran/resolve.c
===
--- gcc/fortran/resolve.c   (revision 247818)
+++ gcc/fortran/resolve.c   (working copy)
@@ -13833,6 +13833,9 @@ resolve_fl_derived (gfc_symbol *sym)
  gcc_assert (vtab);
  vptr->ts.u.derived = vtab->ts.u.derived;
}
+
+  if (!resolve_fl_derived0 (vptr->ts.u.derived))
+   return false;
 }

   if (!resolve_fl_derived0 (sym))

[Bug fortran/80766] [7/8 Regression] [OOP] ICE with type-bound procedure returning an array

2017-05-20 Thread janus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80766

--- Comment #4 from janus at gcc dot gnu.org ---
(In reply to Dominique d'Humieres from comment #3)
> Revision r241395+patch (2016-10-21) is OK, r241433+patch (2016-10-21) gives
> the ICE. AFAICT the only possible revision is the range is r241403. However
> in the patches of r241433 I have the patch for r241450 (pr69834), but it
> looks also unrelated.

I actually think that r241450 might be more relevant here than r241403 (in
particular since it messes with gfc_find_derived_vtab).

[Bug c++/80839] New: Memory leak in _Rb_tree

2017-05-20 Thread mail at kitsu dot me
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80839

Bug ID: 80839
   Summary: Memory leak in _Rb_tree
   Product: gcc
   Version: 6.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: mail at kitsu dot me
  Target Milestone: ---

Greetings!

I want to use static const std::map with aggregate initialization. The simpler
example of my case is:
#include 

int main() {
static const std::map a {{1,2}};
return 0;
}

Besides I also want to use clang's memory sanitizer, so the compilation line
is:
clang++ a.cc -fsanitize=memory -std=c++11

And after launch, memory sanitizer complain about use-of-uninitialized-value
(filepaths removed):
==7422==WARNING: MemorySanitizer: use-of-uninitialized-value
#0 0x493408 in std::_Rb_tree,
std::_Select1st >, std::less,
std::allocator >
>::_M_erase(std::_Rb_tree_node >*)
#1 0x49349f in std::_Rb_tree,
std::_Select1st >, std::less,
std::allocator >
>::_M_erase(std::_Rb_tree_node >*)
#2 0x493214 in std::_Rb_tree,
std::_Select1st >, std::less,
std::allocator > >::~_Rb_tree()
#3 0x493161 in std::map,
std::allocator > >::~map()
#4 0x4220a3 in MSanAtExitWrapper(void*)
#5 0x7f933520f6bf in __run_exit_handlers
#6 0x7f933520f719 in __GI_exit
#7 0x7f93351f9517 in __libc_start_main
#8 0x41a1b9 in _start

SUMMARY: MemorySanitizer: use-of-uninitialized-value in std::_Rb_tree, std::_Select1st >,
std::less, std::allocator >
>::_M_erase(std::_Rb_tree_node >*)
Exiting

Compiling with debug symbols seems to lead somewhere around here:
https://github.com/gcc-mirror/gcc/blob/gcc-6-branch/libstdc++-v3/include/bits/stl_tree.h#L1633

System info:

Linux: Arch Linux, kernel 4.11.2 + pf-patch
clang: 4.0.0 (last one from off repo)
gcc: 6.3.1 (last one from off repo)

[Bug fortran/80766] [7/8 Regression] [OOP] ICE with type-bound procedure returning an array

2017-05-20 Thread dominiq at lps dot ens.fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80766

--- Comment #3 from Dominique d'Humieres  ---
> > Confirmed, likely r241403 (pr69566).
>
> Not so sure here. Looks rather unrelated to me. Why do you think this one
> is the culprit?

Revision r241395+patch (2016-10-21) is OK, r241433+patch (2016-10-21) gives the
ICE. AFAICT the only possible revision is the range is r241403. However in the
patches of r241433 I have the patch for r241450 (pr69834), but it looks also
unrelated.

[Bug driver/80836] final binaries missing rpath despite configure with LDFLAGS=-Wl,-rpath=$prefix/lib

2017-05-20 Thread rjvbertin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80836

--- Comment #3 from René J.V. Bertin  ---
A bit complicated, no?

Also, how does one get binaries built with the resulting compilers to use the
corresponding runtime libraries (libstdc++, libgfortran, ...)? There should be
a configure option for that too, IMHO, possibly even a single option.

Sidewise related: I configured with `--with-libdir=/opt/local/lib/gcc7`,
thinking I'd get the runtime libraries installed there. This is the case on
Mac, but on Linux I discovered libstdc++ and family in /opt/local/lib/lib64
instead.

That kind of surprise doesn't make it easy for the end-user to add an explicit
rpath option.

[Bug bootstrap/80838] New: PGO/LTO bootstrapped compiler 5% slower than pure PGO bootstrapped one

2017-05-20 Thread trippels at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80838

Bug ID: 80838
   Summary: PGO/LTO bootstrapped compiler 5% slower than pure PGO
bootstrapped one
   Product: gcc
   Version: 7.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: trippels at gcc dot gnu.org
CC: hubicka at ucw dot cz
  Target Milestone: ---

On the new Ryzen compile farm machine a PGO/LTO bootstrapped compiler is ~5%
slower than a pure PGO bootstrapped one.
I've seen the same effect on other X86_64 machines, too.

 % ../gcc/configure --disable-libstdcxx-pch --disable-libvtv --disable-libitm
--disable-libcilkrts --disable-libssp --disable-libgomp --disable-werror
--disable-multilib --enable-languages=c,c++,fortran --enable-checking=release
--with-build-config="bootstrap-O3 (bootstrap-lto)"
 % make -j16 BOOT_CFLAGS="-Wno-error=coverage-mismatch -march=native -O3 -pipe"
STAGE1_CFLAGS="-Wno-error=coverage-mismatch -march=native -O3 -pipe"
CFLAGS_FOR_TARGET="-Wno-error=coverage-mismatch -march=native -O3 -pipe"
CXXFLAGS_FOR_TARGET="-Wno-error=coverage-mismatch -march=native -O3 -pipe"
profiledbootstrap

bootstrap-lto/PGO:
trippels@gcc67 ~ % time g++ -Ofast -w tramp3d-v4.cpp
g++ -Ofast -w tramp3d-v4.cpp  16.12s user 0.30s system 99% cpu 16.465 total

pure PGO:
trippels@gcc67 ~ % time g++ -Ofast -w tramp3d-v4.cpp
g++ -Ofast -w tramp3d-v4.cpp  15.17s user 0.41s system 99% cpu 15.626 total

(The resulting binary runs very quick:
--cartvis 1.0 0.0 --rhomin 1e-8 -n 20
Time spent in iteration: 0.93937
(on Sandy Bridge:1.78999))

[Bug fortran/80766] [7/8 Regression] ICE with type bound procedures returning an array

2017-05-20 Thread janus at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80766

janus at gcc dot gnu.org changed:

   What|Removed |Added

 CC||janus at gcc dot gnu.org

--- Comment #2 from janus at gcc dot gnu.org ---
(In reply to Dominique d'Humieres from comment #1)
> Confirmed, likely r241403 (pr69566).

Not so sure here. Looks rather unrelated to me. Why do you think this one is
the culprit?

[Bug c/80806] gcc does not warn if local array is memset only

2017-05-20 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80806

--- Comment #6 from Daniel Fruzynski  ---
I have checked list of my issues reported here and found Bug 68034 which is
closely related to this one. This patch probably will fix that one too, user
will know that his memset was removed by compiler.

[Bug driver/80836] final binaries missing rpath despite configure with LDFLAGS=-Wl,-rpath=$prefix/lib

2017-05-20 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80836

--- Comment #2 from Andrew Pinski  ---
BOOT_LDFLAGS should be used if want them for non stage1.

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636

--- Comment #3 from Peter Cordes  ---
The point about moves also applies to integer code, since a 64-bit mov requires
an extra byte for the REX prefix (unless a REX prefix was already required for
r8-r15).

I just noticed a case where gcc uses a 64-bit mov to copy a just-zeroed integer
register, when setting up for a 16-byte atomic load (see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 re: using a narrow load for
a single member, and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 for a
7.1.0 regression.  And https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 for
the store-forwarding stalls from this code with -m32)

// https://godbolt.org/g/xnyI0l
// return the first 8-byte member of a 16-byte atomic object.
#include 
#include 
struct node;
struct alignas(2*sizeof(void*)) counted_ptr {
node *ptr;// non-atomic pointer-to-atomic
uintptr_t count;
};

node *load_nounion(std::atomic *p) {
  return p->load(std::memory_order_acquire).ptr;
}

gcc6.3 -std=gnu++11 -O3 -mcx16 compiles this to

pushq   %rbx
xorl%ecx, %ecx
xorl%eax, %eax
xorl%edx, %edx
movq%rcx, %rbx### BAD: should be movl %ecx,%ebx.  Or another
xor
lock cmpxchg16b (%rdi)
popq%rbx
ret

MOVQ is obviously sub-optimal, unless done for padding to avoid NOPs later.

It's debatable whether %rbx should be zeroed with xorl %ebx,%ebx or movl
%ecx,%ebx.

* AMD: copying a zeroed register is always at least as good, sometimes better.
* Intel: xor-zeroing is always best, but on IvB and later copying a zeroed reg
is as good most of the time.  (But not in cases where mov %r10d, %ebx would
cost a REX and xor %ebx,%ebx wouldn't.)

Unfortunately, -march/-mtune doesn't affect the code-gen either way.  OTOH,
there's not much to gain here, and the current strategy of mostly using xor is
not horrible for any CPUs.  Just avoiding useless REX prefixes to save code
size would be good enough.

But if anyone does care about optimally zeroing multiple registers:

-mtune=bdver1/2/3 should maybe use one xorl and three movl (since integer MOV
can run on ports AGU01 as well as EX01, but integer xor-zeroing still takes an
execution unit, AFAIK, and can only run on EX01.)  Copying a zeroed register is
definitely good for vectors, since vector movdqa is handled at rename with no
execution port or latency.

-mtune=znver1 (AMD Ryzen) needs an execution port for integer xor-zeroing (and
maybe vector), but integer and vector mov run with no execution port or latency
(in the rename stage).  XOR-zeroing one register and copying it (with 32-bit
integer or 128-bit vector mov) is clearly optimal.  In
http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen3_InstLatX64.txt, mov
r32,r32 throughput is 0.2, but integer xor-zeroing throughput is only 0.25. 
IDK why vector movdqa throughput isn't 0.2, but the latency data tells us it's
handled at rename, which Agner Fog's data confirms.


-mtune=nehalem and earlier Intel P6-family don't care much: both mov and
xor-zeroing use an execution port.  But mov has non-zero latency, so the
mov-zeroed registers are ready at the earliest 2 cycles after the xor and mov
uops issue.  Also, mov may not preserve the upper-bytes-zeroes property that
avoids partial register stalls if you write AL and then read EAX.  Definitely
don't MOV a register that was zeroed a long time ago: that will contribute to
register-read stalls.  (http://stackoverflow.com/a/41410223/224132). 
mov-zeroing is only ok within about 5 cycles of the xor-zeroing.

-mtune=sandybridge should definitely use four XOR-zeroing instructions, because
MOV needs an execution unit (and has 1c latency), but xor-zeroing doesn't.  
XOR-zeroing also avoids consuming space in the physical register file:
http://stackoverflow.com/a/33668295/224132.

-mtune=ivybridge and later Intel shouldn't care most of the time, but
xor-zeroing is sometimes better (and never worse):  They can handle integer and
SSE MOV instructions in the rename stage with no execution port, the same way
they and SnB handle xor-zeroing.  However, mov-zeroing reads more registers,
which can be a bottleneck (especially if they're cold?) on HSW/SKL.
http://www.agner.org/optimize/blog/read.php?i=415#852.  Apparently
mov-elimination isn't perfect, and it sometimes does use an execution port. 
IDK when it fails.  Also, a kernel save/restore might leave the zeroed source
register no longer in the special zeroed state (pointing to the physical
zero-register, so it and its copies don't take up a register-file entry).  So
mov-zeroing is likely to be worse in the same cases as Nehalem and earlier:
when the source was zeroed a while ago. 


IDK about Silvermont/KNL or Jaguar, except that 64-bit xorq same,same isn't a
dependency-breaker on Silvermont/KNL.  Fortunately, gcc always uses 32-bit xor
for integer registers.


-mtune=generic might take a balanced approach and zero two or three with XOR

[Bug target/80837] New: [7.1.0 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837

Bug ID: 80837
   Summary: [7.1.0 regression] x86 accessing a member of a 16-byte
atomic object generates terrible code:
splitting/merging the bytes
   Product: gcc
   Version: 7.1.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

This code compiles to much worse asm on gcc7.1.0 than on gcc6 or gcc8, doing an
insane split/merge of separate bytes of the atomic load result.  Sorry I can't
easily check if it's fixed in the latest gcc7, and I don't know what to search
for the check for an existing report.

#include 
#include 
struct node;
struct alignas(2*sizeof(void*)) counted_ptr {
node *ptr;// non-atomic pointer-to-atomic
uintptr_t count;
//int foo;
};

node *load_nounion(std::atomic *p) {
  return p->load(std::memory_order_acquire).ptr;
}

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 re: using a narrow load
of just the requested member.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 for the store-forwarding
stalls from a bad int->xmm strategy when compiling this with -m32.

With gcc7.1.0 -mx32 -O3 -march=nehalem (https://godbolt.org/g/jeCwsP)


movq(%edi), %rcx   # 8-byte load of the whole object
xorl%edx, %edx # and then split/merge the bytes
movzbl  %ch, %eax
movb%cl, %dl
movq%rcx, %rsi
movb%al, %dh   # put the 2nd byte back where it belongs, after
getting it into the low byte of eax for no reason.  This could have been movb
%ch, %dh.
andl$16711680, %esi
andl$4278190080, %ecx
movzwl  %dx, %eax  # partial-register stall on reading %dx; 3 cycle
stall on core2/nehalem to insert a merging uop (and even compiling with
-march=core2 doesn't change the strategy to use an extra OR insn)
orq %rsi, %rax
orq %rcx, %rax # why 64-bit operand-size?
ret

-m32 uses basically the same code, but with 32-bit OR at the end.  It does the
8-byte atomic load using SSE or x87.  (And then stores/reloads instead of using
movq %xmm0, %eax, but that's a separate missed-optimization)

-m64 (without -msse4) uses a lot of movabs to set up masks for the high bytes
to split/merge all 8.  (The 16B atomic load is done with a call to libatomic
instead of inlining cmpxchg16b since this patch:
https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html).

---

-msse4 results in a different strategy for a useless 8-byte split/merge:

The atomic-load result in rdx:rax is stored to the stack, then reloaded into
xmm1, resulting in a store-forwarding stall.  Looks like the 64-bit version of
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833: bad int->xmm strategies.


Then the insane byte-at-a-time copy is from the low 8 bytes of %xmm1 into
%xmm0, using pextrb and pinsrb (I'm not going to paste in the whole code), see
https://godbolt.org/g/SCUIeu.

It initializes %xmm0 with
  movq$0, 16(%rsp)
  movq$0, 24(%rsp)
  movb%r10b, 16(%rsp)  # r10 holds the first byte of the source data
  movdqa  16(%rsp), %xmm0
which is obviously horrible compared to movd %r10d, %xmm0.  Or movzbl %al,%eax
/ movd %eax,%xmm0 instead of messing around with %r10 at all. (%r10d is was
written with movzbl (%rsp), %r10d, reloading from the store of RAX.)

IDK how gcc managed to produce code with so many of the small steps done in
horribly sub-optimal ways, nevermind the fact that it's all useless in the
first place.

[Bug driver/80836] final binaries missing rpath despite configure with LDFLAGS=-Wl,-rpath=$prefix/lib

2017-05-20 Thread rjvbertin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80836

--- Comment #1 from René J.V. Bertin  ---
I found some suggestions here

http://stackoverflow.com/questions/13813737/how-can-i-set-rpath-on-gcc-binaries-during-bootstrap

which I haven't yet been able to check (a full build runs overnight on my
hardware). It seems though that this is a sufficiently likely issue to warrant
a dedicated --with-rpath option. Especially since --with-stage1-ldflags (and
--with-bootstrap-ldflags) have a platform-specific default that apparently gets
overridden instead of completed by the user-supplied value.

[Bug driver/80836] New: final binaries missing rpath despite configure with LDFLAGS=-Wl,-rpath=$prefix/lib

2017-05-20 Thread rjvbertin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80836

Bug ID: 80836
   Summary: final binaries missing rpath despite configure with
LDFLAGS=-Wl,-rpath=$prefix/lib
   Product: gcc
   Version: 7.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: driver
  Assignee: unassigned at gcc dot gnu.org
  Reporter: rjvbertin at gmail dot com
  Target Milestone: ---

I'm building GCC 7.1.0 on Linux for installation into a separate prefix
(/opt/local) that is not declared via ldconfig. I thus need to store rpath
information in my binaries because I do not want to set LD_LIBRARY_PATH
systematically (that would defeat my purposes).

It's unclear how to set up GCC to build that way, or I overlooked any specific
options other than adding `-Wl,-rpath=/opt/local/lib` to LDFLAGS when running
configure.

When I do that (and add LD_LIBRARY_PATH=/opt/local/lib to the environment
during the build and `make install`), the build terminates OK. Just to be
certain I repeated the install with LDFLAGS set too as shown (because I notice
relinking during the step) but that doesn't change anything.

This affects all libraries from /opt/local/lib that the GCC executables link
to, meaning some won't be found and for some an older version from the host
will be used at runtime.

This looks like such a serious oversight that I cannot imagine nothing has been
foreseen to avoid the issue at hand. Also, is there anything I can do to fix
this without doing a full rebuild from scratch (I used the bootstrap-lean
target)

[Bug libstdc++/80835] New: Reading a member of an atomic can load just that member, not the whole struct

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835

Bug ID: 80835
   Summary: Reading a member of an atomic can load just
that member, not the whole struct
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

For std::atomic or similar small struct, accessing just
one member with foo.load().m compiles to an atomic load of the entire thing. 
This is very slow for a 16B object on x86 (lock cmpxchg16b).  It's also not
free for 8-byte objects in 32-bit code (bouncing through SSE).

I think it's safe to optimize partial accesses to atomic object into narrow
loads.  We need to be sure:

* it's still atomic
* it still Synchronizes With release-stores of the whole object (i.e. that
other threads' ops become globally visible to us in the right order)
* it can't lead to any of this thread's operations becoming globally visible in
the wrong order

For larger objects (non-lock-free), obviously once you hold the lock you can
copy as much or as little as you want, but I think libatomic would need new
entry-points to support small copies out of larger objects.  Acquiring the same
lock that the writer used gives us acquire semantics for the load.

--

For lock-free objects, foo.load(mo_relaxed).m can be a narrow load if a load
with that size & alignment is still atomic.  There are no ordering requirements
for the load itself, except that it's still ordered by fences and other acq/rel
operations (which is the case).

mo_acquire loads need to Synchronize With release-stores of the whole object. 
AFAIK, all architectures that uses MESI-style coherent caches synchronize based
on cache lines, not object addresses.  So an acquire-load Synchronizes With a
release-store if they both touch the same cache-line, even if they use
different address and size.  Thus it's safe to use a narrow acquire-load on
part of an object, even if the object spans two cache lines but the load only
touches one.  (But note that for x86, cmpxchg16b requires a 16B aligned memory
operand, so this would only be possible with an 8B or smaller object, in which
case we'd need lock cmpxchg loads and xchg stores, and those would be even
slower than usual because of the cl-split.)

For platforms other than x86, there may be issues if using synchronization
primitives that only synchronize a particular address, but I'm not familiar
with them.



You can use a union of an atomic and a struct of atomic<> members to
implement efficient read-only access to a single member with current gcc/clang.
 It's safe this optimization is safe. 
 See
http://stackoverflow.com/questions/38984153/implement-aba-counter-with-c11-cas/38991835#38991835.
 It would be nice if gcc would do the optimization for us on targets where it
really is safe.

 If someone uses a union to store to part of an atomic object, the resulting UB
is their problem.  Lock-free atomic stores must always store the whole object. 
It usually doesn't make sense to store less than the whole object anyway, and
C++ std::atomic has no way to get an lvalue referring to a single member of an
atomic.

-

One worry I had was Intel's x86 manual saying:

> 8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations

> The Intel-64 memory-ordering model allows a load to be reordered with an
> earlier store to a different location. However, loads are not reordered with
> stores to the same location.

We know (from 
http://stackoverflow.com/questions/35830641/can-x86-reorder-a-narrow-store-with-a-wider-load-that-fully-contains-it)
that there can be reordering when the load and store addresses don't match,
even if they overlap.

But it seems Intel only means within a single thread, i.e. that a thread
observes its own operations to happen in program order.  Store-forwarding
allows a store/reload to become globally visible load-first.

But I think if the store fully overlaps the load, opposite of what the SO
question is about, there's no way you can observe that reordering, because all
the load data comes from the store.  That just means the store and reload are
adjacent to each other in the total order, with way for a store from another
thread to sneak in between them.

See discussion in comments for more about this: 

http://stackoverflow.com/questions/35830641/can-x86-reorder-a-narrow-store-with-a-wider-load-that-fully-contains-it/35910141?noredirect=1#comment75186782_35910141


-

A simplified example of loading one member of a pointer+counter struct:

// https://godbolt.org/g/k063wH for the code with a union, from the SO Q&A.

// https://godbolt.org/g/hNoJzj for this simplified code:
#include 
#include 
struct node;
struct alignas(2*sizeof(void*)) counted_ptr {
node *ptr;//

[Bug target/80817] [missed optimization][x86] relaxed atomics

2017-05-20 Thread Joost.VandeVondele at mat dot ethz.ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80817

Joost VandeVondele  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-05-20
 CC||Joost.VandeVondele at mat dot 
ethz
   ||.ch
 Ever confirmed|0   |1