build breakage in libsanitizer
Hey all, I've just tried building Trunk from branches/google/main/ and got the following failure in libsanitizer: In file included from /mnt/test/rvbd-root-gcc49/usr/include/features.h:356:0, from /mnt/test/rvbd-root-gcc49/usr/include/arpa/inet.h:22, from ../../../../gcc49-google-main/libsanitizer/sanitizer_common/sanitizer_platform_limits_posix.cc:20: /mnt/test/rvbd-root-gcc49/usr/include/sys/timex.h:145:31: error: expected initializer before 'throw' __asm__ ("ntp_gettimex") __THROW; ^ This is an intel-to-intel cross-compiler that is being built against linux-2.6.32.27 headers and glibc-2.12 (with a few redhat patches). The system header in question contains the following: #if defined __GNUC__ && __GNUC__ >= 2 extern int ntp_gettime (struct ntptimeval *__ntv) __asm__ ("ntp_gettimex") __THROW; #else extern int ntp_gettimex (struct ntptimeval *__ntv) __THROW; # define ntp_gettime ntp_gettimex #endif This same header works against gcc4.8 (or is not used). Could someone clarify what libsanitizer needs here and what gcc dislikes please? Which should I patch? Thanks in advance! Oleg.
controlling the default C++ dialect
Hey all, this thread started on the libstdc++ list where I asked a couple of questions about patching std::string for C++11 compliance. I have figured how to do that and it yields a library that only works in the C++11 mode. This is not an issue here as we deploy a versioned runtime into a specific path. However, the whole thing is a bit inconvenient to C++ devs as they have to pass -std=c++11 to get anything done. So, my question is - how do I patch the notion of the "default" C++ dialect that gcc has? I have patched "cxx_dialect" in gcc/c-family/c-common.c, yet that is not enough. Is there some other thing that overwrites the default? Thanks in advance, Oleg.
Re: google branch breakage
On 2013-10-01 09:00, Paul Pluzhnikov wrote: I'll test trunk breakage with --enable-symvers above, and then attached patch should fix it. libstdc++-v3/ChangeLog: 2013-10-01 Paul Pluzhnikov * src/c++11/snprintf_lite.cc: Add missing _GLIBCXX_{BEGIN,END}_NAMESPACE_VERSION Thanks. This patch address the immediate issue and my build moves forward. Would you like me to build the whole compiler package as well as a test program? Oleg.
Re: google branch breakage
On 2013-10-01 08:49, Paul Pluzhnikov wrote: I've attached the fill log. How was this GCC configured? Hey Paul, thanks for the quick reply. Here is how the final compiler is configured (it's an intel-to-intel sysroot'd cross-compiler): ../gcc48-google/configure \ --prefix=%{PREFIX} --target=%{TARGET} --with-sysroot=%{SYSROOT} \ --enable-languages=c,c++ \ --disable-lto \ --with-gnu-as --with-gnu-ld \ --enable-threads=posix --disable-nls --disable-multilib --enable-shared \ --disable-sjlj-exceptions \ --with-build-time-tools=%{PREFIX}/%{TARGET}/bin \ --enable-gold=default \ --enable-symvers=gnu-versioned-namespace I'd guess that only the very last line is pertinent to the issue. Oleg.
google branch breakage
Hey all, hey Paul, it looks like the code branches/google/gcc-4_8 does not compile any more: ../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc: In function 'int __gnu_cxx::__concat_size_t(char*, std::size_t, std::size_t)': ../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc:84:35: error: call of overloaded '__int_to_char(char*, long long unsigned int&, const char*&, const fmtflags&, bool)' is ambiguous std::ios_base::dec, true); ^ I've attached the fill log. As a guess, it may be due to the following commit. r202927 | ppluzhnikov | 2013-09-25 15:12:11 -0700 (Wed, 25 Sep 2013) | 11 lines For Google b/10323610, partially backport upstream revisions r202818, r202832 and r202836. 2013-09-25 Paul Pluzhnikov * libstdc++-v3/config/abi/pre/gnu.ver: Add _ZSt24__throw_out_of_range_fmtPKcz * libstdc++-v3/src/c++11/Makefile.am: Add snprintf_lite. * libstdc++-v3/src/c++11/Makefile.in: Regenerate. * libstdc++-v3/src/c++11/snprintf_lite.cc: New. * libstdc++-v3/src/c++11/functexcept.cc (__throw_out_of_range_fmt): New. Making all in c++11 realmake[6]: Entering directory `/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/src/c++11' /bin/sh ../../libtool --tag CXX --tag disable-shared --mode=compile /work/osmolsky/_rpm_builds/BUILD/_gcc3/./gcc/xgcc -shared-libgcc -B/work/osmolsky/_rpm_builds/BUILD/_gcc3/./gcc -nostdinc++ -L/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/src -L/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/src/.libs -B/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/bin/ -B/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/lib/ -isystem /mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/include -isystem /mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/sys-include -I/work/osmolsky/_rpm_builds/BUILD/gcc48-google/libstdc++-v3/../libgcc -I/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/include/x86_64-unknown-linux -I/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/include -I/work/osmolsky/_rpm_builds/BUILD/gcc48-google/libstdc++-v3/libsupc++ -std=gnu++11 -prefer-pic -D_GLIBCXX_SHARED -fno-implicit-templates -Wall -Wextra -Wwrite-strings -Wcast-qual -Wabi -fdiagnostics-show-location=once -ffunction-sections -fdata-sections -frandom-seed=snprintf_lite.lo -std=gnu++11 -g -O2 -D_GNU_SOURCE -c -o snprintf_lite.lo ../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc libtool: compile: /work/osmolsky/_rpm_builds/BUILD/_gcc3/./gcc/xgcc -shared-libgcc -B/work/osmolsky/_rpm_builds/BUILD/_gcc3/./gcc -nostdinc++ -L/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/src -L/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/src/.libs -B/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/bin/ -B/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/lib/ -isystem /mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/include -isystem /mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/sys-include -I/work/osmolsky/_rpm_builds/BUILD/gcc48-google/libstdc++-v3/../libgcc -I/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/include/x86_64-unknown-linux -I/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/include -I/work/osmolsky/_rpm_builds/BUILD/gcc48-google/libstdc++-v3/libsupc++ -std=gnu++11 -D_GLIBCXX_SHARED -fno-implicit-templates -Wall -Wextra -Wwrite-strings -Wcast-qual -Wabi -fdiagnostics-show-location=once -ffunction-sections -fdata-sections -frandom-seed=snprintf_lite.lo -std=gnu++11 -g -O2 -D_GNU_SOURCE -c ../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc -fPIC -DPIC -D_GLIBCXX_SHARED -o snprintf_lite.o ../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc: In function 'int __gnu_cxx::__concat_size_t(char*, std::size_t, std::size_t)': ../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc:84:35: error: call of overloaded '__int_to_char(char*, long long unsigned int&, const char*&, const fmtflags&, bool)' is ambiguous std::ios_base::dec, true); ^ ../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc:84:35: note: candidates are: ../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc:33:3: note: int std::__int_to_char(_CharT*, _ValueT, const _CharT*, std::__7::ios_base::fmtflags, bool) [with _CharT = char; _ValueT = long long unsigned int; std::__7::ios_base::fmtflags = std::__7::_Ios_Fmtflags] __int_to_char(_CharT* __bufend, _ValueT __v, const _CharT* __lit, ^ In file included from /work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-un
gcc 4.8: broken headers when using gnu-versioned-namespace
Hey all, I've just built gcc 4.8 with --enable-symvers=gnu-versioned-namespace and compilation of a small test fails with the following: In file included from /work/opt/gcc-4.8/include/c++/4.8.0/array:324:0, from /work/opt/gcc-4.8/include/c++/4.8.0/tuple:39, from /work/opt/gcc-4.8/include/c++/4.8.0/bits/stl_map.h:63, from /work/opt/gcc-4.8/include/c++/4.8.0/map:61, from /work/opt/gcc-4.8/include/c++/4.8.0/debug/array:296:12: error: ‘tuple_size’ is not a class template struct tuple_size<__debug::array<_Tp, _Nm>> ... ... It looks like include/c++/4.8.0/debug/array is missing a couple of wrappers: _GLIBCXX_BEGIN_NAMESPACE_VERSION _GLIBCXX_END_NAMESPACE_VERSION IE it is a similar thing to this change: http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=193584 Should I re-open the bug? Oleg.
_GLIBCXX_DEBUG with v7 namespace (bug 55028)
Hi, below is a patch for Bug 55028. My tests link now, yet, more symbols may need to be exposed... Could someone check please? Thanks! Oleg. --- abi/pre/gnu-versioned-namespace.ver (revision 192953) +++ abi/pre/gnu-versioned-namespace.ver (working copy) @@ -116,6 +116,13 @@ _ZN11__gnu_debug19_Safe_sequence_base22_M_revalidate_singularEv; _ZN11__gnu_debug19_Safe_sequence_base7_M_swapERS0_; +# __gnu_debug::_Safe_unordered_container_base and _Safe_local_iterator_base +_ZN11__gnu_debug30_Safe_unordered_container_base7_M_swapERS0_; + _ZN11__gnu_debug30_Safe_unordered_container_base13_M_detach_allEv; + _ZN11__gnu_debug25_Safe_local_iterator_base9_M_attachEPNS_19_Safe_sequence_\ +baseEb; +_ZN11__gnu_debug25_Safe_local_iterator_base9_M_detachEv; + _ZN11__gnu_debug19_Safe_iterator_base9_M_attach*; _ZN11__gnu_debug19_Safe_iterator_base16_M_attach_single*; _ZN11__gnu_debug19_Safe_iterator_base9_M_detachEv;
Re: RFC: -Wall by default
On 2012-04-11 01:50, Vincent Lefevre wrote: On 2012-04-09 13:03:38 -0500, Gabriel Dos Reis wrote: On Mon, Apr 9, 2012 at 12:44 PM, Robert Dewar wrote: On 4/9/2012 1:36 PM, Jonathan Wakely wrote: Maybe -Wstandard isn't the best name though, as "standard" usually means something quite specific for compilers, and the warning switch wouldn't have anything to do with standards conformance. -Wdefault might be better except if people want warnings about "defaults" in C++11 (which can mean lot of things). How about a warning level? -W0: no warnings (equivalent to -w) -W1: default -W2: equivalent to the current -Wall -W3: equivalent to the current -Wall -Wextra This is exactly what Microsoft C++ compiler does and what their Visual Studio IDE exposes in the UI. So, there is a reasonable precedent. Oleg.
Re: C Compiler benchmark: gcc 4.6.3 vs. Intel v11 and others
Nice work! The only think is that you didn't enable WPO/LTCG on VC++ builds, so that test is a little skewed... On 2012/1/18 20:35, willus.com wrote: Hello, For those who might be interested, I've recently benchmarked gcc 4.6.3 (and 3.4.2) vs. Intel v11 and Microsoft (in Windows 7) here: http://willus.com/ccomp_benchmark2.shtml willus.com
Re: Performance degradation on g++ 4.6
Sure. I've just attached it to the bug. On 2011/8/24 14:56, Xinliang David Li wrote: Thanks. Can you make the test case a standalone preprocessed file (using -E)? David On Wed, Aug 24, 2011 at 2:26 PM, Oleg Smolsky wrote: On 2011/8/24 13:02, Xinliang David Li wrote: On 2011/8/23 11:38, Xinliang David Li wrote: Partial register stall happens when there is a 32bit register read followed by a partial register write. In your case, the stall probably happens in the next iteration when 'add eax, 0Ah' executes, so your manual patch does not work. Try change add al, [dx] into two instructions (assuming esi is available here) movzx esi, ds:data8[dx] add eax, esi I patched the code to use "movzx edi" but the result is a little clumsy as the loop is based on the virtual address rather than index. my bad -- I did copy&paste without making it precise. No worries. The fragment did fit into the padding :) So, this is one test out of the suite. Many of them degraded... Are you guys interested in looking at other ones? Or is there something to be fixed in the register allocation logic? File bugs --- the isolated examples like this one would be very helpful in the bug report. Done: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 Regards, Oleg.
Re: Performance degradation on g++ 4.6
On 2011/8/24 13:02, Xinliang David Li wrote: On 2011/8/23 11:38, Xinliang David Li wrote: Partial register stall happens when there is a 32bit register read followed by a partial register write. In your case, the stall probably happens in the next iteration when 'add eax, 0Ah' executes, so your manual patch does not work. Try change add al, [dx] into two instructions (assuming esi is available here) movzx esi, ds:data8[dx] add eax, esi I patched the code to use "movzx edi" but the result is a little clumsy as the loop is based on the virtual address rather than index. my bad -- I did copy& paste without making it precise. No worries. The fragment did fit into the padding :) So, this is one test out of the suite. Many of them degraded... Are you guys interested in looking at other ones? Or is there something to be fixed in the register allocation logic? File bugs --- the isolated examples like this one would be very helpful in the bug report. Done: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 Regards, Oleg.
Re: Performance degradation on g++ 4.6
On 2011/8/23 11:38, Xinliang David Li wrote: Partial register stall happens when there is a 32bit register read followed by a partial register write. In your case, the stall probably happens in the next iteration when 'add eax, 0Ah' executes, so your manual patch does not work. Try change add al, [dx] into two instructions (assuming esi is available here) movzx esi, ds:data8[dx] add eax, esi I patched the code to use "movzx edi" but the result is a little clumsy as the loop is based on the virtual address rather than index. Also, the sequence is a bit bigger so I had to spill the patch into the preceding padding: .text:00400D80 loc_400D80: .text:00400D80 mov edx, offset data8 .text:00400D85 xor eax, eax .text:00400D87 nop .text:00400D88 nop .text:00400D89 nop .text:00400D8A nop .text:00400D8B nop .text:00400D8C .text:00400D8C loc_400D8C: .text:00400D8C movzx edi, byte ptr [rdx+0] .text:00400D90 add eax, edi .text:00400D92 add eax, 0Ah .text:00400D95 add rdx, 1 .text:00400D99 cmp rdx, 503480h .text:00400DA0 jnz short loc_400D8C .text:00400DA2 movsx eax, al .text:00400DA5 add ecx, 1 .text:00400DA8 add ebx, eax .text:00400DAA cmp ecx, esi .text:00400DAC jnz short loc_400D80 The performance improved from 2.84 sec (563.38 M ops/s) to 1.51 sec (1059.60 M ops/s). It's close to the code emitted by g++4.1 now. Very funky! So, this is one test out of the suite. Many of them degraded... Are you guys interested in looking at other ones? Or is there something to be fixed in the register allocation logic? Oleg.
Re: Performance degradation on g++ 4.6
Hey Andrew, On 2011/8/22 18:37, Andrew Pinski wrote: On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky wrote: On 2011/8/22 18:09, Oleg Smolsky wrote: Both compilers fully inline the templated function and the emitted code looks very similar. I am puzzled as to why one of these loops is significantly slower than the other. I've attached disassembled listings - perhaps someone could have a look please? (the body of the loop starts at 00400FD for gcc41 and at 00400D90 for gcc46) The difference, theoretically, should be due to the inner loop: v4.6: .text:00400DA0 loc_400DA0: .text:00400DA0 add eax, 0Ah .text:00400DA3 add al, [rdx] .text:00400DA5 add rdx, 1 .text:00400DA9 cmp rdx, 5034E0h .text:00400DB0 jnz short loc_400DA0 v4.1: .text:00400FE0 loc_400FE0: .text:00400FE0 movzx eax, ds:data8[rdx] .text:00400FE7 add rdx, 1 .text:00400FEB add eax, 0Ah .text:00400FEE cmp rdx, 1F40h .text:00400FF5 lea ecx, [rax+rcx] .text:00400FF8 jnz short loc_400FE0 However, I cannot see how the first version would be slow... The custom templated "shifter" degenerates into "add 0xa", which is the point of the test... Hmm... It is slower because of the subregister depedency between eax and al. Hmm... it is little difficult to reason about these fragments as they are not equivalent in functionality. The g++4.1 version discards the result while the other version (correctly) accumulates. Oh, I've just realized that I grabbed the first iteration of the inner loop which was factored out (perhaps due to unrolling?) Oops, my apologies. Here are complete loops, out of a further digested test: g++ 4.1 (1.35 sec, 1185M ops/s): .text:00400FDB loc_400FDB: .text:00400FDB xor ecx, ecx .text:00400FDD xor edx, edx .text:00400FDF nop .text:00400FE0 .text:00400FE0 loc_400FE0: .text:00400FE0 movzx eax, ds:data8[rdx] .text:00400FE7 add rdx, 1 .text:00400FEB add eax, 0Ah .text:00400FEE cmp rdx, 1F40h .text:00400FF5 lea ecx, [rax+rcx] .text:00400FF8 jnz short loc_400FE0 .text:00400FFA movsx eax, cl .text:00400FFD add esi, 1 .text:00401000 add ebx, eax .text:00401002 cmp esi, edi .text:00401004 jnz short loc_400FDB g++ 4.6 (2.86s, 563M ops/s) : .text:00400D80 loc_400D80: .text:00400D80 mov edx, offset data8 .text:00400D85 xor eax, eax .text:00400D87 db 66h, 66h .text:00400D87 nop .text:00400D8A db 66h, 66h .text:00400D8A nop .text:00400D8D db 66h, 66h .text:00400D8D nop .text:00400D90 .text:00400D90 loc_400D90: .text:00400D90 add eax, 0Ah .text:00400D93 add al, [rdx] .text:00400D95 add rdx, 1 .text:00400D99 cmp rdx, 503480h .text:00400DA0 jnz short loc_400D90 .text:00400DA2 movsx eax, al .text:00400DA5 add ecx, 1 .text:00400DA8 add ebx, eax .text:00400DAA cmp ecx, esi .text:00400DAC jnz short loc_400D80 Your observation still holds - there are two sequential instructions that operate on the same register. So, I manually patched the 4.6 binary's inner loop to the following: .text:00400D90 add al, [rdx] .text:00400D92 add rdx, 1 .text:00400D96 add eax, 0Ah .text:00400D99 cmp rdx, 503480h .text:00400DA0 jnz short loc_400D90 and that made no significant difference in performance. Is this dependency really a performance issue? BTW, the outer loop executes 200,000 times... Thanks! Oleg. P.S. GDB disassembles the v4.6 emitted padding as: 0x00400d87 <+231>: data32 xchg ax,ax 0x00400d8a <+234>: data32 xchg ax,ax 0x00400d8d <+237>: data32 xchg ax,ax
Re: Performance degradation on g++ 4.6
On 2011/8/22 18:09, Oleg Smolsky wrote: Both compilers fully inline the templated function and the emitted code looks very similar. I am puzzled as to why one of these loops is significantly slower than the other. I've attached disassembled listings - perhaps someone could have a look please? (the body of the loop starts at 00400FD for gcc41 and at 00400D90 for gcc46) The difference, theoretically, should be due to the inner loop: v4.6: .text:00400DA0 loc_400DA0: .text:00400DA0 add eax, 0Ah .text:00400DA3 add al, [rdx] .text:00400DA5 add rdx, 1 .text:00400DA9 cmp rdx, 5034E0h .text:00400DB0 jnz short loc_400DA0 v4.1: .text:00400FE0 loc_400FE0: .text:00400FE0 movzx eax, ds:data8[rdx] .text:00400FE7 add rdx, 1 .text:00400FEB add eax, 0Ah .text:00400FEE cmp rdx, 1F40h .text:00400FF5 lea ecx, [rax+rcx] .text:00400FF8 jnz short loc_400FE0 However, I cannot see how the first version would be slow... The custom templated "shifter" degenerates into "add 0xa", which is the point of the test... Hmm... Oleg.
Re: Performance degradation on g++ 4.6
Hey David, these two --param options made no difference to the test. I've cut the suite down to a single test (attached), which yields the following results: ./simple_types_constant_folding_os (gcc 41) test description time operations/s 0 "int8_t constant add" 1.34 sec 1194.03 M ./simple_types_constant_folding_os (gcc 46) test description time operations/s 0 "int8_t constant add" 2.84 sec 563.38 M Both compilers fully inline the templated function and the emitted code looks very similar. I am puzzled as to why one of these loops is significantly slower than the other. I've attached disassembled listings - perhaps someone could have a look please? (the body of the loop starts at 00400FD for gcc41 and at 00400D90 for gcc46) Thanks, Oleg. On 2011/8/1 22:48, Xinliang David Li wrote: Try isolate the int8_t constant folding testing from the rest to see if the slow down can be reproduced with the isolated case. If the problem disappear, it is likely due to the following inline parameters: large-function-insns, large-function-growth, large-unit-insns, inline-unit-growth. For instance set --param large-function-insns=1 --param large-unit-insns=2 David On Mon, Aug 1, 2011 at 11:43 AM, Oleg Smolsky wrote: On 2011/7/29 14:07, Xinliang David Li wrote: Profiling tools are your best friend here. If you don't have access to any, the least you can do is to build the program with -pg option and use gprof tool to find out differences. The test suite has a bunch of very basic C++ tests that are executed an enormous number of times. I've built one with the obvious performance degradation and attached the source, output and reports. Here are some highlights: v4.1:Total absolute time for int8_t constant folding: 30.42 sec v4.6:Total absolute time for int8_t constant folding: 43.32 sec Every one of the tests in this section had degraded... the first half more than the second. I am not sure how much further I can take this - the benchmarked code is very short and plain. I can post disassembly for one (some?) of them if anyone is willing to take a look... Thanks, Oleg. /* Copyright 2007-2008 Adobe Systems Incorporated Distributed under the MIT License (see accompanying file LICENSE_1_0_0.txt or a copy at http://stlab.adobe.com/licenses.html ) Source file for tests shared among several benchmarks */ /**/ template inline bool tolerance_equal(T &a, T &b) { T diff = a - b; return (abs(diff) < 1.0e-6); } template<> inline bool tolerance_equal(int32_t &a, int32_t &b) { return (a == b); } template<> inline bool tolerance_equal(uint32_t &a, uint32_t &b) { return (a == b); } template<> inline bool tolerance_equal(uint64_t &a, uint64_t &b) { return (a == b); } template<> inline bool tolerance_equal(int64_t &a, int64_t &b) { return (a == b); } template<> inline bool tolerance_equal(double &a, double &b) { double diff = a - b; double reldiff = diff; if (fabs(a) > 1.0e-8) reldiff = diff / a; return (fabs(reldiff) < 1.0e-6); } template<> inline bool tolerance_equal(float &a, float &b) { float diff = a - b; double reldiff = diff; if (fabs(a) > 1.0e-4) reldiff = diff / a; return (fabs(reldiff) < 1.0e-3);// single precision divide test is really imprecise } /**/ template inline void check_shifted_sum(T result) { T temp = (T)SIZE * Shifter::do_shift((T)init_value); if (!tolerance_equal(result,temp)) printf("test %i failed\n", current_test); } template inline void check_shifted_sum_CSE(T result) { T temp = (T)0.0; if (!tolerance_equal(result,temp)) printf("test %i failed\n", current_test); } template inline void check_shifted_variable_sum(T result, T var) { T temp = (T)SIZE * Shifter::do_shift((T)init_value, var); if (!tolerance_equal(result,temp)) printf("test %i failed\n", current_test); } template inline void check_shifted_variable_sum(T result, T var1, T var2, T var3, T var4) { T temp = (T)SIZE * Shifter::do_shift((T)init_value, var1, var2, var3, var4); if (!tolerance_equal(result,temp)) printf("test %i failed\n", current_test); } template inline void check_shifted_variable_sum_CSE(T result, T var) { T temp = (T)0.0; if (!tolerance_equal(result,temp)) printf("test %i failed\n", current_test); } template inline void ch
Re: Performance degradation on g++ 4.6
Hi Benjamin, On 2011/7/30 06:22, Benjamin Redelings I wrote: I had some performance degradation with 4.6 as well. However, I was able to cure it by using -finline-limit=800 or 1000 I think. However, this lead to a code size increase. Were the old higher-performance binaries larger? Yes, the older binary for the degraded test was indeed larger: 107K vs 88K. However, I have just re-built and re-run the test and there was no significant difference in performance. IE the degradation in "simple_types_constant_folding" test remains when building with -finline-limit=800 (or =1000) IIRC, setting finline-limit=n actually sets two params to n/2, but I think you may only need to change 1 to get the old performance back. --param max-inline-insns-single defaults to 450, but --param max-inline-insns-auto defaults to 90. Perhaps you can get the old performance back by adjusting just one of these two parameters, or by setting them to different values, instead of the same value, as would be achieved by -finline-limit. "--param max-inline-insns-auto=800" by itself does not help. The "--param max-inline-insns-single=800 --param max-inline-insns-auto=1000" combination makes no significant difference either. BTW, some of these tweaks increase the binary size to 99K, yet there is no performance increase. Oleg.
Re: Performance degradation on g++ 4.6
Hey David, here are a couple of answers and notes: - I built the test suite with -O3 and cannot see anything else related to inlining that isn't already ON (except for -finline-limit=n which I do not how to use) http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html - FTO looks like a very different kettle of fish, I'd prefer to leave it aside to limit the number of data points (at least for the initial investigation) - I've just rerun the suite with -flto and there are no significant differences in performance What else is there? Oleg. On 2011/7/29 11:07, Xinliang David Li wrote: My guess is inlining differences. Try more aggressive inline parameters to see if helps. Also try FDO to see there is any performance difference between two versions. You will probably need to do first level triage and file bug reports. David On Fri, Jul 29, 2011 at 10:56 AM, Oleg Smolsky wrote: Hi there, I have compiled and run a set of C++ benchmarks on a CentOS4/64 box using the following compilers: a) g++4.1 that is available for this distro (GCC version 4.1.2 20071124 (Red Hat 4.1.2-42) b) g++4.6 that I built (stock version 4.6.1) The machine has two Intel quad core processors in x86_64 mode (/proc/cpuinfo attached) Benchmarks were taken from this page: http://stlab.adobe.com/performance/ Results: - some of these tests showed 20..30% performance degradation (eg the second section in the simple_types_constant_folding test: 30s -> 44s) - a few were quicker - full reports are attached I would assume that performance of the generated code is closely monitored by the dev community and obvious blunders should not sneak in... However, my findings are reproducible with these synthetic benchmarks as well as production code at work. The latter shows approximately 25% degradation on CPU bound tests. Is there a trick to building the compiler or using a specific -mtune/-march flag for my CPU? I built the compiler with all the default options (it just has a distinct installation path): ../gcc-%{version}/configure --prefix=/work/tools/gcc46 --enable-languages=c,c++ --with-system-zlib --with-mpfr=/work/tools/mpfr24 --with-gmp=/work/tools/gmp --with-mpc=/work/tools/mpc LD_LIBRARY_PATH=/work/tools/mpfr/lib24:/work/tools/gmp/lib:/work/tools/mpc/lib Are there any published benchmarks? I'd appreciate any advice or pointers. Thanks in advance, Oleg.