build breakage in libsanitizer

2014-02-24 Thread Oleg Smolsky
Hey all, I've just tried building Trunk from branches/google/main/ and 
got the following failure in libsanitizer:


In file included from 
/mnt/test/rvbd-root-gcc49/usr/include/features.h:356:0,

 from /mnt/test/rvbd-root-gcc49/usr/include/arpa/inet.h:22,
 from 
../../../../gcc49-google-main/libsanitizer/sanitizer_common/sanitizer_platform_limits_posix.cc:20:
/mnt/test/rvbd-root-gcc49/usr/include/sys/timex.h:145:31: error: 
expected initializer before 'throw'

  __asm__ ("ntp_gettimex") __THROW;
   ^

This is an intel-to-intel cross-compiler that is being built against 
linux-2.6.32.27 headers and glibc-2.12 (with a few redhat patches). The 
system header in question contains the following:


#if defined __GNUC__ && __GNUC__ >= 2
extern int ntp_gettime (struct ntptimeval *__ntv)
 __asm__ ("ntp_gettimex") __THROW;
#else
extern int ntp_gettimex (struct ntptimeval *__ntv) __THROW;
# define ntp_gettime ntp_gettimex
#endif

This same header works against gcc4.8 (or is not used). Could someone 
clarify what libsanitizer needs here and what gcc dislikes please? Which 
should I patch?


Thanks in advance!
Oleg.


controlling the default C++ dialect

2013-12-20 Thread Oleg Smolsky
Hey all, this thread started on the libstdc++ list where I asked a 
couple of questions about patching std::string for C++11 compliance. I 
have figured how to do that and it yields a library that only works in 
the C++11 mode. This is not an issue here as we deploy a versioned 
runtime into a specific path. However, the whole thing is a bit 
inconvenient to C++ devs as they have to pass -std=c++11 to get anything 
done.


So, my question is - how do I patch the notion of the "default" C++ 
dialect that gcc has? I have patched "cxx_dialect" in 
gcc/c-family/c-common.c, yet that is not enough. Is there some other 
thing that overwrites the default?


Thanks in advance,
Oleg.


Re: google branch breakage

2013-10-01 Thread Oleg Smolsky

On 2013-10-01 09:00, Paul Pluzhnikov wrote:



I'll test trunk breakage with --enable-symvers above, and then
attached patch should fix it.


libstdc++-v3/ChangeLog:

2013-10-01  Paul Pluzhnikov  

* src/c++11/snprintf_lite.cc: Add missing
_GLIBCXX_{BEGIN,END}_NAMESPACE_VERSION

Thanks. This patch address the immediate issue and my build moves forward.

Would you like me to build the whole compiler package as well as a test 
program?


Oleg.


Re: google branch breakage

2013-10-01 Thread Oleg Smolsky

On 2013-10-01 08:49, Paul Pluzhnikov wrote:

I've attached the fill log.

How was this GCC configured?


Hey Paul, thanks for the quick reply. Here is how the final compiler is 
configured (it's an intel-to-intel sysroot'd cross-compiler):


../gcc48-google/configure \
--prefix=%{PREFIX} --target=%{TARGET} --with-sysroot=%{SYSROOT} \
--enable-languages=c,c++ \
--disable-lto \
--with-gnu-as --with-gnu-ld \
--enable-threads=posix --disable-nls --disable-multilib 
--enable-shared \

--disable-sjlj-exceptions \
--with-build-time-tools=%{PREFIX}/%{TARGET}/bin \
--enable-gold=default \
--enable-symvers=gnu-versioned-namespace

I'd guess that only the very last line is pertinent to the issue.

Oleg.


google branch breakage

2013-10-01 Thread Oleg Smolsky
Hey all, hey Paul, it looks like the code branches/google/gcc-4_8 does 
not compile any more:


../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc: In 
function 'int __gnu_cxx::__concat_size_t(char*, std::size_t, std::size_t)':
../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc:84:35: 
error: call of overloaded '__int_to_char(char*, long long unsigned int&, 
const char*&, const


fmtflags&, bool)' is ambiguous
   std::ios_base::dec, true);
   ^

I've attached the fill log. As a guess, it may be due to the following 
commit.


r202927 | ppluzhnikov | 2013-09-25 15:12:11 -0700 (Wed, 25 Sep 2013) | 
11 lines


For Google b/10323610, partially backport upstream revisions r202818,
r202832 and r202836.

2013-09-25  Paul Pluzhnikov  

* libstdc++-v3/config/abi/pre/gnu.ver: Add 
_ZSt24__throw_out_of_range_fmtPKcz

* libstdc++-v3/src/c++11/Makefile.am: Add snprintf_lite.
* libstdc++-v3/src/c++11/Makefile.in: Regenerate.
* libstdc++-v3/src/c++11/snprintf_lite.cc: New.
* libstdc++-v3/src/c++11/functexcept.cc 
(__throw_out_of_range_fmt): New.


Making all in c++11
realmake[6]: Entering directory 
`/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/src/c++11'
/bin/sh ../../libtool --tag CXX --tag disable-shared   --mode=compile 
/work/osmolsky/_rpm_builds/BUILD/_gcc3/./gcc/xgcc -shared-libgcc 
-B/work/osmolsky/_rpm_builds/BUILD/_gcc3/./gcc -nostdinc++ 
-L/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/src 
-L/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/src/.libs
 -B/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/bin/ 
-B/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/lib/ 
-isystem 
/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/include 
-isystem 
/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/sys-include
-I/work/osmolsky/_rpm_builds/BUILD/gcc48-google/libstdc++-v3/../libgcc 
-I/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/include/x86_64-unknown-linux
 
-I/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/include
 -I/work/osmolsky/_rpm_builds/BUILD/gcc48-google/libstdc++-v3/libsupc++  
-std=gnu++11 -prefer-pic -D_GLIBCXX_SHARED -fno-implicit-templates -Wall 
-Wextra -Wwrite-strings -Wcast-qual -Wabi  -fdiagnostics-show-location=once   
-ffunction-sections -fdata-sections  -frandom-seed=snprintf_lite.lo 
-std=gnu++11  -g -O2 -D_GNU_SOURCE  -c -o snprintf_lite.lo 
../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc
libtool: compile:  /work/osmolsky/_rpm_builds/BUILD/_gcc3/./gcc/xgcc 
-shared-libgcc -B/work/osmolsky/_rpm_builds/BUILD/_gcc3/./gcc -nostdinc++ 
-L/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/src 
-L/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/src/.libs
 -B/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/bin/ 
-B/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/lib/ 
-isystem 
/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/include 
-isystem 
/mnt/project/granite/toolchains/elixir/rvbd-gcc48/x86_64-unknown-linux/sys-include
 -I/work/osmolsky/_rpm_builds/BUILD/gcc48-google/libstdc++-v3/../libgcc 
-I/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/include/x86_64-unknown-linux
 
-I/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-unknown-linux/libstdc++-v3/include
 -I/work/osmolsky/_rpm_builds/BUILD/gcc48-google/libstdc++-v3/libsupc++ 
-std=gnu++11 -D_GLIBCXX_SHARED -fno-implicit-templates -Wall -Wextra 
-Wwrite-strings -Wcast-qual -Wabi -fdiagnostics-show-location=once 
-ffunction-sections -fdata-sections -frandom-seed=snprintf_lite.lo -std=gnu++11 
-g -O2 -D_GNU_SOURCE -c 
../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc  -fPIC 
-DPIC -D_GLIBCXX_SHARED  -o snprintf_lite.o
../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc: In 
function 'int __gnu_cxx::__concat_size_t(char*, std::size_t, std::size_t)':
../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc:84:35: 
error: call of overloaded '__int_to_char(char*, long long unsigned int&, const 
char*&, const fmtflags&, bool)' is ambiguous
   std::ios_base::dec, true);
   ^
../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc:84:35: 
note: candidates are:
../../../../../gcc48-google/libstdc++-v3/src/c++11/snprintf_lite.cc:33:3: note: 
int std::__int_to_char(_CharT*, _ValueT, const _CharT*, 
std::__7::ios_base::fmtflags, bool) [with _CharT = char; _ValueT = long long 
unsigned int; std::__7::ios_base::fmtflags = std::__7::_Ios_Fmtflags]
   __int_to_char(_CharT* __bufend, _ValueT __v, const _CharT* __lit,
   ^
In file included from 
/work/osmolsky/_rpm_builds/BUILD/_gcc3/x86_64-un

gcc 4.8: broken headers when using gnu-versioned-namespace

2013-05-21 Thread Oleg Smolsky
Hey all, I've just built gcc 4.8 with 
--enable-symvers=gnu-versioned-namespace and compilation of a small test 
fails with the following:


In file included from /work/opt/gcc-4.8/include/c++/4.8.0/array:324:0,
 from /work/opt/gcc-4.8/include/c++/4.8.0/tuple:39,
 from 
/work/opt/gcc-4.8/include/c++/4.8.0/bits/stl_map.h:63,

 from /work/opt/gcc-4.8/include/c++/4.8.0/map:61,
 from 
/work/opt/gcc-4.8/include/c++/4.8.0/debug/array:296:12: error: 
‘tuple_size’ is not a class template

 struct tuple_size<__debug::array<_Tp, _Nm>>
...
...

It looks like include/c++/4.8.0/debug/array is missing a couple of wrappers:

_GLIBCXX_BEGIN_NAMESPACE_VERSION
_GLIBCXX_END_NAMESPACE_VERSION

IE it is a similar thing to this change:
http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=193584

Should I re-open the bug?

Oleg.


_GLIBCXX_DEBUG with v7 namespace (bug 55028)

2012-10-29 Thread Oleg Smolsky
Hi, below is a patch for Bug 55028. My tests link now, yet, more symbols 
may need to be exposed...


Could someone check please?

Thanks!
Oleg.

--- abi/pre/gnu-versioned-namespace.ver (revision 192953)
+++ abi/pre/gnu-versioned-namespace.ver (working copy)
@@ -116,6 +116,13 @@
_ZN11__gnu_debug19_Safe_sequence_base22_M_revalidate_singularEv;
 _ZN11__gnu_debug19_Safe_sequence_base7_M_swapERS0_;

+# __gnu_debug::_Safe_unordered_container_base and 
_Safe_local_iterator_base

+_ZN11__gnu_debug30_Safe_unordered_container_base7_M_swapERS0_;
+ _ZN11__gnu_debug30_Safe_unordered_container_base13_M_detach_allEv;
+ 
_ZN11__gnu_debug25_Safe_local_iterator_base9_M_attachEPNS_19_Safe_sequence_\

+baseEb;
+_ZN11__gnu_debug25_Safe_local_iterator_base9_M_detachEv;
+
 _ZN11__gnu_debug19_Safe_iterator_base9_M_attach*;
 _ZN11__gnu_debug19_Safe_iterator_base16_M_attach_single*;
 _ZN11__gnu_debug19_Safe_iterator_base9_M_detachEv;



Re: RFC: -Wall by default

2012-04-13 Thread Oleg Smolsky

On 2012-04-11 01:50, Vincent Lefevre wrote:

On 2012-04-09 13:03:38 -0500, Gabriel Dos Reis wrote:

On Mon, Apr 9, 2012 at 12:44 PM, Robert Dewar  wrote:

On 4/9/2012 1:36 PM, Jonathan Wakely wrote:


Maybe -Wstandard isn't the best name though, as "standard" usually
means something quite specific for compilers, and the warning switch
wouldn't have anything to do with standards conformance.

-Wdefault

might be better

except if people want warnings about "defaults" in C++11 (which can mean
lot of things).

How about a warning level?

-W0: no warnings (equivalent to -w)
-W1: default
-W2: equivalent to the current -Wall
-W3: equivalent to the current -Wall -Wextra

This is exactly what Microsoft C++ compiler does and what their Visual 
Studio IDE exposes in the UI. So, there is a reasonable precedent.


Oleg.


Re: C Compiler benchmark: gcc 4.6.3 vs. Intel v11 and others

2012-01-19 Thread Oleg Smolsky
Nice work! The only think is that you didn't enable WPO/LTCG on VC++ 
builds, so that test is a little skewed...


On 2012/1/18 20:35, willus.com wrote:

Hello,

For those who might be interested, I've recently benchmarked gcc 4.6.3 
(and 3.4.2) vs. Intel v11 and Microsoft (in Windows 7) here:


http://willus.com/ccomp_benchmark2.shtml

willus.com




Re: Performance degradation on g++ 4.6

2011-08-24 Thread Oleg Smolsky

Sure. I've just attached it to the bug.

On 2011/8/24 14:56, Xinliang David Li wrote:

Thanks.

Can you make the test case a standalone preprocessed file (using -E)?

David

On Wed, Aug 24, 2011 at 2:26 PM, Oleg Smolsky  wrote:

On 2011/8/24 13:02, Xinliang David Li wrote:

On 2011/8/23 11:38, Xinliang David Li wrote:

Partial register stall happens when there is a 32bit register read
followed by a partial register write. In your case, the stall probably
happens in the next iteration when 'add eax, 0Ah' executes, so your
manual patch does not work.  Try change

add al, [dx] into two instructions (assuming esi is available here)

movzx esi, ds:data8[dx]
add  eax, esi


I patched the code to use  "movzx edi" but the result is a little clumsy
as
the loop is based on the virtual address rather than index.

my bad -- I did copy&paste without making it precise.

No worries. The fragment did fit into the padding :)


So, this is one test out of the suite. Many of them degraded... Are you
guys
interested in looking at other ones? Or is there something to be fixed in
the register allocation logic?

File bugs --- the isolated examples like this one would be very helpful in
the bug report.

Done:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

Regards,
Oleg.





Re: Performance degradation on g++ 4.6

2011-08-24 Thread Oleg Smolsky

On 2011/8/24 13:02, Xinliang David Li wrote:

On 2011/8/23 11:38, Xinliang David Li wrote:

Partial register stall happens when there is a 32bit register read
followed by a partial register write. In your case, the stall probably
happens in the next iteration when 'add eax, 0Ah' executes, so your
manual patch does not work.  Try change

add al, [dx] into two instructions (assuming esi is available here)

movzx esi, ds:data8[dx]
add  eax, esi


I patched the code to use  "movzx edi" but the result is a little clumsy as
the loop is based on the virtual address rather than index.

my bad -- I did copy&  paste without making it precise.

No worries. The fragment did fit into the padding :)


So, this is one test out of the suite. Many of them degraded... Are you guys
interested in looking at other ones? Or is there something to be fixed in
the register allocation logic?
File bugs --- the isolated examples like this one would be very 
helpful in the bug report. 

Done:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

Regards,
Oleg.


Re: Performance degradation on g++ 4.6

2011-08-24 Thread Oleg Smolsky

On 2011/8/23 11:38, Xinliang David Li wrote:

Partial register stall happens when there is a 32bit register read
followed by a partial register write. In your case, the stall probably
happens in the next iteration when 'add eax, 0Ah' executes, so your
manual patch does not work.  Try change

add al, [dx] into two instructions (assuming esi is available here)

movzx esi, ds:data8[dx]
add  eax, esi

I patched the code to use  "movzx edi" but the result is a little clumsy 
as the loop is based on the virtual address rather than index. Also, the 
sequence is a bit bigger so I had to spill the patch into the preceding 
padding:


.text:00400D80 loc_400D80:
.text:00400D80 mov edx, offset data8
.text:00400D85 xor eax, eax
.text:00400D87 nop
.text:00400D88 nop
.text:00400D89 nop
.text:00400D8A nop
.text:00400D8B nop
.text:00400D8C
.text:00400D8C loc_400D8C:
.text:00400D8C movzx   edi, byte ptr [rdx+0]
.text:00400D90 add eax, edi
.text:00400D92 add eax, 0Ah
.text:00400D95 add rdx, 1
.text:00400D99 cmp rdx, 503480h
.text:00400DA0 jnz short loc_400D8C
.text:00400DA2 movsx   eax, al
.text:00400DA5 add ecx, 1
.text:00400DA8 add ebx, eax
.text:00400DAA cmp ecx, esi
.text:00400DAC jnz short loc_400D80

The performance improved from 2.84 sec (563.38 M ops/s) to 1.51 sec 
(1059.60 M ops/s). It's close to the code emitted by g++4.1 now. Very funky!


So, this is one test out of the suite. Many of them degraded... Are you 
guys interested in looking at other ones? Or is there something to be 
fixed in the register allocation logic?


Oleg.


Re: Performance degradation on g++ 4.6

2011-08-23 Thread Oleg Smolsky

Hey Andrew,

On 2011/8/22 18:37, Andrew Pinski wrote:

On Mon, Aug 22, 2011 at 6:34 PM, Oleg Smolsky  wrote:

On 2011/8/22 18:09, Oleg Smolsky wrote:

Both compilers fully inline the templated function and the emitted code
looks very similar. I am puzzled as to why one of these loops is
significantly slower than the other. I've attached disassembled listings -
perhaps someone could have a look please? (the body of the loop starts at
00400FD for gcc41 and at 00400D90 for gcc46)

The difference, theoretically, should be due to the inner loop:

v4.6:
.text:00400DA0 loc_400DA0:
.text:00400DA0 add eax, 0Ah
.text:00400DA3 add al, [rdx]
.text:00400DA5 add rdx, 1
.text:00400DA9 cmp rdx, 5034E0h
.text:00400DB0 jnz short loc_400DA0

v4.1:
.text:00400FE0 loc_400FE0:
.text:00400FE0 movzx   eax, ds:data8[rdx]
.text:00400FE7 add rdx, 1
.text:00400FEB add eax, 0Ah
.text:00400FEE cmp rdx, 1F40h
.text:00400FF5 lea ecx, [rax+rcx]
.text:00400FF8 jnz short loc_400FE0

However, I cannot see how the first version would be slow... The custom
templated "shifter" degenerates into "add 0xa", which is the point of the
test... Hmm...

It is slower because of the subregister depedency between eax and al.

Hmm... it is little difficult to reason about these fragments as they 
are not equivalent in functionality. The g++4.1 version discards the 
result while the other version (correctly) accumulates. Oh, I've just 
realized that I grabbed the first iteration of the inner loop which was 
factored out (perhaps due to unrolling?) Oops, my apologies.


Here are complete loops, out of a further digested test:

g++ 4.1 (1.35 sec, 1185M ops/s):

.text:00400FDB loc_400FDB:
.text:00400FDB xor ecx, ecx
.text:00400FDD xor edx, edx
.text:00400FDF nop
.text:00400FE0
.text:00400FE0 loc_400FE0:
.text:00400FE0 movzx   eax, ds:data8[rdx]
.text:00400FE7 add rdx, 1
.text:00400FEB add eax, 0Ah
.text:00400FEE cmp rdx, 1F40h
.text:00400FF5 lea ecx, [rax+rcx]
.text:00400FF8 jnz short loc_400FE0
.text:00400FFA movsx   eax, cl
.text:00400FFD add esi, 1
.text:00401000 add ebx, eax
.text:00401002 cmp esi, edi
.text:00401004 jnz short loc_400FDB

g++ 4.6 (2.86s, 563M ops/s) :

.text:00400D80 loc_400D80:
.text:00400D80 mov edx, offset data8
.text:00400D85 xor eax, eax
.text:00400D87 db  66h, 66h
.text:00400D87 nop
.text:00400D8A db  66h, 66h
.text:00400D8A nop
.text:00400D8D db  66h, 66h
.text:00400D8D nop
.text:00400D90
.text:00400D90 loc_400D90:
.text:00400D90 add eax, 0Ah
.text:00400D93 add al, [rdx]
.text:00400D95 add rdx, 1
.text:00400D99 cmp rdx, 503480h
.text:00400DA0 jnz short loc_400D90
.text:00400DA2 movsx   eax, al
.text:00400DA5 add ecx, 1
.text:00400DA8 add ebx, eax
.text:00400DAA cmp ecx, esi
.text:00400DAC jnz short loc_400D80

Your observation still holds - there are two sequential instructions 
that operate on the same register. So, I manually patched the 4.6 
binary's inner loop to the following:


.text:00400D90 add al, [rdx]
.text:00400D92 add rdx, 1
.text:00400D96 add eax, 0Ah
.text:00400D99 cmp rdx, 503480h
.text:00400DA0 jnz short loc_400D90

and that made no significant difference in performance.

Is this dependency really a performance issue? BTW, the outer loop 
executes 200,000 times...


Thanks!

Oleg.

P.S. GDB disassembles the v4.6 emitted padding as:

   0x00400d87 <+231>:   data32 xchg ax,ax
   0x00400d8a <+234>:   data32 xchg ax,ax
   0x00400d8d <+237>:   data32 xchg ax,ax


Re: Performance degradation on g++ 4.6

2011-08-22 Thread Oleg Smolsky

On 2011/8/22 18:09, Oleg Smolsky wrote:
Both compilers fully inline the templated function and the emitted 
code looks very similar. I am puzzled as to why one of these loops is 
significantly slower than the other. I've attached disassembled 
listings - perhaps someone could have a look please? (the body of the 
loop starts at 00400FD for gcc41 and at 00400D90 for 
gcc46)

The difference, theoretically, should be due to the inner loop:

v4.6:
.text:00400DA0 loc_400DA0:
.text:00400DA0 add eax, 0Ah
.text:00400DA3 add al, [rdx]
.text:00400DA5 add rdx, 1
.text:00400DA9 cmp rdx, 5034E0h
.text:00400DB0 jnz short loc_400DA0

v4.1:
.text:00400FE0 loc_400FE0:
.text:00400FE0 movzx   eax, ds:data8[rdx]
.text:00400FE7 add rdx, 1
.text:00400FEB add eax, 0Ah
.text:00400FEE cmp rdx, 1F40h
.text:00400FF5 lea ecx, [rax+rcx]
.text:00400FF8 jnz short loc_400FE0

However, I cannot see how the first version would be slow... The custom 
templated "shifter" degenerates into "add 0xa", which is the point of 
the test... Hmm...


Oleg.


Re: Performance degradation on g++ 4.6

2011-08-22 Thread Oleg Smolsky

Hey David, these two --param options made no difference to the test.

I've cut the suite down to a single test (attached), which yields the 
following results:


./simple_types_constant_folding_os (gcc 41)
test description   time   operations/s
 0 "int8_t constant add"   1.34 sec   1194.03 M

./simple_types_constant_folding_os (gcc 46)
test description   time   operations/s
 0 "int8_t constant add"   2.84 sec   563.38 M

Both compilers fully inline the templated function and the emitted code 
looks very similar. I am puzzled as to why one of these loops is 
significantly slower than the other. I've attached disassembled listings 
- perhaps someone could have a look please? (the body of the loop starts 
at 00400FD for gcc41 and at 00400D90 for gcc46)


Thanks,
Oleg.


On 2011/8/1 22:48, Xinliang David Li wrote:

Try isolate the int8_t constant folding testing from the rest to see
if the slow down can be reproduced with the isolated case. If the
problem disappear, it is likely due to the following inline
parameters:

large-function-insns, large-function-growth, large-unit-insns,
inline-unit-growth. For instance set

--param large-function-insns=1
--param large-unit-insns=2

David

On Mon, Aug 1, 2011 at 11:43 AM, Oleg Smolsky  wrote:

On 2011/7/29 14:07, Xinliang David Li wrote:

Profiling tools are your best friend here. If you don't have access to
any, the least you can do is to build the program with -pg option and
use gprof tool to find out differences.

The test suite has a bunch of very basic C++ tests that are executed an
enormous number of times. I've built one with the obvious performance
degradation and attached the source, output and reports.

Here are some highlights:
v4.1:Total absolute time for int8_t constant folding: 30.42 sec
v4.6:Total absolute time for int8_t constant folding: 43.32 sec

Every one of the tests in this section had degraded... the first half more
than the second. I am not sure how much further I can take this - the
benchmarked code is very short and plain. I can post disassembly for one
(some?) of them if anyone is willing to take a look...

Thanks,
Oleg.



/*
Copyright 2007-2008 Adobe Systems Incorporated
Distributed under the MIT License (see accompanying file LICENSE_1_0_0.txt
or a copy at http://stlab.adobe.com/licenses.html )


Source file for tests shared among several benchmarks
*/

/**/

template
inline bool tolerance_equal(T &a, T &b) {
T diff = a - b;
return (abs(diff) < 1.0e-6);
}


template<>
inline bool tolerance_equal(int32_t &a, int32_t &b) {
return (a == b);
}
template<>
inline bool tolerance_equal(uint32_t &a, uint32_t &b) {
return (a == b);
}
template<>
inline bool tolerance_equal(uint64_t &a, uint64_t &b) {
return (a == b);
}
template<>
inline bool tolerance_equal(int64_t &a, int64_t &b) {
return (a == b);
}

template<>
inline bool tolerance_equal(double &a, double &b) {
double diff = a - b;
double reldiff = diff;
if (fabs(a) > 1.0e-8)
reldiff = diff / a;
return (fabs(reldiff) < 1.0e-6);
}

template<>
inline bool tolerance_equal(float &a, float &b) {
float diff = a - b;
double reldiff = diff;
if (fabs(a) > 1.0e-4)
reldiff = diff / a;
return (fabs(reldiff) < 1.0e-3);// single precision 
divide test is really imprecise
}

/**/

template 
inline void check_shifted_sum(T result) {
T temp = (T)SIZE * Shifter::do_shift((T)init_value);
if (!tolerance_equal(result,temp))
printf("test %i failed\n", current_test);
}

template 
inline void check_shifted_sum_CSE(T result) {
T temp = (T)0.0;
if (!tolerance_equal(result,temp))
printf("test %i failed\n", current_test);
}

template 
inline void check_shifted_variable_sum(T result, T var) {
T temp = (T)SIZE * Shifter::do_shift((T)init_value, var);
if (!tolerance_equal(result,temp))
printf("test %i failed\n", current_test);
}

template 
inline void check_shifted_variable_sum(T result, T var1, T var2, T var3, T 
var4) {
T temp = (T)SIZE * Shifter::do_shift((T)init_value, var1, var2, var3, 
var4);
if (!tolerance_equal(result,temp))
printf("test %i failed\n", current_test);
}

template 
inline void check_shifted_variable_sum_CSE(T result, T var) {
T temp = (T)0.0;
if (!tolerance_equal(result,temp))
printf("test %i failed\n", current_test);
}

template 
inline void ch

Re: Performance degradation on g++ 4.6

2011-08-01 Thread Oleg Smolsky

Hi Benjamin,

On 2011/7/30 06:22, Benjamin Redelings I wrote:

I had some performance degradation with 4.6 as well.

However, I was able to cure it by using -finline-limit=800 or 1000 I 
think.  However, this lead to a code size increase.  Were the old 
higher-performance binaries larger?

Yes, the older binary for the degraded test was indeed larger: 107K vs 88K.

However, I have just re-built and re-run the test and there was no 
significant difference in performance. IE the degradation in 
"simple_types_constant_folding" test remains when building with 
-finline-limit=800 (or =1000)


IIRC, setting finline-limit=n actually sets two params to n/2, but I 
think you may only need to change 1 to get the old performance back.  
--param max-inline-insns-single defaults to 450, but --param 
max-inline-insns-auto defaults to 90.  Perhaps you can get the old 
performance back by adjusting just one of these two parameters, or by 
setting them to different values, instead of the same value, as would 
be achieved by -finline-limit.
"--param max-inline-insns-auto=800" by itself does not help. The 
"--param max-inline-insns-single=800 --param max-inline-insns-auto=1000" 
combination makes no significant difference either.


BTW, some of these tweaks increase the binary size to 99K, yet there is 
no performance increase.


Oleg.


Re: Performance degradation on g++ 4.6

2011-07-29 Thread Oleg Smolsky

Hey David, here are a couple of answers and notes:
- I built the test suite with -O3 and cannot see anything else 
related to inlining that isn't already ON (except for -finline-limit=n 
which I do not how to use)

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
- FTO looks like a very different kettle of fish, I'd prefer to 
leave it aside to limit the number of data points (at least for the 
initial investigation)
- I've just rerun the suite with -flto and there are no significant 
differences in performance


What else is there?

Oleg.

On 2011/7/29 11:07, Xinliang David Li wrote:

My guess is inlining differences. Try more aggressive inline
parameters to see if helps. Also try FDO to see there is any
performance difference between two versions. You will probably need to
do first level triage and file bug reports.

David


On Fri, Jul 29, 2011 at 10:56 AM, Oleg Smolsky
  wrote:

Hi there, I have compiled and run a set of C++ benchmarks on a CentOS4/64
box using the following compilers:
a) g++4.1 that is available for this distro (GCC version 4.1.2 20071124
(Red Hat 4.1.2-42)
b) g++4.6 that I built (stock version 4.6.1)

The machine has two Intel quad core processors in x86_64 mode (/proc/cpuinfo
attached)

Benchmarks were taken from this page:
http://stlab.adobe.com/performance/

Results:
- some of these tests showed 20..30% performance degradation
  (eg the second section in the simple_types_constant_folding test: 30s
->  44s)
- a few were quicker
- full reports are attached

I would assume that performance of the generated code is closely monitored
by the dev community and obvious blunders should not sneak in... However, my
findings are reproducible with these synthetic benchmarks as well as
production code at work. The latter shows approximately 25% degradation on
CPU bound tests.

Is there a trick to building the compiler or using a specific -mtune/-march
flag for my CPU? I built the compiler with all the default options (it just
has a distinct installation path):
../gcc-%{version}/configure --prefix=/work/tools/gcc46
--enable-languages=c,c++ --with-system-zlib --with-mpfr=/work/tools/mpfr24
--with-gmp=/work/tools/gmp --with-mpc=/work/tools/mpc
LD_LIBRARY_PATH=/work/tools/mpfr/lib24:/work/tools/gmp/lib:/work/tools/mpc/lib

Are there any published benchmarks? I'd appreciate any advice or pointers.

Thanks in advance,
Oleg.