Re: Rename unwind.h to unwind-gcc.h
On 4/16/2014 03:22, Ian Lance Taylor wrote: > On Tue, Apr 15, 2014 at 4:45 AM, Douglas B Rupp wrote: >> On 04/14/2014 02:01 PM, Ian Lance Taylor wrote: >> >> No I considered that but I think that number will be very small. Will you >> concede, in hindsight, that it would be better had the name been chosen to >> be more unique? > > No, I won't concede that. The unwind.h file provides the interface > for the C++ exception handling interface > (http://mentorembedded.github.io/cxx-abi/abi-eh.html). That interface > is implemented by several different compilers, not just GCC. The header can provide the exact same interface with a different, better file name. He's basically asking, "If you had it to do all over again, would you still call it unwind.h or would you call it something different?" It's just an academic discussion because answering yes or no changes nothing, but I think the majority of the people would give it a different file name if they could do it all over again. It's not a big concession. John
Fragile test case nsdmi-union5.C
Ran into a fragile test case: FAIL: g+.dg/cpp0x/nsdmi-union5.C -std=c+11 scan-assembler 7 $ cat nsdmi-union5.C // PR c++/58701 // { dg-require-effective-target c++11 } // { dg-final { scan-assembler "7" } } static union { union { int i = 7; }; }; Two issues make it very fragile. It only seems to pass with -O0, as the code will be optimized away with any non-O0 levels. This is somewhat acceptable, but following is annoying: It scans digit 7 in resulting asm, which will pass whenever * Any CPU name, file name, svn/git revision number and ARM eabi_attribute contain digit 7. * All GCC x.7 versions * Any GCC built in July or on 7th/17th/27th of the month, or in 2017 Actually I ran into this issue when I checked my test result with a reference. Unfortunately one has digit 7 in revision number and one has not. I tend to just not scan anything and makes it a do-compile case against ICE. But I'm not sure if there was a reason to scan the digit. Please comment. Thanks, Joey
Re: Fragile test case nsdmi-union5.C
> On Apr 16, 2014, at 12:42 AM, "Joey Ye" wrote: > > Ran into a fragile test case: > FAIL: g+.dg/cpp0x/nsdmi-union5.C -std=c+11 scan-assembler 7 > > $ cat nsdmi-union5.C > // PR c++/58701 > // { dg-require-effective-target c++11 } > // { dg-final { scan-assembler "7" } } > > static union > { > union > { >int i = 7; > }; > }; > > Two issues make it very fragile. It only seems to pass with -O0, as the code > will be optimized away with any non-O0 levels. This is somewhat acceptable, > but following is annoying: It scans digit 7 in resulting asm, which will > pass whenever > * Any CPU name, file name, svn/git revision number and ARM eabi_attribute > contain digit 7. > * All GCC x.7 versions > * Any GCC built in July or on 7th/17th/27th of the month, or in 2017 > Actually I ran into this issue when I checked my test result with a > reference. Unfortunately one has digit 7 in revision number and one has not. > > I tend to just not scan anything and makes it a do-compile case against ICE. > But I'm not sure if there was a reason to scan the digit. Please comment. What about adding an explicit -O0 and doing a scan on the gimple dump? Thanks, Andrew > > Thanks, > Joey > >
Re: Fragile test case nsdmi-union5.C
On Wed, Apr 16, 2014 at 10:24 AM, wrote: > > >> On Apr 16, 2014, at 12:42 AM, "Joey Ye" wrote: >> >> Ran into a fragile test case: >> FAIL: g+.dg/cpp0x/nsdmi-union5.C -std=c+11 scan-assembler 7 >> >> $ cat nsdmi-union5.C >> // PR c++/58701 >> // { dg-require-effective-target c++11 } >> // { dg-final { scan-assembler "7" } } >> >> static union >> { >> union >> { >>int i = 7; >> }; >> }; >> >> Two issues make it very fragile. It only seems to pass with -O0, as the code >> will be optimized away with any non-O0 levels. This is somewhat acceptable, >> but following is annoying: It scans digit 7 in resulting asm, which will >> pass whenever >> * Any CPU name, file name, svn/git revision number and ARM eabi_attribute >> contain digit 7. >> * All GCC x.7 versions >> * Any GCC built in July or on 7th/17th/27th of the month, or in 2017 >> Actually I ran into this issue when I checked my test result with a >> reference. Unfortunately one has digit 7 in revision number and one has not. >> >> I tend to just not scan anything and makes it a do-compile case against ICE. >> But I'm not sure if there was a reason to scan the digit. Please comment. > > > What about adding an explicit -O0 and doing a scan on the gimple dump? Or make it a runtime testcase? Richard. > Thanks, > Andrew > >> >> Thanks, >> Joey >> >>
Fwd: Using GCC to convert markup to C++
Forwarding this to the gcc list, since it seems to be more releavant to the topic of this list. Sorry for the confusion. Regards, Akos Vandra -- Forwarded message -- From: Akos Vandra Date: 16 April 2014 12:48 Subject: Using GCC to convert markup to C++ To: gcc-h...@gcc.gnu.org Hi! We are developing a C++ Web Framework, and we use ERB for the markup of the views. I would like to use the GCC tool collection to convert that ERB markup to C++ code that would actually do the rendering, so the markup interpretation would happen at compile time. How is that possible? Can you give me some pointers on what part of the documentation I should read up on? Is a frontend capable to generate C++ code? Is that the way this should be solved? Or should I take another route? The reason for using GCC would be its (presumed) capability of parsing a language grammar extracting the tokens, which can be easily translated into code afterwards. Thanks, Akos Vandra P.S. Something like: " <%#include > Hello ERB World! You are the <%= count %>. visitor! The current time is <%= strftime(time()) %>. " into " #include "abstractview.h" #include "context.h" class MyView : public AbstractView { virtual void render(Context ctx); } " " #include "my_view.h" #include void MyView::render(Context ctx) { std::string s; s.reserve(1); s.append("Hello ERB World!\n You are the "); s.append(count); s.append(". visitor!\n The current time is: "); s.append(strftime(time()); s.append("."); return s; } "
Performance gain through dereferencing?
I have made a curious performance observation with gcc under 64 bit cygwin on a corei7. I'm genuinely puzzled and couldn't find any information about it. Perhaps this is only indirectly a gcc question though, bear with me. I have two trivial programs which assign a loop variable to a local variable 10^8 times. One does it the obvious way, the other one accesses the variable through a pointer, which means it must dereference the pointer first. This is reflected nicely in the disassembly snippets of the respective loop bodies below. Funny enough, the loop with the extra dereferencing runs considerably faster than the loop with the direct assignment (>10%). While the issue (indeed the whole program ;-) ) goes away with optimization, in less trivial scenarios that may not be so. My first question is: What makes the smaller code slower? The gcc question is: Should assignment always be performed through a pointer if it is faster? (Probably not, but why not?) A session transcript including the compilable source is below. Here are the disassembled loop bodies: Direct access = localInt = i; 1004010e6: 8b 45 fcmov-0x4(%rbp),%eax 1004010e9: 89 45 f8mov%eax,-0x8(%rbp) Pointer access = *localP = i; 1004010ee: 48 8b 45 f0 mov-0x10(%rbp),%rax 1004010f2: 8b 55 fcmov-0x4(%rbp),%edx 1004010f5: 89 10 mov%edx,(%rax) Note the first instruction which moves the address into %rax. The other two are similar to the direct assignment above.-- Here is a session transcript: $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-cygwin/4.8.2/lto-wrapper.exe Target: x86_64-pc-cygwin Configured with: /cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2/configure --srcdir=/cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2 --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --libexecdir=/usr/libexec --datadir=/usr/share --localstatedir=/var --sysconfdir=/etc --libdir=/usr/lib --datarootdir=/usr/share --docdir=/usr/share/doc/gcc --htmldir=/usr/share/doc/gcc/html -C --build=x86_64-pc-cygwin --host=x86_64-pc-cygwin --target=x86_64-pc-cygwin --without-libiconv-prefix --without-libintl-prefix --enable-shared --enable-shared-libgcc --enable-static --enable-version-specific-runtime-libs --enable-bootstrap --disable-__cxa_atexit --with-dwarf2 --with-tune=generic --enable-languages=ada,c,c++,fortran,lto,objc,obj-c++ --enable-graphite --enable-threads=posix --enable-libatomic --enable-libgomp --disable-libitm --enable-libquadmath --enable-libquadmath-support --enable-libssp --enable-libada --enable-libgcj-sublibs --disable-java-awt --disable-symvers --with-ecj-jar=/usr/share/java/ecj.jar --with-gnu-ld --with-gnu-as --with-cloog-include=/usr/include/cloog-isl --without-libiconv-prefix --without-libintl-prefix --with-system-zlib --libexecdir=/usr/lib Thread model: posix gcc version 4.8.2 (GCC) peter@peter-lap ~/src/test/obj_vs_ptr $ cat ./t #!/bin/bash cat $1.c && gcc -std=c99 -O0 -g -o $1 $1.c && time ./$1 peter@peter-lap ~/src/test/obj_vs_ptr $ ./t obj int main() { int localInt; for (int i = 0; i < 1; ++i) localInt = i; return 0; } real0m0.248s user0m0.234s sys 0m0.015s peter@peter-lap ~/src/test/obj_vs_ptr $ ./t ptr int main() { int localInt; int *localP = &localInt; for (int i = 0; i < 1; ++i) *localP = i; return 0; } real0m0.215s user0m0.203s sys 0m0.000s
Re: Performance gain through dereferencing?
Hi, You cannot learn useful timing information from a single run of a short test like this - there are far too many other factors that come into play. You cannot learn useful timing information from unoptimised code. There is too much luck involved in a test like this to be useful. You need optimised code (at least -O1), longer times, more tests, varied code, etc., before being able to conclude anything. Otherwise the result could be nothing more than a quirk of the way caching worked out. mvh., David On 16/04/14 16:26, Peter Schneider wrote: > I have made a curious performance observation with gcc under 64 bit > cygwin on a corei7. I'm genuinely puzzled and couldn't find any > information about it. Perhaps this is only indirectly a gcc question > though, bear with me. > > I have two trivial programs which assign a loop variable to a local > variable 10^8 times. One does it the obvious way, the other one accesses > the variable through a pointer, which means it must dereference the > pointer first. This is reflected nicely in the disassembly snippets of > the respective loop bodies below. Funny enough, the loop with the extra > dereferencing runs considerably faster than the loop with the direct > assignment (>10%). While the issue (indeed the whole program ;-) ) goes > away with optimization, in less trivial scenarios that may not be so. > > My first question is: What makes the smaller code slower? > The gcc question is: Should assignment always be performed through a > pointer if it is faster? (Probably not, but why not?) A session > transcript including the compilable source is below. > > Here are the disassembled loop bodies: > > Direct access > = > localInt = i; >1004010e6: 8b 45 fcmov-0x4(%rbp),%eax >1004010e9: 89 45 f8mov%eax,-0x8(%rbp) > > > Pointer access > = > *localP = i; >1004010ee: 48 8b 45 f0 mov-0x10(%rbp),%rax >1004010f2: 8b 55 fcmov-0x4(%rbp),%edx >1004010f5: 89 10 mov%edx,(%rax) > > Note the first instruction which moves the address into %rax. The other > two are similar to the direct assignment above.-- > > Here is a session transcript: > > $ gcc -v > Using built-in specs. > COLLECT_GCC=gcc > COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-cygwin/4.8.2/lto-wrapper.exe > Target: x86_64-pc-cygwin > Configured with: > /cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2/configure > --srcdir=/cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2 > --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin > --libexecdir=/usr/libexec --datadir=/usr/share --localstatedir=/var > --sysconfdir=/etc --libdir=/usr/lib --datarootdir=/usr/share > --docdir=/usr/share/doc/gcc --htmldir=/usr/share/doc/gcc/html -C > --build=x86_64-pc-cygwin --host=x86_64-pc-cygwin > --target=x86_64-pc-cygwin --without-libiconv-prefix > --without-libintl-prefix --enable-shared --enable-shared-libgcc > --enable-static --enable-version-specific-runtime-libs > --enable-bootstrap --disable-__cxa_atexit --with-dwarf2 > --with-tune=generic > --enable-languages=ada,c,c++,fortran,lto,objc,obj-c++ --enable-graphite > --enable-threads=posix --enable-libatomic --enable-libgomp > --disable-libitm --enable-libquadmath --enable-libquadmath-support > --enable-libssp --enable-libada --enable-libgcj-sublibs > --disable-java-awt --disable-symvers > --with-ecj-jar=/usr/share/java/ecj.jar --with-gnu-ld --with-gnu-as > --with-cloog-include=/usr/include/cloog-isl --without-libiconv-prefix > --without-libintl-prefix --with-system-zlib --libexecdir=/usr/lib > Thread model: posix > gcc version 4.8.2 (GCC) > > peter@peter-lap ~/src/test/obj_vs_ptr > $ cat ./t > #!/bin/bash > > cat $1.c && gcc -std=c99 -O0 -g -o $1 $1.c && time ./$1 > > > peter@peter-lap ~/src/test/obj_vs_ptr > $ ./t obj > int main() > { > int localInt; > for (int i = 0; i < 1; ++i) > localInt = i; > return 0; > } > real0m0.248s > user0m0.234s > sys 0m0.015s > > peter@peter-lap ~/src/test/obj_vs_ptr > $ ./t ptr > int main() > { > int localInt; > int *localP = &localInt; > for (int i = 0; i < 1; ++i) > *localP = i; > return 0; > } > > real0m0.215s > user0m0.203s > sys 0m0.000s > >
Re: Performance gain through dereferencing?
Hello, I completely agree with David. Note that your results will greatly vary depending on the machine you run the tests on. Performance on such tests it is very machine-dependant, so the conclusion cannot be generalized. David 2014-04-16 16:49 GMT+02:00 David Brown : > > Hi, > > You cannot learn useful timing information from a single run of a short > test like this - there are far too many other factors that come into play. > > You cannot learn useful timing information from unoptimised code. > > There is too much luck involved in a test like this to be useful. You > need optimised code (at least -O1), longer times, more tests, varied > code, etc., before being able to conclude anything. Otherwise the > result could be nothing more than a quirk of the way caching worked out. > > mvh., > > David > > > On 16/04/14 16:26, Peter Schneider wrote: >> I have made a curious performance observation with gcc under 64 bit >> cygwin on a corei7. I'm genuinely puzzled and couldn't find any >> information about it. Perhaps this is only indirectly a gcc question >> though, bear with me. >> >> I have two trivial programs which assign a loop variable to a local >> variable 10^8 times. One does it the obvious way, the other one accesses >> the variable through a pointer, which means it must dereference the >> pointer first. This is reflected nicely in the disassembly snippets of >> the respective loop bodies below. Funny enough, the loop with the extra >> dereferencing runs considerably faster than the loop with the direct >> assignment (>10%). While the issue (indeed the whole program ;-) ) goes >> away with optimization, in less trivial scenarios that may not be so. >> >> My first question is: What makes the smaller code slower? >> The gcc question is: Should assignment always be performed through a >> pointer if it is faster? (Probably not, but why not?) A session >> transcript including the compilable source is below. >> >> Here are the disassembled loop bodies: >> >> Direct access >> = >> localInt = i; >>1004010e6: 8b 45 fcmov-0x4(%rbp),%eax >>1004010e9: 89 45 f8mov%eax,-0x8(%rbp) >> >> >> Pointer access >> = >> *localP = i; >>1004010ee: 48 8b 45 f0 mov-0x10(%rbp),%rax >>1004010f2: 8b 55 fcmov-0x4(%rbp),%edx >>1004010f5: 89 10 mov%edx,(%rax) >> >> Note the first instruction which moves the address into %rax. The other >> two are similar to the direct assignment above.-- >> >> Here is a session transcript: >> >> $ gcc -v >> Using built-in specs. >> COLLECT_GCC=gcc >> COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-cygwin/4.8.2/lto-wrapper.exe >> Target: x86_64-pc-cygwin >> Configured with: >> /cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2/configure >> --srcdir=/cygdrive/i/szsz/tmpp/cygwin64/gcc/gcc-4.8.2-3/src/gcc-4.8.2 >> --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin >> --libexecdir=/usr/libexec --datadir=/usr/share --localstatedir=/var >> --sysconfdir=/etc --libdir=/usr/lib --datarootdir=/usr/share >> --docdir=/usr/share/doc/gcc --htmldir=/usr/share/doc/gcc/html -C >> --build=x86_64-pc-cygwin --host=x86_64-pc-cygwin >> --target=x86_64-pc-cygwin --without-libiconv-prefix >> --without-libintl-prefix --enable-shared --enable-shared-libgcc >> --enable-static --enable-version-specific-runtime-libs >> --enable-bootstrap --disable-__cxa_atexit --with-dwarf2 >> --with-tune=generic >> --enable-languages=ada,c,c++,fortran,lto,objc,obj-c++ --enable-graphite >> --enable-threads=posix --enable-libatomic --enable-libgomp >> --disable-libitm --enable-libquadmath --enable-libquadmath-support >> --enable-libssp --enable-libada --enable-libgcj-sublibs >> --disable-java-awt --disable-symvers >> --with-ecj-jar=/usr/share/java/ecj.jar --with-gnu-ld --with-gnu-as >> --with-cloog-include=/usr/include/cloog-isl --without-libiconv-prefix >> --without-libintl-prefix --with-system-zlib --libexecdir=/usr/lib >> Thread model: posix >> gcc version 4.8.2 (GCC) >> >> peter@peter-lap ~/src/test/obj_vs_ptr >> $ cat ./t >> #!/bin/bash >> >> cat $1.c && gcc -std=c99 -O0 -g -o $1 $1.c && time ./$1 >> >> >> peter@peter-lap ~/src/test/obj_vs_ptr >> $ ./t obj >> int main() >> { >> int localInt; >> for (int i = 0; i < 1; ++i) >> localInt = i; >> return 0; >> } >> real0m0.248s >> user0m0.234s >> sys 0m0.015s >> >> peter@peter-lap ~/src/test/obj_vs_ptr >> $ ./t ptr >> int main() >> { >> int localInt; >> int *localP = &localInt; >> for (int i = 0; i < 1; ++i) >> *localP = i; >> return 0; >> } >> >> real0m0.215s >> user0m0.203s >> sys 0m0.000s >> >> >
Re: Performance gain through dereferencing?
Hi David, Sorry, I had included more information in an earlier draft which I edited out for brevity. > You cannot learn useful timing > information from a single run of a short > test like this - there are far too many > other factors that come into play. I didn't mention that I have run it dozens of times. I know that blunt runtime measurements on a non-realtime system tend to be non-reproducible, and that they are inadequate for exact measurements. But the difference here is so large that the result is highly significant, in spite of the "amateurish" setup. The run I am showing here is typical. One of my four cores is surely idle at any given moment, and there is no I/O, so the variations are small. You cannot learn useful timing information from unoptimised code. I beg to disagree. While in this case the problem (and indeed eventually the whole program ;-) ) goes away with optimization that may not be the case in less trivial scenarios. And optimization or not -- I would always contend that *p = n is **not slower** than i = n. But it is. Something is wrong ;-). So I'd like to direct our attention to the generated code and its performance (because such code conceivably could appear as the result of an optimized compiler run as well, in less trivial scenarios). What puzzles me is: How can it be that two instructions are slower than a very similar pair of instructions plus another one? (And that question is totally unrelated to optimization.) Otherwise the result could be nothing more than a quirk of the way caching worked out. Could you explain how caching could play a role here if all variables and adresses are on the stack and are likely to be in the same memory page? (I'm not being sarcastic -- I may miss something obvious). I can imagine that somehow the processor architecture is better utilized by the faster version (e.g. because short inner loops pipleline worse or whatever). For what it's worth, the programs were running on a i7-3632QM.
Re: Rename unwind.h to unwind-gcc.h
On 04/16/2014 12:01 AM, John Marino wrote: > On 4/16/2014 03:22, Ian Lance Taylor wrote: >> On Tue, Apr 15, 2014 at 4:45 AM, Douglas B Rupp wrote: >>> On 04/14/2014 02:01 PM, Ian Lance Taylor wrote: >>> >>> No I considered that but I think that number will be very small. Will you >>> concede, in hindsight, that it would be better had the name been chosen to >>> be more unique? >> >> No, I won't concede that. The unwind.h file provides the interface >> for the C++ exception handling interface >> (http://mentorembedded.github.io/cxx-abi/abi-eh.html). That interface >> is implemented by several different compilers, not just GCC. > > The header can provide the exact same interface with a different, better > file name. > > He's basically asking, "If you had it to do all over again, would you > still call it unwind.h or would you call it something different?" > > It's just an academic discussion because answering yes or no changes > nothing, but I think the majority of the people would give it a > different file name if they could do it all over again. It's not a big > concession. No, I don't think the majority would. Because GCC would then be already incompatible with the Intel compiler from which this interface was drawn, way back when the ia64 support was added to GCC and we redesigned GCC's exception handling. r~
Re: Performance gain through dereferencing?
In order to see what difference a different processor makes I also tried the same code on a fairly old 32 bit "AMD Athlon(tm) XP 3000+" with the current stable gcc (4.7.2). The difference is even more striking (dereferencing is much faster). I see that the size of the code inside the loop for the faster pointer access is exactly 8. No idea whether that has any significance. Here as well I performed several runs with similar results. Statistical significance was established around n=2 ;-). gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/i486-linux-gnu/4.7/lto-wrapper Target: i486-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --enable-targets=all --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu Thread model: posix gcc version 4.7.2 (Debian 4.7.2-5) ppeterr@www:~/src/test/obj-vs-ptr$ cat t #!/bin/bash cat $1.c && gcc -std=c99 -O0 -g -o $1 $1.c && time ./$1 ppeterr@www:~/src/test/obj-vs-ptr$ ./t obj int main() { int localInt; for (int i = 0; i < 1; ++i) localInt = i; return 0; } real0m0.418s user0m0.416s sys 0m0.004s ppeterr@www:~/src/test/obj-vs-ptr$ ./t ptr int main() { int localInt; int *localP = &localInt; for (int i = 0; i < 1; ++i) *localP = i; return 0; } real0m0.243s user0m0.240s sys 0m0.000s === The disassembly is for the direct access (slower): localInt = i; 80483eb: 8b 45 fcmov-0x4(%ebp),%eax 80483ee: 89 45 f8mov%eax,-0x8(%ebp) And for the pointer access (faster): *localP = i; 80483f1: 8b 45 f8mov-0x8(%ebp),%eax 80483f4: 8b 55 fcmov-0x4(%ebp),%edx 80483f7: 89 10 mov%edx,(%eax)
Re: Performance gain through dereferencing?
On April 16, 2014 7:45:55 PM CEST, Peter Schneider wrote: >In order to see what difference a different processor makes I also >tried >the same code on a fairly old 32 bit "AMD Athlon(tm) XP 3000+" with the > >current stable gcc (4.7.2). The difference is even more striking >(dereferencing is much faster). I see that the size of the code inside >the loop for the faster pointer access is exactly 8. No idea whether >that has any significance. Alignment of jump targets are important. I don't think we do anything special there at O0, so the result will be pure luck. Richard. >Here as well I performed several runs with similar results. Statistical > >significance was established around n=2 ;-). > >gcc -v >Using built-in specs. >COLLECT_GCC=gcc >COLLECT_LTO_WRAPPER=/usr/lib/gcc/i486-linux-gnu/4.7/lto-wrapper >Target: i486-linux-gnu >Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5' > >--with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs >--enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr >--program-suffix=-4.7 --enable-shared --enable-linker-build-id >--with-system-zlib --libexecdir=/usr/lib --without-included-gettext >--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 >--libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu >--enable-libstdcxx-debug --enable-libstdcxx-time=yes >--enable-gnu-unique-object --enable-plugin --enable-objc-gc >--enable-targets=all --with-arch-32=i586 --with-tune=generic >--enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu >--target=i486-linux-gnu >Thread model: posix >gcc version 4.7.2 (Debian 4.7.2-5) > >ppeterr@www:~/src/test/obj-vs-ptr$ cat t >#!/bin/bash >cat $1.c && gcc -std=c99 -O0 -g -o $1 $1.c && time ./$1 > >ppeterr@www:~/src/test/obj-vs-ptr$ ./t obj >int main() >{ > int localInt; > for (int i = 0; i < 1; ++i) > localInt = i; > return 0; >} > >real0m0.418s >user0m0.416s >sys 0m0.004s >ppeterr@www:~/src/test/obj-vs-ptr$ ./t ptr >int main() >{ > int localInt; > int *localP = &localInt; > for (int i = 0; i < 1; ++i) > *localP = i; > return 0; >} > >real0m0.243s >user0m0.240s >sys 0m0.000s > >=== > >The disassembly is for the direct access (slower): > > localInt = i; > 80483eb: 8b 45 fcmov-0x4(%ebp),%eax > 80483ee: 89 45 f8mov%eax,-0x8(%ebp) > >And for the pointer access (faster): > > *localP = i; > 80483f1: 8b 45 f8mov-0x8(%ebp),%eax > 80483f4: 8b 55 fcmov-0x4(%ebp),%edx > 80483f7: 89 10 mov%edx,(%eax)
Re: Rename unwind.h to unwind-gcc.h
> Because GCC would then be already incompatible with the Intel compiler from > which this interface was drawn, way back when the ia64 support was added to > GCC and we redesigned GCC's exception handling. The irony being that WindRiver is now owned by Intel... Doug, what does this unwind.h from VxWorks 7 contain exactly? Is it something that is derived from the original ICC implementation? -- Eric Botcazou
Re: Rename unwind.h to unwind-gcc.h
On 04/16/2014 12:38 PM, Eric Botcazou wrote: Because GCC would then be already incompatible with the Intel compiler from which this interface was drawn, way back when the ia64 support was added to GCC and we redesigned GCC's exception handling. The irony being that WindRiver is now owned by Intel... Doug, what does this unwind.h from VxWorks 7 contain exactly? Is it something that is derived from the original ICC implementation? There's no reference in it to a derivative work. It's Copyright 2010 Wind River and the description says it's the C++ ABI interface to the unwinder.
Re: Rename unwind.h to unwind-gcc.h
On Wed, Apr 16, 2014 at 12:01 AM, John Marino wrote: > On 4/16/2014 03:22, Ian Lance Taylor wrote: >> On Tue, Apr 15, 2014 at 4:45 AM, Douglas B Rupp wrote: >>> On 04/14/2014 02:01 PM, Ian Lance Taylor wrote: >>> >>> No I considered that but I think that number will be very small. Will you >>> concede, in hindsight, that it would be better had the name been chosen to >>> be more unique? >> >> No, I won't concede that. The unwind.h file provides the interface >> for the C++ exception handling interface >> (http://mentorembedded.github.io/cxx-abi/abi-eh.html). That interface >> is implemented by several different compilers, not just GCC. > > The header can provide the exact same interface with a different, better > file name. > > He's basically asking, "If you had it to do all over again, would you > still call it unwind.h or would you call it something different?" > > It's just an academic discussion because answering yes or no changes > nothing, but I think the majority of the people would give it a > different file name if they could do it all over again. It's not a big > concession. I agree that it doesn't matter at this date, but I would still vote to call it unwind.h. It's a good descriptive name for the interface described by the file. I certainly wouldn't call it unwind-gcc.h; it's intentionally not GCC-specific. Ian
Re: Rename unwind.h to unwind-gcc.h
On 04/16/2014 12:48 PM, Douglas B Rupp wrote: > On 04/16/2014 12:38 PM, Eric Botcazou wrote: >>> Because GCC would then be already incompatible with the Intel compiler from >>> which this interface was drawn, way back when the ia64 support was added to >>> GCC and we redesigned GCC's exception handling. >> >> The irony being that WindRiver is now owned by Intel... >> >> Doug, what does this unwind.h from VxWorks 7 contain exactly? Is it >> something >> that is derived from the original ICC implementation? >> > > There's no reference in it to a derivative work. It's > Copyright 2010 Wind River > and the description says it's the C++ ABI interface to the unwinder. Is it a (reasonably) close match, interface-wise? Ought we be using --with-system-libunwind for VxWorks7, like we do for hpux? r~
Re: Rename unwind.h to unwind-gcc.h
On 04/16/2014 12:55 PM, Richard Henderson wrote: Is it a (reasonably) close match, interface-wise? Ought we be using --with-system-libunwind for VxWorks7, like we do for hpux? It looks reasonable at first glance, but there's a disturbing comment in the code something to the effect: until we have a GCC compatible unwinding library, hide the API
Re: Redundant / wasted stack space and instructions
On 04/16/14 00:30, pshor...@dataworx.com.au wrote: I had left the movsi patterns unimplemented because I was told that if I did this then gcc would create expands/splits to use 16 bit moves. So, I removed my movsi patterns and all seemed well. Correct. GCC can synthesize movsi from movhi. However In comparing the output of the expand pass from the msp430 port and mine I could see the movsi instructions for the msp430 and the movhi subreg instructions in mine. I wondered if using the subreg:HI to split the SI moves into HI moves was hiding real nature of the moves for reload and future optimizations, so I reverted to my movsi patterns and ... the spills now resolve back to the incoming stack parameter slots as desired ! Seems plausible. In general the compiler is not as good at optimizing code with SUBREG expressions. There's a natural tension between making the MD file closely match the hardware capabilities and presenting a somewhat less accurate, but easier to optimize description. Determining what's "best" is difficult, even for those of us with many years of experience with GCC. In the end, it always comes down to looking how GCC compiles code of interest and determining how to improve things. Oh, and I tried using the LRA on this test case and Jeff, you're correct the generated code is far, far better. Good to know. LRA is definitely the future, but it's not a "turn on the bit and everything gets magically better", at least not most of the time. jeff
Re: LRA Stuck in a loop until aborting
Solved... kind of. *ldsi is one of the patterns movsi is expanded to and as the name suggests it only handles register loads. I know that at some stages memory references will pass the register_operand predicate so I changed the predicate for operand 0 and added an alternative to *ldsi that could store to memory and the problem went away. What is interesting is that when LRA is disabled this case was handled fine by the old reload pass, suggesting that the new LRA pass handles this situation differently - or not at all ? BTW. Can someone tell me whether I should be top or bottom posting ? Cheers, Paul. On 16/04/14 16:38, pshor...@dataworx.com.au wrote: I've got a small test case there the ira pass produces this ... (insn 35 38 36 5 (set (reg/v:SI 29 [orig:17 _b ] [17]) (reg/v:SI 17 [ _b ])) 48 {*ldsi} (expr_list:REG_DEAD (reg/v:SI 17 [ _b ]) (nil))) and the LRA processes it as follows ... Spilling non-eliminable hard regs: 6 0 Non input pseudo reload: reject++ alt=0,overall=607,losers=1,rld_nregs=2 0 Non input pseudo reload: reject++ 1 Spill pseudo into memory: reject+=3 alt=1,overall=616,losers=2,rld_nregs=2 0 Non input pseudo reload: reject++ alt=2: Bad operand -- refuse Choosing alt 0 in insn 35: (0) =r (1) r {*ldsi} Creating newreg=34 from oldreg=29, assigning class GENERAL_REGS to r34 35: r34:SI=r17:SI REG_DEAD r17:SI Inserting insn reload after: 45: r29:SI=r34:SI 0 Non input pseudo reload: reject++ 1 Non pseudo reload: reject++ alt=0,overall=608,losers=1,rld_nregs=2 0 Non input pseudo reload: reject++ alt=1,overall=613,losers=2,rld_nregs=2 0 Non input pseudo reload: reject++ alt=2: Bad operand -- refuse Choosing alt 0 in insn 45: (0) =r (1) r {*ldsi} Creating newreg=35 from oldreg=29, assigning class GENERAL_REGS to r35 45: r35:SI=r34:SI Inserting insn reload after: 46: r29:SI=r35:SI so, it is stuck in a loop (continues on for 90 attempts then aborts) but I can't see what is causing it. The pattern (below) shouldn't require a reload so I can't see why it would be doing this (define_insn "*ldsi" [(set (match_operand:SI 0 "register_operand" "=r,r,r") (match_operand:SI 1 "general_operand" "r,m,i")) ] "" Can anyone shed any light on this behaviour ? Thanks, Paul.
Re: LRA Stuck in a loop until aborting
On 04/16/2014 03:05 PM, Paul Shortis wrote: > Solved... kind of. > > *ldsi is one of the patterns movsi is expanded to and as the name suggests it > only handles register loads. I know that at some stages memory references will > pass the register_operand predicate so I changed the predicate for operand 0 > and added an alternative to *ldsi that could store to memory and the problem > went away. > > What is interesting is that when LRA is disabled this case was handled fine by > the old reload pass, suggesting that the new LRA pass handles this situation > differently - or not at all ? No, old reload didn't handle this either. That you got away with it is bizzare. It's always the case that you should combine patterns such that a given operation can be performed with as many different kinds of inputs and outputs as possible. This allows the register allocator freedom to move data around as it sees fit. But the move patterns are especially special, in that a *single* pattern must describe *all* possible ways that a given data type (mode) can be moved around. The register allocators only select an alternative for a move. They do not choose between N different patterns, separately describing loads, stores, and register-to-register movement. I'm fairly sure the documentation is quite clear on this, and GCC had required this since the beginning of time. r~
gcc-4.9-20140416 is now available
Snapshot gcc-4.9-20140416 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.9-20140416/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.9 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_9-branch revision 209450 You'll find: gcc-4.9-20140416.tar.bz2 Complete GCC MD5=dda6cefa1ed78845e1e4862dc7f7522d SHA1=41bf62ed3008fa6fa092731e4a8bf5702521eae4 Diffs from 4.9-20140406 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.9 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: Rename unwind.h to unwind-gcc.h
> It looks reasonable at first glance, but there's a disturbing comment in > the code something to the effect: > > until we have a GCC compatible unwinding library, hide the API Indeed, it looks like it was written from scratch for the Diab compiler and is not fully compatible, but it is meant to implement the common C++ ABI. -- Eric Botcazou
Re: Rename unwind.h to unwind-gcc.h
On Wed, Apr 16, 2014 at 3:53 PM, Eric Botcazou wrote: >> It looks reasonable at first glance, but there's a disturbing comment in >> the code something to the effect: >> >> until we have a GCC compatible unwinding library, hide the API > > Indeed, it looks like it was written from scratch for the Diab compiler and > is not fully compatible, but it is meant to implement the common C++ ABI. I'm still not clear on what the real problem is. It seems to me that when using GCC, if you #include , you will get the GCC version, since it will come from the installed GCC include directory, which is searched ahead of /usr/include. That seems OK. The original e-mail suggested that there was a problem building libgcc, but I don't see why that would be. Is that the real problem? If so we need more details. Ian
Re: Rename unwind.h to unwind-gcc.h
On 04/16/2014 05:56 PM, Ian Lance Taylor wrote: I'm still not clear on what the real problem is. It seems to me that when using GCC, if you #include , you will get the GCC version, since it will come from the installed GCC include directory, which is searched ahead of /usr/include. That seems OK. The original e-mail suggested that there was a problem building libgcc, but I don't see why that would be. Is that the real problem? If so we need more details. The root of the problem is a hack in libgcc/config/t-vxworks put in to resolve a name clash for "regs.h", but this clash was in the other direction, e.g. the vxworks version is needed in preference to the gcc version. Now with vxworks7, we have a name clash in the other direction and a catch-22 trying to fix it. I'm working now to try to remove this first hack, then the unwind.h problem should resolve itself.
Re: LRA Stuck in a loop until aborting
On 04/16/14 16:19, Richard Henderson wrote: The register allocators only select an alternative for a move. They do not choose between N different patterns, separately describing loads, stores, and register-to-register movement. I'm fairly sure the documentation is quite clear on this, and GCC had required this since the beginning of time. Correct on all counts; many an hour was spent reimplementing the PA movXX patterns to satisfy that requirement. jeff
stack-protection vs alloca vs dwarf2
While debugging some gdb-related FAILs, I discovered that gcc's -fstack-check option effectively calls alloca() to adjust the stack pointer. However, it doesn't mark the stack adjustment as FRAME_RELATED even when it's setting up the local variables for the function. In the case of rx-elf, for this testcase, the CFA for the function is defined in terms of the stack pointer - and thus is incorrect after the alloca call. My question is: who's fault is this? Should alloca() tell the debug stuff that the stack pointer has changed? Should it tell it to not use $sp at all? Should the debug stuff "just know" that $sp isn't a valid choice for the CFA? The testcase from gdb is pretty simple: void medium_frame () { char S [16384]; small_frame (); }
Re: LRA Stuck in a loop until aborting
On 17.04.2014 13:00, Jeff Law wrote: On 04/16/14 16:19, Richard Henderson wrote: The register allocators only select an alternative for a move. They do not choose between N different patterns, separately describing loads, stores, and register-to-register movement. I'm fairly sure the documentation is quite clear on this, and GCC had required this since the beginning of time. Correct on all counts; many an hour was spent reimplementing the PA movXX patterns to satisfy that requirement. jeff I'm convinced :-) but... gcc internals info about movm is fairly comprehensive and I had taken care to ensure that I satisfied ... "The constraints on a ‘movm’ must permit moving any hard register to any other hard register provided..." by providing a define_expand that assigns from a general_operand to a nonimmediate_operand and ... *ldsi instruction that can load from a general_operand to a nonimmediate_operand and a *storesi instruction that can store a register_operand to a memory_operand In any case, out of curiosity and to convice myself I hadn't imagined the old reload pass handling this I reverted my recent fixes so that ldsi and storesi were once again as described above then repeated the exercise with full rtl dumping on and compared the rtl generated both with and without LRA enabled. In both cases the *.ira dmp produced the triggering ... (insn 57 61 58 5 (set (reg/v:SI 46 [orig:31 s ] [31]) (reg/v:SI 31 [ s ])) 48 {*ldsi} (expr_list:REG_DEAD (reg/v:SI 31 [ s ]) (nil))) The non-LRA reload rtl produced .. (insn 57 61 67 3 (set (reg:SI 1 r1) (mem/c:SI (plus:HI (reg/f:HI 3 r3) (const_int 4 [0x4])) [4 %sfp+4 S4 A16])) 48 {*ldsi} (nil)) (insn 67 57 58 3 (set (mem/c:SI (plus:HI (reg/f:HI 3 r3) (const_int 4 [0x4])) [4 %sfp+4 S4 A16]) (reg:SI 1 r1)) 47 {*storesi} (nil)) While the LRA just got stuck in a loop unable to perform the reload of insn 57 that the old reload pass handled (or more correctly didn't choke over - it seems to be a redundant load/store). I'm really just highlighting this because I know the LRA is quite young and this might be a hint towards a deeper/other issues. Paul.