Boehm-gc performance data
I'm still waiting for the testsuite to complete (it's been running just for about 24 hours so far). In the meanwhile I'd like to discuss the first performance results, which I've put on the Wiki: First number is GCC with Boehm's GC and the number in parentheses is GCC with page collector. combine.c: top mem usage: 52180k (13915k). GC execution time 0.66 (0.61) 4% (4%). User running time: 0m16 (0m14). reload.c: top mem usage: 35764k (10049k). GC execution time 0.44 (0.53) 5% (6%). User running time: 0m10 (0m9). PR/8361 (C++): top mem usage: 289128k (62510k). GC execution time 3.97 (5.77) 5% (9%). User running time: 1m17 (1m6). System running time: 0m2 (0m2). PR/19614 (C++): top mem usage: 289140k (139520k). GC execution time 5 (4.68) 15% (17%). User running time: 0m35 (0m27). System running time: 0m1 (0m1). My observations and some hypotheses: 1. Top memory usage is rather bad - the collector is not aggresive as it should. I've done sanity check and verified that the number of GC allocated bytes is the same in both cases. 2. GC part in total runtime is decreased but the total runtime is increased - caused by effective GC algorithms, but worse allocated data locality? Why this data might have some inaccuracies: 1. Debugging version of Boehm's GC API is used now, though the collector itself is not compiled with debug options. I will re-run tests with non-debugging API, but I don't expect significant impact on results. 2. Decision when to collect is made a bit differently than with old collectors. They track increase in number of allocated bytes in the GC heap, on the other hand I track increase of the GC heap size itself. Again, it will be easy to re-run tests with other way around. All in all, IMHO this data favours against Boehm's GC in GCC. But before deciding I would like to enable generational GC features, if that will help with run time. On the other hand, I don't see how peak memory usage could be reduced. What do you think? -- Laurynas
Re: Boehm-gc performance data
On 6/23/06, Laurynas Biveinis <[EMAIL PROTECTED]> wrote: All in all, IMHO this data favours against Boehm's GC in GCC. But before deciding I would like to enable generational GC features, if that will help with run time. On the other hand, I don't see how peak memory usage could be reduced. What do you think? First of all, I think I'm impressed by how quickly you've done all this. Don't write off Boehm's GC just yet. You can't expect to beat something that has seen a lot of tuning for GCC with something that you got working only a few days ago. There are a lot of special tricks especially in ggc-page that may put it at an advantage, but with some tuning perhaps you can get Boehm's to perform better for GCC. For the locality thing: Have you already tried using something like cachegrind or oprofile to compare the cache behavior of gcc with Boehm's and gcc with ggc? What about allocation strategies? Perhaps that's another thing you could toy with to improve the peak memory usage issue. I don't know how Boehm's GC works, but in ggc-page e.g. all binary expression 'tree's are allocated on the same bag of pages, which may help (or not, dunno). Keep up the good work! Gr. Steven
Re: Boehm-gc performance data
On Jun 23, 2006, at 8:51 AM, Laurynas Biveinis wrote: First number is GCC with Boehm's GC and the number in parentheses is GCC with page collector. combine.c: top mem usage: 52180k (13915k). GC execution time 0.66 (0.61) 4% (4%). User running time: 0m16 (0m14). Are these with checking on or off? Normally checking is on, you have to go out of your way to turn it off. If it were on, the real numbers are going to look much worse than the ones you're presented. Also, I've not been following real closely, but the GTY markers are used by PCH and the dual use of them by GC allow one to find PCH bugs more quickly and easily. If we moved entirely to Boehm's, did you have a plan for the GTY markers and PCH?
Re: Boehm-gc performance data
> > On Jun 23, 2006, at 8:51 AM, Laurynas Biveinis wrote: > > First number is GCC with Boehm's GC and the number in parentheses is > > GCC with page collector. > > > > combine.c: top mem usage: 52180k (13915k). GC execution time 0.66 > > (0.61) 4% (4%). User running time: 0m16 (0m14). > > Are these with checking on or off? Normally checking is on, you have > to go out of your way to turn it off. If it were on, the real > numbers are going to look much worse than the ones you're presented. > > Also, I've not been following real closely, but the GTY markers are > used by PCH and the dual use of them by GC allow one to find PCH bugs > more quickly and easily. If we moved entirely to Boehm's, did you > have a plan for the GTY markers and PCH? GTY markers are still used to mark roots with the boehm-gc. Thanks, Andrew Pinski
Re: Getting to the GCC Summit web page
Thanks! I put an updated page up at http://kegel.com/gcc/summit2006.html I won't be attending myself this year (I needed a break from travel), but if anyone's blogging the event, please let me know and I'll link to their blog from my page. - Dan On 6/23/06, Andrey Belevantsev <[EMAIL PROTECTED]> wrote: Hi Daniel, Last year, when I was at the GCC Summit for the first time, I've found your web page with directions on how to get there really helpful (http://kegel.com/gcc/summit2005.html). By now, some links from the page are not working: 1. The transitway info and map is now at the same page at http://www.octranspo.com/mapscheds/transitway/transitway_map.html instead of http://www.octranspo.com/mapscheds/transitway/tway_map.html 2. Mackenzie King is now http://www.octranspo.com/mapscheds/transitway/station_layout.asp?station_id=MAC instead of http://www.octranspo.com/mapscheds/transitway/mackenzie_king.htm 3. Area walking map is now http://www.octranspo.com/mapscheds/transitway/area_map.asp?station_id=MAC instead of http://www.octranspo.com/mapscheds/transitway/areamaps/mackenzie_king_area.htm All the others seem to be ok. Hope that helps. Andrey -- Wine for Windows ISVs: http://kegel.com/wine/isv
Fwd: Lots of gfortrans testsuite failuers on sparc64-linux: undefined reference to `_gfortran_reshape_r8
Bugger, this went to testresults insetad of here... sorry for that... -- Forwarded message -- From: Christian Joensson <[EMAIL PROTECTED]> Date: Jun 23, 2006 8:09 PM Subject: Lots of gfortrans testsuite failuers on sparc64-linux: undefined reference to `_gfortran_reshape_r8 To: [EMAIL PROTECTED] Aurora SPARC Linux release 2.1 (Snowshoe FC3)/TI UltraSparc IIi (Sabre) sun4u: binutils 2.17.50 20060610 bison-1.875c-2.sparc dejagnu-1.4.4-2.noarch expect-5.42.1-1.sparc gcc-3.4.2-6.fc3.sparc gcc-c++-3.4.2-6.fc3.sparc gcc-gnat-3.4.2-6.fc3.sparc glibc-2.3.6-0.fc3.1.sparc64 glibc-2.3.6-0.fc3.1.sparcv9 glibc-devel-2.3.6-0.fc3.1.sparc64 glibc-devel-2.3.6-0.fc3.1.sparc glibc-headers-2.3.6-0.fc3.1.sparc glibc-kernheaders-2.6-20sparc.sparc gmp-4.1.4-3sparc.sparc gmp-4.1.4-3sparc.sparc64 gmp-devel-4.1.4-3sparc.sparc gmp-devel-4.1.4-3sparc.sparc64 kernel-2.6.16-1.2241sp1.sparc64 kernel-devel-2.6.16-1.2241sp1.sparc64 libgcc-3.4.2-6.fc3.sparc libgcc-3.4.2-6.fc3.sparc64 libgcj-3.4.2-6.fc3.sparc libgcj-devel-3.4.2-6.fc3.sparc libstdc++-3.4.2-6.fc3.sparc libstdc++-3.4.2-6.fc3.sparc64 libstdc++-devel-3.4.2-6.fc3.sparc libstdc++-devel-3.4.2-6.fc3.sparc64 make-3.80-5.sparc nptl-devel-2.3.6-0.fc3.1.sparcv9 tcl-8.4.7-2.sparc LAST_UPDATED: Thu Jun 22 17:11:44 UTC 2006 (revision 114896) Platform: sparc64-unknown-linux-gnu configure flags: --enable-__cxa_atexit --enable-shared --with-cpu=v7 --enable-languages=c,ada,c++,fortran,java,objc,obj-c++,treelang I get a lot of gfortran testsuite failuers like this: PASS: gfortran.dg/append-1.f90 -Os execution test Executing on host: /usr/local/src/trunk/objdir/gcc/testsuite/gfortran/../../gfortran -B/usr/local/src/trunk/objdir/gcc/testsuite/gfortran/../../ /usr/local/src/trunk/gcc/gcc/testsuite/gfortran.dg/array-1.f90 -O0 -pedantic-errors -L/usr/local/src/trunk/objdir/sparc64-unknown-linux-gnu/64/libgfortran/.libs -L/usr/local/src/trunk/objdir/sparc64-unknown-linux-gnu/64/libgfortran/.libs -L/usr/local/src/trunk/objdir/sparc64-unknown-linux-gnu/64/libiberty -lm -m64 -o ./array-1.exe(timeout = 1800) /tmp/ccwsoiqs.o: In function `MAIN__': array-1.f90:(.text+0x33c): undefined reference to `_gfortran_reshape_r8' collect2: ld returned 1 exit status compiler exited with status 1 output is: /tmp/ccwsoiqs.o: In function `MAIN__': array-1.f90:(.text+0x33c): undefined reference to `_gfortran_reshape_r8' collect2: ld returned 1 exit status FAIL: gfortran.dg/array-1.f90 -O0 (test for excess errors) Excess errors: array-1.f90:(.text+0x33c): undefined reference to `_gfortran_reshape_r8' WARNING: gfortran.dg/array-1.f90 -O0 compilation failed to produce executable Any ideas? The FAILS were not present in my last test suite run... http://gcc.gnu.org/ml/gcc-testresults/2006-06/msg01081.html Would you like me to file a bug? -- Cheers, /ChJ -- Cheers, /ChJ
RE: Intermixing powerpc-eabi and powerpc-linux C code
> -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of > Ron McCall > Sent: Thursday, June 01, 2006 2:33 PM > To: gcc@gcc.gnu.org > Subject: Intermixing powerpc-eabi and powerpc-linux C code > > Hi! > > Does anyone happen to know if it is possible to link > (and run) C code compiled with a powerpc-eabi targeted > gcc with C code compiled with a powerpc-linux targeted > gcc? The resulting program would be run on a PowerPC > Linux system (ELDK 4.0). When I last played with the powerpc many years ago, the main differences between Linux and eabi was some details that you may or may not run into (note these are from memory, so you probably need to check what the current reality is): 1) eabi had different stack alignments than Linux; 2) eabi uses 2 small data registers (r2, r13) and Linux only 1 (r13?). 3) There are eabi relocations not officially in Linux and vice versa, but the GNU linker should support any relocations the compiler uses. 4) eabi can be little endian, Linux is only big endian. 5) different system libraries were linked in by default. -- Michael Meissner AMD, MS 83-29 90 Central Street Boxborough, MA 01719
Project RABLET
Last fall I produced the RABLE document which described the approach I thought should be taken to write a new register allocator for GCC. A new register allocator written from scratch is a very long term project (measured in years), and there is no guarantee after all that work that we'd end up with something which is remarkably better. One would hope that it is a lot more maintainable, but the generated code is a crapshot. It will surely look better but will it really run faster? The current plate of spaghetti we call the register allocator has had a lot of fine tuning go into it over the years, and it generally generates pretty darn good code IF it doesn't have to spill much, which is much of the time. This describes my current work-in-progress, RABLET, which stands for RABLE-Themes, and conveniently implies something smaller. Rather than write a new allocator, I think there are things we can do that are a lot less work which will reap many of the benefits. This works from the premise that we generate good code when we don't have to spill, so try to detect early that we are going to spill and do something about it at a point where we have good analysis. THEMES -- 1 - One of the core themes in RABLE was very early selection of instructions from patterns. RTL patterns are initially chosen by the EXPAND pass. EXPAND tends to generates better rtl patterns by being handed complex trees which it can process and get better combinations. When TREE-SSA was first implemented, we got very poor RTL because expand was seeing very small trees. TER (Temporary Expression Replacement) was invented, which mashed any single-def/single-use ssa_names together into more complex trees. This gave expand a better chance of selecting better instructions, and made a huge difference. 2 - Rematerialization/register pressure reduction was another major component. The tree-ssa optimizers pay no attention to potential register pressure (which is as it should be). This sometimes results in very high register pressure entering the back end. RABLE defined passes performing expression rematerialization and other register pressure reducing techniques to reduce excessive register pressure before actually trying to do allocation. If you have 120 register live at some point, and only 16 physical registers, the allocator is clearly going to have a difficult time. Try to present it with something more manageable. RABLET CORE --- On to the meat of the subject. RABLET involves reworking the out-of-ssa pass, rewriting expand such that it is tightly integrated with the out-of-ssa pass, and adding some new work to reduce register pressure. Thats a lot less work than a whole new register allocator. If we look at the RABLE architecture, the passes before global allocation are as follows: Live-range-disjointer instruction-selection Register coalescing register pressure reduction <.. then regular register allocation activity ...> In the context of RABLET: -Out of SSA naturally performs live range disjointing. -Initial RTL pattern selection is performed by expand. -Register coalescing - ssa_name coalescing is done by out-of-ssa, and ssa_name == register. -Register pressure reduction – This is new work which would be implemented. If we create a new black box super pass called out-of-ssa-expand-register-pressure-reducer, (ewww, lets just refer to it as ssa-to-rtl :-), then we'd have something close to the first part of RABLE. Not exactly of course, because 'instruction selection' means something slightly different. In RABLE it means choosing an instruction alternative from an RTL pattern , in RABLET it simple means "choose good RTL patterns". “But wait...” I hear some clever person saying... “A lot of things happen between ssa-to-rtl and global register allocation.”. Hold that thought until I get through describing ssa-to-rtl since there are some important considerations which affect that discussion SSA-TO-RTL -- When out-of-ssa was originally written, tree-ssa was still evolving. We didnt know exactly what was going to be expected of it, so it was written to be flexible and had a lot of gorp added after the fact. TREE-SSA is now mature and we understand much more fully what is expected of translation out of ssa. I have begun rewriting it to be smaller and faster and to eliminate all the unused features which were initially provided. It is also being rewritten with RABLET in mind. The new generation out-of-ssa will eliminate PHIs, much as it does today. This is done by coalescing together ssa_names connected by copies (and PHIs are just copies), and issuing copies on edges required by the PHIs. Instead of then mapping these back to VAR_DECLs and writing this all back to the trees, it will simply be maintained in a partition list map. This is *key*. At this point, we have a mapping of what ssa_names have been coalesced together, and the copies required to perform this. The trees themselves are unaltere
Re: Boehm-gc performance data
On 6/23/06, Laurynas Biveinis <[EMAIL PROTECTED]> wrote: What do you think? Is it possible to turn garbage collection totally off for a null-case run-time comparison or would that cause thrashing except for very small jobs? -- David L Nicol "if life were like Opera, this would probably have poison in it" -- Lyric Opera promotional coffee cup sleeve from Latte Land
Re: Project RABLET
Andrew MacLeod wrote: A new register allocator written from scratch is a very long term project (measured in years), and there is no guarantee after all that work that we'd end up with something which is remarkably better. One would hope that it is a lot more maintainable, but the generated code is a crapshot. It will surely look better but will it really run faster? The current plate of spaghetti we call the register allocator has had a lot of fine tuning go into it over the years, and it generally generates pretty darn good code IF it doesn't have to spill much, which is much of the time. If you are starting from scratch would it not be better to adopt the approach of combining register allocation and scheduling. Significant progress has been made in this area in recent years.
Visibility and C++ Classes/Templates
I'm currently working on a massive overhaul of the visibility code to make it play nice with C++. One of the issues I've run into is the question of priority of #pragma visibility versus other sources of visibility information. Consider: #pragma GCC visibility push(hidden) class __attribute((visibility("default"))) A { void f (); }; void A::f() { } Here I think we'd all agree that f should get default visibility. class A { void f (); }; #pragma GCC visibility push(hidden) void A::f() { } #pragma GCC visibility pop This case is less clear; A does not have a specified visibility, but the context of f's definition does. However, we don't want to encourage this kind of code; the visibility should be specified as early as possible so that callers use the right calling convention. Waiting until the definition to specify visibility is bad practice. Also, the status quo is that f gets A's visibility. I would preserve that and possibly give a warning to tell the user that they might want to add __attribute((visibility)) to the declaration of f in A. Now, templates: template __attribute((visibility ("hidden")) T f(T); #pragma GCC visibility push(default) extern template int f(int); #pragma GCC visibility pop This could really go either way. It could be considered similar to the above case in that f is in a way "part" of f, but there isn't the same scoping relationship. Also, there isn't the declaration/definition problem, as the extern template directive is the first declaration of the instantiation. In this case I am inclined to respect the #pragma rather than the attribute on the template. Using an attribute would be less ambiguous: extern template __attribute ((visibility ("default")) int f(int); In a PR Geoff asked if we really want to allow different visibility for different instantiations. I think we do; perhaps one instantiation is part of the interface of an exported class, but we want other instantiations to be generated locally in each shared object. Jason
Re: Project RABLET
On Fri, 2006-06-23 at 15:29 -0400, Robert Dewar wrote: > Andrew MacLeod wrote: > > > A new register allocator written from scratch is a very long term > > project (measured in years), and there is no guarantee after all that > > work that we'd end up with something which is remarkably better. One > > would hope that it is a lot more maintainable, but the generated code is > > a crapshot. It will surely look better but will it really run faster? > > The current plate of spaghetti we call the register allocator has had a > > lot of fine tuning go into it over the years, and it generally generates > > pretty darn good code IF it doesn't have to spill much, which is much of > > the time. > > If you are starting from scratch would it not be better to adopt > the approach of combining register allocation and scheduling. > Significant progress has been made in this area in recent years. > I am personally not a believer in combining register allocation and scheduling. They are two different problems, and although there is some interaction, I am still in the "keep them seperate" camp. However, RABLET is not writing a register allocator so its moot anyway :-). Andrew
Re: Project RABLET
Andrew MacLeod wrote: I am personally not a believer in combining register allocation and scheduling. They are two different problems, and although there is some interaction, I am still in the "keep them seperate" camp. I disagree, there is in fact much more than "some interaction", there is a very strong interaction between scheduling and register allocation, particularly on modern machines like the typical x86 chips (which have only a fraction of their registers directly nameable). The research results on combining the two steps looks very promising to me. However, RABLET is not writing a register allocator so its moot anyway :-). indeed, moot = disussable, undecided, so here we are discussing (or if you like to use the verb, mooting) the issue. Andrew
Re: Project RABLET
On Fri, Jun 23, 2006 at 04:30:01PM -0400, Robert Dewar wrote: > >However, RABLET is not writing a register allocator so its moot > >anyway :-). > > indeed, moot = disussable, undecided, so here we are discussing > (or if you like to use the verb, mooting) the issue. Please try the other definition, which he clearly meant: 2. Of purely theoretical or academic interest; having no practical consequence; as, the team won in spite of the bad call, and whether the ruling was correct is a moot question. -- Daniel Jacobowitz CodeSourcery
Re: Project RABLET
Daniel Jacobowitz wrote: Please try the other definition, which he clearly meant: 2. Of purely theoretical or academic interest; having no practical consequence; as, the team won in spite of the bad call, and whether the ruling was correct is a moot question. Well I am not sure what he meant, but for sure it is not the case that optimal register allocation and scheduling is of only theoretical or academic interest with no practical consequences!
Re: Project RABLET
On 6/23/06, Robert Dewar <[EMAIL PROTECTED]> wrote: Well I am not sure what he meant, but for sure it is not the case that optimal register allocation and scheduling is of only theoretical or academic interest with no practical consequences! Thanks for making that point. Now, what do you think about this RABLET idea, which has nothing to do with either register allocation or scheduling? ;-) Gr. Steven
Re: Project RABLET
Steven Bosscher wrote: Now, what do you think about this RABLET idea, which has nothing to do with either register allocation or scheduling? ;-) Well I would not say that it has nothing to do with register allocation! But indeed this seems a promising approach. The real question in my mind is whether it can be done in a way that simplifies and clarifies rather than adding to what is now very complex code to follow. I think the answer to that is probably yes.
Fortran Compiler
Hello, I would like to know if there is a fortran compiler that runs on AMD 64 bits. I have installed suse 10.1 linux on my computer, I would really apreciated all your help. I heard yours also have C and C++. Thank you very much, I write you from Argentina, héctor Riojas Roldan
RFC: __cxa_atexit for mingw32
Hello, One of things mingw32 C runtime lacks is an implementation of __cxa_atexit. However, as explained in the comment below, some of the behaviour of __cxa_atexit is already in the C runtime atexit implementation. Adding the object below to libstdc++ or libgcc.a and configuring with __cxa_atexit enabled produces PASSES for the three __cxa_atexit dependent testcases. It works fine in tests with destruction of objects in dll's too, whether these dlls are unloaded at process exit or by earlier calls to UnloadLibrary. (No, it doesn't allow exceptions to be thrown and caugtht across dll boundaries -- thats another story for gcc 4.3 -- but it removes one obstacle.) Although this keeps the changes local to mingw32 code, I don't really like adding a fake __cxa_atexit to a runtime lib. So, the other option would be to add a 'if (flag_use_dllonexit)' code to cp/decl.c and decl2.c to parallel flag_use_cxa_atexit. Adding a real __cxa_atexit to mingw runtime is of course also possible, but I thought I'd attempt the easy options first. I would appreciate any comments. Danny /* mingw32-cxa_atexit.c Contributed by Danny Smith ([EMAIL PROTECTED]) Copyright (C) 2006 Free Software Foundation, Inc. This file is part of GCC. GCC is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. GCC is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with GCC; see the file COPYING. If not, write to the Free Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. */ /* On mingw32, each dll has its own on-exit table, which is initialized on dll load, Calls to atexit or onexit will register functions in the on-exit table of the containing module. Each dll-specific on-exit table runs when that dll unloads. The calls to atexit from the main app, (ie, including all static libs) are finalized at process exit. cc1plus currently ignores the argument to __cxa_atexit-registered functions. If that changes, we will need to replace this with a real __cxa_atexit implementation in mingw runtime. */ #include /* We don't need an explicit dll handle. The handle is always 'this'. */ void* __dso_handle = NULL; int __mingw_cxa_atexit (void (*)(void *), void *, void *); int __mingw_cxa_atexit (void (*func)(void *), void *arg __attribute__((unused)), void *d __attribute__((unused))) { return atexit ((void (*) (void)) func); } int __cxa_atexit (void (*)(void *), void *, void *) __attribute__ ((alias ("__mingw_cxa_atexit")));
Re: Project RABLET
Andrew MacLeod <[EMAIL PROTECTED]> writes: > This describes my current work-in-progress, RABLET, which stands for > RABLE-Themes, and conveniently implies something smaller. Thanks for this proposal. > ssa-to-rtl > spill cost analysis > global allocation > spiller > spill location optimizer > instruction rewriter. You omitted the RTL loop optimizer passes, which still do quite a bit of work despite the tree-ssa loop passes. Also if-conversion and some minor passes, though they are less relevant. > If expand is made much smarter, I would argue that much of GCSE and CSE > isn't needed. We've already performed those optimizations at a high > level, and we can hopefully do a lot of the factoring and things on > addressing registers exposed during expand. I'm sure there are other > things to do, but I would argue that they are significantly less than a > "general purpose" CSE and GCSE pass. And in the cases of high register > pressure, how much would you want them to do anyway? Its really these > high register pressure areas that RABLET is attacking anyway. Here I think you are waving your hands a little too hard. RTL level CSE is significant for handling common expressions exposed by address calculations and by DImode (and larger) computations. On some processors giving up CSE on address calculations would be very painful. There needs to be a plan to handle that. Also at present may vector calculations are not exposed at the tree level--they are hidden inside builtin functions until they are expanded--and vector heavy code can also have a lot of common subexpressions. > If I recall, scheduling is register pressure aware and normally doesn't > increase register pressure dramatically. If it does increase pressure, > well, this won't solve every problem after all. Unfortunately, scheduling is currently not register pressure aware at all. The scheduler will gleefully increase register pressure. That's why we don't even run the scheduler before register allocation on x86. Modulo the above comments, I don't see anything wrong with your basic idea. But I also wonder whether you couldn't get a similar effect by forcing instruction selection to occur before register allocation. If that is done well, reload will have much less work to do. One of the basic issues with the current code is not that we do register allocation well or poorly, but that reload takes the output of the register allocator and munges it unpredictably. That's going to happen with your proposal as well. It doesn't mean that your proposal won't improve things. But no register allocator can do a good job when it can't make the final decisions. Ian
Re: Visibility and C++ Classes/Templates
Jason Merrill wrote: Nice to see this stuff getting improved! > #pragma GCC visibility push(hidden) > class __attribute((visibility("default"))) A > { > void f (); > }; > > void A::f() { } > > Here I think we'd all agree that f should get default visibility. Agreed. > class A > { > void f (); > }; > > #pragma GCC visibility push(hidden) > void A::f() { } > #pragma GCC visibility pop > > This case is less clear; A does not have a specified visibility, but the > context of f's definition does. However, we don't want to encourage > this kind of code; the visibility should be specified as early as > possible so that callers use the right calling convention. Waiting > until the definition to specify visibility is bad practice. Also, the > status quo is that f gets A's visibility. I would preserve that and > possibly give a warning to tell the user that they might want to add > __attribute((visibility)) to the declaration of f in A. Agreed. > Now, templates: > > template __attribute((visibility ("hidden")) T f(T); > #pragma GCC visibility push(default) > extern template int f(int); > #pragma GCC visibility pop > > This could really go either way. It could be considered similar to the > above case in that f is in a way "part" of f, but there isn't > the same scoping relationship. Also, there isn't the > declaration/definition problem, as the extern template directive is the > first declaration of the instantiation. In this case I am inclined to > respect the #pragma rather than the attribute on the template. I'd tend to say that the attribute wins, and that if you want to specify the visibility on the template instantiation, you must use the attribute on the instantiation, as you suggest: > Using an attribute would be less ambiguous: > > extern template __attribute ((visibility ("default")) int f(int); > > In a PR Geoff asked if we really want to allow different visibility for > different instantiations. I think we do; perhaps one instantiation is > part of the interface of an exported class, but we want other > instantiations to be generated locally in each shared object. Agreed. -- Mark Mitchell CodeSourcery [EMAIL PROTECTED] (650) 331-3385 x713
Re: Visibility and C++ Classes/Templates
Mark Mitchell <[EMAIL PROTECTED]> writes: > Jason Merrill wrote: > > Now, templates: > > > > template __attribute((visibility ("hidden")) T f(T); > > #pragma GCC visibility push(default) > > extern template int f(int); > > #pragma GCC visibility pop > > > > This could really go either way. It could be considered similar to the > > above case in that f is in a way "part" of f, but there isn't > > the same scoping relationship. Also, there isn't the > > declaration/definition problem, as the extern template directive is the > > first declaration of the instantiation. In this case I am inclined to > > respect the #pragma rather than the attribute on the template. > > I'd tend to say that the attribute wins, and that if you want to specify > the visibility on the template instantiation, you must use the attribute > on the instantiation, as you suggest: Don't you still have to deal with this case? #pragma GCC visibility push(hidden) template T f(T); #pragma GCC visibility pop ... #pragma GCC visibility push(default) extern template int f(int); #pragma GCC visibility pop Personally I wouldn't mind saying that the attribute always beats the pragma, but it seems to me that there is still the potential for ambiguity. Ian
gcc-4.1-20060623 is now available
Snapshot gcc-4.1-20060623 is now available on ftp://gcc.gnu.org/pub/gcc/snapshots/4.1-20060623/ and on various mirrors, see http://gcc.gnu.org/mirrors.html for details. This snapshot has been generated from the GCC 4.1 SVN branch with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_1-branch revision 114953 You'll find: gcc-4.1-20060623.tar.bz2 Complete GCC (includes all of below) gcc-core-4.1-20060623.tar.bz2 C front end and core compiler gcc-ada-4.1-20060623.tar.bz2 Ada front end and runtime gcc-fortran-4.1-20060623.tar.bz2 Fortran front end and runtime gcc-g++-4.1-20060623.tar.bz2 C++ front end and runtime gcc-java-4.1-20060623.tar.bz2 Java front end and runtime gcc-objc-4.1-20060623.tar.bz2 Objective-C front end and runtime gcc-testsuite-4.1-20060623.tar.bz2The GCC testsuite Diffs from 4.1-20060616 are available in the diffs/ subdirectory. When a particular snapshot is ready for public consumption the LATEST-4.1 link is updated and a message is sent to the gcc list. Please do not use a snapshot before it has been announced that way.
Re: Project RABLET
On 6/23/06, Andrew MacLeod <[EMAIL PROTECTED]> wrote: ... 1 - One of the core themes in RABLE was very early selection of instructions from patterns. RTL patterns are initially chosen by the EXPAND pass. EXPAND tends to generates better rtl patterns by being handed complex trees which it can process and get better combinations. When TREE-SSA was first implemented, we got very poor RTL because expand was seeing very small trees. TER (Temporary Expression Replacement) was invented, which mashed any single-def/single-use ssa_names together into more complex trees. This gave expand a better chance of selecting better instructions, and made a huge difference. Have you considered using BURG/IBURG style tree pattern matching instruction selection ? http://www.cs.princeton.edu/software/iburg/ That approach can certainly provide a low register pressure high quality instruction selection. -- #pragma ident "Seongbae Park, compiler, http://seongbae.blogspot.com";
sh3e opcodes in sh2e's crt1.o?
It looks like crt1.asm unconditionally includes an sh3e opcode (stc spc,r1) which causes problems trying to build an sh2a-single-only executable, which falls back to sh2e but doesn't have this sh3e opcode. Comments? 1091 ! Here handler available, call it. 1092 /* Now call the trap handler with as much of the context unchanged as possible. 1093 Move trapping address into PR to make it look like the trap point */ 1094 052a 0142 stc spc, r1 1095 052c 412A lds r1, pr
unable to detect exception model
I have run into a build problem with tonights gcc trunk on MacOS X which didn't exist in yesterdays svn pull. The gcc trunk build on MacOS X 10.4.6 crashes with... checking how to run the C++ preprocessor... /sw/src/fink.build/gcc4-4.1.999-20060623/darwin_objdir/./gcc/xgcc -shared-libgcc -B/sw/src/fink.build/gcc4-4.1.999-20060623/darwin_objdir/./gcc -nostdinc++ -L/sw/src/fink.build/gcc4-4.1.999-20060623/darwin_objdir/powerpc-apple-darwin8/libstdc++-v3/src -L/sw/src/fink.build/gcc4-4.1.999-20060623/darwin_objdir/powerpc-apple-darwin8/libstdc++-v3/src/.libs -B/sw/lib/gcc4/powerpc-apple-darwin8/bin/ -B/sw/lib/gcc4/powerpc-apple-darwin8/lib/ -isystem /sw/lib/gcc4/powerpc-apple-darwin8/include -isystem /sw/lib/gcc4/powerpc-apple-darwin8/sys-include -E loading cache ./config.cache within ltconfig checking host system type... powerpc-apple-darwin8 checking build system type... powerpc-apple-darwin8 checking for objdir... .libs checking for /sw/src/fink.build/gcc4-4.1.999-20060623/darwin_objdir/./gcc/xgcc option to produce PIC... -fno-common -DPIC checking if /sw/src/fink.build/gcc4-4.1.999-20060623/darwin_objdir/./gcc/xgcc PIC flag -fno-common -DPIC works... yes checking if /sw/src/fink.build/gcc4-4.1.999-20060623/darwin_objdir/./gcc/xgcc static flag -static works... no finding the maximum length of command line arguments... (cached) 196608 checking if /sw/src/fink.build/gcc4-4.1.999-20060623/darwin_objdir/./gcc/xgcc supports -c -o file.o... (cached) yes checking if /sw/src/fink.build/gcc4-4.1.999-20060623/darwin_objdir/./gcc/xgcc supports -fno-rtti -fno-exceptions ... yes checking whether the linker (/sw/src/fink.build/gcc4-4.1.999-20060623/darwin_objdir/./gcc/collect-ld) supports shared libraries... checking how to hardcode library paths into programs... unsupported checking whether stripping libraries is possible... no checking dynamic linker characteristics... darwin8 dyld checking command to parse /sw/src/fink.build/gcc4-4.1.999-20060623/darwin_objdir/./gcc/nm output... failed checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes appending configuration tag "CXX" to libtool checking for exception model to use... configure: error: unable to detect exception model make[1]: *** [configure-target-libstdc++-v3] Error 1 make: *** [all] Error 2 ### execution of /var/tmp/tmp.2.jzc1x4 failed, exit code 2 Failed: phase compiling: gcc4-4.1.999-20060623 failed Any idea where this is coming from? Jack
Re: Project RABLET
Ian Lance Taylor wrote: > Andrew MacLeod <[EMAIL PROTECTED]> writes: > >> This describes my current work-in-progress, RABLET, which stands for >> RABLE-Themes, and conveniently implies something smaller. > > Thanks for this proposal. > > >> ssa-to-rtl >> spill cost analysis >> global allocation >> spiller >> spill location optimizer >> instruction rewriter. > > You omitted the RTL loop optimizer passes, which still do quite a bit > of work despite the tree-ssa loop passes. Also if-conversion and some > minor passes, though they are less relevant. > > >> If expand is made much smarter, I would argue that much of GCSE and CSE >> isn't needed. We've already performed those optimizations at a high >> level, and we can hopefully do a lot of the factoring and things on >> addressing registers exposed during expand. I'm sure there are other >> things to do, but I would argue that they are significantly less than a >> "general purpose" CSE and GCSE pass. And in the cases of high register >> pressure, how much would you want them to do anyway? Its really these >> high register pressure areas that RABLET is attacking anyway. > > Here I think you are waving your hands a little too hard. RTL level > CSE is significant for handling common expressions exposed by address > calculations and by DImode (and larger) computations. On some > processors giving up CSE on address calculations would be very > painful. There needs to be a plan to handle that. I agree with Ian completely. Also, after having stared and worked on df in the backend with Kenny and watched the amount of work that has had to be done, i think you may be underestimating the complexity of what is really going on in the backend right now. Not that i wouldn't love to see our backend become simpler and have a bunch of relatively non-complex df based passes, because I would, but i *also* don't think RABLET is going to enable that (or the removal of CSE/GCSE) through smarter expand. It's possible you'd remove GCSE, but only because the last time i remember someone looking (stevenb, i think), it wasn't doing all *that* much. Again, like Ian, I'd argue you'd need to do real instruction selection before register allocation before that can happen. Luckily, these days, BURG based instruction selection has become production usable, so that task isn't as horrid as it used to be. --Dan
Re:sh3e opcodes in sh2e's crt1.o?
> It looks like crt1.asm unconditionally includes an sh3e opcode (stc > spc,r1) which causes problems trying to build an sh2a-single-only > executable, which falls back to sh2e but doesn't have this sh3e > opcode. Comments? It's not actually unconditional, but the condition it depends on is set conditionally with a flawed condition. Please try the attached patch. tmp Description: Binary data
Re: sh3e opcodes in sh2e's crt1.o?
> It's not actually unconditional, but the condition it depends on is set > conditionally with a flawed condition. Please try the attached patch. That seems to fix it, although I only tested a simple hello.c program. Thanks!
Re: Project RABLET
On Fri, 2006-06-23 at 23:08 -0400, Daniel Berlin wrote: > Ian Lance Taylor wrote: > > > > Here I think you are waving your hands a little too hard. RTL level > > CSE is significant for handling common expressions exposed by address > > calculations and by DImode (and larger) computations. On some > > processors giving up CSE on address calculations would be very > > painful. There needs to be a plan to handle that. > > I agree with Ian completely. > Also, after having stared and worked on df in the backend with Kenny and > watched the amount of work that has had to be done, i think you may be > underestimating the complexity of what is really going on in the backend > right now. > > Not that i wouldn't love to see our backend become simpler and have a > bunch of relatively non-complex df based passes, because I would, but i > *also* don't think RABLET is going to enable that (or the removal of > CSE/GCSE) through smarter expand. It's possible you'd remove GCSE, but > only because the last time i remember someone looking (stevenb, i > think), it wasn't doing all *that* much. > > Again, like Ian, I'd argue you'd need to do real instruction selection > before register allocation before that can happen. Luckily, these days, > BURG based instruction selection has become production usable, so that > task isn't as horrid as it used to be. It occurs to me I think there is a misunderstanding here. Perhaps I didn't communicate this well enough, or perhaps I got a little carried away trying to make RABLET look like RABLE. Im not actually proposing that RABLET will enable the backend to suddenly become simple... The initial impact of RABLET is to simply remove some of the onus of dealing with excessive register pressure from the register allocator. RABLET will really do nothing when register pressure is not high, things would be pretty much exactly as they are today. When register pressure is high, many of the things the RTL optimizations I mentioned do really become irrelevant (I think), since they increase register pressure more, and cause more spilling. This generally offsets whatever good they do. I was trying to claim that some level of this work can be done in expand and in *this* circumstance, thats all that needs to be done. Making the resulting model look somewhat like RABLE, and simplifying the view of the RTL optimizations. Its possible that some of this work can simplify the RTL optimizations in other cases, perhaps not. If we can simplify anything, that great. I'd love to see it, and I hope some of it is possible. I do see possibilities that will hopefully pan out. Ultimately, RABLET simply tries to present the backend with code that looks more like low register pressure code which the current backend is pretty good at handling. Anything else we can get from it is a bonus. For the work involved in RABLET vs. the work involved in a new allocator like RABLE, I think RABLET is well worth doing. (months vs. years). I think RABLET will show a significant benefit. (famous last words, ho ho :-) Andrew
Re: Visibility and C++ Classes/Templates
Ian Lance Taylor wrote: > Don't you still have to deal with this case? > > #pragma GCC visibility push(hidden) > template T f(T); > #pragma GCC visibility pop > ... > #pragma GCC visibility push(default) > extern template int f(int); > #pragma GCC visibility pop > > Personally I wouldn't mind saying that the attribute always beats the > pragma, but it seems to me that there is still the potential for > ambiguity. I would treat that case as if the template had the attribute, and, therefore, ignore the pragma at the point of instantiation. My concern here is that template instantiation can happen "at any time". I'm sure we all agree that the pragma should affect *implicit* instantiations; if you happened to say: #pragma GCC visibility push(default) int i = f(int); #pragma GCC visibility pop we wouldn't want the visibility of "i" to affect "f". But, an explicit instantiation: template int f(int); should really behave just like an implicit instantiation; it's just a manual way of saying instantiate here. And, "extern template" is a GNU extension which says "there's an explicit instantiation elsewhere; you needn't bother implicitly instantiating here". I'm just not comfortable with the idea of #pragmas affecting instantiations. (I'm OK with them affecting specializations, though; in that case, the original template has basically no impact, so I think it's fine to treat the specialization case as if it were any other function.) -- Mark Mitchell CodeSourcery [EMAIL PROTECTED] (650) 331-3385 x713
Re: Project RABLET
On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote: > You omitted the RTL loop optimizer passes, which still do quite a bit > of work despite the tree-ssa loop passes. Also if-conversion and some > minor passes, though they are less relevant. Which brings up a good discussion. I presume the rtl loop optimizers see things exposed by addressing modes which aren't seen in the higher level code. I wonder what the "big gains" are here... and if they are detectable at expansion time... In general, I didnt mention anything that tends not to increase register pressure, at least not in any significant manner as far as RABLET is concerned. > > > If expand is made much smarter, I would argue that much of GCSE and CSE > > isn't needed. We've already performed those optimizations at a high > > level, and we can hopefully do a lot of the factoring and things on > > addressing registers exposed during expand. I'm sure there are other > > things to do, but I would argue that they are significantly less than a > > "general purpose" CSE and GCSE pass. And in the cases of high register > > pressure, how much would you want them to do anyway? Its really these > > high register pressure areas that RABLET is attacking anyway. > > Here I think you are waving your hands a little too hard. RTL level > CSE is significant for handling common expressions exposed by address > calculations and by DImode (and larger) computations. On some > processors giving up CSE on address calculations would be very > painful. There needs to be a plan to handle that. > Yes, there is some hand waving, mostly because I haven't gotten that far in details yet. I expect to be able to do some of this type of commoning at rtl generation time as things are generated. (much like RABLE's spiller reuses spill loads nearby). That may turn out to be more difficult than I anticipate however. Pain is in the implementation :-) I am not proposing that CSE necessarily be eliminated *all* the time, but in cases when register pressure is already excessively high, is further commoning of DImode values going to make things better? Its really this case I'm interested in evaluating since this is the case we already have problems. if we don't spill, RABLET would effectively do nothing. Clearly there will be a lot of further investigation required once implementation reaches this point. Ultimately CSE and all RTL optimizations can be re-evaluated to see if things can be simplified. > Also at present may vector calculations are not exposed at the tree > level--they are hidden inside builtin functions until they are > expanded--and vector heavy code can also have a lot of common > subexpressions. > I have no plan at moment for vector operations :-). That could change, but for now we'll have to keep whatever we do today for those. > > > If I recall, scheduling is register pressure aware and normally doesn't > > increase register pressure dramatically. If it does increase pressure, > > well, this won't solve every problem after all. > > Unfortunately, scheduling is currently not register pressure aware at > all. The scheduler will gleefully increase register pressure. That's > why we don't even run the scheduler before register allocation on x86. > hum, too bad. for some reason I was under the impression that it at least tried not to increase register pressure when it was above a certain threshold value. Not running it at least means it wont increase register pressure, so that works :-) > > Modulo the above comments, I don't see anything wrong with your basic > idea. But I also wonder whether you couldn't get a similar effect by > forcing instruction selection to occur before register allocation. If > that is done well, reload will have much less work to do. > That was one of the premises of RABLE. Since out of ssa needs some TLC and TER has been a wart for years, this seems like a good way of dealing with those issues, and perhaps dealing with some significant RA issues at the same time. (Anything to avoid actually rewriting RA eh!) > One of the basic issues with the current code is not that we do > register allocation well or poorly, but that reload takes the output > of the register allocator and munges it unpredictably. That's going > to happen with your proposal as well. It doesn't mean that your > proposal won't improve things. But no register allocator can do a > good job when it can't make the final decisions. > Truer words have never been spoken. RABLET makes no attempt to do anything about reload. It simply attempts to present the backend with code that isn't full of excessive register pressure. If it turns out to be something reload screws up today, it will continue to be screwed up. I suspect a lot of the time we do have excessive spill, RABLET will show benefit. Its clearly not as good as a new register allocator would be, but the effort to benefit ratio ought to be a lot higher for RABLET than for a register allo
Re: Project RABLET
On Jun 23, 2006, at 9:39 PM, Andrew MacLeod wrote: On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote: You omitted the RTL loop optimizer passes, which still do quite a bit of work despite the tree-ssa loop passes. Also if-conversion and some minor passes, though they are less relevant. Which brings up a good discussion. I presume the rtl loop optimizers see things exposed by addressing modes which aren't seen in the higher level code. I wonder what the "big gains" are here... and if they are detectable at expansion time... The one rtl loop optimizer which has nothing to do with addressing modes and loops is the doloop optimizer which is most likely possible to do expansion time and is one of the few loop optimizer which lowers register pressure. The reason why it lowers register pressure is because it makes use of a special register for loops (at least on PowerPC). Thanks, Andrew Pinski
Re: Project RABLET
Andrew MacLeod <[EMAIL PROTECTED]> writes: > On Fri, 2006-06-23 at 15:07 -0700, Ian Lance Taylor wrote: > > > You omitted the RTL loop optimizer passes, which still do quite a bit > > of work despite the tree-ssa loop passes. Also if-conversion and some > > minor passes, though they are less relevant. > > Which brings up a good discussion. I presume the rtl loop optimizers see > things exposed by addressing modes which aren't seen in the higher level > code. I wonder what the "big gains" are here... and if they are > detectable at expansion time... One obvious gain is hoisting constants exposed by address expansion out of loops. Also once addressing modes are expanded, there are new IVs. > I am not proposing that CSE necessarily be eliminated *all* the time, > but in cases when register pressure is already excessively high, is > further commoning of DImode values going to make things better? Its > really this case I'm interested in evaluating since this is the case we > already have problems. if we don't spill, RABLET would effectively do > nothing. I think that even when pressure is high, it helps a lot to do CSE after DImode values have been split up, as will be the case even today for, e.g., DImode bitwise operations. It tends to reduce register pressure if anything. As you say, none of these arguments that RABLET is a bad idea, they are just arguments that we can't expect to remove the RTL passes without a lot more work, whether or not they increase register pressure. One thing we could perhaps consider would be expanding addressing mode calculations at the tree level. Ian