"Mark Knecht" <[EMAIL PROTECTED]> posted [EMAIL PROTECTED], excerpted below, on Thu, 14 Sep 2006 07:15:42 -0700:
> I'm just curious whether anyone besides me is noticing their machine > feeling somewhat sluggish since doing the gcc-4.1 upgrade? Mine seems ot > be using a lot of memory. Alt-tabbing between windows seems slow. > Ethernet traffic in my browser is causing pretty noticeable > interruptions in things like MythTV. > The machine is still quite usable, but it doesn't feel as snappy as it > did last week. > > I made no changes in /etc/make.conf for the upgrade. Everything is > pretty basic as far as I can tell: > > CFLAGS="-march=k8 -O2 -pipe" > CXXFLAGS="${CFLAGS}" I've noticed rather the opposite, here. gcc-4.1.1 compiled binaries are /dramatically/ faster and more efficient than 3.x. However, I'm using a rather more elaborate CFLAGS/CXXFLAGS, and it's my conviction that gcc-4.1 does better at optimizing exactly the way you've told it to. That is, if you've given it inefficient optimizations, I'm convinced it makes a bad thing worse, while if you've chosen your optimizations well, it makes a good thing dramatically better. Here's my CFLAGS/CXXFLAGS: CFLAGS="-march=k8 -Os -pipe -frename-registers -fweb -freorder-blocks -freorder-blocks-and-partition -combine -funit-at-a-time -ftree-pre -fgcse-sm -fgcse-las -fgcse-after-reload -fmerge-all-constants" CXXFLAGS="-march=k8 -Os -pipe -frename-registers -fweb -freorder-blocks -funit-at-a-time -ftree-pre -fgcse-sm -fgcse-las -fgcse-after-reload -fmerge-all-constants" The general strategy here is to take advantage of size optimization -- on modern compilers, L1 and L2 cache are FAR FAR faster than main memory, and raw CPU cycles runs circles around even cache speeds. Thus, optimizing for CPU speed at the expense of size makes little sense, because all those saved cycles and more are likely to be spent waiting for memory to return code that /would/ have fit in the cache were it size optimized. Thus, for example, where traditional optimizations unroll loops into flat code where possible, to avoid the expense of the jump back to the top of the loop, that spreads out the loop to several times its original code size, thus taking far more room in fast cache and forcing the CPU to wait far more often for code to be fetched from main memory. I prefer to keep the loops, making the code smaller and thus allowing more of it to fit in faster cache. I believe that for most code, this technique will result in faster execution in the real world, despite the theoretical loss of a CPU cycle here or there due to jumping back to the top of the loop. The -freorder-blocks-and-partition, OTOH, can make code slightly larger, but the effect is the same as the above, increasing execution speed. What this optimization does is separate code that is used often from that which is seldom used, so the "hot" code is smaller and fits better in high speed cache, while the "cold" code ends up in slower main memory most of the time. While a lower percentage of the code may be in cache due to the larger size, cache will be used far more effectively, as more "hot" code will be retained therein, with the cold code that's not used so often allowed to drop out of cache into main memory. This particular optimization doesn't work well with C++, however, so it's in my CFLAGS but not my CXXFLAGS. Likewise with -combine, which allows the compiler to optimize across multiple source files at a time. It's only implemented for C at this time (according to the gcc manpage), so it's in my CFLAGS but omitted from my CXXFLAGS. The other strategy here is to make as full a use of the extra registers available to amd64 in 64-bit mode (as opposed to 32-bit x86 mode) as possible. Registers operate at the speed of the CPU, no wait at all, as there is for even L1 cache, so it pays to use them as efficiently as possible. Several of the flags (-frename-registers of course, -fweb, etc) in my CFLAGS are therefore designed to encourage gcc to do this. All the flags I've not mentioned specifically are designed to further the three common goals mentioned above, making as efficient a use as possible of the speed of (1) registers and (2) cache memory, by allowing gcc to optimize over as wide a scope (3, whole units with unit-at-a-time, or even multiple units with -combine) as possible. Of course, see the gcc manpage for additional details. As I said, with the above, there's a /dramatic/ improvement in performance between gcc-3.x and gcc-4.1.x. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- gentoo-amd64@gentoo.org mailing list