Re: food for optimizer developers
On Sat, Aug 14, 2010 at 12:08:21AM +0200, Tobias Burnus wrote: > > In terms of options, I think -funroll-loops should also be used as it > usually improves performance (it is not enabled by any -O... option). > I wonder if gfortran should check if -O and -O2 is given, and then add -funroll-loops. In my tests, I've rarely seen a performance issue with -O and -O2. I have seen problems in the past with -O3 and -funroll-loops. -- Steve
Re: food for optimizer developers
Ralf W. Grosse-Kunstleve wrote: Without knowing the compiler options, the results of any benchmark are meaningless. I used gfortran -o dsyev_test_gfortran -O3 -ffast-math dsyev_test.f If you work on a 32bit x86 system, you really should use -march=native and possibly -mfpmath=sse - otherwise you might generate code which also works with very old versions of x86 processors. (On x86-64 -march=native is also useful albeit the effect is not as large; ifort's equivalent option is -xHost. By default, ifort uses SSE and more modern processors on 32bit.) A possibly negligible effect is that ifort ignores parentheses by default - to be standard conform, use ifort's -assume protect_parens. (Or gfortran's -fno-protect-parens, though your gfortran might be too old.) An interesting test could be to use Intel's libimf with gfortran-compiled binaries by setting the LD_PRELOAD=/opt/intel/Compiler/.../lib/.../libimf.so environmental variable (complete the path!) before running the binary of your program. Another check could be to compile with the -mveclibabi=svml and link then Intel's short vactor math library. That way you make sure you compare the compiler itself and not the compiler + libraries. In terms of options, I think -funroll-loops should also be used as it usually improves performance (it is not enabled by any -O... option). GCC 4.5/4.6: You could also try "-flto -fwhole-file". (Best to use 4.6 for LTO/-fwhole-file.) Tobias
Re: food for optimizer developers
On Aug 12 2010, Steve Kargl wrote: Your observation re-enforces the notion that doing benchmarks properly is difficult. I forgot about the lapack inquiry routines. One would think that some 20+ year after F90, that Dongarra and colleagues would use the intrinsic numeric inquiry functions. Although the accumulated time is small, DLAMCH() is called 2642428 times during execution. Everything returned by DLAMCH() can be reduced to a compile time constant. Part of that is deliberate - to enable the compiled code to be used from languages like C++ - but I agree that this is a case that SHOULD not cause trouble. Whether it does, under some compilers, I don't know. A project that would be useful (but not very interesting) would be to rewrite the LAPACK reference implementation in proper Fortran 90+. A variant would be to do that, but update the interfaces, too. Both of these would be good benchmarks for compilers - especially the latter - and would encourage good code generation of array operations. This would be a LOT shorter - I did the latter for the Cholesky solver as a course example to show how modern Fortran is a transliteration of the actual algorithm that is published everywhere. Unfortunately, that's scarcely even a start to the whole library :-( I believe that some other people have done a bit more, but not accumulating to most of a conversion, though I might be out of date (it is some time since I looked). I can't think of how to encourage such work, and am not going to do it myself. Regards, Nick Maclaren.
Re: food for optimizer developers
Hi Steve, > > Can you tell how you obtained the performance numbers you are using? > > There may be a few compiler flags you could add to reduce that ratio > > of 1.4 to something better. > > > > Without knowing the compiler options, the results of any benchmark > are meaningless. I used gfortran -o dsyev_test_gfortran -O3 -ffast-math dsyev_test.f as per this script (same directory as the .f file) which lists all compilation commands (ifort, etc.): http://cci.lbl.gov/lapack_fem/lapack_fem_001/compile_dsyev_tests.sh > For various versions of gfortran, I find the > following average of 5 executions in seconds: # A B C D E F G H # gfc43 9.808 9.374 9.314 9.832 9.620 9.526 9.022 9.156 # gfc44 9.806 9.440 9.222 9.810 9.414 9.320 8.980 9.152 # gfc45 9.672 9.530 9.250 9.744 9.400 9.204 8.960 8.992 # gfc4x 9.814 9.358 8.622 9.810 Note1 9.172 8.958 9.022 # # A = -march=native -O # B = -march=native -O2 # C = -march=native -O3 # D = -march=native -O -ffast-math # E = -march=native -O2 -ffast-math # F = -march=native -O -funroll-loops # G = -march=native -O2 -funroll-loops # H = -march=native -O3 -funroll-loops # # Note 1: STOP DLAMC1 failure (10) # # gfc43 --> 4.3.6 20100728 (prerelease) # gfc44 --> 4.4.5 20100728 (prerelease) # gfc45 --> 4.5.1 20100728 (prerelease) # gfc4x --> 4.6.0 20100810 (experimental) Very useful! I'm adding a column "I" with "-O3 -ffast-math" (which I've been using forever...). I'm also trying with ("fc13-n") and without ("fc13") -march=native; I'm embarrassed to admit this option has escaped me before. On my FC13 machine with gcc 4.4.4 (12-core Opteron 2.2GHz): # A B C D E F G H I # fc13 3.309 2.755 2.462 3.234 2.787 2.956 2.366 2.381 2.296 # fc13-n 3.176 2.742 2.037 3.310 2.730 2.899 2.447 1.982 1.894 For comparison, the ifort -O time was 1.790. Which means gfortran is only 6% slower! My original table revised after adding -march=native: absolute relative ifort 11.1.0721.790s1.00 gfortran 4.4.41.894s1.06 g++ 4.4.4 2.772s1.55 Ralf
Re: food for optimizer developers
On Thu, Aug 12, 2010 at 08:47:34PM +0200, Toon Moene wrote: > Steve Kargl wrote: > > ># gfc4x 9.814 9.358 8.622 9.810 Note1 9.172 8.958 9.022 > > Column 5 compiled with -march=native -O2 -ffast-math > > ># Note 1: STOP DLAMC1 failure (10) > > That's probably because a standard compile of the LAPACK sources only > compiles {S|D}LAM* with -O0. > > The code is simply not written for any higher optimization (i.e., it > assumes the compiler more or less compiles it "literally"). > Your observation re-enforces the notion that doing benchmarks properly is difficult. I forgot about the lapack inquiry routines. One would think that some 20+ year after F90, that Dongarra and colleagues would use the intrinsic numeric inquiry functions. Although the accumulated time is small, DLAMCH() is called 2642428 times during execution. Everything returned by DLAMCH() can be reduced to a compile time constant. -- Steve
Re: food for optimizer developers
Steve Kargl wrote: # gfc4x 9.814 9.358 8.622 9.810 Note1 9.172 8.958 9.022 Column 5 compiled with -march=native -O2 -ffast-math # Note 1: STOP DLAMC1 failure (10) That's probably because a standard compile of the LAPACK sources only compiles {S|D}LAM* with -O0. The code is simply not written for any higher optimization (i.e., it assumes the compiler more or less compiles it "literally"). Cheers, -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html#Fortran
Re: food for optimizer developers
On Thu, Aug 12, 2010 at 09:51:42AM +0200, Steven Bosscher wrote: > On Thu, Aug 12, 2010 at 8:46 AM, Ralf W. Grosse-Kunstleve > wrote: > > Hi Vladimir, > > > > Thanks for the feedback! Very interesting. > > > > > >> Intel optimization compiler team (besides researchers) is much bigger than > >>whole GCC community. > > > > That's a surprise to me. I have to say that the GCC community has done > > amazing > > work, as you came within factor 1.4 (gfortran) and 1.6 (g++ compiling > > converted > > code) > > of ifort performance, which is close enough for our purposes, and I think > > those > > of many people. > > Well, I think a ratio of gfortran/ifort=1.4 isn't so great, really. If > you look at one of the popular Fortran benchmarks (Polyhedron, > http://www.polyhedron.com/pb05-linux-f90bench_p40html), the ratio was > less than 1.2 for gfortran 4.3 vs. ifort 11 on an Intel iCore7. > > Can you tell how you obtained the performance numbers you are using? > There may be a few compiler flags you could add to reduce that ratio > of 1.4 to something better. > Without knowing the compiler options, the results of any benchmark are meaningless. For various versions of gfortran, I find the following average of 5 executions in seconds: # A B C D E F G H # gfc43 9.808 9.374 9.314 9.832 9.620 9.526 9.022 9.156 # gfc44 9.806 9.440 9.222 9.810 9.414 9.320 8.980 9.152 # gfc45 9.672 9.530 9.250 9.744 9.400 9.204 8.960 8.992 # gfc4x 9.814 9.358 8.622 9.810 Note1 9.172 8.958 9.022 # # A = -march=native -O # B = -march=native -O2 # C = -march=native -O3 # D = -march=native -O -ffast-math # E = -march=native -O2 -ffast-math # F = -march=native -O -funroll-loops # G = -march=native -O2 -funroll-loops # H = -march=native -O3 -funroll-loops # # Note 1: STOP DLAMC1 failure (10) # # gfc43 --> 4.3.6 20100728 (prerelease) # gfc44 --> 4.4.5 20100728 (prerelease) # gfc45 --> 4.5.1 20100728 (prerelease) # gfc4x --> 4.6.0 20100810 (experimental) # I'll note that G is my normal FFLAGS setting along with -ftree-vectorize. Column D and E above highlights why I consider -ffast-math to be an evil option. For this benchmark the math is neither fast nor is it particularly safe with gfc4x. -- Steve
Re: food for optimizer developers
On 11/08/2010 23:04, Vladimir Makarov wrote: On 08/10/2010 09:51 PM, Ralf W. Grosse-Kunstleve wrote: I wrote a Fortran to C++ conversion program that I used to convert selected LAPACK sources. Comparing runtimes with different compilers I get: absolute relative ifort 11.1.072 1.790s 1.00 gfortran 4.4.4 2.470s 1.38 g++ 4.4.4 2.922s 1.63 To get a full picture, it would be nice to see icc times too. This is under Fedora 13, 64-bit, 12-core Opteron 2.2GHz All files needed to easily reproduce the results are here: http://cci.lbl.gov/lapack_fem/ See the README file or the example commands below. Questions: - Is there a way to make the g++ version as fast as ifort? I think it is more important (and harder) to make gfortran closer to ifort. I can not say about your fragment of LAPACK. But about 15 years ago I worked on manual LAPACK optimization for an Alpha processor. As I remember LAPACK is quite memory bound benchmark. The hottest spot was matrix multiplication which is used in many LAPACK places. The matrix multiplication in LAPACK is already moderately optimized by using temporary variable and that makes it 1.5 faster (if cache is not enough to hold matrices) than normal algorithm. But proper loop optimizations (tiling mostly) could improve it in more 4 times. So I guess and hope graphite project finally will improve LAPACK by implementing tiling. After solving memory bound problem, loop vectorization is another important optimization which could improve LAPACK. Unfortunately, GCC vectorizes less loops (it was about 2 time less when last time I checked) than ifort. I did not analyze what is the reason for this. After solving vectorization problem, another important lower-level loop optimization is modulo scheduling (even if modern x86/x86_64 processor are out of order) because OOO processors can look only through a few branches. And as I remember, Intel compiler does make modulo scheduling frequently. GCC modulo-scheduling is quite constraint. That is my thoughts but I might be wrong because I have no time to confirm my speculations. If you really want to help GCC developers, you could make comparison analysis of the code generated by ifort and gfortran and find what optimizations GCC misses. GCC has few resources and developers who could solve the problems are very busy. Intel optimization compiler team (besides researchers) is much bigger than whole GCC community. Taking this into account and that they have much more info about their processors, I don't think gfortran will generate a better or equal code for floating point benchmarks in near future. This is a little out of my league (being neither a FORTRAN programmer nor a gcc developer). However, I note that in the code translated from Fortran to C++, the two-dimensional array accesses are all changed into manual address calculations done as integer arithmetic. My understanding of the vectorisation, loop optimisation and more advanced code transformations from graphite is that they work best when given standard C array constructs. This gives the compiler the most information, and thus it can generate the best code.
Re: food for optimizer developers
On Thu, Aug 12, 2010 at 8:46 AM, Ralf W. Grosse-Kunstleve wrote: > Hi Vladimir, > > Thanks for the feedback! Very interesting. > > >> Intel optimization compiler team (besides researchers) is much bigger than >>whole GCC community. > > That's a surprise to me. I have to say that the GCC community has done amazing > work, as you came within factor 1.4 (gfortran) and 1.6 (g++ compiling > converted > code) > of ifort performance, which is close enough for our purposes, and I think > those > of many people. Well, I think a ratio of gfortran/ifort=1.4 isn't so great, really. If you look at one of the popular Fortran benchmarks (Polyhedron, http://www.polyhedron.com/pb05-linux-f90bench_p40html), the ratio was less than 1.2 for gfortran 4.3 vs. ifort 11 on an Intel iCore7. Can you tell how you obtained the performance numbers you are using? There may be a few compiler flags you could add to reduce that ratio of 1.4 to something better. Ciao! Steven
Re: food for optimizer developers
Hi Vladimir, Thanks for the feedback! Very interesting. > Intel optimization compiler team (besides researchers) is much bigger than >whole GCC community. That's a surprise to me. I have to say that the GCC community has done amazing work, as you came within factor 1.4 (gfortran) and 1.6 (g++ compiling converted code) of ifort performance, which is close enough for our purposes, and I think those of many people. To add to this, icpc vs. g++ is a tie overall, with g++ even having a slight advantage. Really great work! Ralf
Re: food for optimizer developers
Hi Richard, > How about using an automatic converter to arrange for C++ code to > call into the generated Fortran code instead? Create nice classes > and wrappers and such, but in the end arrange for the Fortran code > to be called to do the real work. I found it very labor intensive to maintain a mixed Fortran/C++ build system. I rather take the speed hit than dealing with the constant trickle of problems arising from non-existing or incompatible Fortran compilers. We distribute a pretty large system in source form to users (biologist) who sometimes don't even know what a command-line prompt is. If installation doesn't work out of the box a large fraction of our users simply give up. Ralf
Re: food for optimizer developers
On 08/10/2010 09:51 PM, Ralf W. Grosse-Kunstleve wrote: I wrote a Fortran to C++ conversion program that I used to convert selected LAPACK sources. Comparing runtimes with different compilers I get: absolute relative ifort 11.1.0721.790s1.00 gfortran 4.4.42.470s1.38 g++ 4.4.4 2.922s1.63 To get a full picture, it would be nice to see icc times too. This is under Fedora 13, 64-bit, 12-core Opteron 2.2GHz All files needed to easily reproduce the results are here: http://cci.lbl.gov/lapack_fem/ See the README file or the example commands below. Questions: - Is there a way to make the g++ version as fast as ifort? I think it is more important (and harder) to make gfortran closer to ifort. I can not say about your fragment of LAPACK. But about 15 years ago I worked on manual LAPACK optimization for an Alpha processor. As I remember LAPACK is quite memory bound benchmark. The hottest spot was matrix multiplication which is used in many LAPACK places. The matrix multiplication in LAPACK is already moderately optimized by using temporary variable and that makes it 1.5 faster (if cache is not enough to hold matrices) than normal algorithm. But proper loop optimizations (tiling mostly) could improve it in more 4 times. So I guess and hope graphite project finally will improve LAPACK by implementing tiling. After solving memory bound problem, loop vectorization is another important optimization which could improve LAPACK. Unfortunately, GCC vectorizes less loops (it was about 2 time less when last time I checked) than ifort. I did not analyze what is the reason for this. After solving vectorization problem, another important lower-level loop optimization is modulo scheduling (even if modern x86/x86_64 processor are out of order) because OOO processors can look only through a few branches. And as I remember, Intel compiler does make modulo scheduling frequently. GCC modulo-scheduling is quite constraint. That is my thoughts but I might be wrong because I have no time to confirm my speculations. If you really want to help GCC developers, you could make comparison analysis of the code generated by ifort and gfortran and find what optimizations GCC misses. GCC has few resources and developers who could solve the problems are very busy. Intel optimization compiler team (besides researchers) is much bigger than whole GCC community. Taking this into account and that they have much more info about their processors, I don't think gfortran will generate a better or equal code for floating point benchmarks in near future.
Re: food for optimizer developers
On 08/11/2010 10:59 AM, Ralf W. Grosse-Kunstleve wrote: > My original posting shows that gfortran and g++ don't do as good > a job as ifort in generating efficient machine code. Note that the > loss going from gfortran to g++ isn't as bad as going from ifort to > gfortran. This gives me hope that the gcc developers could work over > time towards bringing the performance of the g++-generated code > closer to the original ifort performance. While of course there's room for g++ to improve, I think it's more likely that gfortran can improve to meet ifort. The biggest issue, from the compiler writer's perspective, is that the Fortran language provides more information to the optimizers than the C++ language can. A Really Good compiler will probably always be able to do better with Fortran than C++. > I think speed will be the major argument against using the C++ code > generated by the automatic converter. How about using an automatic converter to arrange for C++ code to call into the generated Fortran code instead? Create nice classes and wrappers and such, but in the end arrange for the Fortran code to be called to do the real work. r~
Re: food for optimizer developers
Hi Tim, > Do you mean you are adding an additional level of functions and hoping > for efficient in-lining? Note that my questions arise in the context of automatic code generation: http://cci.lbl.gov/fable Please compare e.g. the original LAPACK code with the generated C++ code to see why the C++ code is done the way it is. A goal more important than speed is that the auto-generated C++ code is similar to the original Fortran code and not inflated/obfuscated by constructs meant to cater to optimizers (which change over time anyway). My original posting shows that gfortran and g++ don't do as good a job as ifort in generating efficient machine code. Note that the loss going from gfortran to g++ isn't as bad as going from ifort to gfortran. This gives me hope that the gcc developers could work over time towards bringing the performance of the g++-generated code closer to the original ifort performance. I think speed will be the major argument against using the C++ code generated by the automatic converter. If the generated C++ code could somehow be made to run nearly as fast as the original Fortran compiled with ifort there wouldn't be any good reason anymore to still develop in Fortran, or to bother with the complexities of mixing languages. Ralf
Re: food for optimizer developers
On 8/10/2010 9:21 PM, Ralf W. Grosse-Kunstleve wrote: Most of the time is spent in this function... void dlasr( str_cref side, str_cref pivot, str_cref direct, int const& m, int const& n, arr_cref c, arr_cref s, arr_ref a, int const& lda) in this loop: FEM_DOSTEP(j, n - 1, 1, -1) { ctemp = c(j); stemp = s(j); if ((ctemp != one) || (stemp != zero)) { FEM_DO(i, 1, m) { temp = a(i, j + 1); a(i, j + 1) = ctemp * temp - stemp * a(i, j); a(i, j) = stemp * temp + ctemp * a(i, j); } } } a(i, j) is implemented as T* elems_; // member T const& operator()( ssize_t i1, ssize_t i2) const { return elems_[dims_.index_1d(i1, i2)]; } with ssize_t all[Ndims]; // member ssize_t origin[Ndims]; // member size_t index_1d( ssize_t i1, ssize_t i2) const { return (i2 - origin[1]) * all[0] + (i1 - origin[0]); } The array pointer is buried as elems_ member in the arr_ref<> class template. How can I apply __restrict in this case? Do you mean you are adding an additional level of functions and hoping for efficient in-lining? Your programming style is elusive, and your insistence on top posting will make this thread difficult to deal with. The conditional inside the loop likely is even more difficult for C++ to optimize than Fortran. As already discussed, if you don't optimize otherwise, you will need __restrict to overcome aliasing concerns among a,c, and s. If you want efficient C++, you will need a lot of hand optimization, and verification of the effect of each level of obscurity which you add. How is this topic appropriate to gcc mail list? -- Tim Prince
Re: food for optimizer developers
Most of the time is spent in this function... void dlasr( str_cref side, str_cref pivot, str_cref direct, int const& m, int const& n, arr_cref c, arr_cref s, arr_ref a, int const& lda) in this loop: FEM_DOSTEP(j, n - 1, 1, -1) { ctemp = c(j); stemp = s(j); if ((ctemp != one) || (stemp != zero)) { FEM_DO(i, 1, m) { temp = a(i, j + 1); a(i, j + 1) = ctemp * temp - stemp * a(i, j); a(i, j) = stemp * temp + ctemp * a(i, j); } } } a(i, j) is implemented as T* elems_; // member T const& operator()( ssize_t i1, ssize_t i2) const { return elems_[dims_.index_1d(i1, i2)]; } with ssize_t all[Ndims]; // member ssize_t origin[Ndims]; // member size_t index_1d( ssize_t i1, ssize_t i2) const { return (i2 - origin[1]) * all[0] + (i1 - origin[0]); } The array pointer is buried as elems_ member in the arr_ref<> class template. How can I apply __restrict in this case? Ralf - Original Message From: Andrew Pinski To: Ralf W. Grosse-Kunstleve Cc: gcc@gcc.gnu.org Sent: Tue, August 10, 2010 8:47:18 PM Subject: Re: food for optimizer developers On Tue, Aug 10, 2010 at 6:51 PM, Ralf W. Grosse-Kunstleve wrote: > I wrote a Fortran to C++ conversion program that I used to convert selected > LAPACK sources. Comparing runtimes with different compilers I get: > > absolute relative > ifort 11.1.0721.790s1.00 > gfortran 4.4.42.470s1.38 > g++ 4.4.4 2.922s1.63 I wonder if adding __restrict to some of the arguments of the functions will help. Fortran aliasing is so different from C aliasing. -- Pinski
Re: food for optimizer developers
On Tue, Aug 10, 2010 at 6:51 PM, Ralf W. Grosse-Kunstleve wrote: > I wrote a Fortran to C++ conversion program that I used to convert selected > LAPACK sources. Comparing runtimes with different compilers I get: > > absolute relative > ifort 11.1.072 1.790s 1.00 > gfortran 4.4.4 2.470s 1.38 > g++ 4.4.4 2.922s 1.63 I wonder if adding __restrict to some of the arguments of the functions will help. Fortran aliasing is so different from C aliasing. -- Pinski