On 11/08/2010 23:04, Vladimir Makarov wrote:
On 08/10/2010 09:51 PM, Ralf W. Grosse-Kunstleve wrote:
I wrote a Fortran to C++ conversion program that I used to convert
selected
LAPACK sources. Comparing runtimes with different compilers I get:

absolute relative
ifort 11.1.072 1.790s 1.00
gfortran 4.4.4 2.470s 1.38
g++ 4.4.4 2.922s 1.63

To get a full picture, it would be nice to see icc times too.
This is under Fedora 13, 64-bit, 12-core Opteron 2.2GHz

All files needed to easily reproduce the results are here:

http://cci.lbl.gov/lapack_fem/

See the README file or the example commands below.

Questions:

- Is there a way to make the g++ version as fast as ifort?


I think it is more important (and harder) to make gfortran closer to ifort.

I can not say about your fragment of LAPACK. But about 15 years ago I
worked on manual LAPACK optimization for an Alpha processor. As I
remember LAPACK is quite memory bound benchmark. The hottest spot was
matrix multiplication which is used in many LAPACK places. The matrix
multiplication in LAPACK is already moderately optimized by using
temporary variable and that makes it 1.5 faster (if cache is not enough
to hold matrices) than normal algorithm. But proper loop optimizations
(tiling mostly) could improve it in more 4 times.

So I guess and hope graphite project finally will improve LAPACK by
implementing tiling.

After solving memory bound problem, loop vectorization is another
important optimization which could improve LAPACK. Unfortunately, GCC
vectorizes less loops (it was about 2 time less when last time I
checked) than ifort. I did not analyze what is the reason for this.

After solving vectorization problem, another important lower-level loop
optimization is modulo scheduling (even if modern x86/x86_64 processor
are out of order) because OOO processors can look only through a few
branches. And as I remember, Intel compiler does make modulo scheduling
frequently. GCC modulo-scheduling is quite constraint.

That is my thoughts but I might be wrong because I have no time to
confirm my speculations. If you really want to help GCC developers, you
could make comparison analysis of the code generated by ifort and
gfortran and find what optimizations GCC misses. GCC has few resources
and developers who could solve the problems are very busy. Intel
optimization compiler team (besides researchers) is much bigger than
whole GCC community. Taking this into account and that they have much
more info about their processors, I don't think gfortran will generate a
better or equal code for floating point benchmarks in near future.


This is a little out of my league (being neither a FORTRAN programmer nor a gcc developer).

However, I note that in the code translated from Fortran to C++, the two-dimensional array accesses are all changed into manual address calculations done as integer arithmetic. My understanding of the vectorisation, loop optimisation and more advanced code transformations from graphite is that they work best when given standard C array constructs. This gives the compiler the most information, and thus it can generate the best code.




Reply via email to