Re: food for optimizer developers

2010-08-13 Thread Steve Kargl
On Sat, Aug 14, 2010 at 12:08:21AM +0200, Tobias Burnus wrote:
> 
> In terms of options, I think -funroll-loops should also be used as it 
> usually improves performance (it is not enabled by any -O... option).
> 

I wonder if gfortran should check if -O and -O2 is given,
and then add -funroll-loops.  In my tests, I've rarely
seen a performance issue with -O and -O2.  I have seen problems
in the past with -O3 and -funroll-loops. 

-- 
Steve


Re: food for optimizer developers

2010-08-13 Thread Tobias Burnus

 Ralf W. Grosse-Kunstleve wrote:

Without knowing the compiler options, the results of any benchmark
are meaningless.

I used
   gfortran -o dsyev_test_gfortran -O3 -ffast-math dsyev_test.f


If you work on a 32bit x86 system, you really should use -march=native 
and possibly -mfpmath=sse - otherwise you might generate code which also 
works with very old versions of x86 processors. (On x86-64 -march=native 
is also useful albeit the effect is not as large; ifort's equivalent 
option is -xHost. By default, ifort uses SSE and more modern processors 
on 32bit.)


A possibly negligible effect is that ifort ignores parentheses by 
default - to be standard conform, use ifort's -assume protect_parens. 
(Or gfortran's -fno-protect-parens, though your gfortran might be too old.)


An interesting test could be to use Intel's libimf with 
gfortran-compiled binaries by setting the

  LD_PRELOAD=/opt/intel/Compiler/.../lib/.../libimf.so
environmental variable (complete the path!) before running the binary of 
your program. Another check could be to compile with the  
-mveclibabi=svml and link then Intel's short vactor math library. That 
way you make sure you compare the compiler itself and not the compiler + 
libraries.


In terms of options, I think -funroll-loops should also be used as it 
usually improves performance (it is not enabled by any -O... option).


GCC 4.5/4.6: You could also try "-flto -fwhole-file". (Best to use 4.6 
for LTO/-fwhole-file.)


Tobias


Re: food for optimizer developers

2010-08-13 Thread N.M. Maclaren

On Aug 12 2010, Steve Kargl wrote:


Your observation re-enforces the notion that doing 
benchmarks properly is difficult.  I forgot about

the lapack inquiry routines.  One would think that
some 20+ year after F90, that Dongarra and colleagues
would use the intrinsic numeric inquiry functions.
Although the accumulated time is small, DLAMCH() is
called 2642428 times during execution.  Everything
returned by DLAMCH() can be reduced to a compile
time constant.


Part of that is deliberate - to enable the compiled code to
be used from languages like C++ - but I agree that this is
a case that SHOULD not cause trouble.  Whether it does, under
some compilers, I don't know.

A project that would be useful (but not very interesting)
would be to rewrite the LAPACK reference implementation in
proper Fortran 90+.  A variant would be to do that, but
update the interfaces, too.  Both of these would be good
benchmarks for compilers - especially the latter - and would
encourage good code generation of array operations.

This would be a LOT shorter - I did the latter for the
Cholesky solver as a course example to show how modern
Fortran is a transliteration of the actual algorithm
that is published everywhere.  Unfortunately, that's
scarcely even a start to the whole library :-(

I believe that some other people have done a bit more, but
not accumulating to most of a conversion, though I might be
out of date (it is some time since I looked).  I can't think
of how to encourage such work, and am not going to do it
myself.

Regards,
Nick Maclaren.




Re: food for optimizer developers

2010-08-12 Thread Ralf W. Grosse-Kunstleve
Hi Steve,

> > Can you tell how you obtained the performance numbers you are using?
> > There may be a few compiler flags you could add to reduce that ratio
> > of 1.4 to something better.
> > 
>
> Without knowing the compiler options, the results of any benchmark
> are meaningless.

I used

  gfortran -o dsyev_test_gfortran -O3 -ffast-math dsyev_test.f

as per this script (same directory as the .f file) which lists all compilation
commands (ifort, etc.):

  http://cci.lbl.gov/lapack_fem/lapack_fem_001/compile_dsyev_tests.sh

> For various versions of gfortran, I find the
> following average of 5 executions in seconds:

#   A  B  C  D  E  F  G  H
# gfc43   9.808  9.374  9.314  9.832  9.620  9.526  9.022  9.156
# gfc44   9.806  9.440  9.222  9.810  9.414  9.320  8.980  9.152
# gfc45   9.672  9.530  9.250  9.744  9.400  9.204  8.960  8.992
# gfc4x   9.814  9.358  8.622  9.810  Note1  9.172  8.958  9.022
#
# A = -march=native -O
# B = -march=native -O2
# C = -march=native -O3
# D = -march=native -O  -ffast-math
# E = -march=native -O2 -ffast-math
# F = -march=native -O  -funroll-loops
# G = -march=native -O2 -funroll-loops
# H = -march=native -O3 -funroll-loops
#
# Note 1:  STOP DLAMC1 failure (10)
#
# gfc43 --> 4.3.6 20100728 (prerelease)
# gfc44 --> 4.4.5 20100728 (prerelease)
# gfc45 --> 4.5.1 20100728 (prerelease)
# gfc4x --> 4.6.0 20100810 (experimental)

Very useful!
I'm adding a column "I" with "-O3 -ffast-math" (which I've been using 
forever...).
I'm also trying with ("fc13-n") and without ("fc13") -march=native; I'm
embarrassed to admit this option has escaped me before.
On my FC13 machine with gcc 4.4.4 (12-core Opteron 2.2GHz):

#   A  B  C  D  E  F  G  H  I
# fc13   3.309  2.755  2.462  3.234  2.787  2.956  2.366  2.381  2.296
# fc13-n 3.176  2.742  2.037  3.310  2.730  2.899  2.447  1.982  1.894

For comparison, the ifort -O time was 1.790. Which means gfortran is
only 6% slower!
My original table revised after adding -march=native:

 absolute  relative
ifort 11.1.0721.790s1.00
gfortran 4.4.41.894s1.06
g++ 4.4.4 2.772s1.55

Ralf


Re: food for optimizer developers

2010-08-12 Thread Steve Kargl
On Thu, Aug 12, 2010 at 08:47:34PM +0200, Toon Moene wrote:
> Steve Kargl wrote:
> 
> ># gfc4x   9.814  9.358  8.622  9.810  Note1  9.172  8.958  9.022
> 
> Column 5 compiled with -march=native -O2 -ffast-math
> 
> ># Note 1:  STOP DLAMC1 failure (10)
> 
> That's probably because a standard compile of the LAPACK sources only 
> compiles {S|D}LAM* with -O0.
> 
> The code is simply not written for any higher optimization (i.e., it 
> assumes the compiler more or less compiles it "literally").
> 

Your observation re-enforces the notion that doing 
benchmarks properly is difficult.  I forgot about
the lapack inquiry routines.  One would think that
some 20+ year after F90, that Dongarra and colleagues
would use the intrinsic numeric inquiry functions.
Although the accumulated time is small, DLAMCH() is
called 2642428 times during execution.  Everything
returned by DLAMCH() can be reduced to a compile
time constant. 

-- 
Steve


Re: food for optimizer developers

2010-08-12 Thread Toon Moene

Steve Kargl wrote:


# gfc4x   9.814  9.358  8.622  9.810  Note1  9.172  8.958  9.022


Column 5 compiled with -march=native -O2 -ffast-math


# Note 1:  STOP DLAMC1 failure (10)


That's probably because a standard compile of the LAPACK sources only 
compiles {S|D}LAM* with -O0.


The code is simply not written for any higher optimization (i.e., it 
assumes the compiler more or less compiles it "literally").


Cheers,

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/gcc-4.5/changes.html#Fortran


Re: food for optimizer developers

2010-08-12 Thread Steve Kargl
On Thu, Aug 12, 2010 at 09:51:42AM +0200, Steven Bosscher wrote:
> On Thu, Aug 12, 2010 at 8:46 AM, Ralf W. Grosse-Kunstleve
>  wrote:
> > Hi Vladimir,
> >
> > Thanks for the feedback! Very interesting.
> >
> >
> >> Intel optimization compiler team (besides researchers) is much bigger than
> >>whole GCC community.
> >
> > That's a surprise to me. I have to say that the GCC community has done 
> > amazing
> > work, as you came within factor 1.4 (gfortran) and 1.6 (g++ compiling 
> > converted
> > code)
> > of ifort performance, which is close enough for our purposes, and I think 
> > those
> > of many people.
> 
> Well, I think a ratio of gfortran/ifort=1.4 isn't so great, really. If
> you look at one of the popular Fortran benchmarks (Polyhedron,
> http://www.polyhedron.com/pb05-linux-f90bench_p40html), the ratio was
> less than 1.2 for gfortran 4.3 vs. ifort 11 on an Intel iCore7.
> 
> Can you tell how you obtained the performance numbers you are using?
> There may be a few compiler flags you could add to reduce that ratio
> of 1.4 to something better.
> 

Without knowing the compiler options, the results of any benchmark
are meaningless.  For various versions of gfortran, I find the
following average of 5 executions in seconds:

#   A  B  C  D  E  F  G  H
# gfc43   9.808  9.374  9.314  9.832  9.620  9.526  9.022  9.156
# gfc44   9.806  9.440  9.222  9.810  9.414  9.320  8.980  9.152
# gfc45   9.672  9.530  9.250  9.744  9.400  9.204  8.960  8.992
# gfc4x   9.814  9.358  8.622  9.810  Note1  9.172  8.958  9.022
#
# A = -march=native -O
# B = -march=native -O2
# C = -march=native -O3
# D = -march=native -O  -ffast-math
# E = -march=native -O2 -ffast-math
# F = -march=native -O  -funroll-loops
# G = -march=native -O2 -funroll-loops
# H = -march=native -O3 -funroll-loops
#
# Note 1:  STOP DLAMC1 failure (10)
#
# gfc43 --> 4.3.6 20100728 (prerelease)
# gfc44 --> 4.4.5 20100728 (prerelease)
# gfc45 --> 4.5.1 20100728 (prerelease)
# gfc4x --> 4.6.0 20100810 (experimental)
#

I'll note that G is my normal FFLAGS setting along with
-ftree-vectorize.  Column D and E above highlights why
I consider -ffast-math to be an evil option.  For this
benchmark the math is neither fast nor is it particularly
safe with gfc4x.

-- 
Steve


Re: food for optimizer developers

2010-08-12 Thread David Brown

On 11/08/2010 23:04, Vladimir Makarov wrote:

On 08/10/2010 09:51 PM, Ralf W. Grosse-Kunstleve wrote:

I wrote a Fortran to C++ conversion program that I used to convert
selected
LAPACK sources. Comparing runtimes with different compilers I get:

absolute relative
ifort 11.1.072 1.790s 1.00
gfortran 4.4.4 2.470s 1.38
g++ 4.4.4 2.922s 1.63


To get a full picture, it would be nice to see icc times too.

This is under Fedora 13, 64-bit, 12-core Opteron 2.2GHz

All files needed to easily reproduce the results are here:

http://cci.lbl.gov/lapack_fem/

See the README file or the example commands below.

Questions:

- Is there a way to make the g++ version as fast as ifort?



I think it is more important (and harder) to make gfortran closer to ifort.

I can not say about your fragment of LAPACK. But about 15 years ago I
worked on manual LAPACK optimization for an Alpha processor. As I
remember LAPACK is quite memory bound benchmark. The hottest spot was
matrix multiplication which is used in many LAPACK places. The matrix
multiplication in LAPACK is already moderately optimized by using
temporary variable and that makes it 1.5 faster (if cache is not enough
to hold matrices) than normal algorithm. But proper loop optimizations
(tiling mostly) could improve it in more 4 times.

So I guess and hope graphite project finally will improve LAPACK by
implementing tiling.

After solving memory bound problem, loop vectorization is another
important optimization which could improve LAPACK. Unfortunately, GCC
vectorizes less loops (it was about 2 time less when last time I
checked) than ifort. I did not analyze what is the reason for this.

After solving vectorization problem, another important lower-level loop
optimization is modulo scheduling (even if modern x86/x86_64 processor
are out of order) because OOO processors can look only through a few
branches. And as I remember, Intel compiler does make modulo scheduling
frequently. GCC modulo-scheduling is quite constraint.

That is my thoughts but I might be wrong because I have no time to
confirm my speculations. If you really want to help GCC developers, you
could make comparison analysis of the code generated by ifort and
gfortran and find what optimizations GCC misses. GCC has few resources
and developers who could solve the problems are very busy. Intel
optimization compiler team (besides researchers) is much bigger than
whole GCC community. Taking this into account and that they have much
more info about their processors, I don't think gfortran will generate a
better or equal code for floating point benchmarks in near future.



This is a little out of my league (being neither a FORTRAN programmer 
nor a gcc developer).


However, I note that in the code translated from Fortran to C++, the 
two-dimensional array accesses are all changed into manual address 
calculations done as integer arithmetic.  My understanding of the 
vectorisation, loop optimisation and more advanced code transformations 
from graphite is that they work best when given standard C array 
constructs.  This gives the compiler the most information, and thus it 
can generate the best code.







Re: food for optimizer developers

2010-08-12 Thread Steven Bosscher
On Thu, Aug 12, 2010 at 8:46 AM, Ralf W. Grosse-Kunstleve
 wrote:
> Hi Vladimir,
>
> Thanks for the feedback! Very interesting.
>
>
>> Intel optimization compiler team (besides researchers) is much bigger than
>>whole GCC community.
>
> That's a surprise to me. I have to say that the GCC community has done amazing
> work, as you came within factor 1.4 (gfortran) and 1.6 (g++ compiling 
> converted
> code)
> of ifort performance, which is close enough for our purposes, and I think 
> those
> of many people.

Well, I think a ratio of gfortran/ifort=1.4 isn't so great, really. If
you look at one of the popular Fortran benchmarks (Polyhedron,
http://www.polyhedron.com/pb05-linux-f90bench_p40html), the ratio was
less than 1.2 for gfortran 4.3 vs. ifort 11 on an Intel iCore7.

Can you tell how you obtained the performance numbers you are using?
There may be a few compiler flags you could add to reduce that ratio
of 1.4 to something better.

Ciao!
Steven


Re: food for optimizer developers

2010-08-11 Thread Ralf W. Grosse-Kunstleve
Hi Vladimir,

Thanks for the feedback! Very interesting.


> Intel optimization compiler team (besides researchers) is much bigger than 
>whole GCC community.

That's a surprise to me. I have to say that the GCC community has done amazing
work, as you came within factor 1.4 (gfortran) and 1.6 (g++ compiling converted 
code)
of ifort performance, which is close enough for our purposes, and I think those 
of many
people. To add to this, icpc vs. g++ is a tie overall, with g++ even having a 
slight
advantage. Really great work!

Ralf


Re: food for optimizer developers

2010-08-11 Thread Ralf W. Grosse-Kunstleve
Hi Richard,

> How about using an automatic converter to arrange for C++ code to
> call into the generated Fortran code instead?  Create nice classes
> and wrappers and such, but in the end arrange for the Fortran code
> to be called to do the real work.

I found it very labor intensive to maintain a mixed Fortran/C++ build
system. I rather take the speed hit than dealing with the constant
trickle of problems arising from non-existing or incompatible Fortran
compilers.

We distribute a pretty large system in source form to users (biologist)
who sometimes don't even know what a command-line prompt is.
If installation doesn't work out of the box a large fraction
of our users simply give up.

Ralf


Re: food for optimizer developers

2010-08-11 Thread Vladimir Makarov

 On 08/10/2010 09:51 PM, Ralf W. Grosse-Kunstleve wrote:

I wrote a Fortran to C++ conversion program that I used to convert selected
LAPACK sources. Comparing runtimes with different compilers I get:

  absolute  relative
ifort 11.1.0721.790s1.00
gfortran 4.4.42.470s1.38
g++ 4.4.4 2.922s1.63


To get a full picture, it would be nice to see icc times too.

This is under Fedora 13, 64-bit, 12-core Opteron 2.2GHz

All files needed to easily reproduce the results are here:

   http://cci.lbl.gov/lapack_fem/

See the README file or the example commands below.

Questions:

- Is there a way to make the g++ version as fast as ifort?



I think it is more important (and harder) to make gfortran closer to ifort.

I can not say about your fragment of LAPACK.  But about 15 years ago I 
worked on manual LAPACK optimization for an Alpha processor.   As I 
remember LAPACK is quite memory bound benchmark.  The hottest spot was 
matrix multiplication which is used in many LAPACK places.  The matrix 
multiplication in LAPACK is already moderately optimized by using 
temporary variable and that makes it 1.5 faster (if cache is not enough 
to hold matrices) than normal algorithm.  But proper loop optimizations 
(tiling mostly) could improve it in more 4 times.


So I guess and hope graphite project finally will improve LAPACK by 
implementing tiling.


After solving memory bound problem, loop vectorization is another 
important optimization which could improve LAPACK.  Unfortunately, GCC 
vectorizes less loops (it was about 2 time less when last time I 
checked) than ifort.  I did not analyze what is the reason for this.


After solving vectorization problem, another important lower-level loop 
optimization is modulo scheduling (even if modern x86/x86_64 processor 
are out of order) because OOO processors can look only through a few 
branches.  And as I remember, Intel compiler does make modulo scheduling 
frequently.  GCC modulo-scheduling is quite constraint.


That is my thoughts but I might be wrong because I have no time to 
confirm my speculations.  If you really want to help GCC developers, you 
could make comparison analysis of the code generated by ifort and 
gfortran and find what optimizations GCC misses.  GCC has few resources 
and developers who could solve the problems are very busy.  Intel 
optimization compiler team (besides researchers) is much bigger than 
whole GCC community.  Taking this into account and that they have much 
more info about their processors, I don't think gfortran will generate a 
better or equal code for floating point benchmarks in near future.




Re: food for optimizer developers

2010-08-11 Thread Richard Henderson
On 08/11/2010 10:59 AM, Ralf W. Grosse-Kunstleve wrote:
> My original posting shows that gfortran and g++ don't do as good
> a job as ifort in generating efficient machine code. Note that the
> loss going from gfortran to g++ isn't as bad as going from ifort to
> gfortran. This gives me hope that the gcc developers could work over
> time towards bringing the performance of the g++-generated code
> closer to the original ifort performance.

While of course there's room for g++ to improve, I think it's more
likely that gfortran can improve to meet ifort.

The biggest issue, from the compiler writer's perspective, is that
the Fortran language provides more information to the optimizers 
than the C++ language can.  A Really Good compiler will probably
always be able to do better with Fortran than C++.

> I think speed will be the major argument against using the C++ code
> generated by the automatic converter.

How about using an automatic converter to arrange for C++ code to
call into the generated Fortran code instead?  Create nice classes
and wrappers and such, but in the end arrange for the Fortran code
to be called to do the real work.


r~


Re: food for optimizer developers

2010-08-11 Thread Ralf W. Grosse-Kunstleve
Hi Tim,

> Do you mean you are adding an additional level of functions and hoping 

> for efficient in-lining?

Note that my questions arise in the context of automatic code generation:
  http://cci.lbl.gov/fable
Please compare e.g. the original LAPACK code with the generated C++ code
to see why the C++ code is done the way it is.

A goal more important than speed is that the auto-generated C++ code
is similar to the original Fortran code and not inflated/obfuscated by
constructs meant to cater to optimizers (which change over time anyway).

My original posting shows that gfortran and g++ don't do as good
a job as ifort in generating efficient machine code. Note that the
loss going from gfortran to g++ isn't as bad as going from ifort to
gfortran. This gives me hope that the gcc developers could work over
time towards bringing the performance of the g++-generated code
closer to the original ifort performance.

I think speed will be the major argument against using the C++ code
generated by the automatic converter. If the generated C++ code could somehow
be made to run nearly as fast as the original Fortran compiled with ifort
there wouldn't be any good reason anymore to still develop in Fortran,
or to bother with the complexities of mixing languages.

Ralf


Re: food for optimizer developers

2010-08-10 Thread Tim Prince

On 8/10/2010 9:21 PM, Ralf W. Grosse-Kunstleve wrote:

Most of the time is spent in this function...

void
dlasr(
   str_cref side,
   str_cref pivot,
   str_cref direct,
   int const&  m,
   int const&  n,
   arr_cref  c,
   arr_cref  s,
   arr_ref  a,
   int const&  lda)

in this loop:

 FEM_DOSTEP(j, n - 1, 1, -1) {
   ctemp = c(j);
   stemp = s(j);
   if ((ctemp != one) || (stemp != zero)) {
 FEM_DO(i, 1, m) {
   temp = a(i, j + 1);
   a(i, j + 1) = ctemp * temp - stemp * a(i, j);
   a(i, j) = stemp * temp + ctemp * a(i, j);
 }
   }
 }

a(i, j) is implemented as

   T* elems_; // member

 T const&
 operator()(
   ssize_t i1,
   ssize_t i2) const
 {
   return elems_[dims_.index_1d(i1, i2)];
 }

with

   ssize_t all[Ndims]; // member
   ssize_t origin[Ndims]; // member

 size_t
 index_1d(
   ssize_t i1,
   ssize_t i2) const
 {
   return
   (i2 - origin[1]) * all[0]
 + (i1 - origin[0]);
 }

The array pointer is buried as elems_ member in the arr_ref<>  class template.
How can I apply __restrict in this case?

   
Do you mean you are adding an additional level of functions and hoping 
for efficient in-lining?   Your programming style is elusive, and your 
insistence on top posting will make this thread difficult to deal with.
The conditional inside the loop likely is even more difficult for C++ to 
optimize than Fortran. As already discussed, if you don't optimize 
otherwise, you will need __restrict to overcome aliasing concerns among 
a,c, and s.  If you want efficient C++, you will need a lot of hand 
optimization, and verification of the effect of each level of obscurity 
which you add.   How is this topic appropriate to gcc mail list?


--
Tim Prince



Re: food for optimizer developers

2010-08-10 Thread Ralf W. Grosse-Kunstleve
Most of the time is spent in this function...

void
dlasr(
  str_cref side,
  str_cref pivot,
  str_cref direct,
  int const& m,
  int const& n,
  arr_cref c,
  arr_cref s,
  arr_ref a,
  int const& lda)

in this loop:

FEM_DOSTEP(j, n - 1, 1, -1) {
  ctemp = c(j);
  stemp = s(j);
  if ((ctemp != one) || (stemp != zero)) {
FEM_DO(i, 1, m) {
  temp = a(i, j + 1);
  a(i, j + 1) = ctemp * temp - stemp * a(i, j);
  a(i, j) = stemp * temp + ctemp * a(i, j);
}
  }
}

a(i, j) is implemented as

  T* elems_; // member

T const&
operator()(
  ssize_t i1,
  ssize_t i2) const
{
  return elems_[dims_.index_1d(i1, i2)];
}

with
  
  ssize_t all[Ndims]; // member
  ssize_t origin[Ndims]; // member

size_t
index_1d(
  ssize_t i1,
  ssize_t i2) const
{
  return
  (i2 - origin[1]) * all[0]
+ (i1 - origin[0]);
}

The array pointer is buried as elems_ member in the arr_ref<> class template.
How can I apply __restrict in this case?

Ralf




- Original Message 
From: Andrew Pinski 
To: Ralf W. Grosse-Kunstleve 
Cc: gcc@gcc.gnu.org
Sent: Tue, August 10, 2010 8:47:18 PM
Subject: Re: food for optimizer developers

On Tue, Aug 10, 2010 at 6:51 PM, Ralf W. Grosse-Kunstleve
 wrote:
> I wrote a Fortran to C++ conversion program that I used to convert selected
> LAPACK sources. Comparing runtimes with different compilers I get:
>
> absolute  relative
> ifort 11.1.0721.790s1.00
> gfortran 4.4.42.470s1.38
> g++ 4.4.4 2.922s1.63

I wonder if adding __restrict to some of the arguments of the
functions will help.  Fortran aliasing is so different from C
aliasing.

-- Pinski



Re: food for optimizer developers

2010-08-10 Thread Andrew Pinski
On Tue, Aug 10, 2010 at 6:51 PM, Ralf W. Grosse-Kunstleve
 wrote:
> I wrote a Fortran to C++ conversion program that I used to convert selected
> LAPACK sources. Comparing runtimes with different compilers I get:
>
>                         absolute  relative
> ifort 11.1.072            1.790s    1.00
> gfortran 4.4.4            2.470s    1.38
> g++ 4.4.4                 2.922s    1.63

I wonder if adding __restrict to some of the arguments of the
functions will help.  Fortran aliasing is so different from C
aliasing.

-- Pinski