Re: [Cython] Impressions from EuroSciPy 2010

Sturla Molden Wed, 14 Jul 2010 02:57:10 -0700

> On Tue, Jul 13, 2010 at 10:59 PM, Sturla Molden <[email protected]> wrote:


> Are slower than the 'do' version with explicit indexing -- that is to
> say, slower than they should be.  Apparently the compilers create
> temporary arrays, exactly analogous to how numpy does it (although
> many times faster and with more opportunities for optimization).
>
> Perhaps I was using an old compiler when I tested this, or didn't turn
> on the right compile flags.  Perhaps you can correct me on this.

I think you are right, at least for x86 with gfortran. But it's not only a
matter of speed. Fortran 95 arrays have all the niceness of NumPy arrays,
including broadcasting and slicing. Fortran also knows about complex
numbers, and have a richer standard library. I am willing to suffer some
performance loss to avoid the pain of writing C or Fortran 77.

Regarding performance of Fortran 90 array expressions:

Fortran 90/95 arrays expressions can give poorer poorer or better
performance than similar do-loops, depending on hardware and compiler. You
can see the same with "forall" loops and "where" constructs. Fortran 90
was designed for parallel computing. Therefore array expressions, forall
loops, and where constructs have therefore some inherent "parallelism"
required by the standard. This will sometimes require a temporary array to
be made, but memory allocation and access can be slow on x86. An example
would e.g. be this:

a(2:n) = a(1:n-1) + a(2:n)

A  do loop could use a couple of temporary variables kept in registers
instead:

r1 = a(1)
do i = 2,n
    r2 = a(i)
    a(i) = r1 + a(i)
    r1 = r2
end do

Obviously just putting r1 and r2 in registers would be faster than
allocating a temporary array when using x86, without any other
optimization.

Some f95 compilers (Absoft and Intel) are good at auto-vectorizing array
statements to SIMD or multithreaded code. Sometimes an array expression
can be easier to auto-vectorize. Using a couple of temporary variables
means we must run the code sequentially. So maximum performance on x86
could require manual partial loop unrolling, sort of making the code fit
the hardware.

On older vector computers like the Cray, a temporary array would probably
be much faster than a do loop, as it could multiply two arrays in one
operation. They were SIMD monsters.

On modern GPUs a temporary array solution could be the faster, due to the
massive multi-threading they have inside, and because memory is faster.

The number of temporary variables matters too. If you have a RISC
processor like PPC, there are a lot of registers available. So a solution
with many temporary variables and a do-loop could be very efficient. While
on an x86 there are few registers available, which means a temporary array
could be faster if temporaries get too plentiful, but still slower than
using a do loop with few temporaries.

So there is an interaction between code, hardware and compiler. But at
least with gfortran on x86, I expect do-loops to generally perform better
than array expressions, forall and where.

And regarding speed of Fortran vs. C and compilers:

It used to be the case that Fortran was "faster than C". But that was
before C and C++ compilers became alias-analysis champions and Fortran 95
was adopted. I often see gcc produce code that runs twice as fast as
gfortran, whereas g77 used to beat gcc without problems 10 years ago. With
Intel or Absoft compilers, Fortran still gives the better results. When
using Fortran 95 and GNU compilers, I am not surprised if C code would run
twice as fast on x86. gfortran is fairly good, but not the most aggressive
compiler on the market. Intel ifort is probably the best, but very
expensive. Absoft can be a good compromise between quality and price.

Sturla


_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Impressions from EuroSciPy 2010

Reply via email to