subject:"\[Numpy\-discussion\] f2py performance question from a rookie"

[Numpy-discussion] f2py performance question from a rookie

2010-08-16 Thread Vasileios Gkinis



  
  
 Hi all,

This is a question on f2py.
I am using a Crank Nicholson scheme to model a diffusion process
and in the quest of some extra speed I started playing with
f2py.
However I do not seem to be able to get any significant boost in
the performance of the code.

In the following simple example I am building a tridiagonal
array with plain python for loops and by calling a simple
fortran subroutine that does the same thing.

Here goes the python script:

import numpy as np
import time
import learnf90_05
import sys

run = sys.argv[1]

try:
 dim = np.float(sys.argv[2])
except:
 dim = 500
elem = np.ones(dim)

### Python routine

if run == "Python":
 t_python_i = time.time()
 array_python = np.zeros((dim, dim))
 for row in np.arange(1,dim-1):
 array_python[row, row-1] = elem[row]
 array_python[row, row] = elem[row]
 array_python[row,row+1] = elem[row]

 t_python_f = time.time()
 python_time = t_python_f - t_python_i
 print("Python time: %0.5e" %(python_time))

###fortran routine

elif run == "Fortran":
 t_fortran_i = time.time()
 fortran_array = learnf90_05.test(j = dim, element = elem)
 t_fortran_f = time.time()
 fortran_time = t_fortran_f - t_fortran_i
 print("Fortran time: %0.5e" %(fortran_time))

And the fortran subroutine called test is here: 

subroutine test(j, element, a)
integer, intent(in) :: j
integer :: row, col
real(kind = 8), intent(in) :: element(j)
real(kind = 8), intent(out) :: a(j,j)

do row=2,j-1
 a(row,row-1) = element(row)
 a(row, row) = element(row)
 a(row, row+1) = element(row)
enddo

end

The subroutine is compiled with Gfortran 4.2 on Osx 10.5.8. 

Running the python script with array sizes 300x300 and 3000x3000
gives for the python plain for loops:

dhcp103:learnf vasilis$ python fill_array.py Python 3000
Python time: 1.56063e-01
dhcp103:learnf vasilis$ python fill_array.py Python 300
Python time: 4.82297e-03

and for the fortran subroutine:

  dhcp103:learnf vasilis$ python
fill_array.py Fortran 3000
Fortran time: 1.16298e-01
  dhcp103:learnf vasilis$ python
fill_array.py Fortran 300
Fortran time: 8.83102e-04
  
  It looks like the gain in performance is
rather low compared to tests i have seen elsewhere.

Am I missing something here..?

  Cheers...Vasilis
  
  
  
  -- 

  
  Vasileios Gkinis PhD student
   Center for Ice
and Climate
   Niels Bohr Institute
  Juliane Maries Vej 30
  2100 Copenhagen
  Denmark
  

  
email:

v.gki...@nbi.ku.dk
  
  
skype:
vasgkin
  
  
Tel:
+45 353 25913
  

  
  


-- 
  

Vasileios Gkinis PhD student

  Center for Ice and Climate

  Niels Bohr Institute
Juliane Maries Vej 30
2100 Copenhagen
Denmark

  

  email:
  
  v.gki...@nbi.ku.dk


  skype:
  vasgkin


  Tel:
  +45 353 25913

  


  
  

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] f2py performance question from a rookie

2010-08-16 Thread Robin

On Mon, Aug 16, 2010 at 10:12 AM, Vasileios Gkinis v.gki...@nbi.ku.dkwrote:

  Hi all,

 This is a question on f2py.
 I am using a Crank Nicholson scheme to model a diffusion process and in the
 quest of some extra speed I started playing with f2py.
 However I do not seem to be able to get any significant boost in the
 performance of the code.


Try adding
 -DF2PY_REPORT_ON_ARRAY_COPY
to the f2py command line.

This will cause f2py to report any array copies. If any of the
types/ordering of the arrays don't match f2py will silently make a copy -
this can really affect performance.

Cheers

Robin



 In the following simple example I am building a tridiagonal array with
 plain python for loops and by calling a simple fortran subroutine that does
 the same thing.

 Here goes the python script:

 import numpy as np
 import time
 import learnf90_05
 import sys

 run = sys.argv[1]

 try:
 dim = np.float(sys.argv[2])
 except:
 dim = 500
 elem = np.ones(dim)

 ### Python routine

 if run == Python:
 t_python_i = time.time()
 array_python = np.zeros((dim, dim))
 for row in np.arange(1,dim-1):
 array_python[row, row-1] = elem[row]
 array_python[row, row] = elem[row]
 array_python[row,row+1] = elem[row]

 t_python_f = time.time()
 python_time = t_python_f - t_python_i
 print(Python time: %0.5e %(python_time))

 ###fortran routine

 elif run == Fortran:
 t_fortran_i = time.time()
 fortran_array = learnf90_05.test(j = dim, element = elem)
 t_fortran_f = time.time()
 fortran_time = t_fortran_f - t_fortran_i
 print(Fortran time: %0.5e %(fortran_time))

 And the fortran subroutine called test is here:

 subroutine test(j, element, a)
 integer, intent(in) ::  j
 integer :: row, col
 real(kind = 8),  intent(in) :: element(j)
 real(kind = 8), intent(out) :: a(j,j)

 do row=2,j-1
 a(row,row-1) = element(row)
 a(row, row) = element(row)
 a(row, row+1) = element(row)
 enddo

 end

 The subroutine is compiled with Gfortran 4.2 on Osx 10.5.8.

 Running the python script with array sizes 300x300 and 3000x3000 gives for
 the python plain for loops:

 dhcp103:learnf vasilis$ python fill_array.py Python 3000
 Python time: 1.56063e-01
 dhcp103:learnf vasilis$ python fill_array.py Python 300
 Python time: 4.82297e-03

 and for the fortran subroutine:

 dhcp103:learnf vasilis$ python fill_array.py Fortran 3000
 Fortran time: 1.16298e-01
 dhcp103:learnf vasilis$ python fill_array.py Fortran 300
 Fortran time: 8.83102e-04

 It looks like the gain in performance is rather low compared to tests i
 have seen elsewhere.

 Am I missing something here..?

 Cheers...Vasilis


  --
  
 Vasileios Gkinis PhD student
  Center for Ice and Climate http://www.iceandclimate.nbi.ku.dk
  Niels Bohr Institute http://www.nbi.ku.dk/english/
 Juliane Maries Vej 30
 2100 Copenhagen
 Denmark
   email: v.gki...@nbi.ku.dk v.gki...@nbi.ku.dk?  skype: vasgkin  Tel: +45
 353 25913  
  --
  
 Vasileios Gkinis PhD student
  Center for Ice and Climate http://www.iceandclimate.nbi.ku.dk
  Niels Bohr Institute http://www.nbi.ku.dk/english/
 Juliane Maries Vej 30
 2100 Copenhagen
 Denmark
   email: v.gki...@nbi.ku.dk v.gki...@nbi.ku.dk?  skype: vasgkin  Tel: +45
 353 25913  

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] f2py performance question from a rookie

2010-08-16 Thread Sturla Molden


   font face=Andale MonoIt looks like the gain in performance is
 rather low compared to tests i have seen elsewhere.br
 br
 Am I missing something here..?br
 br
   /fontCheers...Vasilisbr

Turn HTML off please.

Use time.clock(), not time.time().

Try some tasks that actually takes a while. Tasks that take 10**-4 or
10**-3  seconds cannot be reliably timed on Windows or Linux.








___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] f2py performance question from a rookie

2010-08-16 Thread Vasileios Gkinis

Sturla Molden sturla at molden.no writes:

 
 
font face=Andale MonoIt looks like the gain in performance is
  rather low compared to tests i have seen elsewhere.br
  br
  Am I missing something here..?br
  br
/fontCheers...Vasilisbr
 
 Turn HTML off please.
 
 Use time.clock(), not time.time().
 
 Try some tasks that actually takes a while. Tasks that take 10**-4 or
 10**-3  seconds cannot be reliably timed on Windows or Linux.
 


Hi again,

After using time.clock, running f2py with the REPORT_ON_ARRAY_COPY enabled and
passing arrays as np.asfortranarray(array) to the fortran routines I still get a
slow performance on f2py. No copied arrays are reported.

Running on f2py with a 6499x6499 array takes about 1.2sec while the python for
loop does the job slightly faster at 0.9 sec.

Comparisons like this:
http://www.scipy.org/PerformancePython
indicate a 100x-1000x boost in performance with f2py when compared to
conventional python for loops.

Still quite puzzled...any help will be very much appreciated

Regards
Vasileios 

The actual fortran subroutine is here:

subroutine fill_B(j, beta, lamda, mu, b)
integer, intent(in) ::  j
integer :: row, col
real(kind = 8),  intent(in) :: beta(j), lamda(j), mu(j)
real(kind = 8), intent(out) :: b(j,j)

do row=2,j-1
do col=1,j
if (col == row-1) then
b(row,col) = beta(row) - lamda(row) + mu(row)
elseif (col == row) then
b(row,col) = 1 - 2*beta(row)
elseif (col == row+1) then
b(row, col) = beta(row) + lamda(row) - mu(row)
else
b(row, col) = 0
endif
enddo
enddo
b(1,1) = 1
b(j,j) = 1

end

and the python code that calls it together with the alternative implementation
with conventional python for loops is here:


def crank_heat(self, profile_i, go, g1, beta, lamda, mu, usef2py = False):




##Crank Nicolson AX = BD
##Make matrix A (coefs of n+1 step
N = np.size(profile_i)
print N
t1 = time.clock()

if usef2py == True:
matrix_A = fill_A.fill_a(j = N, beta = beta, lamda = lamda, mu = mu)
matrix_B = fill_B.fill_b(j = N, beta = beta, lamda = lamda, mu = mu)
else:
matrix_A = np.zeros((N,N))
matrix_A[0,0] = 1
matrix_A[-1,-1] = 1
for row in np.arange(1,N-1):
matrix_A[row,row-1] = -(beta[row] - lamda[row] + mu[row])
matrix_A[row, row] = 1 + 2*beta[row]
matrix_A[row, row+1] = -(beta[row] + lamda[row] - mu[row])


#make matrix B
matrix_B = np.zeros((N,N))
matrix_B[0,0] = 1
matrix_B[-1,-1] = 1
for row in np.arange(1,N-1):
matrix_B[row,row-1] = beta[row] - lamda[row] + mu[row]
matrix_B[row, row] = 1 - 2*beta[row]
matrix_B[row, row+1] = beta[row] + lamda[row] - mu[row]

print(CN function time: %0.5e %(time.clock()-t1))

matrix_C = np.dot(matrix_B, profile_i)
t1 = time.clock()
matrix_X = np.linalg.solve(matrix_A, matrix_C)
print(CN solve time: %0.5e %(time.clock()-t1))
matrix_X[0] = go
matrix_X[-1] = g1



return matrix_X





___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] f2py performance question from a rookie

2010-08-16 Thread Sturla Molden


 After using time.clock, running f2py with the REPORT_ON_ARRAY_COPY enabled
 and
 passing arrays as np.asfortranarray(array) to the fortran routines I still
 get a
 slow performance on f2py. No copied arrays are reported.

That is not any better as np.asfortranarray will make a copy instead. Pass
the transpose to Fortran, and transpose the return value. Make sure you
neither make a copy before or after calling Fortran. A transpose does not
make a copy in NumPy.

You want C ordering in NumPy and Fortran ordering in Fortran. Which means
you probably want to call Fortran like this:

foobar(x.T, y.T)

If you let f2py return y as a result, make sure it is C ordered after
transpose:

y = foobar(x.T).T
assert(y.flags['C_CONTIGUOUS'])

Even if you don't get a copy and transpose immediately (i.e. f2py does
not report a copy), it might happen silently later on if you get Fortran
ordered arrays returned into Python.


 Running on f2py with a 6499x6499 array takes about 1.2sec while the python
 for
 loop does the job slightly faster at 0.9 sec.

I suppose you are including the solve time here? In which case the
dominating factors are np.dot and np.linalg.solve. Fortran will not help a
lot then.

Well if you have access to performance libraries (ACML or MKL) you might
save a little by using Fortran DGEMM instead of np.dot and a carefully
selected Fortran LAPACK solver (that is, np.linalg.solve use Gaussian
substitution, i.e. LU, but there might be structure in your matrices to
exploit, I haven't looked too carefully.)


 subroutine fill_B(j, beta, lamda, mu, b)
 integer, intent(in) ::  j
 integer :: row, col
 real(kind = 8),  intent(in) :: beta(j), lamda(j), mu(j)
 real(kind = 8), intent(out) :: b(j,j)

 do row=2,j-1
 do col=1,j
 if (col == row-1) then
 b(row,col) = beta(row) - lamda(row) + mu(row)
 elseif (col == row) then
 b(row,col) = 1 - 2*beta(row)
 elseif (col == row+1) then
 b(row, col) = beta(row) + lamda(row) - mu(row)
 else
 b(row, col) = 0
 endif
 enddo
 enddo
 b(1,1) = 1
 b(j,j) = 1

 end

That iterates over strided memory in Fortran. Memory access is slow,
computation is fast.

So there are two problems with your Fortran code:

1. You want b to be transposed in Fortran compared to Python.
2. Iterate in column major order in Fortran, not row major order.


Sturla




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] f2py performance question from a rookie

Re: [Numpy-discussion] f2py performance question from a rookie

Re: [Numpy-discussion] f2py performance question from a rookie

Re: [Numpy-discussion] f2py performance question from a rookie

Re: [Numpy-discussion] f2py performance question from a rookie

5 matches

Site Navigation

Mail list logo

Footer information