[Numpy-discussion] f2py performance question from a rookie
Hi all, This is a question on f2py. I am using a Crank Nicholson scheme to model a diffusion process and in the quest of some extra speed I started playing with f2py. However I do not seem to be able to get any significant boost in the performance of the code. In the following simple example I am building a tridiagonal array with plain python for loops and by calling a simple fortran subroutine that does the same thing. Here goes the python script: import numpy as np import time import learnf90_05 import sys run = sys.argv[1] try: dim = np.float(sys.argv[2]) except: dim = 500 elem = np.ones(dim) ### Python routine if run == "Python": t_python_i = time.time() array_python = np.zeros((dim, dim)) for row in np.arange(1,dim-1): array_python[row, row-1] = elem[row] array_python[row, row] = elem[row] array_python[row,row+1] = elem[row] t_python_f = time.time() python_time = t_python_f - t_python_i print("Python time: %0.5e" %(python_time)) ###fortran routine elif run == "Fortran": t_fortran_i = time.time() fortran_array = learnf90_05.test(j = dim, element = elem) t_fortran_f = time.time() fortran_time = t_fortran_f - t_fortran_i print("Fortran time: %0.5e" %(fortran_time)) And the fortran subroutine called test is here: subroutine test(j, element, a) integer, intent(in) :: j integer :: row, col real(kind = 8), intent(in) :: element(j) real(kind = 8), intent(out) :: a(j,j) do row=2,j-1 a(row,row-1) = element(row) a(row, row) = element(row) a(row, row+1) = element(row) enddo end The subroutine is compiled with Gfortran 4.2 on Osx 10.5.8. Running the python script with array sizes 300x300 and 3000x3000 gives for the python plain for loops: dhcp103:learnf vasilis$ python fill_array.py Python 3000 Python time: 1.56063e-01 dhcp103:learnf vasilis$ python fill_array.py Python 300 Python time: 4.82297e-03 and for the fortran subroutine: dhcp103:learnf vasilis$ python fill_array.py Fortran 3000 Fortran time: 1.16298e-01 dhcp103:learnf vasilis$ python fill_array.py Fortran 300 Fortran time: 8.83102e-04 It looks like the gain in performance is rather low compared to tests i have seen elsewhere. Am I missing something here..? Cheers...Vasilis -- Vasileios Gkinis PhD student Center for Ice and Climate Niels Bohr Institute Juliane Maries Vej 30 2100 Copenhagen Denmark email: v.gki...@nbi.ku.dk skype: vasgkin Tel: +45 353 25913 -- Vasileios Gkinis PhD student Center for Ice and Climate Niels Bohr Institute Juliane Maries Vej 30 2100 Copenhagen Denmark email: v.gki...@nbi.ku.dk skype: vasgkin Tel: +45 353 25913 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] f2py performance question from a rookie
On Mon, Aug 16, 2010 at 10:12 AM, Vasileios Gkinis v.gki...@nbi.ku.dkwrote: Hi all, This is a question on f2py. I am using a Crank Nicholson scheme to model a diffusion process and in the quest of some extra speed I started playing with f2py. However I do not seem to be able to get any significant boost in the performance of the code. Try adding -DF2PY_REPORT_ON_ARRAY_COPY to the f2py command line. This will cause f2py to report any array copies. If any of the types/ordering of the arrays don't match f2py will silently make a copy - this can really affect performance. Cheers Robin In the following simple example I am building a tridiagonal array with plain python for loops and by calling a simple fortran subroutine that does the same thing. Here goes the python script: import numpy as np import time import learnf90_05 import sys run = sys.argv[1] try: dim = np.float(sys.argv[2]) except: dim = 500 elem = np.ones(dim) ### Python routine if run == Python: t_python_i = time.time() array_python = np.zeros((dim, dim)) for row in np.arange(1,dim-1): array_python[row, row-1] = elem[row] array_python[row, row] = elem[row] array_python[row,row+1] = elem[row] t_python_f = time.time() python_time = t_python_f - t_python_i print(Python time: %0.5e %(python_time)) ###fortran routine elif run == Fortran: t_fortran_i = time.time() fortran_array = learnf90_05.test(j = dim, element = elem) t_fortran_f = time.time() fortran_time = t_fortran_f - t_fortran_i print(Fortran time: %0.5e %(fortran_time)) And the fortran subroutine called test is here: subroutine test(j, element, a) integer, intent(in) :: j integer :: row, col real(kind = 8), intent(in) :: element(j) real(kind = 8), intent(out) :: a(j,j) do row=2,j-1 a(row,row-1) = element(row) a(row, row) = element(row) a(row, row+1) = element(row) enddo end The subroutine is compiled with Gfortran 4.2 on Osx 10.5.8. Running the python script with array sizes 300x300 and 3000x3000 gives for the python plain for loops: dhcp103:learnf vasilis$ python fill_array.py Python 3000 Python time: 1.56063e-01 dhcp103:learnf vasilis$ python fill_array.py Python 300 Python time: 4.82297e-03 and for the fortran subroutine: dhcp103:learnf vasilis$ python fill_array.py Fortran 3000 Fortran time: 1.16298e-01 dhcp103:learnf vasilis$ python fill_array.py Fortran 300 Fortran time: 8.83102e-04 It looks like the gain in performance is rather low compared to tests i have seen elsewhere. Am I missing something here..? Cheers...Vasilis -- Vasileios Gkinis PhD student Center for Ice and Climate http://www.iceandclimate.nbi.ku.dk Niels Bohr Institute http://www.nbi.ku.dk/english/ Juliane Maries Vej 30 2100 Copenhagen Denmark email: v.gki...@nbi.ku.dk v.gki...@nbi.ku.dk? skype: vasgkin Tel: +45 353 25913 -- Vasileios Gkinis PhD student Center for Ice and Climate http://www.iceandclimate.nbi.ku.dk Niels Bohr Institute http://www.nbi.ku.dk/english/ Juliane Maries Vej 30 2100 Copenhagen Denmark email: v.gki...@nbi.ku.dk v.gki...@nbi.ku.dk? skype: vasgkin Tel: +45 353 25913 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] f2py performance question from a rookie
font face=Andale MonoIt looks like the gain in performance is rather low compared to tests i have seen elsewhere.br br Am I missing something here..?br br /fontCheers...Vasilisbr Turn HTML off please. Use time.clock(), not time.time(). Try some tasks that actually takes a while. Tasks that take 10**-4 or 10**-3 seconds cannot be reliably timed on Windows or Linux. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] f2py performance question from a rookie
Sturla Molden sturla at molden.no writes: font face=Andale MonoIt looks like the gain in performance is rather low compared to tests i have seen elsewhere.br br Am I missing something here..?br br /fontCheers...Vasilisbr Turn HTML off please. Use time.clock(), not time.time(). Try some tasks that actually takes a while. Tasks that take 10**-4 or 10**-3 seconds cannot be reliably timed on Windows or Linux. Hi again, After using time.clock, running f2py with the REPORT_ON_ARRAY_COPY enabled and passing arrays as np.asfortranarray(array) to the fortran routines I still get a slow performance on f2py. No copied arrays are reported. Running on f2py with a 6499x6499 array takes about 1.2sec while the python for loop does the job slightly faster at 0.9 sec. Comparisons like this: http://www.scipy.org/PerformancePython indicate a 100x-1000x boost in performance with f2py when compared to conventional python for loops. Still quite puzzled...any help will be very much appreciated Regards Vasileios The actual fortran subroutine is here: subroutine fill_B(j, beta, lamda, mu, b) integer, intent(in) :: j integer :: row, col real(kind = 8), intent(in) :: beta(j), lamda(j), mu(j) real(kind = 8), intent(out) :: b(j,j) do row=2,j-1 do col=1,j if (col == row-1) then b(row,col) = beta(row) - lamda(row) + mu(row) elseif (col == row) then b(row,col) = 1 - 2*beta(row) elseif (col == row+1) then b(row, col) = beta(row) + lamda(row) - mu(row) else b(row, col) = 0 endif enddo enddo b(1,1) = 1 b(j,j) = 1 end and the python code that calls it together with the alternative implementation with conventional python for loops is here: def crank_heat(self, profile_i, go, g1, beta, lamda, mu, usef2py = False): ##Crank Nicolson AX = BD ##Make matrix A (coefs of n+1 step N = np.size(profile_i) print N t1 = time.clock() if usef2py == True: matrix_A = fill_A.fill_a(j = N, beta = beta, lamda = lamda, mu = mu) matrix_B = fill_B.fill_b(j = N, beta = beta, lamda = lamda, mu = mu) else: matrix_A = np.zeros((N,N)) matrix_A[0,0] = 1 matrix_A[-1,-1] = 1 for row in np.arange(1,N-1): matrix_A[row,row-1] = -(beta[row] - lamda[row] + mu[row]) matrix_A[row, row] = 1 + 2*beta[row] matrix_A[row, row+1] = -(beta[row] + lamda[row] - mu[row]) #make matrix B matrix_B = np.zeros((N,N)) matrix_B[0,0] = 1 matrix_B[-1,-1] = 1 for row in np.arange(1,N-1): matrix_B[row,row-1] = beta[row] - lamda[row] + mu[row] matrix_B[row, row] = 1 - 2*beta[row] matrix_B[row, row+1] = beta[row] + lamda[row] - mu[row] print(CN function time: %0.5e %(time.clock()-t1)) matrix_C = np.dot(matrix_B, profile_i) t1 = time.clock() matrix_X = np.linalg.solve(matrix_A, matrix_C) print(CN solve time: %0.5e %(time.clock()-t1)) matrix_X[0] = go matrix_X[-1] = g1 return matrix_X ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] f2py performance question from a rookie
After using time.clock, running f2py with the REPORT_ON_ARRAY_COPY enabled and passing arrays as np.asfortranarray(array) to the fortran routines I still get a slow performance on f2py. No copied arrays are reported. That is not any better as np.asfortranarray will make a copy instead. Pass the transpose to Fortran, and transpose the return value. Make sure you neither make a copy before or after calling Fortran. A transpose does not make a copy in NumPy. You want C ordering in NumPy and Fortran ordering in Fortran. Which means you probably want to call Fortran like this: foobar(x.T, y.T) If you let f2py return y as a result, make sure it is C ordered after transpose: y = foobar(x.T).T assert(y.flags['C_CONTIGUOUS']) Even if you don't get a copy and transpose immediately (i.e. f2py does not report a copy), it might happen silently later on if you get Fortran ordered arrays returned into Python. Running on f2py with a 6499x6499 array takes about 1.2sec while the python for loop does the job slightly faster at 0.9 sec. I suppose you are including the solve time here? In which case the dominating factors are np.dot and np.linalg.solve. Fortran will not help a lot then. Well if you have access to performance libraries (ACML or MKL) you might save a little by using Fortran DGEMM instead of np.dot and a carefully selected Fortran LAPACK solver (that is, np.linalg.solve use Gaussian substitution, i.e. LU, but there might be structure in your matrices to exploit, I haven't looked too carefully.) subroutine fill_B(j, beta, lamda, mu, b) integer, intent(in) :: j integer :: row, col real(kind = 8), intent(in) :: beta(j), lamda(j), mu(j) real(kind = 8), intent(out) :: b(j,j) do row=2,j-1 do col=1,j if (col == row-1) then b(row,col) = beta(row) - lamda(row) + mu(row) elseif (col == row) then b(row,col) = 1 - 2*beta(row) elseif (col == row+1) then b(row, col) = beta(row) + lamda(row) - mu(row) else b(row, col) = 0 endif enddo enddo b(1,1) = 1 b(j,j) = 1 end That iterates over strided memory in Fortran. Memory access is slow, computation is fast. So there are two problems with your Fortran code: 1. You want b to be transposed in Fortran compared to Python. 2. Iterate in column major order in Fortran, not row major order. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion