Re: Speed up suggestion for task #7807.

Edward d'Auvergne Wed, 11 Jun 2014 03:02:11 -0700

Hi,

Oh well, I can see you've now have an implementation (new = False)
that beats mine when clustered :)  You can use some of the ideas such
as the out ufunc argument and temporary storage to your advantage
nevertheless.  For example you can use the out argument of these
ufuncs even more, replacing:


"""
            self.dw_struct[:] = 1.0
            self.dw_struct[:] = multiply(self.dw_struct,
tile(asarray(dw).reshape(self.NE, self.NS)[:,:,None,None,None], (1, 1,
self.NM, self.NO, self.ND)), ) * self.disp_struct * self.frqs_a
"""


with:

"""
            self.dw_struct[:] = 1.0
            multiply(self.dw_struct, tile(asarray(dw).reshape(self.NE,
self.NS)[:,:,None,None,None], (1, 1, self.NM, self.NO, self.ND)),
self.dw_struct)
            multiply(self.dw_struct, self.disp_struct, self.dw_struct)
            multiply(self.dw_struct, self.frqs_a, self.dw_struct)
"""


That shaves off a few milliseconds by avoiding automatic array
creation and destruction, with before:

"""
('sfrq: ', 600000000.0, 'number of cpmg frq', 15, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.]))
('sfrq: ', 800000000.0, 'number of cpmg frq', 20, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.,  62.,  66.,  70.,  74.,  78.]))
('sfrq: ', 900000000.0, 'number of cpmg frq', 22, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.,  62.,  66.,  70.,  74.,  78.,  82.,  86.]))
('chi2 cluster:', 0.0)
Wed Jun 11 11:45:42 2014    /tmp/tmpwkhLSr

         198252 function calls (197150 primitive calls) in 1.499 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.499    1.499 <string>:1(<module>)
        1    0.001    0.001    1.499    1.499 profiling_cr72.py:449(cluster)
     1000    0.001    0.000    1.427    0.001 profiling_cr72.py:413(calc)
     1000    0.009    0.000    1.425    0.001 relax_disp.py:1020(func_CR72_full)
     1000    0.066    0.000    1.409    0.001 relax_disp.py:544(calc_CR72_chi2)
     1300    0.903    0.001    1.180    0.001 cr72.py:101(r2eff_CR72)
     2300    0.100    0.000    0.222    0.000 numeric.py:2056(allclose)
     3000    0.032    0.000    0.150    0.000 shape_base.py:761(tile)
     4000    0.104    0.000    0.104    0.000 {method 'repeat' of
'numpy.ndarray' objects}
    11828    0.091    0.000    0.091    0.000 {method 'reduce' of
'numpy.ufunc' objects}
        1    0.000    0.000    0.071    0.071 profiling_cr72.py:106(__init__)
        1    0.010    0.010    0.056    0.056
profiling_cr72.py:173(return_r2eff_arrays)
     1000    0.032    0.000    0.048    0.000 chi2.py:72(chi2_rankN)
     4609    0.005    0.000    0.045    0.000 fromnumeric.py:1762(any)
     2300    0.004    0.000    0.036    0.000 fromnumeric.py:1621(sum)
"""


And after:

"""
('sfrq: ', 600000000.0, 'number of cpmg frq', 15, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.]))
('sfrq: ', 800000000.0, 'number of cpmg frq', 20, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.,  62.,  66.,  70.,  74.,  78.]))
('sfrq: ', 900000000.0, 'number of cpmg frq', 22, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.,  62.,  66.,  70.,  74.,  78.,  82.,  86.]))
('chi2 cluster:', 0.0)
Wed Jun 11 11:49:29 2014    /tmp/tmpML9Lx5

         198252 function calls (197150 primitive calls) in 1.462 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.462    1.462 <string>:1(<module>)
        1    0.001    0.001    1.462    1.462 profiling_cr72.py:449(cluster)
     1000    0.001    0.000    1.393    0.001 profiling_cr72.py:413(calc)
     1000    0.009    0.000    1.392    0.001 relax_disp.py:1022(func_CR72_full)
     1000    0.056    0.000    1.376    0.001 relax_disp.py:544(calc_CR72_chi2)
     1300    0.887    0.001    1.158    0.001 cr72.py:101(r2eff_CR72)
     2300    0.097    0.000    0.217    0.000 numeric.py:2056(allclose)
     3000    0.031    0.000    0.148    0.000 shape_base.py:761(tile)
     4000    0.103    0.000    0.103    0.000 {method 'repeat' of
'numpy.ndarray' objects}
    11828    0.090    0.000    0.090    0.000 {method 'reduce' of
'numpy.ufunc' objects}
        1    0.000    0.000    0.068    0.068 profiling_cr72.py:106(__init__)
        1    0.010    0.010    0.053    0.053
profiling_cr72.py:173(return_r2eff_arrays)
     1000    0.031    0.000    0.047    0.000 chi2.py:72(chi2_rankN)
     4609    0.006    0.000    0.044    0.000 fromnumeric.py:1762(any)
     2300    0.004    0.000    0.036    0.000 fromnumeric.py:1621(sum)
"""


The additional suggestions I didn't specify before was to use these
ufuncs with the out argument in the lib.dispersion modules themselves.
You don't need to create R2eff here, just pack it into back_calc!

Regards,

Edward

On 11 June 2014 11:55, Troels Emtekær Linnet <[email protected]> wrote:
> Hi Edward.
>
> Some timings.
> Per spin, you have a faster method.
> But I win per cluster.
>
> 1000 iterations
> 1 / 100 spins
>
> Edward
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1    0.000    0.000    0.523    0.523 <string>:1(<module>)
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1    0.000    0.000    3.875    3.875 <string>:1(<module>)
>
> Troels Tile
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1    0.000    0.000    0.563    0.563 <string>:1(<module>)
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1    0.000    0.000    2.102    2.102 <string>:1(<module>)
>
> Troels Outer
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1    0.000    0.000    0.546    0.546 <string>:1(<module>)
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1    0.000    0.000    1.974    1.974 <string>:1(<module>)
>
> 2014-06-11 11:46 GMT+02:00 Troels Emtekær Linnet <[email protected]>:
>> Hi Edward.
>>
>> This is a really god page!
>> http://docs.scipy.org/doc/numpy/reference/ufuncs.html
>>
>> ""
>> Tip
>> The optional output arguments can be used to help you save memory for
>> large calculations. If your arrays are large, complicated expressions
>> can take longer than absolutely necessary due to the creation and
>> (later) destruction of temporary calculation spaces. For example, the
>> expression G = a * b + c is equivalent to t1 = A * B; G = T1 + C; del
>> t1. It will be more quickly executed as G = A * B; add(G, C, G) which
>> is the same as G = A * B; G += C.
>> ""
>>
>> 2014-06-10 23:08 GMT+02:00 Edward d'Auvergne <[email protected]>:
>>> Note that masks and numpy.ma.multiply() and numpy.ma.add() may speed
>>> this up even more.  However due to overheads in the numpy masking,
>>> there is a chance that this also makes the dw and R20 data structure
>>> construction slower.
>>>
>>> Regards,
>>>
>>> Edward
>>>
>>>
>>>
>>> On 10 June 2014 22:36, Edward d'Auvergne <[email protected]> wrote:
>>>> Hi Troels,
>>>>
>>>> To make things even simpler, here is what needs to be done for R20,
>>>> R20A and R20B:
>>>>
>>>> """
>>>> from numpy import abs, add, array, float64, multiply, ones, sum, zeros
>>>>
>>>> # Init mimic.
>>>> #############
>>>>
>>>> # Values from Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster.
>>>> NE = 1
>>>> NS = 2
>>>> NM = 2
>>>> NO = 1
>>>> ND = 8
>>>> R20A = array([  9.984626320294867,  11.495327724693091,
>>>> 12.991028416082928, 14.498419290021163])
>>>> shape = (NE, NS, NM, NO, ND)
>>>>
>>>> # Final structure for lib.dispersion.
>>>> R20A_struct = zeros(shape, float64)
>>>>
>>>> # Temporary storage to avoid memory allocations and garbage collection.
>>>> R20A_temp = zeros(shape, float64)
>>>>
>>>> # The structure for multiplication with R20A to piecewise build up the
>>>> full R20A structure.
>>>> R20A_mask = zeros((NS*NM,) + shape, float64)
>>>> for si in range(NS):
>>>>     for mi in range(NM):
>>>>         R20A_mask[si*NM+mi, :, si, mi] = 1.0
>>>> print(R20A_mask)
>>>> print("\n\n")
>>>>
>>>> # Values to be found (again taken directly from
>>>> Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster - as a
>>>> printout of dw_frq_a).
>>>> R20A_final = array([[[[[  9.984626320294867,   9.984626320294867,
>>>> 9.984626320294867,
>>>>                           9.984626320294867,   9.984626320294867,
>>>> 9.984626320294867,
>>>>                           9.984626320294867,   9.984626320294867]],
>>>>
>>>>                       [[ 11.495327724693091,  11.495327724693091,
>>>> 11.495327724693091,
>>>>                          11.495327724693091,  11.495327724693091,
>>>> 11.495327724693091,
>>>>                          11.495327724693091,  11.495327724693091]]],
>>>>
>>>>
>>>>                      [[[ 12.991028416082928,  12.991028416082928,
>>>> 12.991028416082928,
>>>>                          12.991028416082928,  12.991028416082928,
>>>> 12.991028416082928,
>>>>                          12.991028416082928,  12.991028416082928]],
>>>>
>>>>                       [[ 14.498419290021163,  14.498419290021163,
>>>> 14.498419290021163,
>>>>                          14.498419290021163,  14.498419290021163,
>>>> 14.498419290021163,
>>>>                          14.498419290021163,  14.498419290021163]]]]])
>>>>
>>>>
>>>> # Target function.
>>>> ##################
>>>>
>>>> # Loop over the R20A elements (one per spin).
>>>> for r20_index in range(NS*NM):
>>>>     # First multiply the spin specific R20A with the spin specific
>>>> frequency mask, using temporary storage.
>>>>     multiply(R20A[r20_index], R20A_mask[r20_index], R20A_temp)
>>>>
>>>>     # The add to the total.
>>>>     add(R20A_struct, R20A_temp, R20A_struct)
>>>>
>>>> # Show that the structure is reproduced perfectly.
>>>> print(R20A_struct)
>>>> print(R20A_struct - R20A_final)
>>>> print(sum(abs(R20A_struct - R20A_final)))
>>>> """
>>>>
>>>>
>>>> You may notice one simplification compared to my previous example for
>>>> the dw parameter
>>>> (http://thread.gmane.org/gmane.science.nmr.relax.devel/6135/focus=6154).
>>>> The values here too come from the
>>>> Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster system test.
>>>>
>>>> Regards,
>>>>
>>>> Edward
>>>>
>>>>
>>>> On 10 June 2014 21:31, Edward d'Auvergne <[email protected]> wrote:
>>>>> Hi Troels,
>>>>>
>>>>> No need for an example.  Here is the code to add to your
>>>>> infrastructure which will make the analytic dispersion models insanely
>>>>> fast:
>>>>>
>>>>>
>>>>> """
>>>>> from numpy import add, array, float64, multiply, ones, zeros
>>>>>
>>>>> # Init mimic.
>>>>> #############
>>>>>
>>>>> # Values from Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster.
>>>>> NE = 1
>>>>> NS = 2
>>>>> NM = 2
>>>>> NO = 1
>>>>> ND = 8
>>>>> dw = array([ 1.847792726895652,  0.193719379085542])
>>>>> frqs = [-382.188861036982701, -318.479128911056137]
>>>>> shape = (NE, NS, NM, NO, ND)
>>>>>
>>>>> # Final structure for lib.dispersion.
>>>>> dw_struct = zeros(shape, float64)
>>>>>
>>>>> # Temporary storage to avoid memory allocations and garbage collection.
>>>>> dw_temp = zeros((NS,) + shape, float64)
>>>>>
>>>>> # The structure for multiplication with dw to piecewise build up the
>>>>> full dw structure.
>>>>> dw_mask = zeros((NS,) + shape, float64)
>>>>> for si in range(NS):
>>>>>     for mi in range(NM):
>>>>>         dw_mask[si, :, si, mi] = frqs[mi]
>>>>> print(dw_mask)
>>>>>
>>>>> # Values to be found (again taken directly from
>>>>> Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster - as a
>>>>> printout of dw_frq_a).
>>>>> dw_final = array([[[[[-706.205797724669765, -706.205797724669765,
>>>>>                       -706.205797724669765, -706.205797724669765,
>>>>>                       -706.205797724669765, -706.205797724669765,
>>>>>                       -706.205797724669765, -706.205797724669765]],
>>>>>
>>>>>                     [[-588.483418069912318, -588.483418069912318,
>>>>>                       -588.483418069912318, -588.483418069912318,
>>>>>                       -588.483418069912318, -588.483418069912318,
>>>>>                       -588.483418069912318, -588.483418069912318]]],
>>>>>
>>>>>
>>>>>                    [[[ -74.03738885349469 ,  -74.03738885349469 ,
>>>>>                        -74.03738885349469 ,  -74.03738885349469 ,
>>>>>                        -74.03738885349469 ,  -74.03738885349469 ,
>>>>>                        -74.03738885349469 ,  -74.03738885349469 ]],
>>>>>
>>>>>                     [[ -61.69557910435401 ,  -61.69557910435401 ,
>>>>>                        -61.69557910435401 ,  -61.69557910435401 ,
>>>>>                        -61.69557910435401 ,  -61.69557910435401 ,
>>>>>                        -61.69557910435401 ,  -61.69557910435401 ]]]]])
>>>>>
>>>>>
>>>>> # Target function.
>>>>> ##################
>>>>>
>>>>> # Loop over the dw elements (one per spin).
>>>>> for si in range(NS):
>>>>>     # First multiply the spin specific dw with the spin specific
>>>>> frequency mask, using temporary storage.
>>>>>     multiply(dw[si], dw_mask[si], dw_temp[si])
>>>>>
>>>>>     # The add to the total.
>>>>>     add(dw_struct, dw_temp[si], dw_struct)
>>>>>
>>>>> # Show that the structure is reproduced perfectly.
>>>>> print(dw_struct - dw_final)
>>>>> """
>>>>>
>>>>> As mentioned in the comments, the structures come from the
>>>>> Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster.  I just
>>>>> added a check of "if len(dw) > 1: asdfasd" to kill the test, and added
>>>>> printouts to obtain dw, frq_a, dw_frq_a, etc.  This is exactly the
>>>>> implementation I described.  Although there might be an even faster
>>>>> way, this will eliminate all numpy array creation and deletion via
>>>>> Python garbage collection in the target functions (when used for R20
>>>>> as well).
>>>>>
>>>>> Regards,
>>>>>
>>>>> Edward
>>>>>
>>>>> On 10 June 2014 21:09, Edward d'Auvergne <[email protected]> wrote:
>>>>>> If you have a really complicated example of your current 'dw_frq_a'
>>>>>> data structure for multiple spins and multiple fields, that could help
>>>>>> to construct an example.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Edward
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10 June 2014 20:57, Edward d'Auvergne <[email protected]> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'll have a look tomorrow but, as you've probably seen, some of the
>>>>>>> fine details such as indices to be used need to be sorted out when
>>>>>>> implementing this.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Edward
>>>>>>>
>>>>>>>
>>>>>>> On 10 June 2014 20:49, Troels Emtekær Linnet <[email protected]> 
>>>>>>> wrote:
>>>>>>>> What ever I do, I cannot get this to work?
>>>>>>>>
>>>>>>>> Can you show an example ?
>>>>>>>>
>>>>>>>> 2014-06-10 16:29 GMT+02:00 Edward d'Auvergne <[email protected]>:
>>>>>>>>> Here is an example of avoiding automatic numpy data structure creation
>>>>>>>>> and then garbage collection:
>>>>>>>>>
>>>>>>>>> """
>>>>>>>>> from numpy import add, ones, zeros
>>>>>>>>>
>>>>>>>>> a = zeros((5, 4))
>>>>>>>>> a[1] = 1
>>>>>>>>> a[:,1] = 2
>>>>>>>>>
>>>>>>>>> b = ones((5, 4))
>>>>>>>>>
>>>>>>>>> add(a, b, a)
>>>>>>>>> print(a)
>>>>>>>>> """
>>>>>>>>>
>>>>>>>>> The result is:
>>>>>>>>>
>>>>>>>>> [[ 1.  3.  1.  1.]
>>>>>>>>>  [ 2.  3.  2.  2.]
>>>>>>>>>  [ 1.  3.  1.  1.]
>>>>>>>>>  [ 1.  3.  1.  1.]
>>>>>>>>>  [ 1.  3.  1.  1.]]
>>>>>>>>>
>>>>>>>>> The out argument for numpy.add() is used here to operate in a similar
>>>>>>>>> way to the Python "+=" operation.  But it avoids the temporary numpy
>>>>>>>>> data structures that the Python "+=" operation will create.  This will
>>>>>>>>> save a lot of time in the dispersion code.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Edward
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10 June 2014 15:56, Edward d'Auvergne <[email protected]> wrote:
>>>>>>>>>> Hi Troels,
>>>>>>>>>>
>>>>>>>>>> Here is one suggestion, of many that I have, for significantly
>>>>>>>>>> improving the speed of the analytic dispersion models in your
>>>>>>>>>> 'disp_spin_speed' branch.  The speed ups you have currently achieved
>>>>>>>>>> for spin clusters are huge and very impressive.  But now that you 
>>>>>>>>>> have
>>>>>>>>>> the infrastructure in place, you can advance this much more!
>>>>>>>>>>
>>>>>>>>>> The suggestion has to do with the R20, R20A, and R20B numpy data
>>>>>>>>>> structures.  They way they are currently handled is relatively
>>>>>>>>>> inefficient, in that they are created de novo for each function call.
>>>>>>>>>> This means that memory allocation and Python garbage collection
>>>>>>>>>> happens for every single function call - something which should be
>>>>>>>>>> avoided at almost all costs.
>>>>>>>>>>
>>>>>>>>>> A better way to do this would be to have a self.R20_struct,
>>>>>>>>>> self.R20A_struct, and self.R20B_struct created in __init__(), and 
>>>>>>>>>> then
>>>>>>>>>> to pack in the values from the parameter vector into these 
>>>>>>>>>> structures.
>>>>>>>>>> You could create a special structure in __init__() for this.  It 
>>>>>>>>>> would
>>>>>>>>>> have the dimensions [r20_index][ei][si][mi][oi], where the first
>>>>>>>>>> dimension corresponds to the different R20 parameters.  And for each
>>>>>>>>>> r20_index element, you would have ones at the [ei][si][mi][oi]
>>>>>>>>>> positions where you would like R20 to be, and zeros elsewhere.  The
>>>>>>>>>> key is that this is created at the target function start up, and not
>>>>>>>>>> for each function call.
>>>>>>>>>>
>>>>>>>>>> This would be combined with the very powerful 'out' argument set to
>>>>>>>>>> self.R20_struct with the numpy.add() and numpy.multiply() functions 
>>>>>>>>>> to
>>>>>>>>>> prevent all memory allocations and garbage collection.  Masks could 
>>>>>>>>>> be
>>>>>>>>>> used, but I think that that would be much slower than having special
>>>>>>>>>> numpy structures with ones where R20 should be and zeros elsewhere.
>>>>>>>>>> For just creating these structures, looping over a single r20_index
>>>>>>>>>> loop and multiplying by the special [r20_index][ei][si][mi][oi]
>>>>>>>>>> one/zero structure and using numpy.add() and numpy.multiply() with 
>>>>>>>>>> out
>>>>>>>>>> arguments would be much, much faster than masks or the current
>>>>>>>>>> R20_axis logic.  It will also simplify the code.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Edward
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> relax (http://www.nmr-relax.com)
>>>>>>>>>
>>>>>>>>> This is the relax-devel mailing list
>>>>>>>>> [email protected]
>>>>>>>>>
>>>>>>>>> To unsubscribe from this list, get a password
>>>>>>>>> reminder, or change your subscription options,
>>>>>>>>> visit the list information page at
>>>>>>>>> https://mail.gna.org/listinfo/relax-devel

_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
[email protected]

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel

Re: Speed up suggestion for task #7807.

Reply via email to