One other tip on setting up your program is to remember to reduce memory
accesses as much as possible, so try to maximize the computations you
perform for every memory transfer. So you'll probably want to load a large
chunk of tf and compute on several indicies of Mf and farray.
Craig

On Sat, Mar 28, 2015 at 8:19 PM Craig Stringham <string...@mers.byu.edu>
wrote:

> Hi Bruce,
> That's an excellent problem for a GPU. However, because each problem uses
> a fair amount of memory being careful about how the memory is accessed will
> dominate your performance gains (as is typical when using a GPU). For
> example tf won't fit in the shared memory or cache of a multi-processor so
> you'll also want to divide the problem again.
> If you don't need to get this working for routine usage though, you might
> just try using numba primitives to move it to a GPU. I haven't used them,
> so I can't attest that it will give you a good answer. On the other hand,
> this is the sort of problem that makes learning CUDA and PyCUDA easy, so
>  you might as well give it a shot.
> Regards,
> Craig
>
> On Sat, Mar 28, 2015 at 8:29 AM Bruce Labitt <bdlab...@gmail.com> wrote:
>
>> From reading the documentation, I am confused if paralleling of this kind
>> of function is worth doing in pycuda.
>>
>> I'm trying to add the effect of phase noise in to a radar simulation.
>> The simulation is written in Scipy/numpy.  Currently I am using joblib to
>> run multiple cores.  It is too slow for the scenarios I wish to try.  It
>> does work for a small number of targets and reduced phase noise array
>> sizes.  The following is the current approach:
>>
>> Function to parallelize
>>
>> def MSIN( farray, Mf, tf, jj ):
>>     """
>>     farray, Mf, tf, ii
>>
>>     farray  array of frequencies  (size =  10000)
>>     Mf      array of coefficients    (size =  10000)
>>     tf        2D array ~[2048 x 256] of time
>>     jj        list of indices (fraction of the problem to solve)
>>
>>     """
>>     Msin = 0.0
>>     for ii in jj:
>>         Msin = Msin + Mf[ii] * 2.0*cos( 2.0*pi*farray[ii]*tf )
>>     return Msin
>>
>> Current method to call function in parallel (multiprocessing)
>>
>> """
>> ====================================================
>> Parallel computes the function MSIN with njobs cores
>> ====================================================
>> """
>> MMM = Parallel(n_jobs=njobs, max_nbytes=None)\
>>     (delayed(MSIN)( f, aa, tf1, ii ) for ii in idx)
>> Msin = reduce(add, MMM)     # add all the results of the cores together
>>
>> Any suggestions to port this to pycuda?  Reasonable candidate?
>>
>> In essence, it is accumulating a scalar weighted cos function for many
>> elements of a 2D array.  It 'feels' like it should be portable.  Any road
>> blocks forseen?  The 2D array of times is continuous in the sense of
>> stride.  But there are discontinuous jumps in time values in the array,
>> which I do not think is a problem.
>>
>> I have from DumpProperties.py
>> Device #0: GeForce GTX 680M
>>   Compute Capability: 3.0
>>   Total Memory: 4193984 KB
>>   CAN_MAP_HOST_MEMORY: 1
>>   CLOCK_RATE: 758000
>>   MAX_BLOCK_DIM_X: 1024
>>   MAX_BLOCK_DIM_Y: 1024
>>   MAX_BLOCK_DIM_Z: 64
>>   MAX_GRID_DIM_X: 2147483647
>>   MAX_GRID_DIM_Y: 65535
>>   MAX_GRID_DIM_Z: 65535
>>
>> CUDA6.5
>>
>> Thanks in advance for any insight, or suggestions on how to attack the
>> problem
>>
>> -Bruce
>>
>> _______________________________________________
>> PyCUDA mailing list
>> PyCUDA@tiker.net
>> http://lists.tiker.net/listinfo/pycuda
>>
>
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to