Hi Magnus, On Wed, 23 Feb 2011 12:12:13 +0100, Magnus Paulsson <paulsso...@gmail.com> wrote: > 1'st, thanks for developing pyCUDA. Just started playing with it last > week and have already code that outperforms the numpy version 10-100 > fold. However, some things are still unclear to me so I will mix > explaining how I understand things and ask questions. Please correct > me if my understanding is faulty. > > 1: gpuarray: I only use gpuarray to send data to the device. Even if > I use my own kernels or scikit.cuda on the data. However, as the > following example demonstrates, you have to make copies of the numpy > array before sending it to the gpu to ensure consistent indexing > (c-type storage without any strange strides) for multi-dimensional > arrays. > > import pycuda.autoinit > import pycuda.gpuarray as gpuarray > import numpy as N > > a=N.zeros((2,2)) > a[0,1] = 1 > a[1,0] = 2 > print "\na=\n",a > print "\ngpu a=\n",gpuarray.to_gpu(a).get() > aT=a.T > print "\na^T=\n",aT > print aT.__array_interface__ > print "\ngpu a^T=\n",gpuarray.to_gpu(aT).get() > print aT.copy().__array_interface__ > print "\ngpu a^T.copy()=\n",gpuarray.to_gpu(aT.copy()).get() > > Note that the gpuarray.to_gpu(aT) is not transposed as it should be. > However, making the copy cures this.
Right-- .T in numpy doesn't change memory layout, it just gives you a new numpy array pointing to the same storage with different meta-information about strides and array dimensions. Since PyCUDA copies the bare bits, the memory layout on the GPU is unchanged as well. > 2: Async to device: I read that you need page-locked memory on the > host for the async copies to work. > Does pycuda.gpuarray.to_gpu_async(x) lock the memory of the numpy > array x or copy the data to a locked memory area? Neither. You need to have made the array using http://documen.tician.de/pycuda/driver.html#pycuda.driver.pagelocked_empty > 3: gpuarray.get_async(): Is control returned to python before the > transfer is completed (as async would indicate)? How do I check when > the transfer is complete? Streams and events. > Page-locked memory? Automatically allocated for you in get_async(). > Do I have to create streams and events to make > async copies work? Not necessarily, but without them it's kind of useless. > 4: Streams: My understanding is that each stream is executed serially > while different streams are running in parallel. Except stream "0" > which waits for all other streams to finish before starting. Any > simple example? We don't have a simple one in PyCUDA--if you'd like to write one, that'd be much appreciated. Andreas
pgp3Iu2Nnlvru.pgp
Description: PGP signature
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda