Hi Magnus,

On Wed, 23 Feb 2011 12:12:13 +0100, Magnus Paulsson <paulsso...@gmail.com> 
wrote:
> 1'st, thanks for developing pyCUDA. Just started playing with it last
> week and have already code that outperforms the numpy version 10-100
> fold. However, some things are still unclear to me so I will mix
> explaining how I understand things and ask questions. Please correct
> me if my understanding is faulty.
> 
> 1:   gpuarray: I only use gpuarray to send data to the device. Even if
> I use my own kernels or scikit.cuda on the data. However, as the
> following example demonstrates, you have to make copies of the numpy
> array before sending it to the gpu to ensure consistent indexing
> (c-type storage without any strange strides) for multi-dimensional
> arrays.
> 
> import pycuda.autoinit
> import pycuda.gpuarray as gpuarray
> import numpy as N
> 
> a=N.zeros((2,2))
> a[0,1] = 1
> a[1,0] = 2
> print "\na=\n",a
> print "\ngpu a=\n",gpuarray.to_gpu(a).get()
> aT=a.T
> print "\na^T=\n",aT
> print aT.__array_interface__
> print "\ngpu a^T=\n",gpuarray.to_gpu(aT).get()
> print aT.copy().__array_interface__
> print "\ngpu a^T.copy()=\n",gpuarray.to_gpu(aT.copy()).get()
> 
> Note that the gpuarray.to_gpu(aT) is not transposed as it should be.
> However, making the copy cures this.

Right-- .T in numpy doesn't change memory layout, it just gives you a
new numpy array pointing to the same storage with different
meta-information about strides and array dimensions. Since PyCUDA copies
the bare bits, the memory layout on the GPU is unchanged as well.

> 2: Async to device: I read that you need page-locked memory on the
> host for the async copies to work.
> Does pycuda.gpuarray.to_gpu_async(x) lock the memory of the numpy
> array x or copy the data to a locked memory area?

Neither. You need to have made the array using
http://documen.tician.de/pycuda/driver.html#pycuda.driver.pagelocked_empty

> 3: gpuarray.get_async(): Is control returned to python before the
> transfer is completed (as async would indicate)? How do I check when
> the transfer is complete? 

Streams and events.

> Page-locked memory?

Automatically allocated for you in get_async().

> Do I have to create streams and events to make
> async copies work?

Not necessarily, but without them it's kind of useless.

> 4: Streams: My understanding is that each stream is executed serially
> while different streams are running in parallel. Except stream "0"
> which waits for all other streams to finish before starting. Any
> simple example?

We don't have a simple one in PyCUDA--if you'd like to write one, that'd
be much appreciated.

Andreas

Attachment: pgp3Iu2Nnlvru.pgp
Description: PGP signature

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to