On Wed, Feb 4, 2009 at 10:43, Andreas Klöckner <[email protected]>wrote:
> On Samstag 31 Januar 2009, Nicholas Tung wrote:
> > Sure, no problem. Also, another question: how do I pass an offset from a
> > DeviceAllocation to a kernel? I can't seem to pass int's where it wants
> > pointers despite being able to cast DeviceAllocation to an int.
>
> Two ways: wrap it in a numpy.intp--a pointer sized integer. Or use the
> prepared invocation interface, which handles this seamlessly.
>
> Current git has test cases for this.
Okay, this works well for calling kernels, but what about initializing
structs? Is it correct on all architectures? Please take a look at
http://forums.nvidia.com/index.php?showtopic=88456
maybe you could answer it. Is there ever a discrepancy between device and
host pointer sizes? I doubt that numpy.intp would be looking at the GPU
device pointer sizes.
I created some documentation for it anyway, diff attached. Please test the
demo.py on a 32-bit machine if you can.
cheers,
Nicholas
diff --git a/doc/source/driver.rst b/doc/source/driver.rst
index 8771a41..4e638ed 100644
--- a/doc/source/driver.rst
+++ b/doc/source/driver.rst
@@ -355,7 +355,7 @@ have an attribute *base* that references an object of type
Release the held memory now instead of when this object
becomes unreachable. Any further use of the object (or its
- associated :module:`numpy` array) is an error
+ associated :mod:`numpy` array) is an error
and will lead to undefined behavior.
Arrays and Textures
@@ -488,6 +488,10 @@ Initializing Device Memory
.. function:: memset_d16(dest, data, count)
.. function:: memset_d32(dest, data, count)
+ .. note::
+
+ *count* is the number of elements, not bytes.
+
.. function:: memset_d2d8(dest, pitch, data, width, height)
.. function:: memset_d2d16(dest, pitch, data, width, height)
.. function:: memset_d2d32(dest, pitch, data, width, height)
diff --git a/doc/source/tutorial.rst b/doc/source/tutorial.rst
index 23a02b0..5c8350e 100644
--- a/doc/source/tutorial.rst
+++ b/doc/source/tutorial.rst
@@ -20,15 +20,15 @@ Transferring Data
The next step in most programs is to transfer data onto the device.
In PyCuda, you will mostly transfer data from :mod:`numpy` arrays
-on the host. (But indeed, everything that satifies the Python buffer
+on the host. (But indeed, everything that satisfies the Python buffer
interface will work, even a :class:`str`.) Let's make a 4x4 array
of random numbers::
import numpy
a = numpy.random.randn(4,4)
-But wait--*a* consists of double precision numbers, but Nvidia
-devices only support single precision as of this writing::
+But wait--*a* consists of double precision numbers, but most nVidia
+devices only support single precision::
a = a.astype(numpy.float32)
@@ -90,6 +90,85 @@ Stick around for some bonus material in the next section, though.
(You can find the code for this demo as :file:`examples/demo.py` in the PyCuda
source distribution.)
+Alternate kernel invocations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The :class:`In`, :class:`Out`, and :class:`InOut` argument handlers can
+simplify some of the memory transfers. For example, instead of creating
+*a_gpu*, if replacing *a* is fine, the following code can be used::
+
+ func(cuda.InOut(a), block=(4, 4, 1))
+
+Structures
+^^^^^^^^^^
+
+Suppose we have the following structure, for doubling a number of variable
+length arrays::
+
+ mod = cuda.SourceModule("""
+ struct DoubleOperation {
+ int datalen, __padding; // so 64-bit ptrs can be aligned
+ float *ptr;
+ };
+
+ __global__ void double_array(DoubleOperation *a) {
+ a = &a[blockIdx.x];
+ for (int idx = threadIdx.x; idx < a->datalen; idx += blockDim.x) {
+ a->ptr[idx] *= 2;
+ }
+ }
+ """)
+
+Each block in the grid (see CUDA documentation) will double one of the arrays.
+The `for` loop allows for more data elements than threads to be doubled,
+though is not efficient if one can guarantee that there will be a sufficient
+number of threads. Next, a wrapper class for the structure is created, and
+two arrays are instantiated::
+
+ class DoubleOpStruct:
+ mem_size = 8 + numpy.intp(0).nbytes
+ def __init__(self, array, struct_arr_ptr):
+ self.data = cuda.to_device(array)
+ self.shape, self.dtype = array.shape, array.dtype
+ cuda.memcpy_htod(int(struct_arr_ptr), numpy.int32(array.size))
+ cuda.memcpy_htod(int(struct_arr_ptr) + 8, numpy.intp(int(self.data)))
+ def __str__(self):
+ return str(cuda.from_device(self.data, self.shape, self.dtype))
+
+ struct_arr = cuda.mem_alloc(2 * DoubleOpStruct.mem_size)
+ do2_ptr = int(struct_arr) + DoubleOpStruct.mem_size
+
+ array1 = DoubleOpStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)
+ array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), do2_ptr)
+ print("original arrays", array1, array2)
+
+This code uses the :func:`to_device` and :func:`from_device` functions to
+allocate and copy values, and demonstrates how offsets to an allocated
+block of memory can be used. Finally, the code can be executed; the following
+demonstrates doubling both arrays, then only the second::
+
+ func = mod.get_function("double_array")
+ func(struct_arr, block = (32, 1, 1), grid=(2, 1))
+ print("doubled arrays", array1, array2)
+
+ func(numpy.intp(do2_ptr), block = (32, 1, 1), grid=(1, 1))
+ print("doubled second only", array1, array2, "\n")
+
+Prepared invocations
+^^^^^^^^^^^^^^^^^^^^
+
+Function invocation using the built-in __call__ method may be slow
+(see :ref:`reference-doc`). To achieve the same effect as above,
+the function is bound to argument types (as designated by Python's
+standard library :mod:`struct` module), and then called. This also
+avoid having to cast arguments using `numpy.[type]` functions::
+
+ func.prepare("P", block=(32, 1, 1))
+ func.prepared_call((2, 1), struct_arr)
+ print("doubled again", array1, array2)
+ func.prepared_call((1, 1), do2_ptr)
+ print("doubled second only again", array1, array2, "\n")
+
Bonus: Abstracting Away the Complications
-----------------------------------------
diff --git a/examples/demo.py b/examples/demo.py
index d742b74..ac80d02 100644
--- a/examples/demo.py
+++ b/examples/demo.py
@@ -1,5 +1,6 @@
# Sample source code from the Tutorial Introduction in the documentation.
+from __future__ import print_function
import pycuda.driver as cuda
import pycuda.autoinit
@@ -25,12 +26,63 @@ func(a_gpu, block=(4,4,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
-print a_doubled
-print a
+print("original array:", a, sep="\n")
+print("doubled with kernel:", a_doubled, "", sep="\n")
+
+# alternate kernel invocation -------------------------------------------------
+
+func(cuda.InOut(a), block=(4, 4, 1))
+print("doubled with InOut:", a, "", sep="\n")
+
+# prepared invocations and structures -----------------------------------------
+
+class DoubleOpStruct:
+ # FIXME - see http://forums.nvidia.com/index.php?showtopic=88456
+ mem_size = 8 + numpy.intp(0).nbytes
+ def __init__(self, array, struct_arr_ptr):
+ self.data = cuda.to_device(array)
+ self.shape, self.dtype = array.shape, array.dtype
+ cuda.memcpy_htod(int(struct_arr_ptr), numpy.int32(array.size))
+ cuda.memcpy_htod(int(struct_arr_ptr) + 8, numpy.intp(int(self.data)))
+ def __str__(self):
+ return str(cuda.from_device(self.data, self.shape, self.dtype))
+
+struct_arr = cuda.mem_alloc(2 * DoubleOpStruct.mem_size)
+do2_ptr = int(struct_arr) + DoubleOpStruct.mem_size
+
+array1 = DoubleOpStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)
+array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), do2_ptr)
+print("original arrays", array1, array2)
+
+mod = cuda.SourceModule("""
+ struct DoubleOperation {
+ int datalen, __padding; // so 64-bit ptrs can be aligned
+ float *ptr;
+ };
+
+ __global__ void double_array(DoubleOperation *a) {
+ a = &a[blockIdx.x];
+ for (int idx = threadIdx.x; idx < a->datalen; idx += blockDim.x) {
+ a->ptr[idx] *= 2;
+ }
+ }
+ """)
+func = mod.get_function("double_array")
+func(struct_arr, block = (32, 1, 1), grid=(2, 1))
+print("doubled arrays", array1, array2)
+
+func(numpy.intp(do2_ptr), block = (32, 1, 1), grid=(1, 1))
+print("doubled second only", array1, array2, "\n")
+
+func.prepare("P", block=(32, 1, 1))
+func.prepared_call((2, 1), struct_arr)
+print("doubled again", array1, array2)
+func.prepared_call((1, 1), do2_ptr)
+print("doubled second only again", array1, array2, "\n")
# part 2 ----------------------------------------------------------------------
import pycuda.gpuarray as gpuarray
a_gpu = gpuarray.to_gpu(numpy.random.randn(4,4).astype(numpy.float32))
a_doubled = (2*a_gpu).get()
-print a_doubled
-print a_gpu
+print("original array:", a_gpu, sep="\n")
+print("doubled with gpuarray:", a_doubled, sep="\n")
_______________________________________________
PyCuda mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net