Re: [PyCuda] codepy

Nicholas Tung Wed, 04 Feb 2009 12:48:08 -0800

On Wed, Feb 4, 2009 at 10:43, Andreas Klöckner <[email protected]>wrote:

> On Samstag 31 Januar 2009, Nicholas Tung wrote:
> > Sure, no problem. Also, another question: how do I pass an offset from a
> > DeviceAllocation to a kernel? I can't seem to pass int's where it wants
> > pointers despite being able to cast DeviceAllocation to an int.
>
> Two ways: wrap it in a numpy.intp--a pointer sized integer. Or use the
> prepared invocation interface, which handles this seamlessly.
>
> Current git has test cases for this.

Okay, this works well for calling kernels, but what about initializing
structs? Is it correct on all architectures? Please take a look at
http://forums.nvidia.com/index.php?showtopic=88456
maybe you could answer it. Is there ever a discrepancy between device and
host pointer sizes? I doubt that numpy.intp would be looking at the GPU
device pointer sizes.

I created some documentation for it anyway, diff attached. Please test the
demo.py on a 32-bit machine if you can.

cheers,
Nicholas

diff --git a/doc/source/driver.rst b/doc/source/driver.rst
index 8771a41..4e638ed 100644
--- a/doc/source/driver.rst
+++ b/doc/source/driver.rst
@@ -355,7 +355,7 @@ have an attribute *base* that references an object of type
     
     Release the held memory now instead of when this object
     becomes unreachable. Any further use of the object (or its 
-    associated :module:`numpy` array) is an error
+    associated :mod:`numpy` array) is an error
     and will lead to undefined behavior.
 
 Arrays and Textures
@@ -488,6 +488,10 @@ Initializing Device Memory
 .. function:: memset_d16(dest, data, count)
 .. function:: memset_d32(dest, data, count)
 
+  .. note::
+      
+      *count* is the number of elements, not bytes.
+
 .. function:: memset_d2d8(dest, pitch, data, width, height)
 .. function:: memset_d2d16(dest, pitch, data, width, height)
 .. function:: memset_d2d32(dest, pitch, data, width, height)
diff --git a/doc/source/tutorial.rst b/doc/source/tutorial.rst
index 23a02b0..5c8350e 100644
--- a/doc/source/tutorial.rst
+++ b/doc/source/tutorial.rst
@@ -20,15 +20,15 @@ Transferring Data
 
 The next step in most programs is to transfer data onto the device.
 In PyCuda, you will mostly transfer data from :mod:`numpy` arrays
-on the host. (But indeed, everything that satifies the Python buffer
+on the host. (But indeed, everything that satisfies the Python buffer
 interface will work, even a :class:`str`.) Let's make a 4x4 array 
 of random numbers::
 
   import numpy
   a = numpy.random.randn(4,4)
 
-But wait--*a* consists of double precision numbers, but Nvidia 
-devices only support single precision as of this writing::
+But wait--*a* consists of double precision numbers, but most nVidia
+devices only support single precision::
 
   a = a.astype(numpy.float32)
 
@@ -90,6 +90,85 @@ Stick around for some bonus material in the next section, though.
 (You can find the code for this demo as :file:`examples/demo.py` in the PyCuda
 source distribution.)
 
+Alternate kernel invocations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The :class:`In`, :class:`Out`, and :class:`InOut` argument handlers can
+simplify some of the memory transfers. For example, instead of creating
+*a_gpu*, if replacing *a* is fine, the following code can be used::
+
+  func(cuda.InOut(a), block=(4, 4, 1))
+
+Structures
+^^^^^^^^^^
+
+Suppose we have the following structure, for doubling a number of variable
+length arrays::
+
+  mod = cuda.SourceModule("""
+      struct DoubleOperation {
+          int datalen, __padding; // so 64-bit ptrs can be aligned
+          float *ptr;
+      };
+  
+      __global__ void double_array(DoubleOperation *a) {
+          a = &a[blockIdx.x];
+          for (int idx = threadIdx.x; idx < a->datalen; idx += blockDim.x) {
+              a->ptr[idx] *= 2;
+          }
+      }
+      """)
+
+Each block in the grid (see CUDA documentation) will double one of the arrays.
+The `for` loop allows for more data elements than threads to be doubled,
+though is not efficient if one can guarantee that there will be a sufficient
+number of threads. Next, a wrapper class for the structure is created, and
+two arrays are instantiated::
+
+  class DoubleOpStruct:
+      mem_size = 8 + numpy.intp(0).nbytes
+      def __init__(self, array, struct_arr_ptr):
+          self.data = cuda.to_device(array)
+          self.shape, self.dtype = array.shape, array.dtype
+          cuda.memcpy_htod(int(struct_arr_ptr), numpy.int32(array.size))
+          cuda.memcpy_htod(int(struct_arr_ptr) + 8, numpy.intp(int(self.data)))
+      def __str__(self):
+          return str(cuda.from_device(self.data, self.shape, self.dtype))
+  
+  struct_arr = cuda.mem_alloc(2 * DoubleOpStruct.mem_size)
+  do2_ptr = int(struct_arr) + DoubleOpStruct.mem_size
+  
+  array1 = DoubleOpStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)
+  array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), do2_ptr)
+  print("original arrays", array1, array2)
+
+This code uses the :func:`to_device` and :func:`from_device` functions to
+allocate and copy values, and demonstrates how offsets to an allocated
+block of memory can be used. Finally, the code can be executed; the following
+demonstrates doubling both arrays, then only the second::
+
+  func = mod.get_function("double_array")
+  func(struct_arr, block = (32, 1, 1), grid=(2, 1))
+  print("doubled arrays", array1, array2)
+  
+  func(numpy.intp(do2_ptr), block = (32, 1, 1), grid=(1, 1))
+  print("doubled second only", array1, array2, "\n")
+
+Prepared invocations
+^^^^^^^^^^^^^^^^^^^^
+
+Function invocation using the built-in __call__ method may be slow
+(see :ref:`reference-doc`). To achieve the same effect as above,
+the function is bound to argument types (as designated by Python's
+standard library :mod:`struct` module), and then called. This also
+avoid having to cast arguments using `numpy.[type]` functions::
+
+  func.prepare("P", block=(32, 1, 1))
+  func.prepared_call((2, 1), struct_arr)
+  print("doubled again", array1, array2)
+  func.prepared_call((1, 1), do2_ptr)
+  print("doubled second only again", array1, array2, "\n")
+
 Bonus: Abstracting Away the Complications
 -----------------------------------------
   
diff --git a/examples/demo.py b/examples/demo.py
index d742b74..ac80d02 100644
--- a/examples/demo.py
+++ b/examples/demo.py
@@ -1,5 +1,6 @@
 # Sample source code from the Tutorial Introduction in the documentation.
 
+from __future__ import print_function
 import pycuda.driver as cuda
 import pycuda.autoinit
 
@@ -25,12 +26,63 @@ func(a_gpu, block=(4,4,1))
 
 a_doubled = numpy.empty_like(a)
 cuda.memcpy_dtoh(a_doubled, a_gpu)
-print a_doubled
-print a
+print("original array:", a, sep="\n")
+print("doubled with kernel:", a_doubled, "", sep="\n")
+
+# alternate kernel invocation -------------------------------------------------
+
+func(cuda.InOut(a), block=(4, 4, 1))
+print("doubled with InOut:", a, "", sep="\n")
+
+# prepared invocations and structures -----------------------------------------
+
+class DoubleOpStruct:
+    # FIXME - see http://forums.nvidia.com/index.php?showtopic=88456
+    mem_size = 8 + numpy.intp(0).nbytes
+    def __init__(self, array, struct_arr_ptr):
+        self.data = cuda.to_device(array)
+        self.shape, self.dtype = array.shape, array.dtype
+        cuda.memcpy_htod(int(struct_arr_ptr), numpy.int32(array.size))
+        cuda.memcpy_htod(int(struct_arr_ptr) + 8, numpy.intp(int(self.data)))
+    def __str__(self):
+        return str(cuda.from_device(self.data, self.shape, self.dtype))
+
+struct_arr = cuda.mem_alloc(2 * DoubleOpStruct.mem_size)
+do2_ptr = int(struct_arr) + DoubleOpStruct.mem_size
+
+array1 = DoubleOpStruct(numpy.array([1, 2, 3], dtype=numpy.float32), struct_arr)
+array2 = DoubleOpStruct(numpy.array([0, 4], dtype=numpy.float32), do2_ptr)
+print("original arrays", array1, array2)
+
+mod = cuda.SourceModule("""
+    struct DoubleOperation {
+        int datalen, __padding; // so 64-bit ptrs can be aligned
+        float *ptr;
+    };
+
+    __global__ void double_array(DoubleOperation *a) {
+        a = &a[blockIdx.x];
+        for (int idx = threadIdx.x; idx < a->datalen; idx += blockDim.x) {
+            a->ptr[idx] *= 2;
+        }
+    }
+    """)
+func = mod.get_function("double_array")
+func(struct_arr, block = (32, 1, 1), grid=(2, 1))
+print("doubled arrays", array1, array2)
+
+func(numpy.intp(do2_ptr), block = (32, 1, 1), grid=(1, 1))
+print("doubled second only", array1, array2, "\n")
+
+func.prepare("P", block=(32, 1, 1))
+func.prepared_call((2, 1), struct_arr)
+print("doubled again", array1, array2)
+func.prepared_call((1, 1), do2_ptr)
+print("doubled second only again", array1, array2, "\n")
 
 # part 2 ----------------------------------------------------------------------
 import pycuda.gpuarray as gpuarray
 a_gpu = gpuarray.to_gpu(numpy.random.randn(4,4).astype(numpy.float32))
 a_doubled = (2*a_gpu).get()
-print a_doubled
-print a_gpu
+print("original array:", a_gpu, sep="\n")
+print("doubled with gpuarray:", a_doubled, sep="\n")

_______________________________________________
PyCuda mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net

Re: [PyCuda] codepy

Reply via email to