Hi again, Here is a working example of kernel concurrency (file attached). Launching all the kernels before placing a record event enabled concurrency. Try it for yourself, if you comment out line 57 the kernels will execute in serial. Here is the timing info output from the program on my machine:
=== Timing info (for last set of kernel launches) Dataset 0 kernel_begin : 40.617023468 kernel_end : 45.0715522766 Dataset 1 kernel_begin : 40.6191368103 kernel_end : 45.0742073059 Notice that the kernel execution seems to overlap. I'm pretty certain that launches are occuring concurrently because, with line 57 commented out, the runtimes are essentially doubled and there is no overlap in the timing. Here is that output: === Timing info (for last set of kernel launches) Dataset 0 kernel_begin : 80.6352005005 kernel_end : 85.0898895264 Dataset 1 kernel_begin : 85.091293335 kernel_end : 89.5460510254 Finally, Andreas, should I post this on the wiki? Also, is there any way to disable the cuda api trace without reinstalling pycuda? Thanks! Jesse On Tue, Apr 5, 2011 at 11:47 AM, Jesse Lu <jess...@stanford.edu> wrote: > Thanks Andreas, > > I've checked with a longer running kernel, and there is still no overlap. > Now I'm going to re-install pycuda and cuda4.0, enable api tracing and go > from there. Thanks! > > Jesse > > > On Mon, Apr 4, 2011 at 9:12 PM, Andreas Kloeckner <li...@informa.tiker.net > > wrote: > >> On Mon, 4 Apr 2011 16:35:22 -0700, Jesse Lu <jess...@stanford.edu> wrote: >> > Hi everyone, >> > >> > I have been trying to create a kernel concurrency example for pycuda but >> > have been unsuccessful. Can anyone point out what I'm doing wrong? >> Thanks in >> > advance! >> >> Two suggestions: >> >> - Perhaps try using longer-running kernels. (a simple for loop in the >> kernel might do the trick) >> >> - Enable tracing in PyCUDA to see if any allocations or deletions happen >> while the kernel loop is going. (Those are often implicit sync points >> in CUDA.) It doesn't look like it, but it can't hurt to check. >> >> Andreas >> > >
#! /usr/bin/env python # A simple program to illustrate kernel concurrency with PyCuda. # Reference: Chapter 3.2.6.5 in Cuda C Programming Guide Version 3.2. # Jesse Lu, 2011-04-04 import numpy as np import pycuda.autoinit import pycuda.driver as drv from pycuda.compiler import SourceModule # # Set up test scenario. # # Create a simple test kernel. mod = SourceModule(""" __global__ void my_kernel(float *d) { const int i = threadIdx.x; for (int m=0; m<100; m++) { for (int k=0; k<100 ; k++) d[i] = d[i] * 2.0; for (int k=0; k<100 ; k++) d[i] = d[i] / 2.0; } d[i] = d[i] * 2.0; } """) my_kernel = mod.get_function("my_kernel") # Create the test data on the host. N = 400 # Size of datasets. n = 2 # Number of datasets (and concurrent operations) used. data, data_check, d_data = [], [], [] for k in range(n): data.append(np.random.randn(N).astype(np.float32)) # Create random data. data_check.append(data[k].copy()) # For checking the result afterwards. d_data.append(drv.mem_alloc(data[k].nbytes)) # Allocate memory on device. # # Start concurrency test. # # Use this event as a reference point. ref = drv.Event() ref.record() # Create the streams and events needed. stream, event = [], [] marker_names = ['kernel_begin', 'kernel_end'] for k in range(n): stream.append(drv.Stream()) event.append(dict([(marker_names[l], drv.Event()) for l in range(len(marker_names))])) # Transfer to device. for k in range(n): drv.memcpy_htod(d_data[k], data[k]) # Run kernels many times, we will only keep data from last loop iteration. for j in range(10): for k in range(n): event[k]['kernel_begin'].record(stream[k]) my_kernel(d_data[k], block=(N,1,1), stream=stream[k]) for k in range(n): event[k]['kernel_end'].record(stream[k]) # Transfer data back to host. for k in range(n): drv.memcpy_dtoh(data[k], d_data[k]) # # Output results. # print '\n=== Device attributes' dev = pycuda.autoinit.device print 'Name:', dev.name() print 'Compute capability:', dev.compute_capability() print 'Concurrent Kernels:', \ bool(dev.get_attribute(drv.device_attribute.CONCURRENT_KERNELS)) print '\n=== Checking answers' for k in range(n): print 'Dataset', k, ':', if (np.linalg.norm((data_check[k] * 2**(j+1)) - data[k]) == 0.0): print 'passed.' else: print 'FAILED!' print '\n=== Timing info (for last set of kernel launches)' for k in range(n): print 'Dataset', k for l in range(len(marker_names)): print marker_names[l], ':', ref.time_till(event[k][marker_names[l]])
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda