I'm running on a GeForce GTX 780, with driver version 319.60.
I've put my test code below. Basically I get the same runtime whether I use
a single queue or multiple queues, and it is much longer than the time
required to do the computation with a single kernel call.
import time
import numpy as np
import pyopencl as cl
import pyopencl.array
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
k = 16
n = 1000
N = k * n
rng = np.random.RandomState()
nsin = 5000
X = rng.uniform(size=N).astype(np.float32)
Y = rng.uniform(size=N).astype(np.float32)
clX = cl.array.Array(queue, X.shape, X.dtype)
clY = cl.array.Array(queue, Y.shape, Y.dtype)
Xs = [X[n*i:n*i+n] for i in range(k)]
Ys = [Y[n*i:n*i+n] for i in range(k)]
clXs = [cl.array.Array(queue, x.shape, x.dtype) for x in Xs]
clYs = [cl.array.Array(queue, y.shape, y.dtype) for y in Ys]
source = """
__kernel void axpy(
long n,
__global const float *X,
__global float *Y
)
{
int i = get_global_id(0);
if (i >= n)
return;
float result = X[i];
for (int i = 0; i < %(nsin)s; i++)
result = sin(result);
Y[i] = result;
}
""" % dict(nsin=nsin)
kernel = cl.Program(ctx, source).build().axpy
t0 = time.time()
Y0 = np.array(X)
for i in range(nsin):
np.sin(Y0, out=Y0)
t0 = time.time() - t0
print("Numpy in %0.3f" % t0)
################################################################################
clX.set(X)
clY.set(Y)
gsize = (N,)
lsize = None
t1 = time.time()
kernel(queue, gsize, lsize, np.asarray(N), clX.data, clY.data)
queue.finish()
t1 = time.time() - t1
print("Combined kernel in %0.3f" % t1)
################################################################################
for x, y, clx, cly in zip(Xs, Ys, clXs, clYs):
clx.set(x)
cly.set(y)
gsize = (n,)
lsize = None
t2 = time.time()
for i, [x, y] in enumerate(zip(clXs, clYs)):
kernel(queue, gsize, lsize, np.asarray(n), x.data, y.data)
queue.finish()
t2 = time.time() - t2
print("Serial kernels in %0.3f" % t2)
################################################################################
for x, y, clx, cly in zip(Xs, Ys, clXs, clYs):
clx.set(x)
cly.set(y)
gsize = (n,)
lsize = None
queues = [cl.CommandQueue(ctx) for _ in range(k)]
t3 = time.time()
for i, [x, y] in enumerate(zip(clXs, clYs)):
q = queues[i % len(queues)]
kernel(q, gsize, lsize, np.asarray(n), x.data, y.data)
for q in queues:
q.flush()
for q in queues:
q.finish()
t3 = time.time() - t3
print("Parallel kernels in %0.3f" % t3)
My output for the above program is:
Numpy in 0.301
Combined kernel in 0.004
Serial kernels in 0.037
Parallel kernels in 0.038
On 17 August 2015 at 14:07, Andreas Kloeckner <[email protected]>
wrote:
> Am 2015-08-17 12:57, schrieb Eric Hunsberger:
>
>> Does anyone know if concurrent kernels work on (newer) NVIDIA devices
>> in OpenCL? If so, can anyone provide some PyOpenCL code that runs a
>> minimal working example? As well as perhaps the driver version you're
>> using?
>>
>> For context, "concurrent kernels" just means multiple kernels running
>> at the same time. For example, if I have a bunch of kernels, each of
>> which only takes up 32 work groups, and my device has a max work group
>> size of 1024, then I should ideally be able to run 32 such kernels at
>> the same time (in parallel). From what I've read, earlier NVIDIA GPUs
>> didn't support this; they added support for up to 16 concurrent
>> kernels with the Fermi architecture.
>>
>> There was a lot of discussion about concurrent kernels four years ago
>> or so, based on the threads I've found, and at the time it wasn't
>> clear if NVIDIA's OpenCL drivers supported this or not. I still can't
>> find a conclusive answer as to whether it should work, and I can't get
>> it working in my own code. I've seen several places that multiple
>> queues are needed to do this, and even heard that it's necessary to
>> flush all the queues, but I still can't get anything to work. NVIDIA
>> devices can do this using CUDA:
>> http://wiki.tiker.net/PyCuda/Examples/KernelConcurrency [1].
>>
>
> What version of the nvidia driver are using to try this? How do you judge
> whether what you are trying is working or not? Can you share some code that
> people can try on their own machines?
>
> My naive perception is that you should just create multiple queues and
> submit kernels to them, and things should just work. What happens if you
> try and do that?
>
> Andreas
>
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl