I'm running on a GeForce GTX 780, with driver version 319.60.

I've put my test code below. Basically I get the same runtime whether I use
a single queue or multiple queues, and it is much longer than the time
required to do the computation with a single kernel call.

import time

import numpy as np
import pyopencl as cl
import pyopencl.array

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

k = 16
n = 1000
N = k * n

rng = np.random.RandomState()

nsin = 5000
X = rng.uniform(size=N).astype(np.float32)
Y = rng.uniform(size=N).astype(np.float32)

clX = cl.array.Array(queue, X.shape, X.dtype)
clY = cl.array.Array(queue, Y.shape, Y.dtype)

Xs = [X[n*i:n*i+n] for i in range(k)]
Ys = [Y[n*i:n*i+n] for i in range(k)]
clXs = [cl.array.Array(queue, x.shape, x.dtype) for x in Xs]
clYs = [cl.array.Array(queue, y.shape, y.dtype) for y in Ys]

source = """
__kernel void axpy(
    long n,
    __global const float *X,
    __global float *Y
    )
{
    int i = get_global_id(0);
    if (i >= n)
        return;

    float result = X[i];
    for (int i = 0; i < %(nsin)s; i++)
        result = sin(result);

    Y[i] = result;
}
""" % dict(nsin=nsin)

kernel = cl.Program(ctx, source).build().axpy

t0 = time.time()
Y0 = np.array(X)
for i in range(nsin):
    np.sin(Y0, out=Y0)
t0 = time.time() - t0
print("Numpy in %0.3f" % t0)

################################################################################
clX.set(X)
clY.set(Y)

gsize = (N,)
lsize = None

t1 = time.time()
kernel(queue, gsize, lsize, np.asarray(N), clX.data, clY.data)
queue.finish()
t1 = time.time() - t1

print("Combined kernel in %0.3f" % t1)

################################################################################
for x, y, clx, cly in zip(Xs, Ys, clXs, clYs):
    clx.set(x)
    cly.set(y)

gsize = (n,)
lsize = None

t2 = time.time()
for i, [x, y] in enumerate(zip(clXs, clYs)):
    kernel(queue, gsize, lsize, np.asarray(n), x.data, y.data)
queue.finish()
t2 = time.time() - t2

print("Serial kernels in %0.3f" % t2)

################################################################################
for x, y, clx, cly in zip(Xs, Ys, clXs, clYs):
    clx.set(x)
    cly.set(y)

gsize = (n,)
lsize = None

queues = [cl.CommandQueue(ctx) for _ in range(k)]

t3 = time.time()
for i, [x, y] in enumerate(zip(clXs, clYs)):
    q = queues[i % len(queues)]
    kernel(q, gsize, lsize, np.asarray(n), x.data, y.data)
for q in queues:
    q.flush()
for q in queues:
    q.finish()
t3 = time.time() - t3

print("Parallel kernels in %0.3f" % t3)



My output for the above program is:

Numpy in 0.301
Combined kernel in 0.004
Serial kernels in 0.037
Parallel kernels in 0.038


On 17 August 2015 at 14:07, Andreas Kloeckner <[email protected]>
wrote:

> Am 2015-08-17 12:57, schrieb Eric Hunsberger:
>
>> Does anyone know if concurrent kernels work on (newer) NVIDIA devices
>> in OpenCL? If so, can anyone provide some PyOpenCL code that runs a
>> minimal working example? As well as perhaps the driver version you're
>> using?
>>
>> For context, "concurrent kernels" just means multiple kernels running
>> at the same time. For example, if I have a bunch of kernels, each of
>> which only takes up 32 work groups, and my device has a max work group
>> size of 1024, then I should ideally be able to run 32 such kernels at
>> the same time (in parallel). From what I've read, earlier NVIDIA GPUs
>> didn't support this; they added support for up to 16 concurrent
>> kernels with the Fermi architecture.
>>
>> There was a lot of discussion about concurrent kernels four years ago
>> or so, based on the threads I've found, and at the time it wasn't
>> clear if NVIDIA's OpenCL drivers supported this or not. I still can't
>> find a conclusive answer as to whether it should work, and I can't get
>> it working in my own code. I've seen several places that multiple
>> queues are needed to do this, and even heard that it's necessary to
>> flush all the queues, but I still can't get anything to work. NVIDIA
>> devices can do this using CUDA:
>> http://wiki.tiker.net/PyCuda/Examples/KernelConcurrency [1].
>>
>
> What version of the nvidia driver are using to try this? How do you judge
> whether what you are trying is working or not? Can you share some code that
> people can try on their own machines?
>
> My naive perception is that you should just create multiple queues and
> submit kernels to them, and things should just work. What happens if you
> try and do that?
>
> Andreas
>
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to