Hi,

I’ve been using pyopencl for awhile for various simulation/data processing 
tasks.  I recently upgraded to a new computer, and noticed things were 
considerably slower.

After some experimentation, I tracked this down to the version of pyopencl I 
was using.  The updated version (2015.2.4; most recent on pypi) takes 
significantly longer to queue a function call (~1.5 ms) than the old version 
(2015.1, ~0.03 ms).  Both times come from the same machine*.  Profiling 
indicates that the newer version is making lots of function calls the old 
version did not.  FYI, the code I used to test this is below (adapted from 
documentation).

For my purposes, this is slightly alarming: my code makes lots of kernel calls, 
in which case the new version is 50x slower for small data sets!

Is this something that has been/will be fixed in newer versions of pyopencl?  
Is there a workaround?  Of course, for the time being I can use the old 
version, but I’d rather not be stuck with it.

If needed, I can provide the profiler output.

Thanks,
Dustin Kleckner

*OS X w/ python 3.5 installed via Anaconda, pyopencl installed via pip.  Code 
was tested with GPU and CPU, which similar results.




Test Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import absolute_import, print_function
import numpy as np
import pyopencl as cl
import time
import cProfile

a_np = np.random.rand(50000).astype(np.float32)
b_np = np.random.rand(50000).astype(np.float32)

# ctx = cl.create_some_context()
device = cl.get_platforms()[0].get_devices(cl.device_type.CPU)[0]
ctx = cl.Context([device])
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)

prg = cl.Program(ctx, """
__kernel void sum(__global const float *a_g, __global const float *b_g, 
__global float *res_g) {
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid];
}
""").build()

res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)

cProfile.run('''
start = time.time()
for n in range (100): prg.sum(queue, a_np.shape, None, a_g, b_g, res_g)
el = time.time() - start
''')

print('cl version:', cl.VERSION)
print('kernel start time: %.3f ms' % (10*el))

start = time.time()
res_np = np.empty_like(a_np)
cl.enqueue_copy(queue, res_np, res_g)
el = time.time() - start

print('copy time: %.3f ms' % (1E3*el))
_______________________________________________
PyOpenCL mailing list
[email protected]
https://lists.tiker.net/listinfo/pyopencl

Reply via email to