I experience some very odd behavior from either pyopencl or opencl. I have produced a boiled-down example and posted it below. Basically the test kernel below takes some large arrays as arguments and initializes one of these with 1's. It also performs a for-loop in which it initializes some local data structures with 1's and does nothing else. In theory this for-loop could be omitted as it does nothing relevant to the output, but... The strange thing is, that when the for-loop has many iterations (e.g. 1000), things goes horribly wrong, the screen flickers, and the output array is not initialized with 1's. If the loop only has a few iterations (e.g. 10), everything works fine. Also, if the variable "rows" in the outer python code is lowered to e.g. 144 instead of 3344, it also works fine, even with 1000 for-iterations.
Can anyone explain what is going on here?! (The code takes around 14 seconds to complete on my laptop) I use macOS 10.6.3, pyopencl-0.91.4 and have just installed gpucomputingsdk_2.3a_macos_32 from http://developer.nvidia.com/object/opencl-download.html. However, I'm not sure how to tell if pyopencl really uses this specific SDK. My machine is a macbook pro, and get_devices(cl.device_type.GPU) returns [<pyopencl.Device 'GeForce 9400M' at 0x2022600>, <pyopencl.Device 'GeForce 9600M GT' at 0x1022600>] Here is the example code: ============================================= import sys import struct import pyopencl as cl import numpy block_size = 16 matrixLength = 3101104 rows = 3344 row2width = numpy.zeros(rows, numpy.int32) row2startIdx = numpy.zeros(rows, numpy.int32) matrix = numpy.zeros(matrixLength, numpy.int32) pl = cl.get_platforms() devs = pl[0].get_devices(cl.device_type.GPU) if(block_size > devs[0].get_info(cl.device_info.MAX_WORK_GROUP_SIZE)): print "Error: block_size is larger than MAX_WORK_GROUP_SIZE..." exit(1) ctx = cl.Context(devs) queue = cl.CommandQueue(ctx) mf = cl.mem_flags src = """ // Thread block size #define BLOCK_SIZE 16 __kernel void matrixMul(__global int* C, int CSize, __global int* A, __global int* rowWidths, __global int* rowStartIdxs) { int bi = get_group_id(0); int bj = get_group_id(1); int ti = get_local_id(0); int tj = get_local_id(1); int rowAIdx = bi * BLOCK_SIZE + ti; int rowBIdx = bj * BLOCK_SIZE + tj; int cOut = 1; for(int x=0; x<1000; x++) { __local int As[BLOCK_SIZE][BLOCK_SIZE]; __local int Bs[BLOCK_SIZE][BLOCK_SIZE]; As[ti][tj] = 1; Bs[ti][tj] = 1; barrier(CLK_LOCAL_MEM_FENCE); } C[rowBIdx * CSize + rowAIdx] = cOut; } """; prg = cl.Program(ctx, src).build(); matrix_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(matrix).astype(numpy.int32)) row2width_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(row2width).astype(numpy.int32)) row2startIdx_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(row2startIdx).astype(numpy.int32)) o = numpy.zeros(rows * rows).astype(numpy.int32) o_buf = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=o) w_o_buf = struct.pack("i", rows) prg.matrixMul(queue, [rows, rows], o_buf, w_o_buf, matrix_buf, row2width_buf, row2startIdx_buf, local_size=(block_size, block_size)) cl.enqueue_read_buffer(queue, o_buf, o).wait() i = numpy.nonzero(o) print len(i[0]) _______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
