I experience some very odd behavior from either pyopencl or opencl. I have
produced a boiled-down example and posted it below. Basically the test kernel
below takes some large arrays as arguments and initializes one of these with
1's. It also performs a for-loop in which it initializes some local data
structures with 1's and does nothing else. In theory this for-loop could be
omitted as it does nothing relevant to the output, but... The strange thing is,
that when the for-loop has many iterations (e.g. 1000), things goes horribly
wrong, the screen flickers, and the output array is not initialized with 1's. If
the loop only has a few iterations (e.g. 10), everything works fine.
Also, if the variable "rows" in the outer python code is lowered to e.g. 144
instead of 3344, it also works fine, even with 1000 for-iterations.

Can anyone explain what is going on here?!
(The code takes around 14 seconds to complete on my laptop)

I use macOS 10.6.3, pyopencl-0.91.4 and have just installed
gpucomputingsdk_2.3a_macos_32 from
http://developer.nvidia.com/object/opencl-download.html. However, I'm not sure
how to tell if pyopencl really uses this specific SDK.
My machine is a macbook pro, and get_devices(cl.device_type.GPU) returns
   [<pyopencl.Device 'GeForce 9400M' at 0x2022600>, <pyopencl.Device 'GeForce
9600M GT' at 0x1022600>]

Here is the example code: =============================================
import sys
import struct
import pyopencl as cl
import numpy

block_size = 16
matrixLength = 3101104
rows = 3344

row2width = numpy.zeros(rows, numpy.int32)
row2startIdx = numpy.zeros(rows, numpy.int32)
matrix = numpy.zeros(matrixLength, numpy.int32)

pl = cl.get_platforms()
devs = pl[0].get_devices(cl.device_type.GPU)
if(block_size > devs[0].get_info(cl.device_info.MAX_WORK_GROUP_SIZE)):
   print "Error: block_size is larger than MAX_WORK_GROUP_SIZE..."
   exit(1)
ctx = cl.Context(devs)
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags

src = """
// Thread block size
#define BLOCK_SIZE 16
  
__kernel void matrixMul(__global int* C, int CSize, __global int* A, __global
int* rowWidths, __global int* rowStartIdxs)
{
    int bi = get_group_id(0);
    int bj = get_group_id(1);
    int ti = get_local_id(0);
    int tj = get_local_id(1);

    int rowAIdx =  bi * BLOCK_SIZE + ti;
    int rowBIdx =  bj * BLOCK_SIZE + tj;

    int cOut = 1;
    for(int x=0; x<1000; x++) {
      __local int As[BLOCK_SIZE][BLOCK_SIZE];
      __local int Bs[BLOCK_SIZE][BLOCK_SIZE];
      As[ti][tj] = 1;
      Bs[ti][tj] = 1;
      barrier(CLK_LOCAL_MEM_FENCE);
    }
    
    C[rowBIdx * CSize + rowAIdx] = cOut;
}
""";
prg = cl.Program(ctx, src).build();
matrix_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
hostbuf=numpy.array(matrix).astype(numpy.int32))
row2width_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
hostbuf=numpy.array(row2width).astype(numpy.int32))
row2startIdx_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
hostbuf=numpy.array(row2startIdx).astype(numpy.int32))
o = numpy.zeros(rows * rows).astype(numpy.int32)
o_buf = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=o)
w_o_buf = struct.pack("i", rows)

prg.matrixMul(queue, [rows, rows], o_buf, w_o_buf, matrix_buf, row2width_buf,
row2startIdx_buf, local_size=(block_size, block_size))
cl.enqueue_read_buffer(queue, o_buf, o).wait()

i = numpy.nonzero(o)
print len(i[0])



_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to