Now your code makes more sense... however 1. it doesn't scale. You're limited to input data that is at most as large (in bytes) as min(local memory, 4*max group size). These two parameters will vary wildly between devices, so it's very poorly portable. 2. it's buggy as it doesn't consider that a carry can cause an overflow itself. Try summing up [2^32 - 1, 2^32- 1, 0] + [1, 0, 0] The expected result is [0, 0, 1] but you'll get [0, 0, 0].
As for performance, I would look into using uint4s instead of uints. Since you'll have 1/4th of the barriers that way, I wouldn't be surprised to get a considerable performance boost. On 23 Sep 2015 09:46, "Blair Azzopardi" <[email protected]> wrote: > Actually, this isn't the best way as it's being forced to run > sequentially. My original aim was to do the initial add then place the > carry array back into a queue to be added on a second pass. I'll see if > that's possible and revert. Thanks again. > > On Wed, Sep 23, 2015 at 9:14 AM, Blair Azzopardi <[email protected]> > wrote: > >> That's very helpful. Thanks. >> >> You're right that I simplified my code for sake of brevity. I am trying >> to write a simple multiple precision adder and I'm using local memory to >> hold the carry. Since the carry is actually the same size as the inputs and >> hence global size, I've now set the local size to global size (I hadn't >> really considered this until I read your response). This appears to fix the >> problem apparently with and without the barrier. >> >> My only challenge now is improving performance. The complete kernel now >> looks like: >> >> #define MAXINT 4294967295U >> __kernel void sum_vector(__global const uint *x_g, __global const uint >> *y_g, __global uint *z_g, __local uint *c_l) { >> >> int i = get_local_id(0); >> uint z; >> uint c; >> >> if (x_g[i] > MAXINT - y_g[i]) { // overflow >> z = x_g[i] - (MAXINT - y_g[i]) - 1; >> c = 1; >> } >> else { >> z = x_g[i] + y_g[i]; >> c = 0; >> } >> >> c_l[i] = c; >> barrier(CLK_LOCAL_MEM_FENCE); // wait for all threads to write local >> mem >> >> if (i > 0) { >> if (c_l[i-1] > 0) { >> z += c_l[i-1]; >> } >> } >> >> z_g[i] = z; >> } >> >> Called using the following: >> >> a = np.array([...], dtype=np.uint32) >> b = np.array([...], dtype=np.uint32) >> c = np.zeros(max(len(a),len(b))+1, dtype=np.uint32) >> a.resize(c.shape) >> b.resize(c.shape) >> >> mf = cl.mem_flags >> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) >> b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b) >> c_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(c))) >> >> sum_vector(queue, c.shape, c.shape, a_g, b_g, c_g, >> cl.LocalMemory(4*len(c))) >> cl.enqueue_copy(queue, c, c_g) >> >> Any suggestions here? Should I try and split the kernel code to allow it >> to divide the task across work groups? Perhaps my goal is a little futile? >> >> >> On Wed, Sep 23, 2015 at 7:37 AM, CRV§ADER//KY <[email protected]> >> wrote: >> >>> there's multiple problems with your code. >>> First, I assume that your actual kernel is more complicated than that >>> and that you stripped code away - otherwise, using local memory in this >>> case is completely useless. >>> >>> Second, if you want to use local memory, you gotta put >>> barrier(CLK_LOCAL_MEM_FENCE) between the instruction that writes it and the >>> one that reads it, as they are in different threads so there's no guarantee >>> which will arrive first. >>> >>> Third, if you have more than 1 workgroup, in other words if >>> get_global_id() != get_local_id(), then you're going to access memory that >>> simply does not exist. >>> >>> Without local memory: >>> __kernel void memskip(__global uint *a_g, __global uint *b_g) { >>> int i = get_global_id(0); >>> if (i > 0) >>> b_g[i] = a_g[i-1]; >>> else >>> b_g[i] = 0; >>> } >>> >>> With local memory; only useful if your code is actually more complex >>> than this - that is, it needs to access the same cell of a_g multiple >>> times, or write to the same cell of b_g multiple times >>> >>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g, >>> __local uint *c_l) { >>> int gid = get_global_id(0); >>> int lid = get_local_id(0); >>> >>> // c_l must be the same size as the workgroup >>> c_l[lid] = a_g[gid]; >>> >>> // wait for all threads of a workgroup to reach this point >>> barrier(CLK_LOCAL_MEM_FENCE) >>> >>> if (lid > 0) >>> b_g[gid] = c_l[lid-1]; >>> else if (gid > 0) >>> // can't use local memory!!! Your work group size better be much >>> larger than our wavelength (32 on NVidia or 64 on AMD), or the whole >>> benefit of local memory will vanish! >>> b_g[gid] = a_g[gid - 1]; >>> else >>> b_g[gid] = 0; >>> } >>> >>> HTH >>> On 23 Sep 2015 00:08, "Blair Azzopardi" <[email protected]> wrote: >>> >>>> Hi >>>> >>>> I am trying to trace down a particular issue where a local memory >>>> parameter appears to vanish the 1st time I run a kernel but on every >>>> subsequent invocation works correctly. This is on my Nvidia Geforce GT >>>> 555M. >>>> >>>> While trying to trace the above issue and switching to my Intel CPU I >>>> came across even more unusual behaviour. >>>> >>>> With the following Kernel >>>> >>>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g, >>>> __local uint *c_l) { >>>> int i = get_global_id(0); >>>> c_l[i] = a_g[i]; >>>> b_g[i] = 0; >>>> if (i > 0) >>>> b_g[i] = c_l[i-1]; >>>> } >>>> >>>> I invoke the following code: >>>> >>>> a = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.uint32) >>>> b = np.zeros(10, dtype=np.uint32) >>>> >>>> mf = cl.mem_flags >>>> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) >>>> b_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(b)) >>>> >>>> localmemskip(queue, b.shape, None, a_g, b_g, cl.LocalMemory(4*len(b))) >>>> cl.enqueue_copy(queue, b, b_g) >>>> print(b) >>>> >>>> Then the output for each platform is: >>>> >>>> [0 1 2 3 4 5 6 7 8 9] # <pyopencl.Platform 'NVIDIA CUDA' > >>>> >>>> [0 1 2 0 4 0 6 242802248 0 361062976] # <pyopencl.Platform 'Intel(R) >>>> OpenCL' > >>>> >>>> >>>> What's more is each output can start with a different array settling >>>> down to the above after 1st invocation. >>>> >>>> I have my suspicions it's something to do with how the local memory is >>>> initialised. >>>> >>>> Can anyone see if I am doing something wrong? >>>> >>>> Thanks >>>> Blair >>>> >>>> >>>> >>>> _______________________________________________ >>>> PyOpenCL mailing list >>>> [email protected] >>>> http://lists.tiker.net/listinfo/pyopencl >>>> >>>> >> >
_______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
