Actually, this isn't the best way as it's being forced to run sequentially. My original aim was to do the initial add then place the carry array back into a queue to be added on a second pass. I'll see if that's possible and revert. Thanks again.
On Wed, Sep 23, 2015 at 9:14 AM, Blair Azzopardi <[email protected]> wrote: > That's very helpful. Thanks. > > You're right that I simplified my code for sake of brevity. I am trying to > write a simple multiple precision adder and I'm using local memory to hold > the carry. Since the carry is actually the same size as the inputs and > hence global size, I've now set the local size to global size (I hadn't > really considered this until I read your response). This appears to fix the > problem apparently with and without the barrier. > > My only challenge now is improving performance. The complete kernel now > looks like: > > #define MAXINT 4294967295U > __kernel void sum_vector(__global const uint *x_g, __global const uint > *y_g, __global uint *z_g, __local uint *c_l) { > > int i = get_local_id(0); > uint z; > uint c; > > if (x_g[i] > MAXINT - y_g[i]) { // overflow > z = x_g[i] - (MAXINT - y_g[i]) - 1; > c = 1; > } > else { > z = x_g[i] + y_g[i]; > c = 0; > } > > c_l[i] = c; > barrier(CLK_LOCAL_MEM_FENCE); // wait for all threads to write local > mem > > if (i > 0) { > if (c_l[i-1] > 0) { > z += c_l[i-1]; > } > } > > z_g[i] = z; > } > > Called using the following: > > a = np.array([...], dtype=np.uint32) > b = np.array([...], dtype=np.uint32) > c = np.zeros(max(len(a),len(b))+1, dtype=np.uint32) > a.resize(c.shape) > b.resize(c.shape) > > mf = cl.mem_flags > a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) > b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b) > c_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(c))) > > sum_vector(queue, c.shape, c.shape, a_g, b_g, c_g, > cl.LocalMemory(4*len(c))) > cl.enqueue_copy(queue, c, c_g) > > Any suggestions here? Should I try and split the kernel code to allow it > to divide the task across work groups? Perhaps my goal is a little futile? > > > On Wed, Sep 23, 2015 at 7:37 AM, CRV§ADER//KY <[email protected]> > wrote: > >> there's multiple problems with your code. >> First, I assume that your actual kernel is more complicated than that and >> that you stripped code away - otherwise, using local memory in this case is >> completely useless. >> >> Second, if you want to use local memory, you gotta put >> barrier(CLK_LOCAL_MEM_FENCE) between the instruction that writes it and the >> one that reads it, as they are in different threads so there's no guarantee >> which will arrive first. >> >> Third, if you have more than 1 workgroup, in other words if >> get_global_id() != get_local_id(), then you're going to access memory that >> simply does not exist. >> >> Without local memory: >> __kernel void memskip(__global uint *a_g, __global uint *b_g) { >> int i = get_global_id(0); >> if (i > 0) >> b_g[i] = a_g[i-1]; >> else >> b_g[i] = 0; >> } >> >> With local memory; only useful if your code is actually more complex than >> this - that is, it needs to access the same cell of a_g multiple times, or >> write to the same cell of b_g multiple times >> >> __kernel void localmemskip(__global uint *a_g, __global uint *b_g, >> __local uint *c_l) { >> int gid = get_global_id(0); >> int lid = get_local_id(0); >> >> // c_l must be the same size as the workgroup >> c_l[lid] = a_g[gid]; >> >> // wait for all threads of a workgroup to reach this point >> barrier(CLK_LOCAL_MEM_FENCE) >> >> if (lid > 0) >> b_g[gid] = c_l[lid-1]; >> else if (gid > 0) >> // can't use local memory!!! Your work group size better be much >> larger than our wavelength (32 on NVidia or 64 on AMD), or the whole >> benefit of local memory will vanish! >> b_g[gid] = a_g[gid - 1]; >> else >> b_g[gid] = 0; >> } >> >> HTH >> On 23 Sep 2015 00:08, "Blair Azzopardi" <[email protected]> wrote: >> >>> Hi >>> >>> I am trying to trace down a particular issue where a local memory >>> parameter appears to vanish the 1st time I run a kernel but on every >>> subsequent invocation works correctly. This is on my Nvidia Geforce GT >>> 555M. >>> >>> While trying to trace the above issue and switching to my Intel CPU I >>> came across even more unusual behaviour. >>> >>> With the following Kernel >>> >>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g, >>> __local uint *c_l) { >>> int i = get_global_id(0); >>> c_l[i] = a_g[i]; >>> b_g[i] = 0; >>> if (i > 0) >>> b_g[i] = c_l[i-1]; >>> } >>> >>> I invoke the following code: >>> >>> a = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.uint32) >>> b = np.zeros(10, dtype=np.uint32) >>> >>> mf = cl.mem_flags >>> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) >>> b_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(b)) >>> >>> localmemskip(queue, b.shape, None, a_g, b_g, cl.LocalMemory(4*len(b))) >>> cl.enqueue_copy(queue, b, b_g) >>> print(b) >>> >>> Then the output for each platform is: >>> >>> [0 1 2 3 4 5 6 7 8 9] # <pyopencl.Platform 'NVIDIA CUDA' > >>> >>> [0 1 2 0 4 0 6 242802248 0 361062976] # <pyopencl.Platform 'Intel(R) >>> OpenCL' > >>> >>> >>> What's more is each output can start with a different array settling >>> down to the above after 1st invocation. >>> >>> I have my suspicions it's something to do with how the local memory is >>> initialised. >>> >>> Can anyone see if I am doing something wrong? >>> >>> Thanks >>> Blair >>> >>> >>> >>> _______________________________________________ >>> PyOpenCL mailing list >>> [email protected] >>> http://lists.tiker.net/listinfo/pyopencl >>> >>> >
_______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
