there's multiple problems with your code.
First, I assume that your actual kernel is more complicated than that and
that you stripped code away - otherwise, using local memory in this case is
completely useless.
Second, if you want to use local memory, you gotta put
barrier(CLK_LOCAL_MEM_FENCE) between the instruction that writes it and the
one that reads it, as they are in different threads so there's no guarantee
which will arrive first.
Third, if you have more than 1 workgroup, in other words if get_global_id()
!= get_local_id(), then you're going to access memory that simply does not
exist.
Without local memory:
__kernel void memskip(__global uint *a_g, __global uint *b_g) {
int i = get_global_id(0);
if (i > 0)
b_g[i] = a_g[i-1];
else
b_g[i] = 0;
}
With local memory; only useful if your code is actually more complex than
this - that is, it needs to access the same cell of a_g multiple times, or
write to the same cell of b_g multiple times
__kernel void localmemskip(__global uint *a_g, __global uint *b_g, __local
uint *c_l) {
int gid = get_global_id(0);
int lid = get_local_id(0);
// c_l must be the same size as the workgroup
c_l[lid] = a_g[gid];
// wait for all threads of a workgroup to reach this point
barrier(CLK_LOCAL_MEM_FENCE)
if (lid > 0)
b_g[gid] = c_l[lid-1];
else if (gid > 0)
// can't use local memory!!! Your work group size better be much
larger than our wavelength (32 on NVidia or 64 on AMD), or the whole
benefit of local memory will vanish!
b_g[gid] = a_g[gid - 1];
else
b_g[gid] = 0;
}
HTH
On 23 Sep 2015 00:08, "Blair Azzopardi" <[email protected]> wrote:
> Hi
>
> I am trying to trace down a particular issue where a local memory
> parameter appears to vanish the 1st time I run a kernel but on every
> subsequent invocation works correctly. This is on my Nvidia Geforce GT
> 555M.
>
> While trying to trace the above issue and switching to my Intel CPU I came
> across even more unusual behaviour.
>
> With the following Kernel
>
> __kernel void localmemskip(__global uint *a_g, __global uint *b_g, __local
> uint *c_l) {
> int i = get_global_id(0);
> c_l[i] = a_g[i];
> b_g[i] = 0;
> if (i > 0)
> b_g[i] = c_l[i-1];
> }
>
> I invoke the following code:
>
> a = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.uint32)
> b = np.zeros(10, dtype=np.uint32)
>
> mf = cl.mem_flags
> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
> b_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(b))
>
> localmemskip(queue, b.shape, None, a_g, b_g, cl.LocalMemory(4*len(b)))
> cl.enqueue_copy(queue, b, b_g)
> print(b)
>
> Then the output for each platform is:
>
> [0 1 2 3 4 5 6 7 8 9] # <pyopencl.Platform 'NVIDIA CUDA' >
>
> [0 1 2 0 4 0 6 242802248 0 361062976] # <pyopencl.Platform 'Intel(R) OpenCL' >
>
>
> What's more is each output can start with a different array settling down
> to the above after 1st invocation.
>
> I have my suspicions it's something to do with how the local memory is
> initialised.
>
> Can anyone see if I am doing something wrong?
>
> Thanks
> Blair
>
>
>
> _______________________________________________
> PyOpenCL mailing list
> [email protected]
> http://lists.tiker.net/listinfo/pyopencl
>
>
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl