That's very helpful. Thanks.
You're right that I simplified my code for sake of brevity. I am trying to
write a simple multiple precision adder and I'm using local memory to hold
the carry. Since the carry is actually the same size as the inputs and
hence global size, I've now set the local size to global size (I hadn't
really considered this until I read your response). This appears to fix the
problem apparently with and without the barrier.
My only challenge now is improving performance. The complete kernel now
looks like:
#define MAXINT 4294967295U
__kernel void sum_vector(__global const uint *x_g, __global const uint
*y_g, __global uint *z_g, __local uint *c_l) {
int i = get_local_id(0);
uint z;
uint c;
if (x_g[i] > MAXINT - y_g[i]) { // overflow
z = x_g[i] - (MAXINT - y_g[i]) - 1;
c = 1;
}
else {
z = x_g[i] + y_g[i];
c = 0;
}
c_l[i] = c;
barrier(CLK_LOCAL_MEM_FENCE); // wait for all threads to write local
mem
if (i > 0) {
if (c_l[i-1] > 0) {
z += c_l[i-1];
}
}
z_g[i] = z;
}
Called using the following:
a = np.array([...], dtype=np.uint32)
b = np.array([...], dtype=np.uint32)
c = np.zeros(max(len(a),len(b))+1, dtype=np.uint32)
a.resize(c.shape)
b.resize(c.shape)
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
c_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(c)))
sum_vector(queue, c.shape, c.shape, a_g, b_g, c_g, cl.LocalMemory(4*len(c)))
cl.enqueue_copy(queue, c, c_g)
Any suggestions here? Should I try and split the kernel code to allow it to
divide the task across work groups? Perhaps my goal is a little futile?
On Wed, Sep 23, 2015 at 7:37 AM, CRV§ADER//KY <[email protected]> wrote:
> there's multiple problems with your code.
> First, I assume that your actual kernel is more complicated than that and
> that you stripped code away - otherwise, using local memory in this case is
> completely useless.
>
> Second, if you want to use local memory, you gotta put
> barrier(CLK_LOCAL_MEM_FENCE) between the instruction that writes it and the
> one that reads it, as they are in different threads so there's no guarantee
> which will arrive first.
>
> Third, if you have more than 1 workgroup, in other words if
> get_global_id() != get_local_id(), then you're going to access memory that
> simply does not exist.
>
> Without local memory:
> __kernel void memskip(__global uint *a_g, __global uint *b_g) {
> int i = get_global_id(0);
> if (i > 0)
> b_g[i] = a_g[i-1];
> else
> b_g[i] = 0;
> }
>
> With local memory; only useful if your code is actually more complex than
> this - that is, it needs to access the same cell of a_g multiple times, or
> write to the same cell of b_g multiple times
>
> __kernel void localmemskip(__global uint *a_g, __global uint *b_g, __local
> uint *c_l) {
> int gid = get_global_id(0);
> int lid = get_local_id(0);
>
> // c_l must be the same size as the workgroup
> c_l[lid] = a_g[gid];
>
> // wait for all threads of a workgroup to reach this point
> barrier(CLK_LOCAL_MEM_FENCE)
>
> if (lid > 0)
> b_g[gid] = c_l[lid-1];
> else if (gid > 0)
> // can't use local memory!!! Your work group size better be much
> larger than our wavelength (32 on NVidia or 64 on AMD), or the whole
> benefit of local memory will vanish!
> b_g[gid] = a_g[gid - 1];
> else
> b_g[gid] = 0;
> }
>
> HTH
> On 23 Sep 2015 00:08, "Blair Azzopardi" <[email protected]> wrote:
>
>> Hi
>>
>> I am trying to trace down a particular issue where a local memory
>> parameter appears to vanish the 1st time I run a kernel but on every
>> subsequent invocation works correctly. This is on my Nvidia Geforce GT
>> 555M.
>>
>> While trying to trace the above issue and switching to my Intel CPU I
>> came across even more unusual behaviour.
>>
>> With the following Kernel
>>
>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g,
>> __local uint *c_l) {
>> int i = get_global_id(0);
>> c_l[i] = a_g[i];
>> b_g[i] = 0;
>> if (i > 0)
>> b_g[i] = c_l[i-1];
>> }
>>
>> I invoke the following code:
>>
>> a = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.uint32)
>> b = np.zeros(10, dtype=np.uint32)
>>
>> mf = cl.mem_flags
>> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
>> b_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(b))
>>
>> localmemskip(queue, b.shape, None, a_g, b_g, cl.LocalMemory(4*len(b)))
>> cl.enqueue_copy(queue, b, b_g)
>> print(b)
>>
>> Then the output for each platform is:
>>
>> [0 1 2 3 4 5 6 7 8 9] # <pyopencl.Platform 'NVIDIA CUDA' >
>>
>> [0 1 2 0 4 0 6 242802248 0 361062976] # <pyopencl.Platform 'Intel(R) OpenCL'
>> >
>>
>>
>> What's more is each output can start with a different array settling down
>> to the above after 1st invocation.
>>
>> I have my suspicions it's something to do with how the local memory is
>> initialised.
>>
>> Can anyone see if I am doing something wrong?
>>
>> Thanks
>> Blair
>>
>>
>>
>> _______________________________________________
>> PyOpenCL mailing list
>> [email protected]
>> http://lists.tiker.net/listinfo/pyopencl
>>
>>
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl