Actually, this isn't the best way as it's being forced to run sequentially.
My original aim was to do the initial add then place the carry array back
into a queue to be added on a second pass. I'll see if that's possible and
revert. Thanks again.

On Wed, Sep 23, 2015 at 9:14 AM, Blair Azzopardi <[email protected]> wrote:

> That's very helpful. Thanks.
>
> You're right that I simplified my code for sake of brevity. I am trying to
> write a simple multiple precision adder and I'm using local memory to hold
> the carry. Since the carry is actually the same size as the inputs and
> hence global size, I've now set the local size to global size (I hadn't
> really considered this until I read your response). This appears to fix the
> problem apparently with and without the barrier.
>
> My only challenge now is improving performance. The complete kernel now
> looks like:
>
> #define MAXINT 4294967295U
> __kernel void sum_vector(__global const uint *x_g, __global const uint
> *y_g, __global uint *z_g, __local uint *c_l) {
>
>     int i = get_local_id(0);
>     uint z;
>     uint c;
>
>     if (x_g[i] > MAXINT - y_g[i]) { // overflow
>         z = x_g[i] - (MAXINT - y_g[i]) - 1;
>         c = 1;
>     }
>     else {
>         z = x_g[i] + y_g[i];
>         c = 0;
>     }
>
>     c_l[i] = c;
>     barrier(CLK_LOCAL_MEM_FENCE); // wait for all threads to write local
> mem
>
>     if (i > 0) {
>         if (c_l[i-1] > 0) {
>             z += c_l[i-1];
>         }
>     }
>
>     z_g[i] = z;
> }
>
> Called using the following:
>
> a = np.array([...], dtype=np.uint32)
> b = np.array([...], dtype=np.uint32)
> c = np.zeros(max(len(a),len(b))+1, dtype=np.uint32)
> a.resize(c.shape)
> b.resize(c.shape)
>
> mf = cl.mem_flags
> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
> b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
> c_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(c)))
>
> sum_vector(queue, c.shape, c.shape, a_g, b_g, c_g,
> cl.LocalMemory(4*len(c)))
> cl.enqueue_copy(queue, c, c_g)
>
> Any suggestions here? Should I try and split the kernel code to allow it
> to divide the task across work groups? Perhaps my goal is a little futile?
>
>
> On Wed, Sep 23, 2015 at 7:37 AM, CRV§ADER//KY <[email protected]>
> wrote:
>
>> there's multiple problems with your code.
>> First, I assume that your actual kernel is more complicated than that and
>> that you stripped code away - otherwise, using local memory in this case is
>> completely useless.
>>
>> Second, if you want to use local memory, you gotta put
>> barrier(CLK_LOCAL_MEM_FENCE) between the instruction that writes it and the
>> one that reads it, as they are in different threads so there's no guarantee
>> which will arrive first.
>>
>> Third, if you have more than 1 workgroup, in other words if
>> get_global_id() != get_local_id(), then you're going to access memory that
>> simply does not exist.
>>
>> Without local memory:
>> __kernel void memskip(__global uint *a_g, __global uint *b_g) {
>>     int i = get_global_id(0);
>>     if (i > 0)
>>         b_g[i] = a_g[i-1];
>>     else
>>         b_g[i] = 0;
>> }
>>
>> With local memory; only useful if your code is actually more complex than
>> this - that is, it needs to access the same cell of a_g multiple times, or
>> write to the same cell of b_g multiple times
>>
>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g,
>> __local uint *c_l) {
>>     int gid = get_global_id(0);
>>     int lid = get_local_id(0);
>>
>>     // c_l must be the same size as the workgroup
>>     c_l[lid] = a_g[gid];
>>
>>     // wait for all threads of a workgroup to reach this point
>>     barrier(CLK_LOCAL_MEM_FENCE)
>>
>>     if (lid > 0)
>>         b_g[gid] = c_l[lid-1];
>>     else if (gid > 0)
>>         // can't use local memory!!! Your work group size better be much
>> larger than our wavelength (32 on NVidia or 64 on AMD), or the whole
>> benefit of local memory will vanish!
>>         b_g[gid] = a_g[gid - 1];
>>      else
>>         b_g[gid] = 0;
>> }
>>
>> HTH
>> On 23 Sep 2015 00:08, "Blair Azzopardi" <[email protected]> wrote:
>>
>>> Hi
>>>
>>> I am trying to trace down a particular issue where a local memory
>>> parameter appears to vanish the 1st time I run a kernel but on every
>>> subsequent invocation works correctly. This is on my Nvidia Geforce GT
>>> 555M.
>>>
>>> While trying to trace the above issue and switching to my Intel CPU I
>>> came across even more unusual behaviour.
>>>
>>> With the following Kernel
>>>
>>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g,
>>> __local uint *c_l) {
>>>     int i = get_global_id(0);
>>>     c_l[i] = a_g[i];
>>>     b_g[i] = 0;
>>>     if (i > 0)
>>>         b_g[i] = c_l[i-1];
>>> }
>>>
>>> I invoke the following code:
>>>
>>> a = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.uint32)
>>> b = np.zeros(10, dtype=np.uint32)
>>>
>>> mf = cl.mem_flags
>>> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
>>> b_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(b))
>>>
>>> localmemskip(queue, b.shape, None, a_g, b_g, cl.LocalMemory(4*len(b)))
>>> cl.enqueue_copy(queue, b, b_g)
>>> print(b)
>>>
>>> Then the output for each platform is:
>>>
>>> [0 1 2 3 4 5 6 7 8 9] # <pyopencl.Platform 'NVIDIA CUDA' >
>>>
>>> [0 1 2 0 4 0 6 242802248 0 361062976] # <pyopencl.Platform 'Intel(R) 
>>> OpenCL' >
>>>
>>>
>>> What's more is each output can start with a different array settling
>>> down to the above after 1st invocation.
>>>
>>> I have my suspicions it's something to do with how the local memory is
>>> initialised.
>>>
>>> Can anyone see if I am doing something wrong?
>>>
>>> Thanks
>>> Blair
>>>
>>>
>>>
>>> _______________________________________________
>>> PyOpenCL mailing list
>>> [email protected]
>>> http://lists.tiker.net/listinfo/pyopencl
>>>
>>>
>
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to