Yes you're right - it is buggy. The carry really should enqueued then added
to previous sum (as mentioned in my last email). I'll have to figure out if
that's possible. It may be useful for something else.

However, I've done some simple timings and note that simply passing over as
many work items as there are words without doing anything is slower than
running gmp add until the number of words reach around 250k+. So not worth
pursuing this approach although thanks for your help.

On Wed, Sep 23, 2015 at 1:23 PM, CRV§ADER//KY <[email protected]> wrote:

> Now your code makes more sense... however
>
> 1. it doesn't scale. You're limited to input data that is at most as large
> (in bytes) as min(local memory, 4*max group size). These two parameters
> will vary wildly between devices, so it's very poorly portable.
> 2. it's buggy as it doesn't consider that a carry can cause an overflow
> itself.
> Try summing up [2^32 - 1, 2^32- 1, 0] + [1, 0, 0]
> The expected result is [0, 0, 1] but you'll get [0, 0, 0].
>
> As for performance, I would look into using uint4s instead of uints. Since
> you'll have 1/4th of the barriers that way, I wouldn't be surprised to get
> a considerable performance boost.
> On 23 Sep 2015 09:46, "Blair Azzopardi" <[email protected]> wrote:
>
>> Actually, this isn't the best way as it's being forced to run
>> sequentially. My original aim was to do the initial add then place the
>> carry array back into a queue to be added on a second pass. I'll see if
>> that's possible and revert. Thanks again.
>>
>> On Wed, Sep 23, 2015 at 9:14 AM, Blair Azzopardi <[email protected]>
>> wrote:
>>
>>> That's very helpful. Thanks.
>>>
>>> You're right that I simplified my code for sake of brevity. I am trying
>>> to write a simple multiple precision adder and I'm using local memory to
>>> hold the carry. Since the carry is actually the same size as the inputs and
>>> hence global size, I've now set the local size to global size (I hadn't
>>> really considered this until I read your response). This appears to fix the
>>> problem apparently with and without the barrier.
>>>
>>> My only challenge now is improving performance. The complete kernel now
>>> looks like:
>>>
>>> #define MAXINT 4294967295U
>>> __kernel void sum_vector(__global const uint *x_g, __global const uint
>>> *y_g, __global uint *z_g, __local uint *c_l) {
>>>
>>>     int i = get_local_id(0);
>>>     uint z;
>>>     uint c;
>>>
>>>     if (x_g[i] > MAXINT - y_g[i]) { // overflow
>>>         z = x_g[i] - (MAXINT - y_g[i]) - 1;
>>>         c = 1;
>>>     }
>>>     else {
>>>         z = x_g[i] + y_g[i];
>>>         c = 0;
>>>     }
>>>
>>>     c_l[i] = c;
>>>     barrier(CLK_LOCAL_MEM_FENCE); // wait for all threads to write local
>>> mem
>>>
>>>     if (i > 0) {
>>>         if (c_l[i-1] > 0) {
>>>             z += c_l[i-1];
>>>         }
>>>     }
>>>
>>>     z_g[i] = z;
>>> }
>>>
>>> Called using the following:
>>>
>>> a = np.array([...], dtype=np.uint32)
>>> b = np.array([...], dtype=np.uint32)
>>> c = np.zeros(max(len(a),len(b))+1, dtype=np.uint32)
>>> a.resize(c.shape)
>>> b.resize(c.shape)
>>>
>>> mf = cl.mem_flags
>>> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
>>> b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
>>> c_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(c)))
>>>
>>> sum_vector(queue, c.shape, c.shape, a_g, b_g, c_g,
>>> cl.LocalMemory(4*len(c)))
>>> cl.enqueue_copy(queue, c, c_g)
>>>
>>> Any suggestions here? Should I try and split the kernel code to allow it
>>> to divide the task across work groups? Perhaps my goal is a little futile?
>>>
>>>
>>> On Wed, Sep 23, 2015 at 7:37 AM, CRV§ADER//KY <[email protected]>
>>> wrote:
>>>
>>>> there's multiple problems with your code.
>>>> First, I assume that your actual kernel is more complicated than that
>>>> and that you stripped code away - otherwise, using local memory in this
>>>> case is completely useless.
>>>>
>>>> Second, if you want to use local memory, you gotta put
>>>> barrier(CLK_LOCAL_MEM_FENCE) between the instruction that writes it and the
>>>> one that reads it, as they are in different threads so there's no guarantee
>>>> which will arrive first.
>>>>
>>>> Third, if you have more than 1 workgroup, in other words if
>>>> get_global_id() != get_local_id(), then you're going to access memory that
>>>> simply does not exist.
>>>>
>>>> Without local memory:
>>>> __kernel void memskip(__global uint *a_g, __global uint *b_g) {
>>>>     int i = get_global_id(0);
>>>>     if (i > 0)
>>>>         b_g[i] = a_g[i-1];
>>>>     else
>>>>         b_g[i] = 0;
>>>> }
>>>>
>>>> With local memory; only useful if your code is actually more complex
>>>> than this - that is, it needs to access the same cell of a_g multiple
>>>> times, or write to the same cell of b_g multiple times
>>>>
>>>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g,
>>>> __local uint *c_l) {
>>>>     int gid = get_global_id(0);
>>>>     int lid = get_local_id(0);
>>>>
>>>>     // c_l must be the same size as the workgroup
>>>>     c_l[lid] = a_g[gid];
>>>>
>>>>     // wait for all threads of a workgroup to reach this point
>>>>     barrier(CLK_LOCAL_MEM_FENCE)
>>>>
>>>>     if (lid > 0)
>>>>         b_g[gid] = c_l[lid-1];
>>>>     else if (gid > 0)
>>>>         // can't use local memory!!! Your work group size better be
>>>> much larger than our wavelength (32 on NVidia or 64 on AMD), or the whole
>>>> benefit of local memory will vanish!
>>>>         b_g[gid] = a_g[gid - 1];
>>>>      else
>>>>         b_g[gid] = 0;
>>>> }
>>>>
>>>> HTH
>>>> On 23 Sep 2015 00:08, "Blair Azzopardi" <[email protected]> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I am trying to trace down a particular issue where a local memory
>>>>> parameter appears to vanish the 1st time I run a kernel but on every
>>>>> subsequent invocation works correctly. This is on my Nvidia Geforce GT
>>>>> 555M.
>>>>>
>>>>> While trying to trace the above issue and switching to my Intel CPU I
>>>>> came across even more unusual behaviour.
>>>>>
>>>>> With the following Kernel
>>>>>
>>>>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g,
>>>>> __local uint *c_l) {
>>>>>     int i = get_global_id(0);
>>>>>     c_l[i] = a_g[i];
>>>>>     b_g[i] = 0;
>>>>>     if (i > 0)
>>>>>         b_g[i] = c_l[i-1];
>>>>> }
>>>>>
>>>>> I invoke the following code:
>>>>>
>>>>> a = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.uint32)
>>>>> b = np.zeros(10, dtype=np.uint32)
>>>>>
>>>>> mf = cl.mem_flags
>>>>> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
>>>>> b_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(b))
>>>>>
>>>>> localmemskip(queue, b.shape, None, a_g, b_g, cl.LocalMemory(4*len(b)))
>>>>> cl.enqueue_copy(queue, b, b_g)
>>>>> print(b)
>>>>>
>>>>> Then the output for each platform is:
>>>>>
>>>>> [0 1 2 3 4 5 6 7 8 9] # <pyopencl.Platform 'NVIDIA CUDA' >
>>>>>
>>>>> [0 1 2 0 4 0 6 242802248 0 361062976] # <pyopencl.Platform 'Intel(R) 
>>>>> OpenCL' >
>>>>>
>>>>>
>>>>> What's more is each output can start with a different array settling
>>>>> down to the above after 1st invocation.
>>>>>
>>>>> I have my suspicions it's something to do with how the local memory is
>>>>> initialised.
>>>>>
>>>>> Can anyone see if I am doing something wrong?
>>>>>
>>>>> Thanks
>>>>> Blair
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> PyOpenCL mailing list
>>>>> [email protected]
>>>>> http://lists.tiker.net/listinfo/pyopencl
>>>>>
>>>>>
>>>
>>
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to