Re: [PyOpenCL] Unusual behaviour of local memory between devices

CRV§ADER//KY Wed, 23 Sep 2015 05:24:54 -0700

Now your code makes more sense... however

1. it doesn't scale. You're limited to input data that is at most as large
(in bytes) as min(local memory, 4*max group size). These two parameters
will vary wildly between devices, so it's very poorly portable.
2. it's buggy as it doesn't consider that a carry can cause an overflow
itself.
Try summing up [2^32 - 1, 2^32- 1, 0] + [1, 0, 0]
The expected result is [0, 0, 1] but you'll get [0, 0, 0].


As for performance, I would look into using uint4s instead of uints. Since
you'll have 1/4th of the barriers that way, I wouldn't be surprised to get
a considerable performance boost.
On 23 Sep 2015 09:46, "Blair Azzopardi" <[email protected]> wrote:

> Actually, this isn't the best way as it's being forced to run
> sequentially. My original aim was to do the initial add then place the
> carry array back into a queue to be added on a second pass. I'll see if
> that's possible and revert. Thanks again.
>
> On Wed, Sep 23, 2015 at 9:14 AM, Blair Azzopardi <[email protected]>
> wrote:
>
>> That's very helpful. Thanks.
>>
>> You're right that I simplified my code for sake of brevity. I am trying
>> to write a simple multiple precision adder and I'm using local memory to
>> hold the carry. Since the carry is actually the same size as the inputs and
>> hence global size, I've now set the local size to global size (I hadn't
>> really considered this until I read your response). This appears to fix the
>> problem apparently with and without the barrier.
>>
>> My only challenge now is improving performance. The complete kernel now
>> looks like:
>>
>> #define MAXINT 4294967295U
>> __kernel void sum_vector(__global const uint *x_g, __global const uint
>> *y_g, __global uint *z_g, __local uint *c_l) {
>>
>>     int i = get_local_id(0);
>>     uint z;
>>     uint c;
>>
>>     if (x_g[i] > MAXINT - y_g[i]) { // overflow
>>         z = x_g[i] - (MAXINT - y_g[i]) - 1;
>>         c = 1;
>>     }
>>     else {
>>         z = x_g[i] + y_g[i];
>>         c = 0;
>>     }
>>
>>     c_l[i] = c;
>>     barrier(CLK_LOCAL_MEM_FENCE); // wait for all threads to write local
>> mem
>>
>>     if (i > 0) {
>>         if (c_l[i-1] > 0) {
>>             z += c_l[i-1];
>>         }
>>     }
>>
>>     z_g[i] = z;
>> }
>>
>> Called using the following:
>>
>> a = np.array([...], dtype=np.uint32)
>> b = np.array([...], dtype=np.uint32)
>> c = np.zeros(max(len(a),len(b))+1, dtype=np.uint32)
>> a.resize(c.shape)
>> b.resize(c.shape)
>>
>> mf = cl.mem_flags
>> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
>> b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
>> c_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(c)))
>>
>> sum_vector(queue, c.shape, c.shape, a_g, b_g, c_g,
>> cl.LocalMemory(4*len(c)))
>> cl.enqueue_copy(queue, c, c_g)
>>
>> Any suggestions here? Should I try and split the kernel code to allow it
>> to divide the task across work groups? Perhaps my goal is a little futile?
>>
>>
>> On Wed, Sep 23, 2015 at 7:37 AM, CRV§ADER//KY <[email protected]>
>> wrote:
>>
>>> there's multiple problems with your code.
>>> First, I assume that your actual kernel is more complicated than that
>>> and that you stripped code away - otherwise, using local memory in this
>>> case is completely useless.
>>>
>>> Second, if you want to use local memory, you gotta put
>>> barrier(CLK_LOCAL_MEM_FENCE) between the instruction that writes it and the
>>> one that reads it, as they are in different threads so there's no guarantee
>>> which will arrive first.
>>>
>>> Third, if you have more than 1 workgroup, in other words if
>>> get_global_id() != get_local_id(), then you're going to access memory that
>>> simply does not exist.
>>>
>>> Without local memory:
>>> __kernel void memskip(__global uint *a_g, __global uint *b_g) {
>>>     int i = get_global_id(0);
>>>     if (i > 0)
>>>         b_g[i] = a_g[i-1];
>>>     else
>>>         b_g[i] = 0;
>>> }
>>>
>>> With local memory; only useful if your code is actually more complex
>>> than this - that is, it needs to access the same cell of a_g multiple
>>> times, or write to the same cell of b_g multiple times
>>>
>>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g,
>>> __local uint *c_l) {
>>>     int gid = get_global_id(0);
>>>     int lid = get_local_id(0);
>>>
>>>     // c_l must be the same size as the workgroup
>>>     c_l[lid] = a_g[gid];
>>>
>>>     // wait for all threads of a workgroup to reach this point
>>>     barrier(CLK_LOCAL_MEM_FENCE)
>>>
>>>     if (lid > 0)
>>>         b_g[gid] = c_l[lid-1];
>>>     else if (gid > 0)
>>>         // can't use local memory!!! Your work group size better be much
>>> larger than our wavelength (32 on NVidia or 64 on AMD), or the whole
>>> benefit of local memory will vanish!
>>>         b_g[gid] = a_g[gid - 1];
>>>      else
>>>         b_g[gid] = 0;
>>> }
>>>
>>> HTH
>>> On 23 Sep 2015 00:08, "Blair Azzopardi" <[email protected]> wrote:
>>>
>>>> Hi
>>>>
>>>> I am trying to trace down a particular issue where a local memory
>>>> parameter appears to vanish the 1st time I run a kernel but on every
>>>> subsequent invocation works correctly. This is on my Nvidia Geforce GT
>>>> 555M.
>>>>
>>>> While trying to trace the above issue and switching to my Intel CPU I
>>>> came across even more unusual behaviour.
>>>>
>>>> With the following Kernel
>>>>
>>>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g,
>>>> __local uint *c_l) {
>>>>     int i = get_global_id(0);
>>>>     c_l[i] = a_g[i];
>>>>     b_g[i] = 0;
>>>>     if (i > 0)
>>>>         b_g[i] = c_l[i-1];
>>>> }
>>>>
>>>> I invoke the following code:
>>>>
>>>> a = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.uint32)
>>>> b = np.zeros(10, dtype=np.uint32)
>>>>
>>>> mf = cl.mem_flags
>>>> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
>>>> b_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(b))
>>>>
>>>> localmemskip(queue, b.shape, None, a_g, b_g, cl.LocalMemory(4*len(b)))
>>>> cl.enqueue_copy(queue, b, b_g)
>>>> print(b)
>>>>
>>>> Then the output for each platform is:
>>>>
>>>> [0 1 2 3 4 5 6 7 8 9] # <pyopencl.Platform 'NVIDIA CUDA' >
>>>>
>>>> [0 1 2 0 4 0 6 242802248 0 361062976] # <pyopencl.Platform 'Intel(R) 
>>>> OpenCL' >
>>>>
>>>>
>>>> What's more is each output can start with a different array settling
>>>> down to the above after 1st invocation.
>>>>
>>>> I have my suspicions it's something to do with how the local memory is
>>>> initialised.
>>>>
>>>> Can anyone see if I am doing something wrong?
>>>>
>>>> Thanks
>>>> Blair
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> PyOpenCL mailing list
>>>> [email protected]
>>>> http://lists.tiker.net/listinfo/pyopencl
>>>>
>>>>
>>
>

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Re: [PyOpenCL] Unusual behaviour of local memory between devices

Reply via email to