barrier only works inside a workgroup. You can't synchronize different
workgroups.
You left the workgroup size to None, which means that you're asking the
scheduler to try and figure out the optimal one for you. Which is almost
universally bad for performance, but in your case, it breaks functionality.
I see that A is populated correctly for:
a[2] = c[2] = 6
a[4] = c[4] = 9
a[6] = c[6] = 12
...which leads me to believe that your workgroup size is 2.
Also, what's N and N2?
Finally, nothing guarantees that get_global_id(0) won't overshoot N. If N =
10 (as I desume) and group size = 32 (typical for NVidia) or 64 (typical
for AMD GPU), you'll have plenty of workers writing out of bounds. You must
pass N as a parameter and add
if (gid >= N) {
return;
}
at the beginning of your kernel.
In your case, I think the scheduler decided to "play safe", since it can't
know if you implemented such safety in your code, and set the workgroup
size to 2 (the only power of 2 that evenly divides 10) so that you wouldn't
risk the above problem.
Cheers
Guido
On 13 May 2014 20:09, Franco Nicolas Bellomo <[email protected]> wrote:
> Hi, I'm Franco, I'm study physic in Argentina and I'm new in PyOpenCl word.
>
> My problem in when I use barrier. This is my kernel:
>
> b_dev = cl.array.arange(queue, N2, dtype=np.float32)
> b_new_dev = cl.array.zeros(queue, N2, dtype=np.float32)
> a_dev = cl.array.zeros(queue, N2, dtype=np.float32)
>
> #kernel
> prg = cl.Program(ctx, """
> __kernel void twice(__global float *a, __global float *b, __global
> float *c)
> {
> int gid = get_global_id(0);
>
> __local float a_local[17];
>
> a_local[gid + 1] = b[gid] + b[gid + 1] + b[gid + 2];
> c[gid + 1] = a_local[gid + 1];
> barrier(CLK_LOCAL_MEM_FENCE);
> a[gid] = a_local[gid];
> }
> """).build()
> twice = prg.twice
>
> twice(queue, (N,), None, a_dev.data, b_dev.data, b_new_dev.data)
>
> print " b\n %s\n\n c\n %s\n vector A donde copio\n %s" %(b_dev,
> b_new_dev, a_dev)
>
>
> And generated this output:
>
> b
> [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.]
>
> c
> [ 0. 3. 6. 9. 12. 15. 18. 21. 24. 27. 30. 0.]
> vector A donde copio
> [ 1.65503965e-24 4.59149455e-41 6.00000000e+00 4.59149455e-41
> 1.20000000e+01 0.00000000e+00 1.80000000e+01 0.00000000e+00
> 2.40000000e+01 0.00000000e+00 0.00000000e+00 0.00000000e+00
>
> Why the last vector is different to the seconds vector? What's I do bad?
>
> Thanks
>
> _______________________________________________
> PyOpenCL mailing list
> [email protected]
> http://lists.tiker.net/listinfo/pyopencl
>
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl