Hello Vincent,
It seems that the problem is the following:
for (unsigned int i=0;i<nb;i+=BLOCKSIZE)
{
barrier(CLK_LOCAL_MEM_FENCE); // <--- need to synchronize here too
a[tx]=va[i+tx];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i=0;i<BLOCKSIZE;i++)
{
s+=native_cos(a[i]*b);
}
}
Otherwise you start filling a[] with new portion of values, while
other threads may still taking values from a[] to calculate
native_cos(), and the test fails when the block size is larger than
warp size (i.e., when the barrier() becomes necessary). (By the way,
and sorry for nitpicking, the 'warp_size' is a misleading name for the
parameter in your template — it is the block/group size; warp size is
fixed for the device).
Best regards,
Bogdan
On Sun, Oct 2, 2011 at 12:20 AM, Vincent Favre-Nicolin
<[email protected]> wrote:
> Hi,
>
> I have run into a problem while converting a (py)cuda program to
> (py)opencl. Basically, it seems that only some local_size are working
> reliably, and I fail to understand why.
> I have written a small program which tests the different platforms
> available, with different local_size, up to max_work_group_size.
>
> The program compares the OpenCL to the numpy calculation and writes
> if the test passed or failed. Surprisingly the test fails for local
> sizes>1 on CPU (using AMD's platform), with no problem on GPU (nVidia).
>
> I even put a reqd_work_group_size kernel option, but it does not make
> any difference.
>
> Any idea ?
>
> Vincent
>
> _______________________________________________
> PyOpenCL mailing list
> [email protected]
> http://lists.tiker.net/listinfo/pyopencl
>
>
_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl