Hello Vincent,

It seems that the problem is the following:

   for (unsigned int i=0;i<nb;i+=BLOCKSIZE)
   {
      barrier(CLK_LOCAL_MEM_FENCE); // <--- need to synchronize here too
      a[tx]=va[i+tx];
      barrier(CLK_LOCAL_MEM_FENCE);
      for(unsigned int i=0;i<BLOCKSIZE;i++)
      {
         s+=native_cos(a[i]*b);
      }
   }

Otherwise you start filling a[] with new portion of values, while
other threads may still taking values from a[] to calculate
native_cos(), and the test fails when the block size is larger than
warp size (i.e., when the barrier() becomes necessary). (By the way,
and sorry for nitpicking, the 'warp_size' is a misleading name for the
parameter in your template — it is the block/group size; warp size is
fixed for the device).

Best regards,
Bogdan

On Sun, Oct 2, 2011 at 12:20 AM, Vincent Favre-Nicolin
<[email protected]> wrote:
>        Hi,
>
>   I have run into a problem while converting a (py)cuda program to
> (py)opencl. Basically, it seems that only some local_size are working
> reliably, and I fail to understand why.
>   I have written a small program which tests the different platforms
> available, with different local_size, up to max_work_group_size.
>
>   The program compares the OpenCL to the numpy calculation and writes
> if the test passed or failed. Surprisingly the test fails for local
> sizes>1 on CPU (using AMD's platform), with no problem on GPU (nVidia).
>
>   I even put a reqd_work_group_size kernel option, but it does not make
> any difference.
>
>   Any idea ?
>
>     Vincent
>
> _______________________________________________
> PyOpenCL mailing list
> [email protected]
> http://lists.tiker.net/listinfo/pyopencl
>
>

_______________________________________________
PyOpenCL mailing list
[email protected]
http://lists.tiker.net/listinfo/pyopencl

Reply via email to