Hello,

In your example the condition is necessary: if N is some large prime
number, you cannot create grid/block pair which contains exactly N
total threads; so you have to skip excessive ones somehow. Moreover,
the "if" statement is not expensive by itself, it becomes expensive if
it causes execution to diverge inside a single warp. Here it will
diverge execution in at most one warp (and not for much), so will not
affect the performance significantly. Maybe it will be even optimized
out completely by branch predication, but I am not sure about that.

Best regards,
Bogdan

On Wed, Sep 28, 2011 at 1:08 AM, ericyosho <ericyo...@gmail.com> wrote:
> I'm not sure if it is the right place, but since it is so elementary,
> I just appreciate some explanation.
> So in every CUDA tutorial example, e.g., to double each element in an
> array, in kernel function, we have the following lines:
>
> int idx = // calculate a unique value for each thread
> if (idx < N) // N is the number of elements of an array
>    a[idx] *= 2;
>
> "if branch" is a rather expensive operation, why do we want each
> thread to go for this check?
> Since on each device, only one kernel function is allowed to evaluate
> at a time, why don't we let each thread double its own associated
> value, and afterwards we simply copy N elements back to the host.
> Basically, we just omit the "if" check, and go for the "double values"
> line unconditionally.
>
> It seems this approach is more straightforward.
> Do I miss anything?
>
> Best,
> Zhe Yao
> --------------
> Department of Electrical and Computer Engineering
> McGill University
> Montreal, QC, Canada
> H3A 2A7
>
> zhe....@mail.mcgill.ca
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA@tiker.net
> http://lists.tiker.net/listinfo/pycuda
>

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to