Two things:
1.  What happens without the branch if N is not a multiple of blockDim?
Without the branch, you will get a segmentation error.
2.  Branches are not as expensive as you think.  Memory reads and writes are
the most expensive things.

- bryan

On Tue, Sep 27, 2011 at 8:08 AM, ericyosho <ericyo...@gmail.com> wrote:

> I'm not sure if it is the right place, but since it is so elementary,
> I just appreciate some explanation.
> So in every CUDA tutorial example, e.g., to double each element in an
> array, in kernel function, we have the following lines:
>
> int idx = // calculate a unique value for each thread
> if (idx < N) // N is the number of elements of an array
>    a[idx] *= 2;
>
> "if branch" is a rather expensive operation, why do we want each
> thread to go for this check?
> Since on each device, only one kernel function is allowed to evaluate
> at a time, why don't we let each thread double its own associated
> value, and afterwards we simply copy N elements back to the host.
> Basically, we just omit the "if" check, and go for the "double values"
> line unconditionally.
>
> It seems this approach is more straightforward.
> Do I miss anything?
>
> Best,
> Zhe Yao
> --------------
> Department of Electrical and Computer Engineering
> McGill University
> Montreal, QC, Canada
> H3A 2A7
>
> zhe....@mail.mcgill.ca
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA@tiker.net
> http://lists.tiker.net/listinfo/pycuda
>
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to