OK, so blockDim.x*gridDim.x gives the max number of threads? I assumed for small arrays it would just be 1 in which case the for loop would be looping over the whole array.
Can you elaborate on why it is said that this approach is slower than when you can guarantee that size < max_threads? In that case the for loop should only go 1 iteration. Thomas On Tue, May 29, 2012 at 6:48 PM, Andreas Kloeckner <li...@informa.tiker.net> wrote: > On Tue, 29 May 2012 18:16:52 -0400, Thomas Wiecki <thomas_wie...@brown.edu> > wrote: >> Hi, >> >> I saw a couple of times the following idiom being used: >> >> const int tidx = blockIdx.x*blockDim.x + threadIdx.x; >> const int delta = blockDim.x*gridDim.x; >> >> curandState local_state = global_state[tidx]; >> >> for (int idx = tidx; idx < n; idx += delta) >> { >> out[idx] = compute_sth(in[idx]) >> } >> >> I'm not sure I 100% understand what's going on but it is looping over >> parts of the array spread dt apart. I think however in the case there >> are enough threads available (n < max_threads) only one thread would >> be doing all the work -- is that correct? >> >> Wouldn't a better idiom do sth along the lines of: >> >> for (int idx = tidx; idx < n; idx += max_threads) >> >> thus if n < max_threads it would loop only once per thread and scale >> up seamlessly. Am I missing something? > > These two look exactly the same to me, except you called "delta" > "max_threads". I'm really squinting hard, I can't find a difference... > > Andreas > _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda