OK, so blockDim.x*gridDim.x gives the max number of threads? I assumed
for small arrays it would just be 1 in which case the for loop would
be looping over the whole array.

Can you elaborate on why it is said that this approach is slower than
when you can guarantee that size < max_threads? In that case the for
loop should only go 1 iteration.

Thomas

On Tue, May 29, 2012 at 6:48 PM, Andreas Kloeckner
<li...@informa.tiker.net> wrote:
> On Tue, 29 May 2012 18:16:52 -0400, Thomas Wiecki <thomas_wie...@brown.edu> 
> wrote:
>> Hi,
>>
>> I saw a couple of times the following idiom being used:
>>
>>         const int tidx = blockIdx.x*blockDim.x + threadIdx.x;
>>         const int delta = blockDim.x*gridDim.x;
>>
>>         curandState local_state = global_state[tidx];
>>
>>         for (int idx = tidx; idx < n; idx += delta)
>>         {
>>              out[idx] = compute_sth(in[idx])
>>         }
>>
>> I'm not sure I 100% understand what's going on but it is looping over
>> parts of the array spread dt apart. I think however in the case there
>> are enough threads available (n < max_threads) only one thread would
>> be doing all the work -- is that correct?
>>
>> Wouldn't a better idiom do sth along the lines of:
>>
>> for (int idx = tidx; idx < n; idx += max_threads)
>>
>> thus if n < max_threads it would loop only once per thread and scale
>> up seamlessly. Am I missing something?
>
> These two look exactly the same to me, except you called "delta"
> "max_threads". I'm really squinting hard, I can't find a difference...
>
> Andreas
>

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to