On Wed, May 30, 2012 at 7:28 PM, Andreas Kloeckner <kloeck...@cims.nyu.edu>
wrote:
> On Wed, 30 May 2012 08:10:03 -0400, Thomas Wiecki <thomas_wie...@brown.edu>
wrote:
>> OK, so blockDim.x*gridDim.x gives the max number of threads?
>
> 'max number of threads' is also bad terminology. Those would be device
> properties. 'Total number of threads in current launch' is correct,
> assuming it's a 1D launch.
>
>> I assumed
>> for small arrays it would just be 1 in which case the for loop would
>> be looping over the whole array.
>>
>> Can you elaborate on why it is said that this approach is slower than
>> when you can guarantee that size < max_threads? In that case the for
>> loop should only go 1 iteration.
>
> Sorry, I'm sure I'm being dense here--but I really don't understand what
> the difference between 'delta' and 'max_threads' in your opinion is.

Yes, you are absolutely correct. I was confused about what
blockDim.x*gridDim.x gives but it makes sense now and they are in fact
equivalent.

> If
> you're asking about the maximal number of threads the device can
> support (see above), there are good reasons to do smaller launches, as
> long as they still fill the machine. (and PyCUDA makes sure of that)

What are those good reasons?

Assuming these good reasons exist, what's the functionality in PyCUDA to do
smaller launches to fill the machine? I assume you refer to the block and
grid parameters. So instead of the above I write a kernel without the for
loop and launch like this (assuming my device can launch 512 threads per
block):

size_out = 2048
out = gpuarray.zeros(size_out, np.float32)
my_kernel(out, block(np.max([512, size_out]), 1, 1), grid=(size_out // 512, 1))

However, in my actual case I think I can't use this pattern as I am passing

pycuda.curandom.XORWOWRandomNumberGenerator().state
to the kernel. I think this stores the generators inside of shared memory.
So using grid size > 1 would try to access generators that were not
initialized. However, could I initialize generators on multiple grid cells
(i.e. device memory) and use the grid approach without a for loop? Would it
be more efficient.

I obviously haven't grasped all the concepts completely so any
clarification would be much appreciated.

Thanks,
Thomas
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to