Thomas Wiecki <thomas_wie...@brown.edu> writes:

> On Thu, Jun 7, 2012 at 11:50 AM, Andreas Kloeckner
> <li...@informa.tiker.net>wrote:
>
>> >> If
>> >> you're asking about the maximal number of threads the device can
>> >> support (see above), there are good reasons to do smaller launches, as
>> >> long as they still fill the machine. (and PyCUDA makes sure of that)
>> >
>> > What are those good reasons?
>>
>> There's some (small) overhead for switching thread blocks compared to
>> just executing code within a block. So more blocks launched -> more of
>> that overhead. The point is that CUDA pretends that there's an
>> 'infinite' number of cores, and it's up to you to choose how many of
>> those to use. Because of the (very slight) penalty, it's best not to
>> stretch the illusion of 'infinitely many cores' too far if it's not
>> necessary. (In fact, much of the overhead is in address computations and
>> such, which can be amortized if there's just a single long for loop.)
>>
>
> I see. In my case each item takes quite a while to compute so taking the
> performance hit that comes with switching thread blocks is probably well
> worth it.

Measure, don't guess.

>> Check the code in pycuda.curandom for how it's used there. I'm certain
>> this uses grid_size > 1, otherwise most of the machine would go unused.
>>
>
> I think this is the relevant call:
>
> p.prepared_call((self.block_count, 1), (self.generators_per_block, 1, 1),
> self.state, self.block_count * self.generators_per_block, seed.gpudata,
> offset)
> in ```XORWOWRandomNumberGenerator```. So if I read that correctly it inits
> the blocks*threads generators, so the maximum number available.
>
> It seems that calling a kernel on an array that is larger than
> threads_per_block*blocks is in general safe. The idx will just scale up so
> that the correct elements can be accessed and somehow the execution seems
> to get serialized to use the maximum number of threads.
>
> However, if I supply generator.state and use more threads than available,
> this serializing will not work as the idx will try to access generators
> outside of what's defined. I think this is what caused my problems before.
>
> The solution it seems is to use the for loop approach and then always call
> the kernel like this:
>
> my_kernel(generator.state, out, block=(generator.generators_per_block,
> 1, 1), grid=(generator.block_count, 1))
>
>
> That way I am sure I will never try to access uninitialized generators and
> only use the for loop if I have to.
>
> Does that make sense?

Yes.

Andreas

Attachment: pgpql2B5RUCPp.pgp
Description: PGP signature

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to