(Apologies if threading is broken, for some reason I didn't receive
this reply directly!)

On Mon Oct 26 14:26:34 GMT 2020, Jakub Jelinek wrote:
> On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> > This patch adds caching for the stack block allocated for offloaded
> > OpenMP kernel launches on NVPTX. This is a performance optimisation
> > -- we observed an average 11% or so performance improvement with
> > this patch across a set of accelerated GPU benchmarks on one
> > machine (results vary according to individual benchmark and with
> > hardware used).
> > 
> > A given kernel launch will reuse the stack block from the previous
> > launch if it is large enough, else it is freed and reallocated. A
> > slight caveat is that memory will not be freed until the device is
> > closed, so e.g. if code is using highly variable launch geometries
> > and large amounts of GPU RAM, you might run out of resources
> > slightly quicker with this patch.
> > 
> > Another way this patch gains performance is by omitting the
> > synchronisation at the end of an OpenMP offload kernel launch --
> > it's safe for the GPU and CPU to continue executing in parallel at
> > that point, because e.g. copies-back from the device will be
> > synchronised properly with kernel completion anyway.
> > 
> > In turn, the last part necessitates a change to the way "(perhaps
> > abort was called)" errors are detected and reported.
> > 
> > Tested with offloading to NVPTX. OK for mainline?
> 
> I'm afraid I don't know the plugin nor CUDA well enough to review this
> properly (therefore I'd like to hear from Thomas, Tom and/or
> Alexander. Anyway, just two questions, wouldn't it make sense to add
> some upper bound limit over which it wouldn't cache the stacks, so
> that it would cache most of the time for normal programs but if some
> kernel is really excessive and then many normal ones wouldn't result
> in memory allocation failures?

Yes, that might work -- another idea is to free the stacks then retry
if a memory allocation fails, though that might lead to worse
fragmentation, perhaps. For the upper bound idea we'd need to pick a
sensible maximum limit. Something like 16MB maybe? Or,
user-controllable or some fraction of the GPU's total memory?

> And, in which context are cuStreamAddCallback registered callbacks
> run? E.g. if it is inside of asynchronous interrput, using locking in
> there might not be the best thing to do.

The cuStreamAddCallback API is documented here:

https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__STREAM.html#group__CUDA__STREAM_1g613d97a277d7640f4cb1c03bd51c2483

We're quite limited in what we can do in the callback function since
"Callbacks must not make any CUDA API calls". So what *can* a callback
function do? It is mentioned that the callback function's execution will
"pause" the stream it is logically running on. So can we get deadlock,
e.g. if multiple host threads are launching offload kernels
simultaneously? I don't think so, but I don't know how to prove it!

Thanks,

Julian

Reply via email to