On Mon, 26 Oct 2020, Jakub Jelinek wrote:

> On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote:
> > This patch adds caching for the stack block allocated for offloaded
> > OpenMP kernel launches on NVPTX. This is a performance optimisation --
> > we observed an average 11% or so performance improvement with this patch
> > across a set of accelerated GPU benchmarks on one machine (results vary
> > according to individual benchmark and with hardware used).

In this patch you're folding two changes together: reuse of allocated stacks
and removing one host-device synchronization.  Why is that?  Can you report
performance change separately for each change (and split out the patches)?

> > A given kernel launch will reuse the stack block from the previous launch
> > if it is large enough, else it is freed and reallocated. A slight caveat
> > is that memory will not be freed until the device is closed, so e.g. if
> > code is using highly variable launch geometries and large amounts of
> > GPU RAM, you might run out of resources slightly quicker with this patch.
> > 
> > Another way this patch gains performance is by omitting the
> > synchronisation at the end of an OpenMP offload kernel launch -- it's
> > safe for the GPU and CPU to continue executing in parallel at that point,
> > because e.g. copies-back from the device will be synchronised properly
> > with kernel completion anyway.

I don't think this explanation is sufficient. My understanding is that OpenMP
forbids the host to proceed asynchronously after the target construct unless
it is a 'target nowait' construct. This may be observable if there's a printf
in the target region for example (or if it accesses memory via host pointers).

So this really needs to be a separate patch with more explanation why this is
okay (if it is okay).

> > In turn, the last part necessitates a change to the way "(perhaps abort
> > was called)" errors are detected and reported.

As already mentioned using callbacks is problematic. Plus, I'm sure the way
you lock out other threads is a performance loss when multiple threads have
target regions: even though they will not run concurrently on the GPU, you
still want to allow host threads to submit GPU jobs while the GPU is occupied.

I would suggest to have a small pool (up to 3 entries perhaps) of stacks. Then
you can arrange reuse without totally serializing host threads on target
regions.

Alexander

Reply via email to