On Mon, 26 Oct 2020, Jakub Jelinek wrote: > On Mon, Oct 26, 2020 at 07:14:48AM -0700, Julian Brown wrote: > > This patch adds caching for the stack block allocated for offloaded > > OpenMP kernel launches on NVPTX. This is a performance optimisation -- > > we observed an average 11% or so performance improvement with this patch > > across a set of accelerated GPU benchmarks on one machine (results vary > > according to individual benchmark and with hardware used).
In this patch you're folding two changes together: reuse of allocated stacks and removing one host-device synchronization. Why is that? Can you report performance change separately for each change (and split out the patches)? > > A given kernel launch will reuse the stack block from the previous launch > > if it is large enough, else it is freed and reallocated. A slight caveat > > is that memory will not be freed until the device is closed, so e.g. if > > code is using highly variable launch geometries and large amounts of > > GPU RAM, you might run out of resources slightly quicker with this patch. > > > > Another way this patch gains performance is by omitting the > > synchronisation at the end of an OpenMP offload kernel launch -- it's > > safe for the GPU and CPU to continue executing in parallel at that point, > > because e.g. copies-back from the device will be synchronised properly > > with kernel completion anyway. I don't think this explanation is sufficient. My understanding is that OpenMP forbids the host to proceed asynchronously after the target construct unless it is a 'target nowait' construct. This may be observable if there's a printf in the target region for example (or if it accesses memory via host pointers). So this really needs to be a separate patch with more explanation why this is okay (if it is okay). > > In turn, the last part necessitates a change to the way "(perhaps abort > > was called)" errors are detected and reported. As already mentioned using callbacks is problematic. Plus, I'm sure the way you lock out other threads is a performance loss when multiple threads have target regions: even though they will not run concurrently on the GPU, you still want to allow host threads to submit GPU jobs while the GPU is occupied. I would suggest to have a small pool (up to 3 entries perhaps) of stacks. Then you can arrange reuse without totally serializing host threads on target regions. Alexander