Hi! On 2021-04-16T17:05:24+0100, Andrew Stubbs <a...@codesourcery.com> wrote: > On 15/04/2021 18:26, Thomas Schwinge wrote: >>> and optimisation, since shared memory might be faster than >>> the main memory on a GPU. >> >> Do we potentially have a problem that making more use of (scarce) >> gang-private memory may negatively affect peformance, because potentially >> fewer OpenACC gangs may then be launched to the GPU hardware in parallel? >> (Of course, OpenACC semantics conformance firstly is more important than >> performance, but there may be ways to be conformant and performant; >> "quality of implementation".) Have you run any such performance testing >> with the benchmarking codes that we've got set up? >> >> (As I'm more familiar with that, I'm using nvptx offloading examples in >> the following, whilst assuming that similar discussion may apply for GCN >> offloading, which uses similar hardware concepts, as far as I remember.) > > Yes, that could happen.
Thanks for sharing the GCN perspective. > However, there's space for quite a lot of > scalars before performance is affected: 64KB of LDS memory shared by a > hardware-defined maximum of 40 threads (Instead of threads, something like thread blocks, I suppose?) > gives about 1.5KB of space for > worker-reduction variables and gang-private variables. PTX, as I understand this, may generally have a lot of Thread Blocks in flight: all for the same GPU kernel as well as any GPU kernels running asynchronously/generally concurrently (system-wide), and libgomp does try launching a high number of Thread Blocks ('num_gangs') (for purposes of hiding memory access latency?). Random example: nvptx_exec: kernel t0_r$_omp_fn$0: launch gangs=1920, workers=32, vectors=32 With that, PTX's 48 KiB of '.shared' memory per SM (processor) are then not so much anymore: just '48 * 1024 / 1920 = 25' bytes of gang-private memory available for each of the 1920 gangs: 'double x, y, z'? (... for the simple case where just one GPU kernel is executing.) (I suppose that calculation is valid for a GPU hardware variant where there is just one SM. If there are several (typically in the order of a few dozens?), I suppose the Thread Blocks launched will be distributed over all these, thus improving the situation correspondingly.) (And of course, there are certainly other factors that also limit the number of Thread Blocks that are actually executing in parallel.) > We might have a > problem if there are large private arrays. Yes, that's understood. Also, directly related, the problem that comes with supporting worker-private memory, which basically calculates to the amount necessary for gang-private memory multiplied by the number of workers? (Out of scope at present.) > I believe we have a "good enough" solution for the usual case So you believe that. ;-) It's certainly what I'd hope, too! But we don't know yet whether there's any noticeable performance impact if we run with (potentially) lesser parallelism, hence my question whether this patch has been run through performance testing. > and a > v2.0 full solution is going to be big and hairy enough for a whole patch > of it's own (requiring per-gang dynamic allocation, a different memory > address space and possibly different instruction selection too). Agree that a fully dynamic allocation scheme likely is going to be ugly, so I'd certainly like to avoid that. Before attempting that, we'd first try to optimize gang-private memory allocation: so that it's function-local (and thus GPU kernel-local) instead of device-global (assuming that's indeed possible), and try not using gang-private memory in cases where it's not actually necessary (semantically not observable, and not necessary for performance reasons). Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf