On 15/04/2021 18:26, Thomas Schwinge wrote:
and optimisation, since shared memory might be faster than
the main memory on a GPU.

Do we potentially have a problem that making more use of (scarce)
gang-private memory may negatively affect peformance, because potentially
fewer OpenACC gangs may then be launched to the GPU hardware in parallel?
(Of course, OpenACC semantics conformance firstly is more important than
performance, but there may be ways to be conformant and performant;
"quality of implementation".)  Have you run any such performance testing
with the benchmarking codes that we've got set up?

(As I'm more familiar with that, I'm using nvptx offloading examples in
the following, whilst assuming that similar discussion may apply for GCN
offloading, which uses similar hardware concepts, as far as I remember.)

Yes, that could happen. However, there's space for quite a lot of scalars before performance is affected: 64KB of LDS memory shared by a hardware-defined maximum of 40 threads gives about 1.5KB of space for worker-reduction variables and gang-private variables. We might have a problem if there are large private arrays.

I believe we have a "good enough" solution for the usual case, and a v2.0 full solution is going to be big and hairy enough for a whole patch of it's own (requiring per-gang dynamic allocation, a different memory address space and possibly different instruction selection too).

Andrew

Reply via email to