https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345
Bug ID: 104345 Summary: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: openacc Severity: minor Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: tschwinge at gcc dot gnu.org CC: sayle at gcc dot gnu.org, vries at gcc dot gnu.org Target Milestone: --- Target: nvptx First, I do acknowledge that commit beed3f8f60492289ca6211d86c54a2254a642035 "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" generally does improve nvptx code generation -- thanks! I've however run into one case where it causes a regression: PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 (test for excess errors) [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 execution test PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O2 (test for excess errors) [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O2 execution test libgomp: The Nvidia accelerator has insufficient resources to launch 'worker$_omp_fn$0' with num_workers = 32 and vector_length = 32; recompile the program with 'num_workers = x and vector_length = y' on that offloaded region or '-fopenacc-dim=:x:y' where x * y <= 896. Same for C++. That's with a Nvidia Tesla K20c, Driver Version: 346.46 -- so, rather old. By the way: the subsequent commit 659f8161f61d3f75c3a47cf646147e8f7b4dcb34 "nvptx: Add support for PTX's cnot instruction" is not helpful or even relevant here; there are no 'cnot's appearing in the PTX code loaded to the GPU (per 'GOMP_DEBUG=1' execution). Per 'diff' of 'GOMP_DEBUG=1' execution we indeed see *more* registers used after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" than before. For '-O0': [...] Link log info : 4 bytes gmem info : Function properties for 'gang$_omp_fn$0': -info : used 51 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem +info : used 68 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 8 bytes cmem[2], 0 bytes lmem info : Function properties for 'worker$_omp_fn$0': -info : used 51 registers, 112 stack, 136 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem +info : used 68 registers, 112 stack, 136 bytes smem, 328 bytes cmem[0], 8 bytes cmem[2], 0 bytes lmem info : Function properties for 'vector$_omp_fn$0': -info : used 51 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem +info : used 68 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 8 bytes cmem[2], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings nvptx_exec: kernel vector$_omp_fn$0: launch gangs=1, workers=1, vectors=32 nvptx_exec: kernel vector$_omp_fn$0: finished -GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7ffc760394a0, size=0x60bb30, kinds=0x60bb48 +GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7fff99653530, size=0x60bad0, kinds=0x60bae8 GOMP_OFFLOAD_openacc_exec: prepare mappings + +libgomp: The Nvidia accelerator has insufficient resources to launch 'worker$_omp_fn$0' with num_workers = 32 and vector_length = 32; recompile the program with 'num_workers = x and vector_length = y' on that offloaded region or '-fopenacc-dim=:x:y' where x * y <= 896. - nvptx_exec: kernel worker$_omp_fn$0: launch gangs=1, workers=32, vectors=32 - nvptx_exec: kernel worker$_omp_fn$0: finished -GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7ffc760394a0, size=0x60bb50, kinds=0x60bb68 - GOMP_OFFLOAD_openacc_exec: prepare mappings - nvptx_exec: kernel gang$_omp_fn$0: launch gangs=32, workers=1, vectors=32 - nvptx_exec: kernel gang$_omp_fn$0: finished Similar for '-O2', just with less stack usage. Cross-checking with a more recent Driver Version: 450.119.03, I'm only seeing slightly increased register usage; 52 registers after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" compared to 51 registers before: [...] Link log info : 4 bytes gmem info : Function properties for 'vector$_omp_fn$0': info : used [-51-]{+52+} registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem info : Function properties for 'worker$_omp_fn$0': info : used [-51-]{+52+} registers, 112 stack, 136 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem info : Function properties for 'gang$_omp_fn$0': info : used [-51-]{+52+} registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem [...] This suggests that compared to before, after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" GCC is generating certain PTX code sequences that the Driver/JIT fails to understand/optimize? While not ideal, the code still executes fine (with newish Driver/JIT), and I'm thus OK if we classify that as not worth looking into -- but I at least wanted to report my findings: maybe there's a way to tune the GCC/nvptx code generation to the PTX -> SASS compiler's liking? Possibly (but that's just guessing!), the reason might be around the following PTX code change: [...] -setp.leu.f64 %r82,%r25,0d7fefffffffffffff; -@ ! %r82 bra $L3; +@ %r78 bra $L20; +setp.leu.f64 %r138,%r57,0d7fefffffffffffff; +bra $L3; +$L20: .loc 2 1976 21 -setp.leu.f64 %r83,%r57,0d7fefffffffffffff; -@ %r83 bra $L19; +setp.leu.f64 %r138,%r57,0d7fefffffffffffff; +@ %r138 bra $L19; $L3: [...] >From a quick look, I read this to mean that the originally ("before") unconditional 'setp.leu.f64 %r82,%r25,0d7fefffffffffffff;' is now ("after") done conditionally. Maybe related, maybe not: when curiously 'diff'ing the before vs. after nvptx-none target libraries, I noticed amongst all the "noise" (improved code generation): 'nvptx-none/libatomic/gcas.o': [...] atom.cas.b32 %r137,[%r34],%r136,%r139; setp.eq.u32 %r140,%r137,%r136; selp.u32 %r138,1,0,%r140; -setp.ne.u32 %r141,%r138,0; -@ %r141 bra $L21; +@ %r140 bra $L18; st.u32 [%r201],%r137; -bra $L19; +$L18: +setp.eq.u32 %r142,%r138,0; +@ %r142 bra $L19; $L21: [...] ... which again looks like a pattern where an originally ("before") unconditional 'setp.ne.u32 %r141,%r138,0;' is now ("after") done conditionally. Similar in other files -- but I certainly didn't look in detail, and I'm certainly not claiming this to be/cause any actual problem. And, I've spotted a few cases where we're generating "maybe worse" code: 'nvptx-none/libgomp/openacc.o' (complete 'diff'): @@ -25,6 +25,7 @@ .reg .u64 %r28; .reg .u32 %r29; .reg .u32 %r30; +.reg .pred %r31; mov.u64 %r27,%ar0; st.u64 [%frame+16],%r27; ld.u64 %r28,[%frame+16]; @@ -38,8 +39,8 @@ ld.param.u32 %r30,[%value_in]; } mov.u32 %r23,%r30; -set.u32.ne.u32 %r24,%r23,0; -neg.s32 %r24,%r24; +setp.ne.u32 %r31,%r23,0; +selp.u32 %r24,1,0,%r31; st.u32 [%frame],%r24; ld.u32 %r25,[%frame]; mov.u32 %r26,%r25; 'nvptx-none/newlib/libc/reent/lib_a-renamer.o' (complete 'diff'): @@ -28,6 +28,8 @@ .reg .u32 %r32; .reg .pred %r33; .reg .u32 %r36; +.reg .u32 %r38; +.reg .pred %r39; mov.u64 %r26,%ar0; mov.u64 %r27,%ar1; mov.u64 %r28,%ar2; @@ -58,7 +60,9 @@ ld.param.u32 %r36,[%value_in]; } .loc 2 57 6 -set.u32.eq.u32 %r25,%r36,-1; +setp.eq.u32 %r39,%r36,-1; +selp.u32 %r38,1,0,%r39; +neg.s32 %r25,%r38; $L1: .loc 2 64 1 mov.u32 %value,%r25; 'nvptx-none/newlib/libc/stdio/lib_a-remove.o' (complete 'diff'): @@ -24,6 +24,8 @@ .reg .u64 %r26; .reg .u64 %r27; .reg .u32 %r30; +.reg .u32 %r33; +.reg .pred %r34; mov.u64 %r26,%ar0; mov.u64 %r27,%ar1; .loc 2 65 7 @@ -37,7 +39,9 @@ ld.param.u32 %r30,[%value_in]; } .loc 2 65 6 -set.u32.eq.u32 %value,%r30,-1; +setp.eq.u32 %r34,%r30,-1; +selp.u32 %r33,1,0,%r34; +neg.s32 %value,%r33; .loc 2 69 1 st.param.u32 [%value_out],%value; ret; @@ -51,6 +55,8 @@ .reg .u64 %r27; .reg .u64 %r30; .reg .u32 %r31; +.reg .u32 %r34; +.reg .pred %r35; mov.u64 %r27,%ar0; .loc 2 65 7 ld.global.u64 %r30,[_impure_ptr]; @@ -64,7 +70,9 @@ ld.param.u32 %r31,[%value_in]; } .loc 2 65 6 -set.u32.eq.u32 %value,%r31,-1; +setp.eq.u32 %r35,%r31,-1; +selp.u32 %r34,1,0,%r35; +neg.s32 %value,%r34; .loc 2 77 1 st.param.u32 [%value_out],%value; ret; 'nvptx-none/newlib/libm/common/lib_a-s_rint.o' (complete 'diff'): @@ -80,6 +80,7 @@ .reg .u32 %r119; .reg .pred %r120; .reg .u32 %r122; +.reg .pred %r123; .reg .u32 %r124; .reg .u32 %r125; .reg .u32 %r126; @@ -197,7 +198,8 @@ setp.eq.u32 %r120,%r41,0; @ %r120 bra $L5; .loc 2 114 9 -set.u32.eq.u32 %r122,%r58,19; +setp.eq.u32 %r123,%r58,19; +selp.u32 %r122,1,0,%r123; shl.b32 %r41,%r122,31; .loc 2 115 17 not.b32 %r124,%r64; I haven't looked if that's "actually worse" in SASS, or just "maybe worse" in the intermediate PTX representation. (... and is most certainly not related to the regression mentioned before.) It seems unlikely, but I'll report in case anything here changes due to Tom's several GCC/nvptx back end commits yesterday. So, please close this PR as "won't fix" unless you see something here that you'd like to look into.