[Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1"

tschwinge at gcc dot gnu.org via Gcc-bugs Wed, 02 Feb 2022 06:10:16 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345


            Bug ID: 104345
           Summary: nvptx: "regression" after "nvptx: Transition nvptx
                    backend to STORE_FLAG_VALUE = 1"
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: openacc
          Severity: minor
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tschwinge at gcc dot gnu.org
                CC: sayle at gcc dot gnu.org, vries at gcc dot gnu.org
  Target Milestone: ---
            Target: nvptx

First, I do acknowledge that commit beed3f8f60492289ca6211d86c54a2254a642035
"nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" generally does
improve nvptx code generation -- thanks!

I've however run into one case where it causes a regression:

    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0  (test
for excess errors)
    [-PASS:-]{+FAIL:+}
libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0 
execution test
    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2  (test
for excess errors)
    [-PASS:-]{+FAIL:+}
libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2 
execution test

    libgomp: The Nvidia accelerator has insufficient resources to launch
'worker$_omp_fn$0' with num_workers = 32 and vector_length = 32; recompile the
program with 'num_workers = x and vector_length = y' on that offloaded region
or '-fopenacc-dim=:x:y' where x * y <= 896.

Same for C++.

That's with a Nvidia Tesla K20c, Driver Version: 346.46 -- so, rather old.

By the way: the subsequent commit 659f8161f61d3f75c3a47cf646147e8f7b4dcb34
"nvptx: Add support for PTX's cnot instruction" is not helpful or even relevant
here; there are no 'cnot's appearing in the PTX code loaded to the GPU (per
'GOMP_DEBUG=1' execution).

Per 'diff' of  'GOMP_DEBUG=1' execution we indeed see *more* registers used
after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" than before. 
For '-O0':

    [...]
     Link log info    : 4 bytes gmem
     info    : Function properties for 'gang$_omp_fn$0':
    -info    : used 51 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0],
16 bytes cmem[2], 0 bytes lmem
    +info    : used 68 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 8
bytes cmem[2], 0 bytes lmem
     info    : Function properties for 'worker$_omp_fn$0':
    -info    : used 51 registers, 112 stack, 136 bytes smem, 328 bytes cmem[0],
16 bytes cmem[2], 0 bytes lmem
    +info    : used 68 registers, 112 stack, 136 bytes smem, 328 bytes cmem[0],
8 bytes cmem[2], 0 bytes lmem
     info    : Function properties for 'vector$_omp_fn$0':
    -info    : used 51 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0],
16 bytes cmem[2], 0 bytes lmem
    +info    : used 68 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 8
bytes cmem[2], 0 bytes lmem
       GOMP_OFFLOAD_openacc_exec: prepare mappings
       nvptx_exec: kernel vector$_omp_fn$0: launch gangs=1, workers=1,
vectors=32
       nvptx_exec: kernel vector$_omp_fn$0: finished
    -GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7ffc760394a0, size=0x60bb30,
kinds=0x60bb48
    +GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7fff99653530, size=0x60bad0,
kinds=0x60bae8
       GOMP_OFFLOAD_openacc_exec: prepare mappings
    +
    +libgomp: The Nvidia accelerator has insufficient resources to launch
'worker$_omp_fn$0' with num_workers = 32 and vector_length = 32; recompile the
program with 'num_workers = x and vector_length = y' on that offloaded region
or '-fopenacc-dim=:x:y' where x * y <= 896.
    -  nvptx_exec: kernel worker$_omp_fn$0: launch gangs=1, workers=32,
vectors=32
    -  nvptx_exec: kernel worker$_omp_fn$0: finished
    -GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7ffc760394a0, size=0x60bb50,
kinds=0x60bb68
    -  GOMP_OFFLOAD_openacc_exec: prepare mappings
    -  nvptx_exec: kernel gang$_omp_fn$0: launch gangs=32, workers=1,
vectors=32
    -  nvptx_exec: kernel gang$_omp_fn$0: finished

Similar for '-O2', just with less stack usage.

Cross-checking with a more recent Driver Version: 450.119.03, I'm only seeing
slightly increased register usage; 52 registers after "nvptx: Transition nvptx
backend to STORE_FLAG_VALUE = 1" compared to 51 registers before:

    [...]
    Link log info    : 4 bytes gmem
    info    : Function properties for 'vector$_omp_fn$0':
    info    : used [-51-]{+52+} registers, 112 stack, 0 bytes smem, 328 bytes
cmem[0], 16 bytes cmem[2], 0 bytes lmem
    info    : Function properties for 'worker$_omp_fn$0':
    info    : used [-51-]{+52+} registers, 112 stack, 136 bytes smem, 328 bytes
cmem[0], 16 bytes cmem[2], 0 bytes lmem
    info    : Function properties for 'gang$_omp_fn$0':
    info    : used [-51-]{+52+} registers, 112 stack, 0 bytes smem, 328 bytes
cmem[0], 16 bytes cmem[2], 0 bytes lmem
    [...]

This suggests that compared to before, after "nvptx: Transition nvptx backend
to STORE_FLAG_VALUE = 1" GCC is generating certain PTX code sequences that the
Driver/JIT fails to understand/optimize?  While not ideal, the code still
executes fine (with newish Driver/JIT), and I'm thus OK if we classify that as
not worth looking into -- but I at least wanted to report my findings: maybe
there's a way to tune the GCC/nvptx code generation to the PTX -> SASS
compiler's liking?

Possibly (but that's just guessing!), the reason might be around the following
PTX code change:

    [...]
    -setp.leu.f64 %r82,%r25,0d7fefffffffffffff;
    -@ ! %r82 bra $L3;
    +@ %r78 bra $L20;
    +setp.leu.f64 %r138,%r57,0d7fefffffffffffff;
    +bra $L3;
    +$L20:
     .loc 2 1976 21
    -setp.leu.f64 %r83,%r57,0d7fefffffffffffff;
    -@ %r83 bra $L19;
    +setp.leu.f64 %r138,%r57,0d7fefffffffffffff;
    +@ %r138 bra $L19;
     $L3:
    [...]

>From a quick look, I read this to mean that the originally ("before")
unconditional 'setp.leu.f64 %r82,%r25,0d7fefffffffffffff;' is now ("after")
done conditionally.

Maybe related, maybe not: when curiously 'diff'ing the before vs. after
nvptx-none target libraries, I noticed amongst all the "noise" (improved code
generation):

'nvptx-none/libatomic/gcas.o':

    [...]
     atom.cas.b32 %r137,[%r34],%r136,%r139;
     setp.eq.u32 %r140,%r137,%r136;
     selp.u32 %r138,1,0,%r140;
    -setp.ne.u32 %r141,%r138,0;
    -@ %r141 bra $L21;
    +@ %r140 bra $L18;
     st.u32 [%r201],%r137;
    -bra $L19;
    +$L18:
    +setp.eq.u32 %r142,%r138,0;
    +@ %r142 bra $L19;
     $L21:
    [...]

... which again looks like a pattern where an originally ("before")
unconditional 'setp.ne.u32 %r141,%r138,0;' is now ("after") done conditionally.

Similar in other files -- but I certainly didn't look in detail, and I'm
certainly not claiming this to be/cause any actual problem.


And, I've spotted a few cases where we're generating "maybe worse" code:

'nvptx-none/libgomp/openacc.o' (complete 'diff'):

    @@ -25,6 +25,7 @@
     .reg .u64 %r28;
     .reg .u32 %r29;
     .reg .u32 %r30;
    +.reg .pred %r31;
     mov.u64 %r27,%ar0;
     st.u64 [%frame+16],%r27;
     ld.u64 %r28,[%frame+16];
    @@ -38,8 +39,8 @@
     ld.param.u32 %r30,[%value_in];
     }
     mov.u32 %r23,%r30;
    -set.u32.ne.u32 %r24,%r23,0;
    -neg.s32 %r24,%r24;
    +setp.ne.u32 %r31,%r23,0;
    +selp.u32 %r24,1,0,%r31;
     st.u32 [%frame],%r24;
     ld.u32 %r25,[%frame];
     mov.u32 %r26,%r25;

'nvptx-none/newlib/libc/reent/lib_a-renamer.o' (complete 'diff'):

    @@ -28,6 +28,8 @@
     .reg .u32 %r32;
     .reg .pred %r33;
     .reg .u32 %r36;
    +.reg .u32 %r38;
    +.reg .pred %r39;
     mov.u64 %r26,%ar0;
     mov.u64 %r27,%ar1;
     mov.u64 %r28,%ar2;
    @@ -58,7 +60,9 @@
     ld.param.u32 %r36,[%value_in];
     }
     .loc 2 57 6
    -set.u32.eq.u32 %r25,%r36,-1;
    +setp.eq.u32 %r39,%r36,-1;
    +selp.u32 %r38,1,0,%r39;
    +neg.s32 %r25,%r38;
     $L1:
     .loc 2 64 1
     mov.u32 %value,%r25;

'nvptx-none/newlib/libc/stdio/lib_a-remove.o' (complete 'diff'):

    @@ -24,6 +24,8 @@
     .reg .u64 %r26;
     .reg .u64 %r27;
     .reg .u32 %r30;
    +.reg .u32 %r33;
    +.reg .pred %r34;
     mov.u64 %r26,%ar0;
     mov.u64 %r27,%ar1;
     .loc 2 65 7
    @@ -37,7 +39,9 @@
     ld.param.u32 %r30,[%value_in];
     }
     .loc 2 65 6
    -set.u32.eq.u32 %value,%r30,-1;
    +setp.eq.u32 %r34,%r30,-1;
    +selp.u32 %r33,1,0,%r34;
    +neg.s32 %value,%r33;
     .loc 2 69 1
     st.param.u32 [%value_out],%value;
     ret;
    @@ -51,6 +55,8 @@
     .reg .u64 %r27;
     .reg .u64 %r30;
     .reg .u32 %r31;
    +.reg .u32 %r34;
    +.reg .pred %r35;
     mov.u64 %r27,%ar0;
     .loc 2 65 7
     ld.global.u64 %r30,[_impure_ptr];
    @@ -64,7 +70,9 @@
     ld.param.u32 %r31,[%value_in];
     }
     .loc 2 65 6
    -set.u32.eq.u32 %value,%r31,-1;
    +setp.eq.u32 %r35,%r31,-1;
    +selp.u32 %r34,1,0,%r35;
    +neg.s32 %value,%r34;
     .loc 2 77 1
     st.param.u32 [%value_out],%value;
     ret;

'nvptx-none/newlib/libm/common/lib_a-s_rint.o' (complete 'diff'):

    @@ -80,6 +80,7 @@
     .reg .u32 %r119;
     .reg .pred %r120;
     .reg .u32 %r122;
    +.reg .pred %r123;
     .reg .u32 %r124;
     .reg .u32 %r125;
     .reg .u32 %r126;
    @@ -197,7 +198,8 @@
     setp.eq.u32 %r120,%r41,0;
     @ %r120 bra $L5;
     .loc 2 114 9
    -set.u32.eq.u32 %r122,%r58,19;
    +setp.eq.u32 %r123,%r58,19;
    +selp.u32 %r122,1,0,%r123;
     shl.b32 %r41,%r122,31;
     .loc 2 115 17
     not.b32 %r124,%r64;

I haven't looked if that's "actually worse" in SASS, or just "maybe worse" in
the intermediate PTX representation.  (... and is most certainly not related to
the regression mentioned before.)


It seems unlikely, but I'll report in case anything here changes due to Tom's
several GCC/nvptx back end commits yesterday.


So, please close this PR as "won't fix" unless you see something here that
you'd like to look into.

[Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1"

Reply via email to