[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 --- Comment #18 from Andrew Stubbs --- That should fix the broken validation check. All V32 permutations should work now on RDNA GPUs, I think. V16 and smaller were already working fine.
[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 --- Comment #16 from Andrew Stubbs --- On 26/06/2024 14:41, rguenther at suse dot de wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #15 from rguenther at suse dot de --- >>> Btw, the above looks quite odd for nelt == 32 anyway - we are permuting >>> two vectors src0 and src1 into one 32 element dst vector (it's no longer >>> required that src0 and src1 line up with the dst vector size btw, they >>> might have different nelt). So the loop would reject interleaving >>> the low parts of two 32 element vectors, a permute that would look like >>> { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes" >>> mean you can never mix the two vector inputs? Or does GCN not have >>> a two-to-one vector permute instruction? >> >> GCN does not have two-to-one vector permute in hardware, so we do two >> permutes and a vec_merge to get the same effect. >> >> GFX9 can permute all the elements within a 64 lane vector arbitrarily. >> >> GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but >> no value may cross the boundary. AFAIK there's no way to do that via any >> vector instruction (i.e. without writing to memory, or extracting values >> element-wise). > > I see - so it cannot even swap low-32 and high-32? I'm thinking of > what sub-part of permutes would be possible by extending the two-to-one > vec_merge trick. No(?) The 64-lane compatibility mode works, under the hood, by allocating double the number of 32-lane registers and then executing each instruction twice. Mostly this is invisible, but it gets exposed for permutations and the like. Logically, the microarchitecture could do a vec_merge to DTRT, but I've not found a way to express that. It's possible I missed something when RTFM. > OTOH we restrict GFX10/11 to 32 lane vectors so in practice this > restriction should be fine. Yes, with the "31" fixed it should work. Andrew
[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 --- Comment #14 from Andrew Stubbs --- On 26/06/2024 13:34, rguenth at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #13 from Richard Biener --- > (In reply to Richard Biener from comment #12) >> (In reply to Andrew Stubbs from comment #10) >>> GFX10 has more limited permutation capabilities than GFX9 because it >>> only has 32-lane vectors natively, even though we're using the 64-lane >>> "compatibility" mode. >>> >>> However, in theory, the permutation capabilities on V32 and below should >>> be the same, and some permutations on V64 are allowed, so I don't know >>> why it doesn't use it. It's possible I broke the logic in >>> gcn_vectorize_vec_perm_const: >>> >>> /* RDNA devices can only do permutations within each group of 32-lanes. >>>Reject permutations that cross the boundary. */ >>> if (TARGET_RDNA2_PLUS) >>> for (unsigned int i = 0; i < nelt; i++) >>> if (i < 31 ? perm[i] > 31 : perm[i] < 32) >>> return false; >>> >>> It looks right to me though? >> >> nelt == 32 so I think the last element has the wrong check applied? >> >> It should be >> >>> if (i < 32 ? perm[i] > 31 : perm[i] < 32) >> >> I think. With that the vectorization happens in a similar way but the >> failure still doesn't reproduce (without the patch, of course). Oops, I think you're right. > Btw, the above looks quite odd for nelt == 32 anyway - we are permuting > two vectors src0 and src1 into one 32 element dst vector (it's no longer > required that src0 and src1 line up with the dst vector size btw, they > might have different nelt). So the loop would reject interleaving > the low parts of two 32 element vectors, a permute that would look like > { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes" > mean you can never mix the two vector inputs? Or does GCN not have > a two-to-one vector permute instruction? GCN does not have two-to-one vector permute in hardware, so we do two permutes and a vec_merge to get the same effect. GFX9 can permute all the elements within a 64 lane vector arbitrarily. GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but no value may cross the boundary. AFAIK there's no way to do that via any vector instruction (i.e. without writing to memory, or extracting values element-wise). In theory, we could implement permutes with different sized inputs and outputs, but right now those are rejected early. The interleave example wouldn't work in hardware, for GFX10, but we could have it for GFX9. However, I think you might be right about the numbering of the "perm" array; we probably need to be testing "(perm[i] % nelt) > 31" if we are to support two-to-one permutations. Thanks for looking at this. Andrew
[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 --- Comment #10 from Andrew Stubbs --- On 26/06/2024 12:05, rguenth at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 > > --- Comment #8 from Richard Biener --- > (In reply to Richard Biener from comment #7) >> I will have a look (and for run validation try to reproduce with gfx1036). > > OK, so with gfx1036 we end up using 16 byte vectors and the testcase > passes. The difference with gfx908 is > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: ==> examining statement: _14 = aa[_13]; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: vect_model_load_cost: aligned. > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: vect_model_load_cost: inside_cost = 2, prologue_cost = 0 . > > vs. > > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: ==> examining statement: _14 = aa[_13]; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > missed: unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 > 10 > 10 11 11 12 12 13 13 14 14 15 15 } > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > missed: unsupported load permutation > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:19:72: > missed: not vectorized: relevant stmt not supported: _14 = aa[_13]; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: removing SLP instance operations starting from: REALPART_EXPR > <(*hadcur_24(D))[_2]> = _86; > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > missed: unsupported SLP instances > /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12: > note: re-trying with SLP disabled > > so gfx1036 cannot do such permutes but gfx908 can? GFX10 has more limited permutation capabilities than GFX9 because it only has 32-lane vectors natively, even though we're using the 64-lane "compatibility" mode. However, in theory, the permutation capabilities on V32 and below should be the same, and some permutations on V64 are allowed, so I don't know why it doesn't use it. It's possible I broke the logic in gcn_vectorize_vec_perm_const: /* RDNA devices can only do permutations within each group of 32-lanes. Reject permutations that cross the boundary. */ if (TARGET_RDNA2_PLUS) for (unsigned int i = 0; i < nelt; i++) if (i < 31 ? perm[i] > 31 : perm[i] < 32) return false; It looks right to me though? The vec_extract patterns that also use permutations are likewise supposedly still enabled for V32 and below. Andrew
[Bug target/115640] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640 --- Comment #3 from Andrew Stubbs --- (In reply to Richard Biener from comment #2) > If you force GCN to use fixed length vectors (how?), does it work? How's > it behaving on aarch64 with SVE? (the CI was happy, but maybe doesn't > enable SVE) I believe "--param vect-partial-vector-usage=0" will disable the use of WHILE_ULT? The default is "2" for the standalone toolchain, and last I checked the value is inherited from the host in the offload toolchain; the default for x86_64 was "1", meaning approximately "only use partial vectors in epilogue loops".
[Bug target/115631] [15 Regression] GCN: [-PASS:-]{+FAIL:+} c-c++-common/torture/builtin-arith-overflow-6.c -O2 execution test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115631 --- Comment #1 from Andrew Stubbs --- It was writing 0 to s12 (scalar register) and then moving the zero to lane zero of v0 (vector register). Now it's writing the 0 directly to v0, of which all but lane zero is masked. These should be identical (unless s12 was also live). The problem must be elsewhere.
[Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304 --- Comment #11 from Andrew Stubbs --- (In reply to rguent...@suse.de from comment #10) > On Mon, 3 Jun 2024, ams at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304 > > > > --- Comment #9 from Andrew Stubbs --- > > (In reply to Richard Biener from comment #6) > > > The best strathegy for GCN would be to gather V4QImode aka SImode into the > > > V64QImode (or V16SImode) vector. For pix2 we have a gap of 28 elements, > > > doing consecutive loads isn't a good strategy here. > > > > I don't fully understand what you're trying to say here, so apologies if you > > knew all this already and I missed the point. > > > > In general, on GCN V4QImode is not in any way equivalent to SImode (when the > > values are in registers). The vector registers are not one single string of > > re-interpretable bits. > > > > For the same reason, you can't load a value as V64QImode and then try to > > interpret it as V16SImode. GCN vector registers just don't work like > > SSE/Neon/etc. > > > > When you load a V64QImode vector, each lane is extended to 32 bits, so what > > you > > actually get in hardware is a V64SImode vector. > > > > Likewise, when you load a V4QImode vector the hardware representation is > > actually V4SImode (which in itself is just V64SImode with undefined values > > in > > the unused lanes). > > I see. I wonder if there's not one or two latent wrong-code because of > this and the vectorizers assumptions ;) I suppose modes_tieable_p > will tell us whether a VIEW_CONVERT_EXPR will do the right thing? > Is GET_MODE_SIZE (V64QImode) == GET_MODE_SIZE (V64SImode) btw? > And V64QImode really V64PSImode? The mode size says how big it will be when written to memory, so no they're not the same. I believe this matches the scalar QImode behaviour. We don't use any PSI modes. There are (some) machine instructions for V64QImode (and V64HImode) so we don't want to lose that information. There may well be some bugs, but we have handling for conversions in a number of places. There are truncate and extend patterns that operate lane-wise, and vec_extract can take a subset of a vector, IIRC. > Still for a V64QImode load on { c[0], c[1], c[2], c[3], c[32], c[33], > c[34], c[35], ... } it's probably best to use a single V64QImode gather > with GCN then rather than four "consecutive" V64QImode loads and then > element swizzling. Fewer loads are always better, and permutations are expensive operations (and don't work with 64-lane vectors on RDNA devices because they're actually two 32-lane vectors stuck together) so it can certainly make sense to use gather with a vector of permuted offsets (although it can be expensive to generate that vector in the first place).
[Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304 --- Comment #9 from Andrew Stubbs --- (In reply to Richard Biener from comment #6) > The best strathegy for GCN would be to gather V4QImode aka SImode into the > V64QImode (or V16SImode) vector. For pix2 we have a gap of 28 elements, > doing consecutive loads isn't a good strategy here. I don't fully understand what you're trying to say here, so apologies if you knew all this already and I missed the point. In general, on GCN V4QImode is not in any way equivalent to SImode (when the values are in registers). The vector registers are not one single string of re-interpretable bits. For the same reason, you can't load a value as V64QImode and then try to interpret it as V16SImode. GCN vector registers just don't work like SSE/Neon/etc. When you load a V64QImode vector, each lane is extended to 32 bits, so what you actually get in hardware is a V64SImode vector. Likewise, when you load a V4QImode vector the hardware representation is actually V4SImode (which in itself is just V64SImode with undefined values in the unused lanes).
[Bug driver/114717] '-fcf-protection' vs. offloading compilation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114717 --- Comment #3 from Andrew Stubbs --- Can this be filtered (safely) in mkoffload? That tool is offload-target-specific, so no problem with "if offload target were to support it".
[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302 --- Comment #4 from Andrew Stubbs --- Yes, that's what the simd-math-3* tests do. The simd-math-5* tests are explicitly supposed to be doing this in the context of the autovectorizer. If these tests are being compiled as (newly) intended then we should change the expected results. So, questions: 1. Are the new results actually correct? (So far I only know that being different is expected.) 2. Is there some other testcase form that would exercise the previously intended routines? 3. Is the new behaviour configurable? I don't think the 16-bit shift bug ever existed on GCN (in which "short" vectors actually have excess bits in each lane, much like scalar registers do).
[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302 --- Comment #2 from Andrew Stubbs --- The execution test checks that each of the libgcc routines work correctly, and the scan assembler tests make sure that we're getting coverage of all of them. In this case, the failure indicates that we're not testing the routine we were aiming for (but I think it does execute correctly and give a good result).
[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085 --- Comment #8 from Andrew Stubbs --- (In reply to seurer from comment #7) > On the BE machine: > > seurer@nilram:~/gcc/git/build/gcc-test$ ulimit -a > real-time non-blocking time (microseconds, -R) unlimited > ... > max locked memory (kbytes, -l) 529679232 > ... That's a suspiciously large number, but OK. > seurer@nilram:~/gcc/git/build/gcc-test$ getconf PAGESIZE > 65536 > > > There were no messages. Running it in gdb I get: > > (gdb) where > #0 0x0fce3340 in ?? () from /lib32/libc.so.6 > #1 0x0fc851e4 in raise () from /lib32/libc.so.6 > #2 0x0fc6a128 in abort () from /lib32/libc.so.6 > #3 0x1ae4 in set_pin_limit (size=size@entry=131072) at > /home/seurer/gcc/git/gcc-test/libgomp/testsuite/libgomp.c/alloc-pinned-4.c:44 > #4 0x1754 in main () at > /home/seurer/gcc/git/gcc-test/libgomp/testsuite/libgomp.c/alloc-pinned-4.c: > 106 > > > if (getrlimit (RLIMIT_MEMLOCK, )) > abort (); // line 44 in alloc-pinned-4.c Why would that fail? Perhaps you can investigate the errno. You're probably best placed to submit a patch for whatever this issue is. > > This is a Debian Trixie machine and it too is using whatever the defaults > are. Good to know.
[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085 --- Comment #6 from Andrew Stubbs --- (In reply to seurer from comment #5) > I should note that pinned-2 also fails on powerpc64 LE. > > make -k check-target-libgomp RUNTESTFLAGS="c.exp=libgomp.c/alloc-pinned-*" > FAIL: libgomp.c/alloc-pinned-1.c execution test > FAIL: libgomp.c/alloc-pinned-2.c execution test > > > On powerpc64 BE pinned-3 and -4 fail (but not -1 and -2): > > make -k check-target-libgomp RUNTESTFLAGS="--target_board=unix'{-m32,-m64}' > c.exp=libgomp.c/alloc-pinned-*" > FAIL: libgomp.c/alloc-pinned-3.c execution test > FAIL: libgomp.c/alloc-pinned-4.c execution test Please show any messages in the libgomp.log file, and find out what the page sizes and locked memory limits are on both machines.
[Bug target/113615] internal compiler error: in extract_insn, at recog.cc:2812
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113615 --- Comment #3 from Andrew Stubbs --- I did see these, but I hadn't had time to chase them up. The proposed patch is exactly the sort of solution I was expecting to find, short term. Have you confirmed that it fixes all the cases? A proper solution is to find out how to implement reductions with the RDNA ISA, of course, but that's probably non-trivial (as in, I'm pretty sure it's more than renaming a few mnemonics), and low-priority as GCC does a reasonably good job without them.
[Bug middle-end/113199] [14 Regression][GCN] ICE (segfault) due to invalid 'loop_mask_46 = VEC_PERM_EXPR' when compiling Newlib's wcsftime.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113199 --- Comment #5 from Andrew Stubbs --- I can confirm that I can now build the amdgcn toolchain once more. :-) Thanks.
[Bug middle-end/113163] [14 Regression][GCN] ICE in vect_peel_nonlinear_iv_init, at tree-vect-loop.cc:9420
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113163 Andrew Stubbs changed: What|Removed |Added CC||ams at gcc dot gnu.org --- Comment #11 from Andrew Stubbs --- (In reply to Tamar Christina from comment #7) > This seems to happen because the vectorizer decides to use partial vectors > to vectorize the loop and the target picks a nonlinear induction step which > we can't support for early breaks. In which hook is this selected? I'm not aware of this being a deliberate choice we made...
[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085 --- Comment #4 from Andrew Stubbs --- It's going to be difficult to make this test work when only one page of locked memory is available. :-( I will look at making it "unsupported".
[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085 --- Comment #1 from Andrew Stubbs --- That is a typo. I don't want to make it pass on machines that have insufficient memory configured because it will mask the case where it fails for another reason. However, the testcase was originally supposed to fit in 64kB. Is your page size larger than 4kB?
[Bug target/113022] GCN offloading bricked by "amdgcn: Work around XNACK register allocation problem"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113022 --- Comment #1 from Andrew Stubbs --- This is what I get for trying to get this done before vacation. :( Yes, there's probably something in mkoffload that has to match the default change from -mxnack=any to -mxnack=off on the older ISAs.
[Bug target/112937] [14 Regression] GCN: FAILs due to unconditional 'f->use_flat_addressing = true;'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112937 --- Comment #2 from Andrew Stubbs --- Flat addressing *should* be the safe option that always works (although using "global" address space permits slightly more efficient offset options).
[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481 Andrew Stubbs changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #13 from Andrew Stubbs --- This should be fixed now.
[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481 --- Comment #7 from Andrew Stubbs --- Simply changing to OPTAB_WIDEN solves the ICE, but I don't know if it does so in a sensible way, for RISC V. @@ -7489,7 +7489,7 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size, if (maybe_ne (GET_MODE_PRECISION (mode), nunits)) tmp = expand_binop (mode, and_optab, tmp, GEN_INT ((1 << nunits) - 1), target, - true, OPTAB_DIRECT); + true, OPTAB_WIDEN); if (tmp != target) emit_move_insn (target, tmp); break; Here are the instructions it generates: (set (reg:DI 165) (and:DI (subreg:DI (reg:SI 164) 0) (const_int 1 [0x1]))) (set (reg:SI 154) (subreg:SI (reg:DI 165) 0)) Should I use that patch? I think it's harmless on targets where OPTAB_DIRECT would work.
[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481 Andrew Stubbs changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2023-11-13 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org --- Comment #4 from Andrew Stubbs --- It fails because optab_handler fails to find an instruction for "and_optab" in SImode. I didn't consider handling that case; seems so unlikely. I guess architectures that can't "and" masks don't get to have safe masks? ... I'll work on a fix.
[Bug target/112308] [14 Regression] GCN: 'error: literal operands are not supported' for 'v_add_co_u32'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112308 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #2 from Andrew Stubbs --- This should be fixed now.
[Bug target/112313] [14 Regression] GCN target 'gcc.dg/pr111082.c' ICE, 'during RTL pass: vregs': 'error: unrecognizable insn'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112313 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|UNCONFIRMED |RESOLVED Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org --- Comment #2 from Andrew Stubbs --- This is now fixed.
[Bug target/112308] [14 Regression] GCN: 'error: literal operands are not supported' for 'v_add_co_u32'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112308 Andrew Stubbs changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2023-11-09 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org
[Bug target/112088] [14 Regression] GCN target testing broken by "amdgcn: add -march=gfx1030 EXPERIMENTAL"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112088 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #3 from Andrew Stubbs --- The patch should fix the bug.
[Bug target/112088] [14 Regression] GCN target testing broken by "amdgcn: add -march=gfx1030 EXPERIMENTAL"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112088 Andrew Stubbs changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Ever confirmed|0 |1 Last reconfirmed||2023-10-27 Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org --- Comment #1 from Andrew Stubbs --- I'm testing a fix for this.
[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313 --- Comment #5 from Andrew Stubbs --- One thing that is unusual about the GCN stack pointer is that it's actually two registers. Could this be breaking some cprop assumptions? GCN can't fit an address in one (SImode) register so all (DImode) pointers require a pair of registers. We had to rework the dwarf stack representation code for this architecture, so I'm pretty sure no other port does this.
[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313 --- Comment #3 from Andrew Stubbs --- It's curious that this affects the Fiji target only, and not the newer targets at all. There are some additional register options for multiply instructions, some differences to atomics, but mostly the difference is that Fiji's "flat" load and store instructions can't have offsets.
[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313 --- Comment #1 from Andrew Stubbs --- This ICE also affect the following standalone test failures (raw amdgcn, no offloading): gfortran.dg/assumed_rank_21.f90 gfortran.dg/finalize_38.f90 gfortran.dg/finalize_38a.f90
[Bug testsuite/108898] [13 Regression] Test introduced by r13-6278-g3da77f217c8b2089ecba3eb201e727c3fcdcd19d failed on i386
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108898 --- Comment #4 from Andrew Stubbs --- I did not know there was a way to do that! I'll add this to my to-do list.
[Bug testsuite/108898] [13 Regression] Test introduced by r13-6278-g3da77f217c8b2089ecba3eb201e727c3fcdcd19d failed on i386
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108898 --- Comment #1 from Andrew Stubbs --- I tested it on i686-pc-linux-gnu before I posted the patch, and it was working then. Can you be more specific what configuration you were testing, please?
[Bug target/107510] gcc/config/gcn/gcn.cc:4930:9: style: Same expression on both sides of '||'. [duplicateExpression]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #4 from Andrew Stubbs --- Fixed.
[Bug other/89863] [meta-bug] Issues in gcc that other static analyzers (cppcheck, clang-static-analyzer, PVS-studio) find that gcc misses
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89863 Bug 89863 depends on bug 107510, which changed state. Bug 107510 Summary: gcc/config/gcn/gcn.cc:4930:9: style: Same expression on both sides of '||'. [duplicateExpression] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510 What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED
[Bug target/107510] gcc/config/gcn/gcn.cc:4930:9: style: Same expression on both sides of '||'. [duplicateExpression]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510 Andrew Stubbs changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #2 from Andrew Stubbs --- Oops, I thought I fixed that. :(
[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096 --- Comment #4 from Andrew Stubbs --- I don't understand rgroups, but I can say that GCN masks are very simply one-bit-one-lane. There are always 64-lanes, regardless of the type, so V64QI mode has fewer bytes and bits than V64DImode (when written to memory). This is different to most other architectures where the bit-size remains the same and number of lanes varies with the inner type, and has caused us some issues with invalid assumptions in GCC (e.g. "there's no need for sign-extends in vector registers" is not true for GCN). However, I think it's the same as you're describing for AVX512, at least in this respect. Incidentally I'm on the cusp of adding multiple "virtual" vector sizes in the GCN backend (in lieu of implementing full mask support everywhere in the middle-end and fixing all the cost assumptions), so these VIEW_CONVERT_EXPR issues are getting worse. I have a bunch of vec_extract patterns that fix up some of it. Within the backed, the V32, V16, V8, V4 and V2 vectors are all really just 64-lane vectors with the mask preset, so the mask has to remain DImode or register allocation becomes tricky.
[Bug middle-end/107088] [13 Regression] cselib ICE building __trunctfxf2 on ia64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107088 --- Comment #9 from Andrew Stubbs --- I can confirm that the patch fixes the amdgcn build.
[Bug middle-end/107088] [13 Regression] cselib ICE building __trunctfxf2 on ia64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107088 Andrew Stubbs changed: What|Removed |Added Target|ia64-*-*|ia64-*-*, amdgcn-*-* CC||ams at gcc dot gnu.org --- Comment #7 from Andrew Stubbs --- I get the same failure on amdgcn building newlib/libm/math/kf_rem_pio2.c
[Bug tree-optimization/106476] New: ICE generating FOLD_EXTRACT_LAST
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106476 Bug ID: 106476 Summary: ICE generating FOLD_EXTRACT_LAST Product: gcc Version: unknown Status: UNCONFIRMED Severity: major Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ams at gcc dot gnu.org CC: rguenther at suse dot de Target Milestone: --- Target: amdgcn-amdhsa Commit 8f4d9c1deda "amdgcn: 64-bit not" exposed an ICE in tree-vect_stmts.cc when compiling gcc.dg/torture/pr67470.c at -O2 for amdgcn. The newly implemented op is not the problem, but it allows this optimization (and many others) to proceed, and the error is no longer hidden. amdgcn has masked vectors and fold_extract_last, which leads to a code path through tree-vect-stmts.cc that has vec_then_clause = vec_oprnds2[i]; if (reduction_type != EXTRACT_LAST_REDUCTION) vec_else_clause = vec_oprnds3[i]; and then /* Instead of doing ~x ? y : z do x ? z : y. */ vec_compare = new_temp; std::swap (vec_then_clause, vec_else_clause); and finally new_stmt = gimple_build_call_internal (IFN_FOLD_EXTRACT_LAST, 3, else_clause, vec_compare, vec_then_clause); in which vec_then_clause remains set to NULL_TREE. The dump shows e_lsm.16_32 = .FOLD_EXTRACT_LAST (e_lsm.16_8, _70, ); (note the last field is missing.) I can fix the ICE if I add "else vec_else_clause = integer_zero_node", but I'm not sure that is the correct logical solution. (CC Richi who touched this code last)
[Bug target/105873] [amdgcn][OpenMP] task reductions fail with "team master not responding; slave thread aborting"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105873 --- Comment #4 from Andrew Stubbs --- I think unused threads should be given a no-op function to run, not a null pointer. The GCN implementation cannot tell the difference between a null pointer and an unset pointer (which is what happens when the master thread dies). There's also a potential issue of what happens when barriers occur within an active thread when there are also inactive threads. GCN barrier instructions are unconditional, meaning that all the live threads must respond. The inactive threads can do so, in a harmless way, as long as they allowed to spin, but we don't want them spinning forever when the master dies. I believe the current barrier implementation skips the barrier instruction when the team's thread count is 1. This is how we avoid issues with nested teams and tasks. I don't know why that doesn't help here?
[Bug target/105246] [amdgcn] Use library call for SQRT with -ffast-math + provide additional option to use single-precsion opcode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105246 --- Comment #2 from Andrew Stubbs --- When we first coded this we only had the GCN3 ISA manual, which says nothing about the accuracy. Now I look in the Vega manual (GCN5) I see: Square root with perhaps not the accuracy you were hoping for -- (2**29)ULP accuracy. On the upside, denormals are supported. The most recent CDNA2 manual is a bit less verbose: Square root. Precision is (2**29) ULP, and supports denormals. The compiler already emits Newton Raphson iterations for division with -ffast-math, so I'm sure it can be done, but I'm not too clear on the mathematics myself.
[Bug target/100181] hot-cold partitioned code doesn't assemble
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100181 --- Comment #13 from Andrew Stubbs --- I've updated the LLVM version documentation at https://gcc.gnu.org/wiki/Offloading#For_AMD_GCN: It's LLVM 9 or 13.0.1 now (nothing in between), and will be 13.0.1+ for the next release (dropping LLVM 9 because we'll want to add newer device support to GCC soonish).
[Bug middle-end/104026] [12 Regression] ICE in wide_int_to_tree_1, at tree.c:1755 via tree-vect-loop-manip.c:673
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104026 Andrew Stubbs changed: What|Removed |Added CC||ams at gcc dot gnu.org --- Comment #6 from Andrew Stubbs --- amdgcn always uses 64-lane vectors, regardless of type, and relies on masking to support anything smaller. The len_store pattern seems to have been introduced in July 2020 which is more recent than the last major work in the amdgcn backend.
[Bug target/103396] [12 Regression][GCN][BUILD] ICE RTL check: access of elt 4 of vector with last elt 3 in move_callee_saved_registers, at config/gcn/gcn.c:2821
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103396 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #6 from Andrew Stubbs --- This problem should be fixed now.
[Bug target/103396] [12 Regression][GCN][BUILD] ICE RTL check: access of elt 4 of vector with last elt 3 in move_callee_saved_registers, at config/gcn/gcn.c:2821
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103396 Andrew Stubbs changed: What|Removed |Added Last reconfirmed||2021-11-24 Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #4 from Andrew Stubbs --- I think I have a fix for this. It happens when the link register has to be saved because it is used implicitly by a function call, but the register is never explicitly mentioned anywhere else in the function. I don't know why this hasn't been a problem before now?
[Bug target/103201] [12 Regression] trunk 20211111 ftbfs for amdgcn – libgomp/teams.c:49:6: error: 'struct gomp_thread' has no member named 'num_teams'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103201 --- Comment #3 from Andrew Stubbs --- I did some preliminary testing on your patch: the libgomp.c/target-teams-1.c testcase runs fine on amdgcn. I presume that that covers most of the existing features of those runtime calls?
[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544 --- Comment #8 from Andrew Stubbs --- Did you get the C version to return anything other than "-1"? (The expected result is "2".) I'm still trying to determine if the device is compatible, but the mapping problem looks like a different issue. Your code works fine on my device using a somewhat more recent GCC build. (I can't install that exact toolchain right now.)
[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544 --- Comment #5 from Andrew Stubbs --- Sorry, I should have said to compile with -fopenacc. If you did do that, please post the GCN_DEBUG output.
[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544 --- Comment #3 from Andrew Stubbs --- That output shows that we have the correct libgomp and rocm is installed and working. Libgomp initialized the GCN plugin, but did not attempt to initialize the device (the next message in the output should have been "Selected kernel arguments memory region", or at least a GCN error message). Instead we have a target-independent libgomp error. Presumably the kernel metadata is malformed, somehow? I think we need a testcase to debug this further, preferably reduced to be as simple as possible. Perhaps it would be a good idea to start with a minimal toy example and see if that works on the device. #include #include int main () { int v = 1; #pragma acc parallel copy(v) { if (acc_on_device(acc_device_host)) v = -1; // error else { v = 2; // success } } printf ("v is %d\n", v); return v; }
[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544 --- Comment #1 from Andrew Stubbs --- Please set "export GCN_DEBUG=1", try it again, and post the output.
[Bug target/102260] amdgcn offload compiler fails to configure, not matching target directive's target id
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102260 Andrew Stubbs changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2021-09-09 Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |ams at gcc dot gnu.org --- Comment #1 from Andrew Stubbs --- In addition to changing the amdgcn_target syntax in LLVM 13, the LLVM GCN guys have also renamed the "sram-ecc" attribute to "sramecc" on the CLI, and have not provided any backwards compatibility for either change. These are not helpful decisions and will require configure tests in GCC to support all the variations. :-( I'm working on it.
[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544 --- Comment #5 from Andrew Stubbs --- [Note: all of my comments refer to the amdgcn case. nvptx has somewhat different support in this area.] (In reply to Jonathan Wakely from comment #4) > But it's a waste of space in the .so to build lots of symbols that use the > stubs. DSOs are not supported. This is strictly for static linking only. > There are other reasons it might be nice to be able to configure libstdc++ > for something in between a full hosted environment and a minimal > freestanding one. If it isn't a horrible hack, like libgfortran minimal mode, then fine. > > I believe static constructors work (libgfortran uses some), but exception > > handling does not. I'm not sure what other exotica C++ might need? > > Ideally, __cxa_atexit and __cxa_thread_atexit for static and thread-local > destructors, but we can survive without them (and have not-fully-conforming > destruction ordering). Offload kernels are just fragments of programs, so this is tricky in those cases. Libgomp explicitly calls _init_array and _fini_array as single-threaded kernel launches. Actually, it's not clear that deconstruction is in any way interesting, given that code running on the GPU has no external access and the resources are all released when the host program exits. Similarly, C++ threads are not interesting in the GPU-offload case. There are a fixed number or threads launched on entry and they are managed by libgomp. In theory it would be possible to code gthreads/libstdc++ to use them in standalone mode, but really that mode only exists to facilitate compiler testing.
[Bug target/100208] amdgcn fails to build with llvm-mc from llvm12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100208 Andrew Stubbs changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #3 from Andrew Stubbs --- I think this issue should be resolved now. (Other reasons why GCC fails with LLVM 12 still exist).
[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544 --- Comment #3 from Andrew Stubbs --- The standalone amdgcn configuration does not support C++. There are a number of technical reasons why it doesn't Just Work, but basically it comes down to no-one ever working on it. Our customers were primarily interested in Fortran with C second. C++ offloading works fine provided that there are no library calls or exceptions. Ignoring unsupported C++ language features, for now, I don't think there's any reason why libstdc++ would need to be cut down. We already build the full libgfortran for amdgcn. System calls that make no sense on the GPU were implemented as stubs in Newlib (mostly returning some reasonable errno value), and it would be straight-forward to implement more the same way. I believe static constructors work (libgfortran uses some), but exception handling does not. I'm not sure what other exotica C++ might need? As for exceptions, set-jump-long-jump is not implemented because there was no call for it and I didn't know how to handle the GCN register files properly. Not only are they variable-sized, they're also potentially very large: ranging from ~6KB up to ~65KB, I think (102 32-bit scalar, and 256 2048-bit vector registers, for single-threaded mode, but only 80 scalar and 24 vector registers in maximum occupancy mode, in which case per-thread stack space is also quite limited). I'm not sure now the other exception implementations work.
[Bug target/101484] [12 Regression] trunk 20210717 ftbfs for amdgcn-amdhsa (gcn offload)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101484 Andrew Stubbs changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2021-07-17 --- Comment #1 from Andrew Stubbs --- A new warning has been added that falsely identifies any access to a hardcoded constant address as bogus. This has affected a few targets, including GCN libgomp. See pr101374. There's some discussion what to do about it. E.g. https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574880.html
[Bug target/97827] bootstrap error building the amdgcn-amdhsa offload compiler with LLVM 11
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97827 Andrew Stubbs changed: What|Removed |Added CC||xw111luoye at gmail dot com --- Comment #17 from Andrew Stubbs --- *** Bug 95023 has been marked as a duplicate of this bug. ***
[Bug target/95023] Offloading AMD GCN wiki cannot be followed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95023 Andrew Stubbs changed: What|Removed |Added CC||ams at gcc dot gnu.org Resolution|--- |DUPLICATE Status|UNCONFIRMED |RESOLVED --- Comment #3 from Andrew Stubbs --- The second problem in this bug is reported in 97827. *** This bug has been marked as a duplicate of bug 97827 ***
[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418 Andrew Stubbs changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #17 from Andrew Stubbs --- This issue should be fixed now.
[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418 --- Comment #13 from Andrew Stubbs --- I found a lot more ICEs when testing my patch. They look to be unrelated (TImode come back to haunt us), but it makes it hard to be sure.
[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418 --- Comment #9 from Andrew Stubbs --- I found a couple of other places to put force_operand and the full case works now. Running more tests
[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418 Andrew Stubbs changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2021-05-05 Status|UNCONFIRMED |NEW --- Comment #6 from Andrew Stubbs --- Using force_operand does fix Tobias's reduced testcase. I'll test it further and let you know.
[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418 --- Comment #4 from Andrew Stubbs --- Alexandre's patch has this: emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize)); Is that generally a valid thing to do? It seems like other places do similar things...
[Bug target/100208] amdgcn fails to build with llvm-mc from llvm12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100208 --- Comment #1 from Andrew Stubbs --- LLVM changed the default parameters, so we either have to change the expectations in the ".amdgcn_target" string (which is basically an assert), or set the attributes be want explicitly on the assembler command line. (Or port binutils to amdgcn, but there's no plan for that.)
[Bug target/97521] [11 Regression] wrong code with -mno-sse2 since r11-3394
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97521 --- Comment #22 from Andrew Stubbs --- (In reply to Andrew Stubbs from comment #21) > (In reply to Richard Biener from comment #19) > > GCN also uses MODE_INT for the mask mode and thus may be similarly affected. > > Andrew - are the bits in the mask dense? Thus for a V4SImode compare > > would the mask occupy only the lowest 4 bits of the DImode mask? > > Yes, that's correct. Or rather, I should say that that *will* be the case when I add partial vector support; right now it can only be done via masking V64SImode. A have a patch set, but the last problem is that while_ult doesn't operate on partial integer masks, leading to wrong code. AArch64 doesn't have a problem with this because it uses VBI masks of the right size. I have a patch that adds the vector size as an operand to while_ult; this seems to fix the problems on GCN, but I need to make corresponding changes for AArch64 also before I can submit those patches, and time is tight.
[Bug target/97521] [11 Regression] wrong code with -mno-sse2 since r11-3394
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97521 --- Comment #21 from Andrew Stubbs --- (In reply to Richard Biener from comment #19) > GCN also uses MODE_INT for the mask mode and thus may be similarly affected. > Andrew - are the bits in the mask dense? Thus for a V4SImode compare > would the mask occupy only the lowest 4 bits of the DImode mask? Yes, that's correct.
[Bug tree-optimization/84958] int loads not eliminated against larger stores
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84958 --- Comment #6 from Andrew Stubbs --- (In reply to Tom de Vries from comment #5) > I've removed the xfail for nvptx. > > The only remaining xfail is for gcn. Is that one still necessary? The test still fails for gcn.
[Bug libgomp/97332] [gcn] GCN_NUM_GANGS/GCN_NUM_WORKERS override compile-time constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97332 Andrew Stubbs changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2020-10-08 Status|UNCONFIRMED |NEW --- Comment #1 from Andrew Stubbs --- At the point the overrides are applied (run_kernel) the code only knows what dimensions were selected at runtime, not how those figures were arrived at. It then prints (with GCN_DEBUG set) the "launch attributes" and "launch actuals". To fix this the overrides will have to applied much earlier, and independently for OpenACC (gcn_exec) and OpenMP (parse_target_attributes). That or the automatic balancing be applied later. Or perhaps the original attributes be stored for later inspection (but GOMP_kernel_launch_attributes is defined by libgomp). The "attributes" and "actuals" will need to be overhauled. Probably get_group_size can be removed. It ought to be doable though.