from:"ams at gcc dot gnu.org via Gcc\-bugs"

[Bug target/116104] [15 Regression] GCN vs. "[rtl-optimization/116037] Explicitly track if a destination was skipped in ext-dce"

2024-07-30 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116104

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|FIXED   |---
 Status|RESOLVED|REOPENED

--- Comment #9 from Andrew Stubbs  ---
The patch fixed ASHIFT, but we now have the exact same failure in LSHIFTRT. 

Looking at the code, there's probably a few other cases in that switch
statement. At least ASHIFTRT, and probably all the other uses of CONSTANT_P.

[Bug target/116103] [15 Regression] GCN vs. "Internal-fn: Only allow modes describe types for internal fn[PR115961]"

2024-07-29 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116103

--- Comment #8 from Andrew Stubbs  ---
(In reply to Thomas Schwinge from comment #4)
> (In reply to Richard Biener from comment #2)
> >   if (VECTOR_BOOLEAN_TYPE_P (type)
> >   && SCALAR_INT_MODE_P (TYPE_MODE (type)))
> > return true;
> 
> >   && TYPE_PRECISION (TREE_TYPE (type)) == 1
> 
> > Thomas, does that resolve the issue?
> 
> Thanks, it does: restores the original '*.s' exactly.  (Assuming that's the
> desired outcome, Andrew?)

It looks like while_ult is being rejected ... we really do want that!

[Bug target/116104] [15 Regression] GCN vs. "[rtl-optimization/116037] Explicitly track if a destination was skipped in ext-dce"

2024-07-29 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116104

--- Comment #4 from Andrew Stubbs  ---
The problem insn is this:

(insn 31 30 32 2 (set (reg:V2SI 711)
(ashift:V2SI (reg:V2SI 161 v1)
(const_vector:V2SI [
(const_int 3 [0x3]) repeated x2
])))
"/srv/data/ams/upB/gcn/src/gcc-mainline/libgcc/libgcc2.c":70:11 2043
{vashlv2si3}
 (expr_list:REG_DEAD (reg:V2SI 161 v1)
(nil)))

That is, it shifts each element in a 2-element vector by 3 bits. This looks
like valid GCN code to me.

ext_dce does not seem to be handling the vector case.

[Bug target/116104] [15 Regression] GCN vs. "[rtl-optimization/116037] Explicitly track if a destination was skipped in ext-dce"

2024-07-29 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116104

--- Comment #3 from Andrew Stubbs  ---
(In reply to Jeffrey A. Law from comment #1)
> So, how am I supposed to reproduce this?  I don't have an assembler/binutils
> for amdgcn and thus libgcc won't configure.  Thus I can't extract a testcase.
> 
> Alternately, if you could just attach a .i file, it'd be helpful.

In fact, you probably do have an assembler and linker already installed!

Just copy "llvm-mc" to amdgcn-amdhsa/bin/as and llvm's "lld" to
amdgcn-amdhsa/bin/ld. You might want ar, ranlib, and nm too. Full details here:
https://gcc.gnu.org/wiki/Offloading#For_AMD_GCN

[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-28 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #18 from Andrew Stubbs  ---
That should fix the broken validation check. All V32 permutations should work
now on RDNA GPUs, I think. V16 and smaller were already working fine.

[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-26 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #16 from Andrew Stubbs  ---
On 26/06/2024 14:41, rguenther at suse dot de wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #15 from rguenther at suse dot de  ---
>>> Btw, the above looks quite odd for nelt == 32 anyway - we are permuting
>>> two vectors src0 and src1 into one 32 element dst vector (it's no longer
>>> required that src0 and src1 line up with the dst vector size btw, they
>>> might have different nelt).  So the loop would reject interleaving
>>> the low parts of two 32 element vectors, a permute that would look like
>>> { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes"
>>> mean you can never mix the two vector inputs?  Or does GCN not have
>>> a two-to-one vector permute instruction?
>>
>> GCN does not have two-to-one vector permute in hardware, so we do two
>> permutes and a vec_merge to get the same effect.
>>
>> GFX9 can permute all the elements within a 64 lane vector arbitrarily.
>>
>> GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but
>> no value may cross the boundary. AFAIK there's no way to do that via any
>> vector instruction (i.e. without writing to memory, or extracting values
>> element-wise).
> 
> I see - so it cannot even swap low-32 and high-32?  I'm thinking of
> what sub-part of permutes would be possible by extending the two-to-one
> vec_merge trick.

No(?)

The 64-lane compatibility mode works, under the hood, by allocating 
double the number of 32-lane registers and then executing each 
instruction twice. Mostly this is invisible, but it gets exposed for 
permutations and the like. Logically, the microarchitecture could do a 
vec_merge to DTRT, but I've not found a way to express that.

It's possible I missed something when RTFM.

> OTOH we restrict GFX10/11 to 32 lane vectors so in practice this
> restriction should be fine.

Yes, with the "31" fixed it should work.

Andrew

[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-26 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #14 from Andrew Stubbs  ---
On 26/06/2024 13:34, rguenth at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #13 from Richard Biener  ---
> (In reply to Richard Biener from comment #12)
>> (In reply to Andrew Stubbs from comment #10)
>>> GFX10 has more limited permutation capabilities than GFX9 because it
>>> only has 32-lane vectors natively, even though we're using the 64-lane
>>> "compatibility" mode.
>>>
>>> However, in theory, the permutation capabilities on V32 and below should
>>> be the same, and some permutations on V64 are allowed, so I don't know
>>> why it doesn't use it. It's possible I broke the logic in
>>> gcn_vectorize_vec_perm_const:
>>>
>>> /* RDNA devices can only do permutations within each group of 32-lanes.
>>>Reject permutations that cross the boundary.  */
>>> if (TARGET_RDNA2_PLUS)
>>>   for (unsigned int i = 0; i < nelt; i++)
>>> if (i < 31 ? perm[i] > 31 : perm[i] < 32)
>>>   return false;
>>>
>>> It looks right to me though?
>>
>> nelt == 32 so I think the last element has the wrong check applied?
>>
>> It should be
>>
>>> if (i < 32 ? perm[i] > 31 : perm[i] < 32)
>>
>> I think.  With that the vectorization happens in a similar way but the
>> failure still doesn't reproduce (without the patch, of course).

Oops, I think you're right.

> Btw, the above looks quite odd for nelt == 32 anyway - we are permuting
> two vectors src0 and src1 into one 32 element dst vector (it's no longer
> required that src0 and src1 line up with the dst vector size btw, they
> might have different nelt).  So the loop would reject interleaving
> the low parts of two 32 element vectors, a permute that would look like
> { 0, 32, 1, 33, 2, 34 ... } so does "within each group of 32-lanes"
> mean you can never mix the two vector inputs?  Or does GCN not have
> a two-to-one vector permute instruction?

GCN does not have two-to-one vector permute in hardware, so we do two 
permutes and a vec_merge to get the same effect.

GFX9 can permute all the elements within a 64 lane vector arbitrarily.

GFX10 and GFX11 can permute the low-32 and high-32 elements freely, but 
no value may cross the boundary. AFAIK there's no way to do that via any 
vector instruction (i.e. without writing to memory, or extracting values 
element-wise).

In theory, we could implement permutes with different sized inputs and 
outputs, but right now those are rejected early. The interleave example 
wouldn't work in hardware, for GFX10, but we could have it for GFX9.

However, I think you might be right about the numbering of the "perm" 
array; we probably need to be testing "(perm[i] % nelt) > 31" if we are 
to support two-to-one permutations.

Thanks for looking at this.

Andrew

[Bug target/115640] [15 Regression] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-26 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #10 from Andrew Stubbs  ---
On 26/06/2024 12:05, rguenth at gcc dot gnu.org wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640
> 
> --- Comment #8 from Richard Biener  ---
> (In reply to Richard Biener from comment #7)
>> I will have a look (and for run validation try to reproduce with gfx1036).
> 
> OK, so with gfx1036 we end up using 16 byte vectors and the testcase
> passes.  The difference with gfx908 is
> 
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   ==> examining statement: _14 = aa[_13];
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   vect_model_load_cost: aligned.
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   vect_model_load_cost: inside_cost = 2, prologue_cost = 0 .
> 
> vs.
> 
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   ==> examining statement: _14 = aa[_13];
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> missed:   unsupported vect permute { 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 
> 10
> 10 11 11 12 12 13 13 14 14 15 15 }
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> missed:   unsupported load permutation
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:19:72:
> missed:   not vectorized: relevant stmt not supported: _14 = aa[_13];
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:   removing SLP instance operations starting from: REALPART_EXPR
> <(*hadcur_24(D))[_2]> = _86;
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> missed:  unsupported SLP instances
> /space/rguenther/src/gcc-autopar_devel/gcc/testsuite/gfortran.dg/vect/pr115528.f:16:12:
> note:  re-trying with SLP disabled
> 
> so gfx1036 cannot do such permutes but gfx908 can?

GFX10 has more limited permutation capabilities than GFX9 because it 
only has 32-lane vectors natively, even though we're using the 64-lane 
"compatibility" mode.

However, in theory, the permutation capabilities on V32 and below should 
be the same, and some permutations on V64 are allowed, so I don't know 
why it doesn't use it. It's possible I broke the logic in 
gcn_vectorize_vec_perm_const:

   /* RDNA devices can only do permutations within each group of 32-lanes.
  Reject permutations that cross the boundary.  */
   if (TARGET_RDNA2_PLUS)
 for (unsigned int i = 0; i < nelt; i++)
   if (i < 31 ? perm[i] > 31 : perm[i] < 32)
 return false;

It looks right to me though?

The vec_extract patterns that also use permutations are likewise 
supposedly still enabled for V32 and below.

Andrew

[Bug target/115640] GCN: FAIL: gfortran.dg/vect/pr115528.f -O execution test

2024-06-25 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115640

--- Comment #3 from Andrew Stubbs  ---
(In reply to Richard Biener from comment #2)
> If you force GCN to use fixed length vectors (how?), does it work?  How's
> it behaving on aarch64 with SVE?  (the CI was happy, but maybe doesn't
> enable SVE)

I believe "--param vect-partial-vector-usage=0" will disable the use of
WHILE_ULT? The default is "2" for the standalone toolchain, and last I checked
the value is inherited from the host in the offload toolchain; the default for
x86_64 was "1", meaning approximately "only use partial vectors in epilogue
loops".

[Bug target/115631] [15 Regression] GCN: [-PASS:-]{+FAIL:+} c-c++-common/torture/builtin-arith-overflow-6.c -O2 execution test

2024-06-25 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115631

--- Comment #1 from Andrew Stubbs  ---
It was writing 0 to s12 (scalar register) and then moving the zero to lane zero
of v0 (vector register).

Now it's writing the 0 directly to v0, of which all but lane zero is masked.

These should be identical (unless s12 was also live).

The problem must be elsewhere.

[Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs

2024-06-03 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304

--- Comment #11 from Andrew Stubbs  ---
(In reply to rguent...@suse.de from comment #10)
> On Mon, 3 Jun 2024, ams at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304
> > 
> > --- Comment #9 from Andrew Stubbs  ---
> > (In reply to Richard Biener from comment #6)
> > > The best strathegy for GCN would be to gather V4QImode aka SImode into the
> > > V64QImode (or V16SImode) vector.  For pix2 we have a gap of 28 elements,
> > > doing consecutive loads isn't a good strategy here.
> > 
> > I don't fully understand what you're trying to say here, so apologies if you
> > knew all this already and I missed the point.
> > 
> > In general, on GCN V4QImode is not in any way equivalent to SImode (when the
> > values are in registers). The vector registers are not one single string of
> > re-interpretable bits.
> > 
> > For the same reason, you can't load a value as V64QImode and then try to
> > interpret it as V16SImode. GCN vector registers just don't work like
> > SSE/Neon/etc.
> > 
> > When you load a V64QImode vector, each lane is extended to 32 bits, so what 
> > you
> > actually get in hardware is a V64SImode vector.
> > 
> > Likewise, when you load a V4QImode vector the hardware representation is
> > actually V4SImode (which in itself is just V64SImode with undefined values 
> > in
> > the unused lanes).
> 
> I see.  I wonder if there's not one or two latent wrong-code because of
> this and the vectorizers assumptions ;)  I suppose modes_tieable_p
> will tell us whether a VIEW_CONVERT_EXPR will do the right thing?
> Is GET_MODE_SIZE (V64QImode) == GET_MODE_SIZE (V64SImode) btw?
> And V64QImode really V64PSImode?

The mode size says how big it will be when written to memory, so no they're not
the same. I believe this matches the scalar QImode behaviour.

We don't use any PSI modes. There are (some) machine instructions for V64QImode
(and V64HImode) so we don't want to lose that information.

There may well be some bugs, but we have handling for conversions in a number
of places. There are truncate and extend patterns that operate lane-wise, and
vec_extract can take a subset of a vector, IIRC.

> Still for a V64QImode load on { c[0], c[1], c[2], c[3], c[32], c[33], 
> c[34], c[35], ... } it's probably best to use a single V64QImode gather 
> with GCN then rather than four "consecutive" V64QImode loads and then
> element swizzling.

Fewer loads are always better, and permutations are expensive operations (and
don't work with 64-lane vectors on RDNA devices because they're actually two
32-lane vectors stuck together) so it can certainly make sense to use gather
with a vector of permuted offsets (although it can be expensive to generate
that vector in the first place).

[Bug tree-optimization/115304] gcc.dg/vect/slp-gap-1.c FAILs

2024-06-03 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115304

--- Comment #9 from Andrew Stubbs  ---
(In reply to Richard Biener from comment #6)
> The best strathegy for GCN would be to gather V4QImode aka SImode into the
> V64QImode (or V16SImode) vector.  For pix2 we have a gap of 28 elements,
> doing consecutive loads isn't a good strategy here.

I don't fully understand what you're trying to say here, so apologies if you
knew all this already and I missed the point.

In general, on GCN V4QImode is not in any way equivalent to SImode (when the
values are in registers). The vector registers are not one single string of
re-interpretable bits.

For the same reason, you can't load a value as V64QImode and then try to
interpret it as V16SImode. GCN vector registers just don't work like
SSE/Neon/etc.

When you load a V64QImode vector, each lane is extended to 32 bits, so what you
actually get in hardware is a V64SImode vector.

Likewise, when you load a V4QImode vector the hardware representation is
actually V4SImode (which in itself is just V64SImode with undefined values in
the unused lanes).

[Bug driver/114717] '-fcf-protection' vs. offloading compilation

2024-04-15 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114717

--- Comment #3 from Andrew Stubbs  ---
Can this be filtered (safely) in mkoffload? That tool is
offload-target-specific, so no problem with "if offload target were to support
it".

[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-03-27 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302

--- Comment #4 from Andrew Stubbs  ---
Yes, that's what the simd-math-3* tests do.

The simd-math-5* tests are explicitly supposed to be doing this in the context
of the autovectorizer.

If these tests are being compiled as (newly) intended then we should change the
expected results.

So, questions:

1. Are the new results actually correct? (So far I only know that being
different is expected.)

2. Is there some other testcase form that would exercise the previously
intended routines?

3. Is the new behaviour configurable? I don't think the 16-bit shift bug ever
existed on GCN (in which "short" vectors actually have excess bits in each
lane, much like scalar registers do).

[Bug target/114302] [14 Regression] GCN regressions after: vect: Tighten vect_determine_precisions_from_range [PR113281]

2024-03-27 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114302

--- Comment #2 from Andrew Stubbs  ---
The execution test checks that each of the libgcc routines work correctly, and
the scan assembler tests make sure that we're getting coverage of all of them.

In this case, the failure indicates that we're not testing the routine we were
aiming for (but I think it does execute correctly and give a good result).

[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails

2024-02-12 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085

--- Comment #8 from Andrew Stubbs  ---
(In reply to seurer from comment #7)
> On the BE machine:
> 
> seurer@nilram:~/gcc/git/build/gcc-test$ ulimit -a
> real-time non-blocking time  (microseconds, -R) unlimited
> ...
> max locked memory   (kbytes, -l) 529679232
> ...

That's a suspiciously large number, but OK.

> seurer@nilram:~/gcc/git/build/gcc-test$ getconf PAGESIZE
> 65536
> 
> 
> There were no messages.  Running it in gdb I get:
> 
> (gdb) where
> #0  0x0fce3340 in ?? () from /lib32/libc.so.6
> #1  0x0fc851e4 in raise () from /lib32/libc.so.6
> #2  0x0fc6a128 in abort () from /lib32/libc.so.6
> #3  0x1ae4 in set_pin_limit (size=size@entry=131072) at
> /home/seurer/gcc/git/gcc-test/libgomp/testsuite/libgomp.c/alloc-pinned-4.c:44
> #4  0x1754 in main () at
> /home/seurer/gcc/git/gcc-test/libgomp/testsuite/libgomp.c/alloc-pinned-4.c:
> 106
> 
> 
>   if (getrlimit (RLIMIT_MEMLOCK, ))
> abort ();   // line 44 in alloc-pinned-4.c

Why would that fail? Perhaps you can investigate the errno. You're probably
best placed to submit a patch for whatever this issue is.

> 
> This is a Debian Trixie machine and it too is using whatever the defaults
> are.

Good to know.

[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails

2024-02-08 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085

--- Comment #6 from Andrew Stubbs  ---
(In reply to seurer from comment #5)
> I should note that pinned-2 also fails on powerpc64 LE.
> 
> make  -k check-target-libgomp RUNTESTFLAGS="c.exp=libgomp.c/alloc-pinned-*"
> FAIL: libgomp.c/alloc-pinned-1.c execution test
> FAIL: libgomp.c/alloc-pinned-2.c execution test
> 
> 
> On powerpc64 BE pinned-3 and -4 fail (but not -1 and -2):
> 
> make  -k check-target-libgomp RUNTESTFLAGS="--target_board=unix'{-m32,-m64}'
> c.exp=libgomp.c/alloc-pinned-*"
> FAIL: libgomp.c/alloc-pinned-3.c execution test
> FAIL: libgomp.c/alloc-pinned-4.c execution test

Please show any messages in the libgomp.log file, and find out what the page
sizes and locked memory limits are on both machines.

[Bug target/113615] internal compiler error: in extract_insn, at recog.cc:2812

2024-01-29 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113615

--- Comment #3 from Andrew Stubbs  ---
I did see these, but I hadn't had time to chase them up.

The proposed patch is exactly the sort of solution I was expecting to find,
short term. Have you confirmed that it fixes all the cases?

A proper solution is to find out how to implement reductions with the RDNA ISA,
of course, but that's probably non-trivial (as in, I'm pretty sure it's more
than renaming a few mnemonics), and low-priority as GCC does a reasonably good 
job without them.

[Bug middle-end/113199] [14 Regression][GCN] ICE (segfault) due to invalid 'loop_mask_46 = VEC_PERM_EXPR' when compiling Newlib's wcsftime.c

2024-01-09 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113199

--- Comment #5 from Andrew Stubbs  ---
I can confirm that I can now build the amdgcn toolchain once more. :-)

Thanks.

[Bug middle-end/113163] [14 Regression][GCN] ICE in vect_peel_nonlinear_iv_init, at tree-vect-loop.cc:9420

2024-01-02 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113163

Andrew Stubbs  changed:

   What|Removed |Added

 CC||ams at gcc dot gnu.org

--- Comment #11 from Andrew Stubbs  ---
(In reply to Tamar Christina from comment #7)
> This seems to happen because the vectorizer decides to use partial vectors
> to vectorize the loop and the target picks a nonlinear induction step which
> we can't support for early breaks.

In which hook is this selected?

I'm not aware of this being a deliberate choice we made...

[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails

2023-12-27 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085

--- Comment #4 from Andrew Stubbs  ---
It's going to be difficult to make this test work when only one page of locked
memory is available. :-(

I will look at making it "unsupported".

[Bug testsuite/113085] New test case libgomp.c/alloc-pinned-1.c from r14-6499-g348874f0baac0f fails

2023-12-20 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113085

--- Comment #1 from Andrew Stubbs  ---
That is a typo.

I don't want to make it pass on machines that have insufficient memory
configured because it will mask the case where it fails for another reason.

However, the testcase was originally supposed to fit in 64kB. Is your page size
larger than 4kB?

[Bug target/113022] GCN offloading bricked by "amdgcn: Work around XNACK register allocation problem"

2023-12-15 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113022

--- Comment #1 from Andrew Stubbs  ---
This is what I get for trying to get this done before vacation. :(

Yes, there's probably something in mkoffload that has to match the default
change from -mxnack=any to -mxnack=off on the older ISAs.

[Bug target/112937] [14 Regression] GCN: FAILs due to unconditional 'f->use_flat_addressing = true;'

2023-12-11 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112937

--- Comment #2 from Andrew Stubbs  ---
Flat addressing *should* be the safe option that always works (although using
"global" address space permits slightly more efficient offset options).

[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c

2023-11-14 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481

Andrew Stubbs  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #13 from Andrew Stubbs  ---
This should be fixed now.

[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c

2023-11-14 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481

--- Comment #7 from Andrew Stubbs  ---
Simply changing to OPTAB_WIDEN solves the ICE, but I don't know if it does so
in a sensible way, for RISC V.

@@ -7489,7 +7489,7 @@ store_constructor (tree exp, rtx target, int cleared,
poly_int64 size,
if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
  tmp = expand_binop (mode, and_optab, tmp,
  GEN_INT ((1 << nunits) - 1), target,
- true, OPTAB_DIRECT);
+ true, OPTAB_WIDEN);
if (tmp != target)
  emit_move_insn (target, tmp);
break;

Here are the instructions it generates:

(set (reg:DI 165)
(and:DI (subreg:DI (reg:SI 164) 0)
(const_int 1 [0x1])))
(set (reg:SI 154)
(subreg:SI (reg:DI 165) 0))

Should I use that patch? I think it's harmless on targets where OPTAB_DIRECT
would work.

[Bug target/112481] [14 Regression] RISCV: ICE: Segmentation fault when compiling pr110817-3.c

2023-11-13 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112481

Andrew Stubbs  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2023-11-13
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org

--- Comment #4 from Andrew Stubbs  ---
It fails because optab_handler fails to find an instruction for "and_optab" in
SImode.  I didn't consider handling that case; seems so unlikely.

I guess architectures that can't "and" masks don't get to have safe masks? ...
I'll work on a fix.

[Bug target/112308] [14 Regression] GCN: 'error: literal operands are not supported' for 'v_add_co_u32'

2023-11-10 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112308

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #2 from Andrew Stubbs  ---
This should be fixed now.

[Bug target/112313] [14 Regression] GCN target 'gcc.dg/pr111082.c' ICE, 'during RTL pass: vregs': 'error: unrecognizable insn'

2023-11-10 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112313

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org

--- Comment #2 from Andrew Stubbs  ---
This is now fixed.

[Bug target/112308] [14 Regression] GCN: 'error: literal operands are not supported' for 'v_add_co_u32'

2023-11-09 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112308

Andrew Stubbs  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2023-11-09
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org

[Bug target/112088] [14 Regression] GCN target testing broken by "amdgcn: add -march=gfx1030 EXPERIMENTAL"

2023-10-27 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112088

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #3 from Andrew Stubbs  ---
The patch should fix the bug.

[Bug target/112088] [14 Regression] GCN target testing broken by "amdgcn: add -march=gfx1030 EXPERIMENTAL"

2023-10-27 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112088

Andrew Stubbs  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2023-10-27
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org

--- Comment #1 from Andrew Stubbs  ---
I'm testing a fix for this.

[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'

2023-06-20 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313

--- Comment #5 from Andrew Stubbs  ---
One thing that is unusual about the GCN stack pointer is that it's actually two
registers. Could this be breaking some cprop assumptions?

GCN can't fit an address in one (SImode) register so all (DImode) pointers
require a pair of registers. We had to rework the dwarf stack representation
code for this architecture, so I'm pretty sure no other port does this.

[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'

2023-06-20 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313

--- Comment #3 from Andrew Stubbs  ---
It's curious that this affects the Fiji target only, and not the newer targets
at all.

There are some additional register options for multiply instructions, some
differences to atomics, but mostly the difference is that Fiji's "flat" load
and store instructions can't have offsets.

[Bug target/110313] [14 Regression] GCN Fiji reload ICE in 'process_alt_operands'

2023-06-20 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110313

--- Comment #1 from Andrew Stubbs  ---
This ICE also affect the following standalone test failures (raw amdgcn, no
offloading):

gfortran.dg/assumed_rank_21.f90
gfortran.dg/finalize_38.f90
gfortran.dg/finalize_38a.f90

[Bug testsuite/108898] [13 Regression] Test introduced by r13-6278-g3da77f217c8b2089ecba3eb201e727c3fcdcd19d failed on i386

2023-03-15 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108898

--- Comment #4 from Andrew Stubbs  ---
I did not know there was a way to do that! I'll add this to my to-do list.

[Bug testsuite/108898] [13 Regression] Test introduced by r13-6278-g3da77f217c8b2089ecba3eb201e727c3fcdcd19d failed on i386

2023-02-23 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108898

--- Comment #1 from Andrew Stubbs  ---
I tested it on i686-pc-linux-gnu before I posted the patch, and it was working
then. Can you be more specific what configuration you were testing, please?

[Bug target/107510] gcc/config/gcn/gcn.cc:4930:9: style: Same expression on both sides of '||'. [duplicateExpression]

2022-11-03 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #4 from Andrew Stubbs  ---
Fixed.

[Bug other/89863] [meta-bug] Issues in gcc that other static analyzers (cppcheck, clang-static-analyzer, PVS-studio) find that gcc misses

2022-11-03 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89863
Bug 89863 depends on bug 107510, which changed state.

Bug 107510 Summary: gcc/config/gcn/gcn.cc:4930:9: style: Same expression on 
both sides of '||'. [duplicateExpression]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug target/107510] gcc/config/gcn/gcn.cc:4930:9: style: Same expression on both sides of '||'. [duplicateExpression]

2022-11-03 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107510

Andrew Stubbs  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org
 Status|NEW |ASSIGNED

--- Comment #2 from Andrew Stubbs  ---
Oops, I thought I fixed that. :(

[Bug tree-optimization/107096] Fully masking vectorization with AVX512 ICEs gcc.dg/vect/vect-over-widen-*.c

2022-10-10 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

--- Comment #4 from Andrew Stubbs  ---
I don't understand rgroups, but I can say that GCN masks are very simply
one-bit-one-lane. There are always 64-lanes, regardless of the type, so V64QI
mode has fewer bytes and bits than V64DImode (when written to memory).

This is different to most other architectures where the bit-size remains the
same and number of lanes varies with the inner type, and has caused us some
issues with invalid assumptions in GCC (e.g. "there's no need for sign-extends
in vector registers" is not true for GCN).

However, I think it's the same as you're describing for AVX512, at least in
this respect.

Incidentally I'm on the cusp of adding multiple "virtual" vector sizes in the
GCN backend (in lieu of implementing full mask support everywhere in the
middle-end and fixing all the cost assumptions), so these VIEW_CONVERT_EXPR
issues are getting worse. I have a bunch of vec_extract patterns that fix up
some of it. Within the backed, the V32, V16, V8, V4 and V2 vectors are all
really just 64-lane vectors with the mask preset, so the mask has to remain
DImode or register allocation becomes tricky.

[Bug middle-end/107088] [13 Regression] cselib ICE building __trunctfxf2 on ia64

2022-09-30 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107088

--- Comment #9 from Andrew Stubbs  ---
I can confirm that the patch fixes the amdgcn build.

[Bug middle-end/107088] [13 Regression] cselib ICE building __trunctfxf2 on ia64

2022-09-30 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107088

Andrew Stubbs  changed:

   What|Removed |Added

 Target|ia64-*-*|ia64-*-*, amdgcn-*-*
 CC||ams at gcc dot gnu.org

--- Comment #7 from Andrew Stubbs  ---
I get the same failure on amdgcn building newlib/libm/math/kf_rem_pio2.c

[Bug tree-optimization/106476] New: ICE generating FOLD_EXTRACT_LAST

2022-07-29 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106476

Bug ID: 106476
   Summary: ICE generating FOLD_EXTRACT_LAST
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ams at gcc dot gnu.org
CC: rguenther at suse dot de
  Target Milestone: ---
Target: amdgcn-amdhsa

Commit 8f4d9c1deda "amdgcn: 64-bit not" exposed an ICE in tree-vect_stmts.cc
when compiling gcc.dg/torture/pr67470.c at -O2 for amdgcn. The newly
implemented op is not the problem, but it allows this optimization (and many
others) to proceed, and the error is no longer hidden.

amdgcn has masked vectors and fold_extract_last, which leads to a code path
through tree-vect-stmts.cc that has

   vec_then_clause = vec_oprnds2[i];
   if (reduction_type != EXTRACT_LAST_REDUCTION)
 vec_else_clause = vec_oprnds3[i];

and then

   /* Instead of doing ~x ? y : z do x ? z : y.  */
   vec_compare = new_temp;
   std::swap (vec_then_clause, vec_else_clause);

and finally

   new_stmt = gimple_build_call_internal
   (IFN_FOLD_EXTRACT_LAST, 3, else_clause, vec_compare,
vec_then_clause);

in which vec_then_clause remains set to NULL_TREE.

The dump shows

   e_lsm.16_32 = .FOLD_EXTRACT_LAST (e_lsm.16_8, _70, );

(note the last field is missing.)

I can fix the ICE if I add "else vec_else_clause = integer_zero_node", but I'm
not sure that is the correct logical solution.

(CC Richi who touched this code last)

[Bug target/105873] [amdgcn][OpenMP] task reductions fail with "team master not responding; slave thread aborting"

2022-06-08 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105873

--- Comment #4 from Andrew Stubbs  ---
I think unused threads should be given a no-op function to run, not a null
pointer. The GCN implementation cannot tell the difference between a null
pointer and an unset pointer (which is what happens when the master thread
dies).

There's also a potential issue of what happens when barriers occur within an
active thread when there are also inactive threads. GCN barrier instructions
are unconditional, meaning that all the live threads must respond. The inactive
threads can do so, in a harmless way, as long as they allowed to spin, but we
don't want them spinning forever when the master dies.

I believe the current barrier implementation skips the barrier instruction when
the team's thread count is 1. This is how we avoid issues with nested teams and
tasks. I don't know why that doesn't help here?

[Bug target/105246] [amdgcn] Use library call for SQRT with -ffast-math + provide additional option to use single-precsion opcode

2022-04-13 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105246

--- Comment #2 from Andrew Stubbs  ---
When we first coded this we only had the GCN3 ISA manual, which says nothing
about the accuracy.

Now I look in the Vega manual (GCN5) I see:

  Square root with perhaps not the accuracy you were hoping for --
  (2**29)ULP accuracy. On the upside, denormals are supported.

The most recent CDNA2 manual is a bit less verbose:

  Square root. Precision is (2**29) ULP, and supports denormals.

The compiler already emits Newton Raphson iterations for division with
-ffast-math, so I'm sure it can be done, but I'm not too clear on the
mathematics myself.

[Bug target/100181] hot-cold partitioned code doesn't assemble

2022-02-11 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100181

--- Comment #13 from Andrew Stubbs  ---
I've updated the LLVM version documentation at
https://gcc.gnu.org/wiki/Offloading#For_AMD_GCN:

It's LLVM 9 or 13.0.1 now (nothing in between), and will be 13.0.1+ for the
next release (dropping LLVM 9 because we'll want to add newer device support to
GCC soonish).

[Bug middle-end/104026] [12 Regression] ICE in wide_int_to_tree_1, at tree.c:1755 via tree-vect-loop-manip.c:673

2022-01-14 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104026

Andrew Stubbs  changed:

   What|Removed |Added

 CC||ams at gcc dot gnu.org

--- Comment #6 from Andrew Stubbs  ---
amdgcn always uses 64-lane vectors, regardless of type, and relies on masking
to support anything smaller.

The len_store pattern seems to have been introduced in July 2020 which is more
recent than the last major work in the amdgcn backend.

[Bug target/103396] [12 Regression][GCN][BUILD] ICE RTL check: access of elt 4 of vector with last elt 3 in move_callee_saved_registers, at config/gcn/gcn.c:2821

2021-11-25 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103396

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #6 from Andrew Stubbs  ---
This problem should be fixed now.

[Bug target/103396] [12 Regression][GCN][BUILD] ICE RTL check: access of elt 4 of vector with last elt 3 in move_callee_saved_registers, at config/gcn/gcn.c:2821

2021-11-24 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103396

Andrew Stubbs  changed:

   What|Removed |Added

   Last reconfirmed||2021-11-24
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #4 from Andrew Stubbs  ---
I think I have a fix for this. It happens when the link register has to be
saved because it is used implicitly by a function call, but the register is
never explicitly mentioned anywhere else in the function. I don't know why this
hasn't been a problem before now?

[Bug target/103201] [12 Regression] trunk 20211111 ftbfs for amdgcn – libgomp/teams.c:49:6: error: 'struct gomp_thread' has no member named 'num_teams'

2021-11-12 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103201

--- Comment #3 from Andrew Stubbs  ---
I did some preliminary testing on your patch: the libgomp.c/target-teams-1.c
testcase runs fine on amdgcn. I presume that that covers most of the existing
features of those runtime calls?

[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'

2021-10-04 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544

--- Comment #8 from Andrew Stubbs  ---
Did you get the C version to return anything other than "-1"? (The expected
result is "2".)

I'm still trying to determine if the device is compatible, but the mapping
problem looks like a different issue.

Your code works fine on my device using a somewhat more recent GCC build. (I
can't install that exact toolchain right now.)

[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'

2021-10-01 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544

--- Comment #5 from Andrew Stubbs  ---
Sorry, I should have said to compile with -fopenacc.

If you did do that, please post the GCN_DEBUG output.

[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'

2021-10-01 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544

--- Comment #3 from Andrew Stubbs  ---
That output shows that we have the correct libgomp and rocm is installed and
working. Libgomp initialized the GCN plugin, but did not attempt to initialize
the device (the next message in the output should have been "Selected kernel
arguments memory region", or at least a GCN error message).

Instead we have a target-independent libgomp error. Presumably the kernel
metadata is malformed, somehow?

I think we need a testcase to debug this further, preferably reduced to be as
simple as possible.

Perhaps it would be a good idea to start with a minimal toy example and see if
that works on the device.

#include 
#include 

int main ()
{
  int v = 1;

#pragma acc parallel copy(v)
  {
if (acc_on_device(acc_device_host))
  v = -1; // error
else {
  v = 2; // success
}
  }

  printf ("v is %d\n", v);
  return v;
}

[Bug target/102544] GCN offloading not working for 'amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-'

2021-09-30 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102544

--- Comment #1 from Andrew Stubbs  ---
Please set "export GCN_DEBUG=1", try it again, and post the output.

[Bug target/102260] amdgcn offload compiler fails to configure, not matching target directive's target id

2021-09-09 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102260

Andrew Stubbs  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-09-09
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |ams at gcc dot gnu.org

--- Comment #1 from Andrew Stubbs  ---
In addition to changing the amdgcn_target syntax in LLVM 13, the LLVM GCN guys
have also renamed the "sram-ecc" attribute to "sramecc" on the CLI, and have
not provided any backwards compatibility for either change.

These are not helpful decisions and will require configure tests in GCC to
support all the variations. :-(

I'm working on it.

[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"

2021-07-21 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544

--- Comment #5 from Andrew Stubbs  ---
[Note: all of my comments refer to the amdgcn case. nvptx has somewhat
different support in this area.]

(In reply to Jonathan Wakely from comment #4)
> But it's a waste of space in the .so to build lots of symbols that use the
> stubs.

DSOs are not supported. This is strictly for static linking only.

> There are other reasons it might be nice to be able to configure libstdc++
> for something in between a full hosted environment and a minimal
> freestanding one.

If it isn't a horrible hack, like libgfortran minimal mode, then fine.

> > I believe static constructors work (libgfortran uses some), but exception
> > handling does not. I'm not sure what other exotica C++ might need?
> 
> Ideally, __cxa_atexit and __cxa_thread_atexit for static and thread-local
> destructors, but we can survive without them (and have not-fully-conforming
> destruction ordering).

Offload kernels are just fragments of programs, so this is tricky in those
cases. Libgomp explicitly calls _init_array and _fini_array as single-threaded
kernel launches. Actually, it's not clear that deconstruction is in any way
interesting, given that code running on the GPU has no external access and the
resources are all released when the host program exits.

Similarly, C++ threads are not interesting in the GPU-offload case. There are a
fixed number or threads launched on entry and they are managed by libgomp. In
theory it would be possible to code gthreads/libstdc++ to use them in
standalone mode, but really that mode only exists to facilitate compiler
testing.

[Bug target/100208] amdgcn fails to build with llvm-mc from llvm12

2021-07-21 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100208

Andrew Stubbs  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Andrew Stubbs  ---
I think this issue should be resolved now.

(Other reasons why GCC fails with LLVM 12 still exist).

[Bug target/101544] [OpenMP][AMDGCN][nvptx] C++ offloading: unresolved _Znwm = "operator new(unsigned long)"

2021-07-21 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101544

--- Comment #3 from Andrew Stubbs  ---
The standalone amdgcn configuration does not support C++. There are a number of
technical reasons why it doesn't Just Work, but basically it comes down to
no-one ever working on it. Our customers were primarily interested in Fortran
with C second.

C++ offloading works fine provided that there are no library calls or
exceptions.

Ignoring unsupported C++ language features, for now, I don't think there's any
reason why libstdc++ would need to be cut down. We already build the full
libgfortran for amdgcn. System calls that make no sense on the GPU were
implemented as stubs in Newlib (mostly returning some reasonable errno value),
and it would be straight-forward to implement more the same way.

I believe static constructors work (libgfortran uses some), but exception
handling does not. I'm not sure what other exotica C++ might need?

As for exceptions, set-jump-long-jump is not implemented because there was no
call for it and I didn't know how to handle the GCN register files properly.
Not only are they variable-sized, they're also potentially very large: ranging
from ~6KB up to ~65KB, I think (102 32-bit scalar, and 256 2048-bit vector
registers, for single-threaded mode, but only 80 scalar and 24 vector registers
in maximum occupancy mode, in which case per-thread stack space is also quite
limited). I'm not sure now the other exception implementations work.

[Bug target/101484] [12 Regression] trunk 20210717 ftbfs for amdgcn-amdhsa (gcn offload)

2021-07-17 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101484

Andrew Stubbs  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2021-07-17

--- Comment #1 from Andrew Stubbs  ---
A new warning has been added that falsely identifies any access to a hardcoded
constant address as bogus. This has affected a few targets, including GCN
libgomp. See pr101374.

There's some discussion what to do about it. E.g.
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574880.html

[Bug target/97827] bootstrap error building the amdgcn-amdhsa offload compiler with LLVM 11

2021-07-02 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97827

Andrew Stubbs  changed:

   What|Removed |Added

 CC||xw111luoye at gmail dot com

--- Comment #17 from Andrew Stubbs  ---
*** Bug 95023 has been marked as a duplicate of this bug. ***

[Bug target/95023] Offloading AMD GCN wiki cannot be followed

2021-07-02 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95023

Andrew Stubbs  changed:

   What|Removed |Added

 CC||ams at gcc dot gnu.org
 Resolution|--- |DUPLICATE
 Status|UNCONFIRMED |RESOLVED

--- Comment #3 from Andrew Stubbs  ---
The second problem in this bug is reported in 97827.

*** This bug has been marked as a duplicate of bug 97827 ***

[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770

2021-05-14 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418

Andrew Stubbs  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #17 from Andrew Stubbs  ---
This issue should be fixed now.

[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770

2021-05-06 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418

--- Comment #13 from Andrew Stubbs  ---
I found a lot more ICEs when testing my patch. They look to be unrelated
(TImode come back to haunt us), but it makes it hard to be sure.

[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770

2021-05-05 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418

--- Comment #9 from Andrew Stubbs  ---
I found a couple of other places to put force_operand and the full case works
now.

Running more tests

[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770

2021-05-05 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418

Andrew Stubbs  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2021-05-05
 Status|UNCONFIRMED |NEW

--- Comment #6 from Andrew Stubbs  ---
Using force_operand does fix Tobias's reduced testcase. I'll test it further
and let you know.

[Bug target/100418] [12 Regression][gcn] since r12-397 bootstrap fails: error: unrecognizable insn: in extract_insn, at recog.c:2770

2021-05-05 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100418

--- Comment #4 from Andrew Stubbs  ---
Alexandre's patch has this:

emit_move_insn (rem, plus_constant (ptr_mode, rem, -blksize));

Is that generally a valid thing to do? It seems like other places do similar
things...

[Bug target/100208] amdgcn fails to build with llvm-mc from llvm12

2021-04-22 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100208

--- Comment #1 from Andrew Stubbs  ---
LLVM changed the default parameters, so we either have to change the
expectations in the ".amdgcn_target" string (which is basically an assert), or
set the attributes be want explicitly on the assembler command line.

(Or port binutils to amdgcn, but there's no plan for that.)

[Bug target/97521] [11 Regression] wrong code with -mno-sse2 since r11-3394

2020-10-23 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97521

--- Comment #22 from Andrew Stubbs  ---
(In reply to Andrew Stubbs from comment #21)
> (In reply to Richard Biener from comment #19)
> > GCN also uses MODE_INT for the mask mode and thus may be similarly affected.
> > Andrew - are the bits in the mask dense?  Thus for a V4SImode compare
> > would the mask occupy only the lowest 4 bits of the DImode mask?
> 
> Yes, that's correct.

Or rather, I should say that that *will* be the case when I add partial vector
support; right now it can only be done via masking V64SImode.

A have a patch set, but the last problem is that while_ult doesn't operate on
partial integer masks, leading to wrong code. AArch64 doesn't have a problem
with this because it uses VBI masks of the right size. I have a patch that adds
the vector size as an operand to while_ult; this seems to fix the problems on
GCN, but I need to make corresponding changes for AArch64 also before I can
submit those patches, and time is tight.

[Bug target/97521] [11 Regression] wrong code with -mno-sse2 since r11-3394

2020-10-23 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97521

--- Comment #21 from Andrew Stubbs  ---
(In reply to Richard Biener from comment #19)
> GCN also uses MODE_INT for the mask mode and thus may be similarly affected.
> Andrew - are the bits in the mask dense?  Thus for a V4SImode compare
> would the mask occupy only the lowest 4 bits of the DImode mask?

Yes, that's correct.

[Bug tree-optimization/84958] int loads not eliminated against larger stores

2020-10-15 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84958

--- Comment #6 from Andrew Stubbs  ---
(In reply to Tom de Vries from comment #5)
> I've removed the xfail for nvptx.
> 
> The only remaining xfail is for gcn.  Is that one still necessary?

The test still fails for gcn.

[Bug libgomp/97332] [gcn] GCN_NUM_GANGS/GCN_NUM_WORKERS override compile-time constants

2020-10-08 Thread ams at gcc dot gnu.org via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97332

Andrew Stubbs  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2020-10-08
 Status|UNCONFIRMED |NEW

--- Comment #1 from Andrew Stubbs  ---
At the point the overrides are applied (run_kernel) the code only knows what
dimensions were selected at runtime, not how those figures were arrived at. It
then prints (with GCN_DEBUG set) the "launch attributes" and "launch actuals".

To fix this the overrides will have to applied much earlier, and independently
for OpenACC (gcn_exec) and OpenMP (parse_target_attributes). That or the
automatic balancing be applied later. Or perhaps the original attributes be
stored for later inspection (but GOMP_kernel_launch_attributes is defined by
libgomp). The "attributes" and "actuals" will need to be overhauled. Probably
get_group_size can be removed.

It ought to be doable though.

72 matches

Mail list logo