pr58955-2.c is miscompiled by RTL scheduling after reload

rguenther at suse dot de via Gcc-bugs Mon, 26 Jun 2023 04:14:33 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #17 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #16 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> (In reply to rguent...@suse.de from comment #14)
> > vectors of T and scalar T interoperate TBAA wise.  What we disambiguate is
> > 
> > int a[2];
> > 
> > int foo(int *p)
> > {
> >   a[0] = 1;
> >   *(v4si *)p = {0,0,0,0};
> >   return a[0];
> > }
> > 
> > because the V4SI vector store is too large for the a[] object.  That
> > doesn't even use TBAA (it works with -fno-strict-aliasing just fine).
> 
> Thank you for the example. If we do the same for vector loads, that's a 
> footgun
> for users who use vector loads to access small objects:
> 
> // alignment to 16 is ensured externally
> extern int a[2];
> 
> int foo()
> {
>   a[0] = 1;
> 
>   __v4si v = (__attribute__((may_alias)) __v4si *) &a;
>   // mask out extra elements in v and continue
>  ...
> }
> 
> This is a benign data race on data that follows 'a' in the address space, but
> otherwise should be a valid and useful technique.

Yes, we do the same to loads.  I hope that's not a common technique
though but I have to admit the vectorizer itself assesses whether it's
safe to access "gaps" by looking at alignment so its code generation
is prone to this same "mistake".

Now, is "alignment to 16 is ensured externally" good enough here?
If we consider

static int a[2];

and code doing

 if (is_aligned (a))
   {
     __v4si v = (__attribute__((may_alias)) __v4si *) &a;
   }

then we cannot even use a DECL_ALIGN that's insufficient for decls
that bind locally.

Note we have similar arguments with aggregate type sizes (and TBAA)
where when we infer a dynamic type from one access we check if
the other access would fit.  Wouldn't the above then extend to that
as well given we could also do aggregate copies of "padding" and
ignore the bits if we'd have ensured the larger access wouldn't trap?

So supporting the above might be a bit of a stretch (though I think
we have to fix the vectorizer here).

> > If the v4si store is masked we cannot do this anymore, but the IL
> > we seed the alias oracle with doesn't know the store is partial.
> > The only way to "fix" it is to take away all of the information from it.
> 
> But that won't fix the trapping issue? I think we need a distinct RTX for
> memory accesses where hardware does fault suppression for masked-out elements.

Yes, it doesn't fix that part.  The idea of using BLKmode instead of
a vector mode for the MEMs would, I guess, together with specifying
MEM_SIZE as not known.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

Reply via email to