[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-11-29 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #28 from Richard Biener  ---
Fixed.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-11-29 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #27 from GCC Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:b09b879e4e9cc24a5d2b0344c1930020c218a104

commit r14-5964-gb09b879e4e9cc24a5d2b0344c1930020c218a104
Author: Richard Biener 
Date:   Wed Jun 21 09:31:30 2023 +0200

middle-end/110237 - wrong MEM_ATTRs for partial loads/stores

The following addresses a miscompilation by RTL scheduling related
to the representation of masked stores.  For that we have

(insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ]
[90])
(const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
)
(const_int -4 [0xfffc] [1 MEM
 [(int *)vectp_b.12_28]+0 S64 A32])
(vec_merge:V16SI (reg:V16SI 20 xmm0 [118])
(mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90])
(const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
)
(const_int -4 [0xfffc] [1 MEM
 [(int *)vectp_b.12_28]+0 S64 A32])

and specifically the memory attributes

  [1 MEM  [(int *)vectp_b.12_28]+0 S64 A32]

are problematic.  They tell us the instruction stores and reads a full
vector which it if course does not.  There isn't any good MEM_EXPR
we can use here (we lack a way to just specify a pointer and restrict
info for example), and since the MEMs have a vector mode it's
difficult in general as passes do not need to look at the memory
attributes at all.

The easiest way to avoid running into the alias analysis problem is
to scrap the MEM_EXPR when we expand the internal functions for
partial loads/stores.  That avoids the disambiguation we run into
which is realizing that we store to an object of less size as
the size of the mode we appear to store.

After the patch we see just

  [1  S64 A32]

so we preserve the alias set, the alignment and the size (the size
is redundant if the MEM insn't BLKmode).  That's still not good
in case the RTL alias oracle would implement the same
disambiguation but it fends off the gimple one.

This fixes gcc.dg/torture/pr58955-2.c when built with AVX512
and --param=vect-partial-vector-usage=1.

PR middle-end/110237
* internal-fn.cc (expand_partial_load_optab_fn): Clear
MEM_EXPR and MEM_OFFSET.
(expand_partial_store_optab_fn): Likewise.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-11-27 Thread rdapp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

Robin Dapp  changed:

   What|Removed |Added

 CC||rdapp at gcc dot gnu.org
 Target|x86_64-*-*  |x86_64-*-* riscv*-*-*

--- Comment #26 from Robin Dapp  ---
We are seeing the same problem on riscv and Richi's RFC patch helps (our
patterns are already unspec).

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-28 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #25 from CVS Commits  ---
The releases/gcc-13 branch has been updated by hongtao Liu
:

https://gcc.gnu.org/g:8b059560146a93e5174262fef25e8b1aa39bb879

commit r13-7500-g8b059560146a93e5174262fef25e8b1aa39bb879
Author: liuhongt 
Date:   Mon Jun 26 21:07:09 2023 +0800

Refine maskstore patterns with UNSPEC_MASKMOV.

Similar like r14-2070-gc79476da46728e

If mem_addr points to a memory region with less than whole vector size
bytes of accessible memory and k is a mask that would prevent reading
the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent
it to be transformed to any other whole memory access instructions.

gcc/ChangeLog:

PR rtl-optimization/110237
* config/i386/sse.md (_store_mask): Refine with
UNSPEC_MASKMOV.
(maskstore_store_mask): New define_insn, it's renamed
from original _store_mask.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-28 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #24 from CVS Commits  ---
The releases/gcc-12 branch has been updated by hongtao Liu
:

https://gcc.gnu.org/g:a435939ba7e5e489a422071014f943c1a577bfe6

commit r12-9743-ga435939ba7e5e489a422071014f943c1a577bfe6
Author: liuhongt 
Date:   Mon Jun 26 21:07:09 2023 +0800

Refine maskstore patterns with UNSPEC_MASKMOV.

Similar like r14-2070-gc79476da46728e

If mem_addr points to a memory region with less than whole vector size
bytes of accessible memory and k is a mask that would prevent reading
the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent
it to be transformed to any other whole memory access instructions.

gcc/ChangeLog:

PR rtl-optimization/110237
* config/i386/sse.md (_store_mask): Refine with
UNSPEC_MASKMOV.
(maskstore_store_mask): New define_insn, it's renamed
from original _store_mask.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-28 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #23 from CVS Commits  ---
The releases/gcc-11 branch has been updated by hongtao Liu
:

https://gcc.gnu.org/g:ad1a3da97f7c4a85778f91b235a8d936bb1c829b

commit r11-10884-gad1a3da97f7c4a85778f91b235a8d936bb1c829b
Author: liuhongt 
Date:   Mon Jun 26 21:07:09 2023 +0800

Refine maskstore patterns with UNSPEC_MASKMOV.

Similar like r14-2070-gc79476da46728e

If mem_addr points to a memory region with less than whole vector size
bytes of accessible memory and k is a mask that would prevent reading
the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent
it to be transformed to any other whole memory access instructions.

gcc/ChangeLog:

PR rtl-optimization/110237
* config/i386/sse.md (_store_mask): Refine with
UNSPEC_MASKMOV.
(maskstore_store_mask): New define_insn, it's renamed
from original _store_mask.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-27 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #22 from rguenther at suse dot de  ---
On Tue, 27 Jun 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #21 from Alexander Monakov  ---
> (In reply to rguent...@suse.de from comment #19)
> > But the size argument doesn't have anything to do with TBAA (and
> > may_alias is about TBAA).  I don't think we have any way to circumvent
> > C object access rules.  That is, for example, with -fno-strict-aliasing
> > the following isn't going to work.
> > 
> > int a;
> > int b;
> > 
> > int main()
> > {
> >   a = 1;
> >   b = 2;
> >   if (&a + 1 == &b) // equality compare of unrelated pointers OK
> > {
> >   long x = *(long *)&a; // access outside of 'a' not OK
> >   if (x != 0x00010002)
> > abort ();
> > }
> > }
> > 
> > there's no command-line flag or attribute to form a pointer
> > to an object composing 'a' and 'b' besides changing how the
> > storage is declared.
> 
> But store-merging and SLP can introduce a wide long-sized access where on
> source level you had two adjacent loads or even memcpy's, so we really seem to
> have a problem here and might need to be able to annotate types or individual
> accesses as "may-alias-with-oob-ok" in the IR: PR 110431.

But above 'a' and 'b' are not adjacent, they are only verified to be
at runtime.  The only thing we do IIRC is use wider loads to access
properly aligned storage as we know the load wouldn't trap.  That
can lead us to the case you pointed out originally - we load stuff
we will ignore but might cause alias disambiguation to disambiguate
against a store of the original non-widened size.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #21 from Alexander Monakov  ---
(In reply to rguent...@suse.de from comment #19)
> But the size argument doesn't have anything to do with TBAA (and
> may_alias is about TBAA).  I don't think we have any way to circumvent
> C object access rules.  That is, for example, with -fno-strict-aliasing
> the following isn't going to work.
> 
> int a;
> int b;
> 
> int main()
> {
>   a = 1;
>   b = 2;
>   if (&a + 1 == &b) // equality compare of unrelated pointers OK
> {
>   long x = *(long *)&a; // access outside of 'a' not OK
>   if (x != 0x00010002)
> abort ();
> }
> }
> 
> there's no command-line flag or attribute to form a pointer
> to an object composing 'a' and 'b' besides changing how the
> storage is declared.

But store-merging and SLP can introduce a wide long-sized access where on
source level you had two adjacent loads or even memcpy's, so we really seem to
have a problem here and might need to be able to annotate types or individual
accesses as "may-alias-with-oob-ok" in the IR: PR 110431.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-27 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #20 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:dbf8ab449417aa24669f6ccf50be8c17f8c1278e

commit r14-2116-gdbf8ab449417aa24669f6ccf50be8c17f8c1278e
Author: liuhongt 
Date:   Mon Jun 26 21:07:09 2023 +0800

Refine maskstore patterns with UNSPEC_MASKMOV.

Similar like r14-2070-gc79476da46728e

If mem_addr points to a memory region with less than whole vector size
bytes of accessible memory and k is a mask that would prevent reading
the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent
it to be transformed to any other whole memory access instructions.

gcc/ChangeLog:

PR rtl-optimization/110237
* config/i386/sse.md (_store_mask): Refine with
UNSPEC_MASKMOV.
(maskstore_store_mask): New define_insn, it's renamed
from original _store_mask.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #19 from rguenther at suse dot de  ---
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #18 from Alexander Monakov  ---
> (In reply to rguent...@suse.de from comment #17)
> > Yes, we do the same to loads.  I hope that's not a common technique
> > though but I have to admit the vectorizer itself assesses whether it's
> > safe to access "gaps" by looking at alignment so its code generation
> > is prone to this same "mistake".
> > 
> > Now, is "alignment to 16 is ensured externally" good enough here?
> > If we consider
> > 
> > static int a[2];
> > 
> > and code doing
> > 
> >  if (is_aligned (a))
> >{
> >  __v4si v = (__attribute__((may_alias)) __v4si *) &a;
> >}
> > 
> > then we cannot even use a DECL_ALIGN that's insufficient for decls
> > that bind locally.
> 
> I agree. I went with the 'extern' example because there it should be more
> obvious the construction ought to work.
> 
> 
> > Note we have similar arguments with aggregate type sizes (and TBAA)
> > where when we infer a dynamic type from one access we check if
> > the other access would fit.  Wouldn't the above then extend to that
> > as well given we could also do aggregate copies of "padding" and
> > ignore the bits if we'd have ensured the larger access wouldn't trap?
> 
> I think a read via a may_alias type just tells you that N bytes are accessible
> for reading, not necessarily for writing. So I don't see a problem, but maybe 
> I
> didn't quite catch what you are saying.

I wasn't sure how to phrase, what I was saying is we have this
"the access is too large for the object in consideration, so it cannot
alias it" in places where we just work with types within the TBAA
framework.  So I wondered if one can construct a similar case to
support that we should not do this.  (tree-ssa-alias.cc:
aliasing_component_refs_p)

> 
> > So supporting the above might be a bit of a stretch (though I think
> > we have to fix the vectorizer here).
> 
> What would the solution be? Using a may_alias type for such accesses?

But the size argument doesn't have anything to do with TBAA (and
may_alias is about TBAA).  I don't think we have any way to circumvent
C object access rules.  That is, for example, with -fno-strict-aliasing
the following isn't going to work.

int a;
int b;

int main()
{
  a = 1;
  b = 2;
  if (&a + 1 == &b) // equality compare of unrelated pointers OK
{
  long x = *(long *)&a; // access outside of 'a' not OK
  if (x != 0x00010002)
abort ();
}
}

there's no command-line flag or attribute to form a pointer
to an object composing 'a' and 'b' besides changing how the
storage is declared.

I don't think we should make an exception for "padding" after
an object and I don't see any sensible way how to constrain
the size of the supported "padding" either?  Pad to the
largest possible alignment of the object?  That would be
MAX_OFILE_ALIGNMENT ...

> 
> > > > If the v4si store is masked we cannot do this anymore, but the IL
> > > > we seed the alias oracle with doesn't know the store is partial.
> > > > The only way to "fix" it is to take away all of the information from it.
> > > 
> > > But that won't fix the trapping issue? I think we need a distinct RTX for
> > > memory accesses where hardware does fault suppression for masked-out 
> > > elements.
> > 
> > Yes, it doesn't fix that part.  The idea of using BLKmode instead of
> > a vector mode for the MEMs would, I guess, together with specifying
> > MEM_SIZE as not known.
> 
> Unfortunate if that works for the trapping side, but not for the 
> aliasing side.

It should work for both I think, but MEM_EXPR would need changing
as well - we do have a perfectly working representation there, it
would just be the first CALL_EXPR in such context ...

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #18 from Alexander Monakov  ---
(In reply to rguent...@suse.de from comment #17)
> Yes, we do the same to loads.  I hope that's not a common technique
> though but I have to admit the vectorizer itself assesses whether it's
> safe to access "gaps" by looking at alignment so its code generation
> is prone to this same "mistake".
> 
> Now, is "alignment to 16 is ensured externally" good enough here?
> If we consider
> 
> static int a[2];
> 
> and code doing
> 
>  if (is_aligned (a))
>{
>  __v4si v = (__attribute__((may_alias)) __v4si *) &a;
>}
> 
> then we cannot even use a DECL_ALIGN that's insufficient for decls
> that bind locally.

I agree. I went with the 'extern' example because there it should be more
obvious the construction ought to work.


> Note we have similar arguments with aggregate type sizes (and TBAA)
> where when we infer a dynamic type from one access we check if
> the other access would fit.  Wouldn't the above then extend to that
> as well given we could also do aggregate copies of "padding" and
> ignore the bits if we'd have ensured the larger access wouldn't trap?

I think a read via a may_alias type just tells you that N bytes are accessible
for reading, not necessarily for writing. So I don't see a problem, but maybe I
didn't quite catch what you are saying.


> So supporting the above might be a bit of a stretch (though I think
> we have to fix the vectorizer here).

What would the solution be? Using a may_alias type for such accesses?


> > > If the v4si store is masked we cannot do this anymore, but the IL
> > > we seed the alias oracle with doesn't know the store is partial.
> > > The only way to "fix" it is to take away all of the information from it.
> > 
> > But that won't fix the trapping issue? I think we need a distinct RTX for
> > memory accesses where hardware does fault suppression for masked-out 
> > elements.
> 
> Yes, it doesn't fix that part.  The idea of using BLKmode instead of
> a vector mode for the MEMs would, I guess, together with specifying
> MEM_SIZE as not known.

Unfortunate if that works for the trapping side, but not for the aliasing side.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #17 from rguenther at suse dot de  ---
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #16 from Alexander Monakov  ---
> (In reply to rguent...@suse.de from comment #14)
> > vectors of T and scalar T interoperate TBAA wise.  What we disambiguate is
> > 
> > int a[2];
> > 
> > int foo(int *p)
> > {
> >   a[0] = 1;
> >   *(v4si *)p = {0,0,0,0};
> >   return a[0];
> > }
> > 
> > because the V4SI vector store is too large for the a[] object.  That
> > doesn't even use TBAA (it works with -fno-strict-aliasing just fine).
> 
> Thank you for the example. If we do the same for vector loads, that's a 
> footgun
> for users who use vector loads to access small objects:
> 
> // alignment to 16 is ensured externally
> extern int a[2];
> 
> int foo()
> {
>   a[0] = 1;
> 
>   __v4si v = (__attribute__((may_alias)) __v4si *) &a;
>   // mask out extra elements in v and continue
>  ...
> }
> 
> This is a benign data race on data that follows 'a' in the address space, but
> otherwise should be a valid and useful technique.

Yes, we do the same to loads.  I hope that's not a common technique
though but I have to admit the vectorizer itself assesses whether it's
safe to access "gaps" by looking at alignment so its code generation
is prone to this same "mistake".

Now, is "alignment to 16 is ensured externally" good enough here?
If we consider

static int a[2];

and code doing

 if (is_aligned (a))
   {
 __v4si v = (__attribute__((may_alias)) __v4si *) &a;
   }

then we cannot even use a DECL_ALIGN that's insufficient for decls
that bind locally.

Note we have similar arguments with aggregate type sizes (and TBAA)
where when we infer a dynamic type from one access we check if
the other access would fit.  Wouldn't the above then extend to that
as well given we could also do aggregate copies of "padding" and
ignore the bits if we'd have ensured the larger access wouldn't trap?

So supporting the above might be a bit of a stretch (though I think
we have to fix the vectorizer here).

> > If the v4si store is masked we cannot do this anymore, but the IL
> > we seed the alias oracle with doesn't know the store is partial.
> > The only way to "fix" it is to take away all of the information from it.
> 
> But that won't fix the trapping issue? I think we need a distinct RTX for
> memory accesses where hardware does fault suppression for masked-out elements.

Yes, it doesn't fix that part.  The idea of using BLKmode instead of
a vector mode for the MEMs would, I guess, together with specifying
MEM_SIZE as not known.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #16 from Alexander Monakov  ---
(In reply to rguent...@suse.de from comment #14)
> vectors of T and scalar T interoperate TBAA wise.  What we disambiguate is
> 
> int a[2];
> 
> int foo(int *p)
> {
>   a[0] = 1;
>   *(v4si *)p = {0,0,0,0};
>   return a[0];
> }
> 
> because the V4SI vector store is too large for the a[] object.  That
> doesn't even use TBAA (it works with -fno-strict-aliasing just fine).

Thank you for the example. If we do the same for vector loads, that's a footgun
for users who use vector loads to access small objects:

// alignment to 16 is ensured externally
extern int a[2];

int foo()
{
  a[0] = 1;

  __v4si v = (__attribute__((may_alias)) __v4si *) &a;
  // mask out extra elements in v and continue
 ...
}

This is a benign data race on data that follows 'a' in the address space, but
otherwise should be a valid and useful technique.

> If the v4si store is masked we cannot do this anymore, but the IL
> we seed the alias oracle with doesn't know the store is partial.
> The only way to "fix" it is to take away all of the information from it.

But that won't fix the trapping issue? I think we need a distinct RTX for
memory accesses where hardware does fault suppression for masked-out elements.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #15 from Hongtao.liu  ---
(In reply to rguent...@suse.de from comment #10)
> On Sun, 25 Jun 2023, crazylht at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> > 
> > --- Comment #9 from Hongtao.liu  ---
> > 
> > > So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the
> > > problematic part of alias analysis.  Together with UNSPEC this might be
> > > enough to fix things.
> > > 
> > Note maskstore won't optimized to vpblendd since it doesn't support memory
> > dest, so I guess no need to use UNSPEC for maskstore?
> 
> A maskstore now looks like
> 
> (insn 31 30 32 5 (set (mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ])
> (reg:DI 90 [ ivtmp.28 ])) [1  S64 A64])
> (vec_merge:V8DF (reg:V8DF 115 [ vect__9.14 ])
> (mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ])
> (reg:DI 90 [ ivtmp.28 ])) [1  S64 A64])
> (reg:QI 100 [ loop_mask_57 ]))) "t.c":6:14 1957 
> {avx512f_storev8df_mask}
> 
> that appears as a full read and a full store which means earlier
> stores to the masked part could be elided rightfully by RTL DSE
> at least.  If there's any RTL analysis about whether a conditional
> load could trap then for example a full V8DF load from
> the same address that's currently conditional but after the above
> could be analyzed as safe to be scheduled speculatively and
> unconditional while it would not be safe as it could trap.
> 

In my understanding, use whole size for trap analysis is suboptimal but safe,
if whole size access is safe, mask_load/store must be safe. But it could be
suboptimal when whole size access can trap but mask_load/store won't, but we
should accept that suboptimal since mask is not always known.

I didn't find such rule in rtx_addr_can_trap_p, but only known_subrange_p.

  /* If the pointer based access is bigger than the variable they cannot
 alias.  This is similar to the check below where we use TBAA to
 increase the size of the pointer based access based on the dynamic
 type of a containing object we can infer from it.  */
  poly_int64 dsize2;
  if (known_size_p (size1)
  && poly_int_tree_p (DECL_SIZE (base2), &dsize2)
  && known_lt (dsize2, size1))
return false;

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #14 from rguenther at suse dot de  ---
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #13 from Alexander Monakov  ---
> (In reply to rguent...@suse.de from comment #12)
> > As explained in comment#3 the issue is related to the tree alias oracle
> > part that gets invoked on the MEM_EXPR for the load where there is
> > no information that the load could be partial so it gets disambiguated
> > against a decl that's off less size than the full vector.
> 
> With my example I'm trying to say that types in the IR are wrong if we
> disambiguate like that. People writing C need to attach may_alias to vector
> types for plain load/stores to validly overlap with scalar accesses, and when
> vectorizer introduces vector accesses it needs to do something like that, or
> else intermixed scalar accesses may be incorrectly disambiguated against new
> vector ones.

vectors of T and scalar T interoperate TBAA wise.  What we disambiguate is

int a[2];

int foo(int *p)
{
  a[0] = 1;
  *(v4si *)p = {0,0,0,0};
  return a[0];
}

because the V4SI vector store is too large for the a[] object.  That
doesn't even use TBAA (it works with -fno-strict-aliasing just fine).

If the v4si store is masked we cannot do this anymore, but the IL
we seed the alias oracle with doesn't know the store is partial.
The only way to "fix" it is to take away all of the information from it.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #13 from Alexander Monakov  ---
(In reply to rguent...@suse.de from comment #12)
> As explained in comment#3 the issue is related to the tree alias oracle
> part that gets invoked on the MEM_EXPR for the load where there is
> no information that the load could be partial so it gets disambiguated
> against a decl that's off less size than the full vector.

With my example I'm trying to say that types in the IR are wrong if we
disambiguate like that. People writing C need to attach may_alias to vector
types for plain load/stores to validly overlap with scalar accesses, and when
vectorizer introduces vector accesses it needs to do something like that, or
else intermixed scalar accesses may be incorrectly disambiguated against new
vector ones.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #12 from rguenther at suse dot de  ---
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #11 from Alexander Monakov  ---
> The trapping angle seems valid, but I have a really hard time understanding 
> the
> DSE issue, and the preceding issue about disambiguation based on RTL aliasing.
> 
> How would DSE optimize out 'd[5] = 1' in your example when the mask_store 
> reads
> it? Isn't that a data dependency?

Ah, yes - I completely missed that.

> How is the initial issue different from
> 
> int f(__m128i *p, __m128i v, int *q)
> {
>   *q = 0;
>   *p = v;
>   return *q;
> }
> 
> that we cannot optimize to 'return 0' due to __attribute__((may_alias))
> attached to __m128i?

As explained in comment#3 the issue is related to the tree alias oracle
part that gets invoked on the MEM_EXPR for the load where there is
no information that the load could be partial so it gets disambiguated
against a decl that's off less size than the full vector.

I've proposed a patch for that issue at hand (clear MEM_EXPR for
all partial load/store MEMs), but didn't yet get an approval.  The
question of what RTL analysis can derive from the RTL is also still
open (but it might be we're "safe" because it just doesn't do any
analysis/transform that would be mislead)

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #11 from Alexander Monakov  ---
The trapping angle seems valid, but I have a really hard time understanding the
DSE issue, and the preceding issue about disambiguation based on RTL aliasing.

How would DSE optimize out 'd[5] = 1' in your example when the mask_store reads
it? Isn't that a data dependency?

How is the initial issue different from

int f(__m128i *p, __m128i v, int *q)
{
  *q = 0;
  *p = v;
  return *q;
}

that we cannot optimize to 'return 0' due to __attribute__((may_alias))
attached to __m128i?

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-26 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #10 from rguenther at suse dot de  ---
On Sun, 25 Jun 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #9 from Hongtao.liu  ---
> 
> > So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the
> > problematic part of alias analysis.  Together with UNSPEC this might be
> > enough to fix things.
> > 
> Note maskstore won't optimized to vpblendd since it doesn't support memory
> dest, so I guess no need to use UNSPEC for maskstore?

A maskstore now looks like

(insn 31 30 32 5 (set (mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ])
(reg:DI 90 [ ivtmp.28 ])) [1  S64 A64])
(vec_merge:V8DF (reg:V8DF 115 [ vect__9.14 ])
(mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ])
(reg:DI 90 [ ivtmp.28 ])) [1  S64 A64])
(reg:QI 100 [ loop_mask_57 ]))) "t.c":6:14 1957 
{avx512f_storev8df_mask}

that appears as a full read and a full store which means earlier
stores to the masked part could be elided rightfully by RTL DSE
at least.  If there's any RTL analysis about whether a conditional
load could trap then for example a full V8DF load from
the same address that's currently conditional but after the above
could be analyzed as safe to be scheduled speculatively and
unconditional while it would not be safe as it could trap.

I think the DSE issue is real and it should be easy to create
a testcase like

void foo (double *d, int mask)
{
  d[5] = 1.;
  _intrinsic_for_mask_store (d, some-val, mask);
}

call that with lane 5 masked and check for d[5] = 1. preserved.
I think RTL DSE will remove that store.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-24 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #9 from Hongtao.liu  ---

> So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the
> problematic part of alias analysis.  Together with UNSPEC this might be
> enough to fix things.
> 
Note maskstore won't optimized to vpblendd since it doesn't support memory
dest, so I guess no need to use UNSPEC for maskstore?

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

Richard Biener  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #8 from Richard Biener  ---
(In reply to Hongtao.liu from comment #7)
> > So it looks like a generic problem and better to be handled in
> > expand_partial_{load, store}_optab_fn?
> 
> There're many other places with assumption MEM_SIZE is equal to MODE_SIZE
> even !MEM_SIZE_KNOWN_P, .i.e. ao_ref_base will rewrite size to MODE_SIZE.

Yes, that's because MEM_EXPR isn't really correct ... I guess we could
work around that in ao_ref_from_mem but expand_partial_store_optab_fn
is wrong in setting that.

There's no "partial memory" MEM, and AFAIK the memory attributes are
only additional info and ignoring them is valid, so indeed a pass
could at least interpret inputs and outputs in mode size even when
UNSPECs are involved.  But it must not(?) interpret them as must uses
or kills?

I also think that MEM_SIZE is really only relevant for BLKmode MEMs,
MEM_OFFSET is only relevant for interpreting MEM_EXPR.

Clearing MEM_EXPR and MEM_SIZE results in an ICE:

#0  fancy_abort (file=0x31ce278
"/space/rguenther/src/gcc11queue/gcc/rtlanal.cc", line=469, 
function=0x31d0560 , poly_int<1u, long>, machine_mode, bool)::__FUNCTION__>
"rtx_addr_can_trap_p_1") at
/space/rguenther/src/gcc11queue/gcc/diagnostic.cc:2232
#1  0x0158b62b in rtx_addr_can_trap_p_1 (x=0x76d51798, offset=...,
size=..., mode=E_V16SImode, unaligned_mems=false)
at /space/rguenther/src/gcc11queue/gcc/rtlanal.cc:467
#2  0x01591a2d in may_trap_p_1 (x=0x76d51780, flags=0) at
/space/rguenther/src/gcc11queue/gcc/rtlanal.cc:3160
#3  0x01591f64 in may_trap_p (x=0x76d51780) at
/space/rguenther/src/gcc11queue/gcc/rtlanal.cc:3283
#4  0x015f32e6 in simplify_context::simplify_ternary_operation
(this=0x7fffcd28, code=VEC_MERGE, mode=E_V16SImode, 
op0_mode=E_V16SImode, op0=0x76e21d90, op1=0x76d51780,
op2=0x76d51108)
at /space/rguenther/src/gcc11queue/gcc/simplify-rtx.cc:7040
#5  0x00f590c7 in simplify_ternary_operation (code=VEC_MERGE,
mode=E_V16SImode, op0_mode=E_V16SImode, op0=0x76e21d90, 
op1=0x76d51780, op2=0x76d51108) at
/space/rguenther/src/gcc11queue/gcc/rtl.h:3498

(gdb) p debug_rtx (op1)
(mem:V16SI (plus:DI (reg/f:DI 113)
(reg:DI 90 [ _22 ])) [1  A32])

where I have only preserved MEM_ALIAS_SET and MEM_ALIGN.  It insists
on knowing the MEMs size if it's not BLKmode or VOIDmode.  Indeed
from the mode we can derived its size.

So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the
problematic part of alias analysis.  Together with UNSPEC this might be
enough to fix things.

I'm going to test such a patch and seek for feedback on the mailing list.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-20 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #7 from Hongtao.liu  ---

> So it looks like a generic problem and better to be handled in
> expand_partial_{load, store}_optab_fn?

There're many other places with assumption MEM_SIZE is equal to MODE_SIZE even
!MEM_SIZE_KNOWN_P, .i.e. ao_ref_base will rewrite size to MODE_SIZE.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-20 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #6 from Hongtao.liu  ---
(In reply to rguent...@suse.de from comment #5)
> On Tue, 20 Jun 2023, crazylht at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> > 
> > --- Comment #4 from Hongtao.liu  ---
> > (In reply to Richard Biener from comment #3)
> > > This looks like the same issue as PR110309.  We have
> > > 
> > > (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] 
> > > [90])
> > > (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> > > )
> > > (const_int -4 [0xfffc] [1 MEM
> > >  [(int *)vectp_b.12_28]+0 S64 A32])
> > > (vec_merge:V16SI (reg:V16SI 20 xmm0 [118])
> > > (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) 
> > > (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> > > )
> > > (const_int -4 [0xfffc] [1 MEM
> > >  [(int *)vectp_b.12_28]+0 S64 A32])
> > > 
> > > so instead of a masked load we see a vec_merge with a (mem:V16SI ...)
> > > based on the vectp_b.12_28 pointer that has full size but the load of b[1]
> > > we try disambiguate against refers to int b[10] which is too small for
> > > a load of 64 bytes so we disambiguate based on that.
> > 
> > 
> >   /* If the pointer based access is bigger than the variable they cannot
> >  alias.  This is similar to the check below where we use TBAA to
> >  increase the size of the pointer based access based on the dynamic
> >  type of a containing object we can infer from it.  */
> >   poly_int64 dsize2;
> >   if (known_size_p (size1) --- should be unknown??
> >   && poly_int_tree_p (DECL_SIZE (base2), &dsize2)
> >   && known_lt (dsize2, size1))
> > return false;
> > 
> > Should we set MEM_SIZE_KNOWN_P to false for maskstore/maskload?
> > It seems to me maxsize should be 64bytes, but real size should be unknown.
> 
> Yes, you shouldn't have MEM_ATTRs that indicate the size is known.

So it looks like a generic problem and better to be handled in
expand_partial_{load, store}_optab_fn?

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-20 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #5 from rguenther at suse dot de  ---
On Tue, 20 Jun 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #4 from Hongtao.liu  ---
> (In reply to Richard Biener from comment #3)
> > This looks like the same issue as PR110309.  We have
> > 
> > (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] 
> > [90])
> > (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> > )
> > (const_int -4 [0xfffc] [1 MEM
> >  [(int *)vectp_b.12_28]+0 S64 A32])
> > (vec_merge:V16SI (reg:V16SI 20 xmm0 [118])
> > (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) 
> > (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> > )
> > (const_int -4 [0xfffc] [1 MEM
> >  [(int *)vectp_b.12_28]+0 S64 A32])
> > 
> > so instead of a masked load we see a vec_merge with a (mem:V16SI ...)
> > based on the vectp_b.12_28 pointer that has full size but the load of b[1]
> > we try disambiguate against refers to int b[10] which is too small for
> > a load of 64 bytes so we disambiguate based on that.
> 
> 
>   /* If the pointer based access is bigger than the variable they cannot
>  alias.  This is similar to the check below where we use TBAA to
>  increase the size of the pointer based access based on the dynamic
>  type of a containing object we can infer from it.  */
>   poly_int64 dsize2;
>   if (known_size_p (size1) --- should be unknown??
>   && poly_int_tree_p (DECL_SIZE (base2), &dsize2)
>   && known_lt (dsize2, size1))
> return false;
> 
> Should we set MEM_SIZE_KNOWN_P to false for maskstore/maskload?
> It seems to me maxsize should be 64bytes, but real size should be unknown.

Yes, you shouldn't have MEM_ATTRs that indicate the size is known.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-20 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #4 from Hongtao.liu  ---
(In reply to Richard Biener from comment #3)
> This looks like the same issue as PR110309.  We have
> 
> (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90])
> (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> )
> (const_int -4 [0xfffc] [1 MEM
>  [(int *)vectp_b.12_28]+0 S64 A32])
> (vec_merge:V16SI (reg:V16SI 20 xmm0 [118])
> (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) 
> (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> )
> (const_int -4 [0xfffc] [1 MEM
>  [(int *)vectp_b.12_28]+0 S64 A32])
> 
> so instead of a masked load we see a vec_merge with a (mem:V16SI ...)
> based on the vectp_b.12_28 pointer that has full size but the load of b[1]
> we try disambiguate against refers to int b[10] which is too small for
> a load of 64 bytes so we disambiguate based on that.


  /* If the pointer based access is bigger than the variable they cannot
 alias.  This is similar to the check below where we use TBAA to
 increase the size of the pointer based access based on the dynamic
 type of a containing object we can infer from it.  */
  poly_int64 dsize2;
  if (known_size_p (size1) --- should be unknown??
  && poly_int_tree_p (DECL_SIZE (base2), &dsize2)
  && known_lt (dsize2, size1))
return false;

Should we set MEM_SIZE_KNOWN_P to false for maskstore/maskload?
It seems to me maxsize should be 64bytes, but real size should be unknown.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-20 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=110309
 Ever confirmed|0   |1
   Last reconfirmed||2023-06-20

--- Comment #3 from Richard Biener  ---
This looks like the same issue as PR110309.  We have

(insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90])
(const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2]  )
(const_int -4 [0xfffc] [1 MEM
 [(int *)vectp_b.12_28]+0 S64 A32])
(vec_merge:V16SI (reg:V16SI 20 xmm0 [118])
(mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) 
(const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
)
(const_int -4 [0xfffc] [1 MEM
 [(int *)vectp_b.12_28]+0 S64 A32])

so instead of a masked load we see a vec_merge with a (mem:V16SI ...)
based on the vectp_b.12_28 pointer that has full size but the load of b[1]
we try disambiguate against refers to int b[10] which is too small for
a load of 64 bytes so we disambiguate based on that.

So quite likely a duplicate.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-14 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

--- Comment #2 from Richard Biener  ---
(In reply to Alexander Monakov from comment #1)
> Sorry, was your recent patch g:1c3661e224e3ddfc6f773b095740c0f5a7ddf5fc
> tackling a different issue?

Yes, the issue in this bug persists after the above fix.

[Bug rtl-optimization/110237] gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload

2023-06-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Sorry, was your recent patch g:1c3661e224e3ddfc6f773b095740c0f5a7ddf5fc
tackling a different issue?