[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-30 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #47 from Linus Torvalds  ---
(In reply to Richard Biener from comment #45)
> For user code
> 
> volatile long long x;
> void foo () { x++; }
> 
> emitting inc + adc with memory operands is only "incorrect" in re-ordering
> the subword reads with the subword writes, the reads and writes still happen
> architecturally ...

But the thing is, the ordering *is* very much defined for volatile accesses.
"volatile" is not a "the access happens architecturally", it's very much
defined "the access is _visible_ architecturally, and ordering matters".

So with the "volatile long long x" code, I think any language lawyer will say
that generating it as

add $1,mem
adc $0,mem+4

is unquestionably a compiler bug.

It may be what the user *wants* (and it's obviously what the gcov code would
like), but it's simply not a valid volatile access to 'x'.

So the gcov code would really want something slightly weaker than 'volatile'.
Something that just does 'guaranteed access' and disallows combining stores or
doing re-loads, without the ordering constraints.

Side note: we would use such a "weaker volatile" in the kernel too. We already
have that concept in the form of READ_ONCE() and WRITE_ONCE(), and it uses
"volatile" internally, and it works fine for us. But if we had another way to
just describe "guaranteed access", that could be useful.

I suspect the memory ordering primitives would be a better model than
'volatile' for this.  What are the rules for doing it as load/store with
'memory_order_relaxed'? That should at least guarantee that the load is never
re-done (getting two different values for anybody who does a load), but maybe
the stores can be combined?

And gcc should already have all that infrastructure in place. Hmm?

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-30 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #46 from rguenther at suse dot de  ---
On Mon, 30 Jan 2023, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552
> 
> --- Comment #44 from Jakub Jelinek  ---
> I guess we should try and see.
> For volatile,
> --- gcc/coverage.cc 2023-01-02 09:32:37.078072992 +0100
> +++ gcc/coverage.cc 2023-01-30 09:24:45.219951352 +0100
> @@ -774,6 +774,7 @@ build_var (tree fn_decl, tree type, int
>TREE_STATIC (var) = 1;
>TREE_ADDRESSABLE (var) = 1;
>DECL_NONALIASED (var) = 1;
> +  TREE_THIS_VOLATILE (var) = 1;
>SET_DECL_ALIGN (var, TYPE_ALIGN (type));
> 
>return var;
> 
> would do it I think (but it should be conditional on new -fupdate-profile
> modes, single-volatile and prefer-atomic-volatile or something similar).
> Or perhaps insert asm volatile ("" : "+g" (tmp)); in between the load and 
> store
> and see how that compares to the volatile vars? Or adding another flag on the
> gcov vars next to DECL_NONALIASED and just avoid specific optimizations on it
> that somebody runs into (not as reliable but could be faster) - for now
> hoisting in LIM and sinking.

We could put an __attribute__(("semi atomic")) on them ... it all somewhat
feels like a hack.  We could make half of the update volatile only,
like only make the store volatile, not the read?

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-30 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #45 from Richard Biener  ---
(In reply to Linus Torvalds from comment #43)
> (In reply to Richard Biener from comment #42)
> > 
> > I think if we want to avoid doing optimizations on gcov counters we should
> > make them volatile. 
> 
> Honestly, that sounds like the cleanest and safest option to me.
> 
> That said, with the gcov counters apparently also being 64-bit, I suspect it
> will create some truly horrid code generation.
> 
> Presumably you'd end up getting a lot of load-load-add-adc-store-store
> instruction patterns, which is not just six instructions when just two
> should do - it also uses up two registers.
> 
> So while it sounds like the simplest and safest model, maybe it just makes
> code generation too unbearably bad?
> 
> Maybe nobody who uses gcov would care. But I suspect it might be quite the
> big performance regression, to the point where even people who thought they
> don't care will go "that's a bit much".
> 
> I wonder if there is some half-way solution that would allow at least a
> load-add-store-load-adc-store instruction sequence, which would then mean
> (a) one less register wasted and (b) potentially allow some peephole
> optimization turning it into just a addmem-adcmem instruction pair.
> 
> Turning just the one of the memops into a volatile access might be enough
> (eg just the load, but not the store?)

It might be possible to introduce something like a __volatile_inc () which
implements a somewhat relaxed "volatile".

For user code

volatile long long x;
void foo () { x++; }

emitting inc + adc with memory operands is only "incorrect" in re-ordering
the subword reads with the subword writes, the reads and writes still happen
architecturally ...

That said, the coverage code could make this re-ordering explicit for
32bit with some conditional code (add-with-overflow) that eventually
combines back nicely even with volatile ...

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-30 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #44 from Jakub Jelinek  ---
I guess we should try and see.
For volatile,
--- gcc/coverage.cc 2023-01-02 09:32:37.078072992 +0100
+++ gcc/coverage.cc 2023-01-30 09:24:45.219951352 +0100
@@ -774,6 +774,7 @@ build_var (tree fn_decl, tree type, int
   TREE_STATIC (var) = 1;
   TREE_ADDRESSABLE (var) = 1;
   DECL_NONALIASED (var) = 1;
+  TREE_THIS_VOLATILE (var) = 1;
   SET_DECL_ALIGN (var, TYPE_ALIGN (type));

   return var;

would do it I think (but it should be conditional on new -fupdate-profile
modes, single-volatile and prefer-atomic-volatile or something similar).
Or perhaps insert asm volatile ("" : "+g" (tmp)); in between the load and store
and see how that compares to the volatile vars? Or adding another flag on the
gcov vars next to DECL_NONALIASED and just avoid specific optimizations on it
that somebody runs into (not as reliable but could be faster) - for now
hoisting in LIM and sinking.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-30 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #43 from Linus Torvalds  ---
(In reply to Richard Biener from comment #42)
> 
> I think if we want to avoid doing optimizations on gcov counters we should
> make them volatile. 

Honestly, that sounds like the cleanest and safest option to me.

That said, with the gcov counters apparently also being 64-bit, I suspect it
will create some truly horrid code generation.

Presumably you'd end up getting a lot of load-load-add-adc-store-store
instruction patterns, which is not just six instructions when just two should
do - it also uses up two registers.

So while it sounds like the simplest and safest model, maybe it just makes code
generation too unbearably bad?

Maybe nobody who uses gcov would care. But I suspect it might be quite the big
performance regression, to the point where even people who thought they don't
care will go "that's a bit much".

I wonder if there is some half-way solution that would allow at least a
load-add-store-load-adc-store instruction sequence, which would then mean (a)
one less register wasted and (b) potentially allow some peephole optimization
turning it into just a addmem-adcmem instruction pair.

Turning just the one of the memops into a volatile access might be enough (eg
just the load, but not the store?)

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-29 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #42 from Richard Biener  ---
(In reply to Jakub Jelinek from comment #40)
> (In reply to Jan Hubicka from comment #39)
> > I was wonering if we should not provide flag to turn all counts
> > volatile.   That way we will still have race conditions on their updates
> > (and it would be chepaer than atomic) but we won't run into such wrong
> > code issues nor large profile mismatches.
> 
> Yes, see above.  Or a mode in which we would just avoid hoisting and sinking
> the gcov vars but keep them non-volatile.  Or both.
> But I guess it would be nice to get Vlad's patch into trunk and release
> branches for now (perhaps with an extra check for startswith "__gcov" on
> DECL_NAME, so that we don't do it for the Fortran tokens).
> 
> As for the patch, just small nits, I think get_base_address returns always
> non-NULL, so it could be
>   if (tree expr = MEM_EXPR (res))
> {
>   expr = get_base_address (expr);
>   if (VAR_P (expr)
>   && DECL_NONALIASED (expr)
>   && DECL_NAME (expr))
> {
>   const char *name = IDENTIFIER_POINTER (DECL_NAME (expr));
>   /* Don't reread coverage counters from memory, if single
>  update model is used in threaded code, other threads
>  could change the counters concurrently.  See PR108552.  */
>   if (startswith (name, "__gcov"))
> return x;
> }
> }

Note that this isn't exactly reliable but a heuristic workaround since
MEM_EXPRs are optional and dropping them is valid (and done in some places).

I think if we want to avoid doing optimizations on gcov counters we should
make them volatile.  I suppose kernel folks would have a way to assess
any "catastrophic consequences" on optimization?  (I have a hard time
imagining them, sure that RMW will not allow add with memory operand,
but that's it?)

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-29 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #41 from Richard Biener  ---
(In reply to Linus Torvalds from comment #31)
> (In reply to Richard Biener from comment #26)
> > 
> > Now, in principle we should have applied store-motion and not only PRE which
> > would have avoided the issue, not tricking the RA into reloading the value
> > from where we store it in the loop, but the kernel uses -fno-tree-loop-im,
> > preventing that.  If you enable that you'd get
> 
> Note that we use -fno-tree-loop-im only for the GCOV case, and because of
> another problem with code generation with gcov. See
> 
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69702
> 
> and the fix for the excessive stack use was to disable that compiler option.
> See
> 
>  
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=c87bf431448b404a6ef5fbabd74c0e3e42157a7f
> 
> for the kernel commit message.

Yes, I remember.  So another option would be to add -fno-tree-pre to that
mix which should avoid hoisting the load out of the loop.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-29 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #40 from Jakub Jelinek  ---
(In reply to Jan Hubicka from comment #39)
> I was wonering if we should not provide flag to turn all counts
> volatile.   That way we will still have race conditions on their updates
> (and it would be chepaer than atomic) but we won't run into such wrong
> code issues nor large profile mismatches.

Yes, see above.  Or a mode in which we would just avoid hoisting and sinking
the gcov vars but keep them non-volatile.  Or both.
But I guess it would be nice to get Vlad's patch into trunk and release
branches for now (perhaps with an extra check for startswith "__gcov" on
DECL_NAME, so that we don't do it for the Fortran tokens).

As for the patch, just small nits, I think get_base_address returns always
non-NULL, so it could be
  if (tree expr = MEM_EXPR (res))
{
  expr = get_base_address (expr);
  if (VAR_P (expr)
  && DECL_NONALIASED (expr)
  && DECL_NAME (expr))
{
  const char *name = IDENTIFIER_POINTER (DECL_NAME (expr));
  /* Don't reread coverage counters from memory, if single
 update model is used in threaded code, other threads
 could change the counters concurrently.  See PR108552.  */
  if (startswith (name, "__gcov"))
return x;
}
}

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-28 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #39 from Jan Hubicka  ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552
> 
> --- Comment #35 from Vladimir Makarov  ---
> (In reply to Jakub Jelinek from comment #34)
> > Seems right now DECL_NONALIASED is only used on these coverage vars and on
> > Fortran caf tokens, so perhaps a quick workaround would be on the LRA side
> > never reread stuff from MEMs with VAR_P && DECL_NONALIASED MEM_EXPRs.  CCing
> > Vlad on that.
> 
> The following patch can do this:

Note that with threads we often get large profile mismatches when the
load/stores are hoisted out of the loop.  I.e. 

for ()
  gcov_count++;

to

i = gcov_count
for (.)
  i++
gocv_count = i

If the second loop is run in parallel a lot of increments may be lost.

I was wonering if we should not provide flag to turn all counts
volatile.   That way we will still have race conditions on their updates
(and it would be chepaer than atomic) but we won't run into such wrong
code issues nor large profile mismatches.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-28 Thread feng.tang at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #38 from Tang, Feng  ---
Created attachment 54368
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54368&action=edit
objdump of  prep_compound_page() with patch in comment 35

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-28 Thread feng.tang at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #37 from Tang, Feng  ---
Created attachment 54367
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54367&action=edit
page_alloc.i with patch in comment 35

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-28 Thread feng.tang at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #36 from Tang, Feng  ---
(In reply to Vladimir Makarov from comment #35)
> (In reply to Jakub Jelinek from comment #34)
> > Seems right now DECL_NONALIASED is only used on these coverage vars and on
> > Fortran caf tokens, so perhaps a quick workaround would be on the LRA side
> > never reread stuff from MEMs with VAR_P && DECL_NONALIASED MEM_EXPRs.  CCing
> > Vlad on that.
> 
> The following patch can do this:
> 
> diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc

Thanks for the patch!

As the bug is against 11.3, so I git cloned gcc git, and checkout
origin/releases/gcc-11 branch, then compile gcc (TBH, it's my first time)

* built gcc-11,compiled i386 kernel, run my local reproduce(QEMU loop booting
that kernel), the error was reproduced at once for every 20 boots rate. 

* manually applied Vladimir's patch (original patch seems to be against
'master' branch)

* rebuilt gcc, make clean and re-compile i386 kernel, and the error was NOT
seen in 350 runs so far

Also I will attach the page_alloc.i and objdump of prep_compound_page() with
the new patched gcc-11

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread vmakarov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #35 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #34)
> Seems right now DECL_NONALIASED is only used on these coverage vars and on
> Fortran caf tokens, so perhaps a quick workaround would be on the LRA side
> never reread stuff from MEMs with VAR_P && DECL_NONALIASED MEM_EXPRs.  CCing
> Vlad on that.

The following patch can do this:

diff --git a/gcc/lra-constraints.cc b/gcc/lra-constraints.cc
index 7bffbc07ee2..d80a6a9f41d 100644   
--- a/gcc/lra-constraints.cc
+++ b/gcc/lra-constraints.cc
@@ -515,6 +515,7 @@ get_equiv (rtx x)   
 {  
   int regno;   
   rtx res; 
+  tree expr;   

   if (! REG_P (x) || (regno = REGNO (x)) < FIRST_PSEUDO_REGISTER   
   || ! ira_reg_equiv[regno].defined_p  
@@ -525,6 +526,10 @@ get_equiv (rtx x)  
 {  
   if (targetm.cannot_substitute_mem_equiv_p (res)) 
return x;   
+  if ((expr = MEM_EXPR (res)) != NULL  
+ && (expr = get_base_address (expr)) != NULL   
+ && VAR_P (expr) && DECL_NONALIASED (expr))
+   return x;   
   return res;  
 }  
   if ((res = ira_reg_equiv[regno].constant) != NULL_RTX)

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

Jakub Jelinek  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #34 from Jakub Jelinek  ---
Seems right now DECL_NONALIASED is only used on these coverage vars and on
Fortran caf tokens, so perhaps a quick workaround would be on the LRA side
never reread stuff from MEMs with VAR_P && DECL_NONALIASED MEM_EXPRs.  CCing
Vlad on that.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #33 from Jakub Jelinek  ---
It is the default unless -pthread is specified:
 %{fprofile-arcs|fprofile-generate*|coverage:\
   %{!fprofile-update=single:\
 %{pthread:-fprofile-update=prefer-atomic}}}
So, when one uses -fprofile-arcs, -fprofile-generate* or -converage together
with -pthread and doesn't use -fprofile-uupdate=single, then
-fprofile-update=prefer-atomic is added (which is -fprofile-update=atomic if
the architecture supports it).

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #32 from Linus Torvalds  ---
Brw, where does the -fprofile-update=single/atomic come from?

The kernel just uses 

  CFLAGS_GCOV:= -fprofile-arcs -ftest-coverage

for this case. So I guess 'single' is just the default value?

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #31 from Linus Torvalds  ---
(In reply to Richard Biener from comment #26)
> 
> Now, in principle we should have applied store-motion and not only PRE which
> would have avoided the issue, not tricking the RA into reloading the value
> from where we store it in the loop, but the kernel uses -fno-tree-loop-im,
> preventing that.  If you enable that you'd get

Note that we use -fno-tree-loop-im only for the GCOV case, and because of
another problem with code generation with gcov. See

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69702

and the fix for the excessive stack use was to disable that compiler option.
See

 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c87bf431448b404a6ef5fbabd74c0e3e42157a7f

for the kernel commit message.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread torvalds--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #30 from Linus Torvalds  ---
(In reply to Richard Biener from comment #26)
> And yes, to IV optimization the gcov counter for the loop body is just
> another IV candidate that can be used, and in this case it allows to elide
> the otherwise
> unused original IV.

Ouch.

So we really don't mind the data race - the gcov data is obviously not primary
- but I don't think anybody expected the data race on the gcov data that isn't
"semantically visible" to then affect actual semantics.

And yeah, atomic updates would be too expensive even on 64-bit architectures,
so we pretty much *depend* on the data race being there. And on 32-bit
architectures (at least i386), atomic 64-bit ones go from "expensive" to
"ludicrously complicated" (ie to get a 64-bit atomic update you'd need to start
doing cmpxchg8b loops or something).

So I think the data race is not just what we expected, it's fundamental. Just
the "mix it with semantics" ends up being less than optimal. 

Having the gcov data be treated as 'volatile' would be one option, but probably
cause horrendous code generation issues as Jakub says.

Although I have several times hit that "I want to just update a volatile in
memory, I wish gcc would just be happy to combine a 'read-modify-update' to a
single instruction". So in a perfect world, that would be fixed too.

I guess from a kernel perspective, we might need to really document that GCOV
has these issues, and you can't use it for any real work. We have just been
lucky this hasn't hit us (admittedly because it's fairly odd that an expected
end gcov value would end up being used in that secondary way as a loop
variable).

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #29 from Jakub Jelinek  ---
(In reply to Jakub Jelinek from comment #27)
> Well, we could in -fprofile-update=single (or perhaps in a new single-like
> mode) mark the gcov artificial vars volatile or with some flag that would at
> least cause reload not to reread values from memory.  The profiling would be
> still racy, but at the expense of somewhat slower code (with volatile more,
> with special flag less so) slightly less so (as it would e.g. prevent the
> compiler from avoiding the rereads).

Though, e.g. volatile would then prevent say on x86 using inc directly on the
memory
location, which is used in the -fprofile-update=atomic mode (with additional
lock).

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #28 from Richard Biener  ---
(In reply to Jakub Jelinek from comment #27)
> (In reply to Richard Biener from comment #25)
> > Ah, reading more comments, no - it probably doesn't.  Jakub correctly says
> > that there seems to be a data race necessary to trigger this, so it doesn't
> > seem to be a GCC issue?
> 
> Well, we could in -fprofile-update=single (or perhaps in a new single-like
> mode) mark the gcov artificial vars volatile or with some flag that would at
> least cause reload not to reread values from memory.  The profiling would be
> still racy, but at the expense of somewhat slower code (with volatile more,
> with special flag less so) slightly less so (as it would e.g. prevent the
> compiler from avoiding the rereads).

-fprofile-update=volatile?  Huh, sure, we could do that.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #27 from Jakub Jelinek  ---
(In reply to Richard Biener from comment #25)
> Ah, reading more comments, no - it probably doesn't.  Jakub correctly says
> that there seems to be a data race necessary to trigger this, so it doesn't
> seem to be a GCC issue?

Well, we could in -fprofile-update=single (or perhaps in a new single-like
mode) mark the gcov artificial vars volatile or with some flag that would at
least cause reload not to reread values from memory.  The profiling would be
still racy, but at the expense of somewhat slower code (with volatile more,
with special flag less so) slightly less so (as it would e.g. prevent the
compiler from avoiding the rereads).

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #26 from Richard Biener  ---
And yes, to IV optimization the gcov counter for the loop body is just another
IV candidate that can be used, and in this case it allows to elide the
otherwise
unused original IV.

Now, in principle we should have applied store-motion and not only PRE which
would have avoided the issue, not tricking the RA into reloading the value
from where we store it in the loop, but the kernel uses -fno-tree-loop-im,
preventing that.  If you enable that you'd get

   [local count: 105119324]:
  __gcov0.prep_compound_page_I_lsm.1755_4 = __gcov0.prep_compound_page[7];
  _92 = (long unsigned int) page_12(D);
  _57 = _92 + 1;
  _119 = page_12(D) + 40;
  ivtmp.1762_136 = (unsigned int) _119;

   [local count: 955630225]:
  # i_66 = PHI 
  # ivtmp.1762_6 = PHI 
  p_15 = (struct page *) ivtmp.1762_6;
  MEM  [(union  *)p_15 + 12B] = 1024B;
  MEM[(volatile long unsigned int *)p_15 + 4B] ={v} _57;
  i_17 = i_66 + 1;
  ivtmp.1762_46 = ivtmp.1762_6 + 40;
  if (nr_pages_11 != i_17)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  _73 = (unsigned int) nr_pages_11;
  _163 = _73 + 4294967294;
  _159 = (long long int) _163;
  _1 = __gcov0.prep_compound_page_I_lsm.1755_4 + 1;
  PROF_edge_counter_74 = _1 + _159;
  __gcov0.prep_compound_page[7] = PROF_edge_counter_74;

which is the desired optimization, handling the counter in the loop like
an induction variable instead of going through memory.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #25 from Richard Biener  ---
(In reply to Richard Biener from comment #24)
> Does
> 
> diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> index 0dd47910f97..f780c0ce08c 100644
> --- a/gcc/tree-ssa-loop-ivopts.cc
> +++ b/gcc/tree-ssa-loop-ivopts.cc
> @@ -2241,7 +2241,7 @@ may_be_nonaddressable_p (tree expr)
>  {
>  case VAR_DECL:
>/* Check if it's a register variable.  */
> -  return DECL_HARD_REGISTER (expr);
> +  return DECL_HARD_REGISTER (expr) || DECL_NONALIASED (expr);
>  
>  case TARGET_MEM_REF:
>/* TARGET_MEM_REFs are translated directly to valid MEMs on the
> 
> fix it?

Ah, reading more comments, no - it probably doesn't.  Jakub correctly says
that there seems to be a data race necessary to trigger this, so it doesn't
seem to be a GCC issue?

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #24 from Richard Biener  ---
Does

diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index 0dd47910f97..f780c0ce08c 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -2241,7 +2241,7 @@ may_be_nonaddressable_p (tree expr)
 {
 case VAR_DECL:
   /* Check if it's a register variable.  */
-  return DECL_HARD_REGISTER (expr);
+  return DECL_HARD_REGISTER (expr) || DECL_NONALIASED (expr);

 case TARGET_MEM_REF:
   /* TARGET_MEM_REFs are translated directly to valid MEMs on the

fix it?

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #23 from Jakub Jelinek  ---
We could mark the __gcov* artificial vars with some flag (unless they are
already) and try to avoid using IVs loaded from those in IVOPTs, but as can be
seen above, the chosen IV really isn't that memory but an SSA_NAME that is
initialized with something loaded from that and in other cases it could be even
not that simple (say multiple copies of the same loop in sequence with a load
from __gcov* only at the beginning and then the loops just using a PRE IV
temporary for all the stores).  I bet the RA does it from similar reasons, var
isn't volatile, updated many times without any atomic barriers in between, so
if some other thread modifies it in between, it would be a data race.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #22 from Uroš Bizjak  ---
BTW: It is the reload pass that duplicates read from
__gcov0.prep_compound_page[7].

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

Jakub Jelinek  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org,
   ||jakub at gcc dot gnu.org

--- Comment #21 from Jakub Jelinek  ---
I'd say using the (default unless -pthread is used) -fprofile-update=single is
wrong for the kernel, it can't work correctly in multi-threaded case which is
the case of kernel.
In the -fprofile-update=single (as opposed to -fprofile-update=atomic) the
updates to the counters aren't atomic and the arrays aren't marked volatile or
something similar, it is really meant for single threaded coverage.

Anyway, before ivopts we have:
  pretmp_93 = __gcov0.prep_compound_page[7];

   [local count: 955630225]:
  # i_66 = PHI 
  # prephitmp_92 = PHI 
  i.144_1 = (unsigned int) i_66;
  _2 = i.144_1 * 40;
  p_15 = page_12(D) + _2;
  p_15->D.13727.D.13672.mapping = 1024B;
  MEM[(volatile long unsigned int *)p_15 + 4B] ={v} _159;
  i_17 = i_66 + 1;
  PROF_edge_counter_46 = prephitmp_92 + 1;
  __gcov0.prep_compound_page[7] = PROF_edge_counter_46;
  if (nr_pages_11 > i_17)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 850510901]:
  goto ; [100.00%]
which given the non-volatile non-atomically updated arrays is to be expected,
instead of re-reading __gcov0.prep_compound_page[7] in every iteration it just
reads it once and stores in each iteration, which is possible because another
thread changing it concurrently would mean a data race anyway.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #20 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #19)
> __gcov0.prep_compound_page. But as shown in Comment #15, we have two

Comment #16, actually.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

--- Comment #19 from Uroš Bizjak  ---
Some further analysis:

  pretmp_94 = __gcov0.prep_compound_page[7];  <--
  _179 = pretmp_94 + 1;   <--
  ivtmp.1725_211 = (unsigned long long) _179;

  _135 = (unsigned int) nr_pages_11;
  _134 = _135 + 4294967294; <--
  _132 = (unsigned long long) _134; <--
  _89 = (unsigned long long) pretmp_94; <--
  _76 = _89 + 2;   <-
  _19 = _76 + _132;<-


And the loop exit condition is:

  # ivtmp.1725_77 = PHI 
  ...
  ivtmp.1725_69 = ivtmp.1725_77 + 1;

And the loop exit condition is:

  if (_19 != ivtmp.1725_69)


So, both ivtmp and _19 are calculated from the value at
__gcov0.prep_compound_page. But as shown in Comment #15, we have two separate
reads from the location, the compiler assumes that the value there is
invariant, which is probably not the case.

[Bug tree-optimization/108552] Linux i386 kernel 5.14 memory corruption for pre_compound_page() when gcov is enabled

2023-01-27 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108552

Uroš Bizjak  changed:

   What|Removed |Added

  Component|target  |tree-optimization
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed|2023-01-26 00:00:00 |2023-01-27
 CC||rguenth at gcc dot gnu.org

--- Comment #18 from Uroš Bizjak  ---
Confirmed, the trail goes into the tree optimization area.