Re: [PATCH v2 2/3] Add a pass to automatically add ptwrite instrumentation

2018-11-22 Thread Andi Kleen
On Thu, Nov 22, 2018 at 02:53:11PM +0100, Richard Biener wrote:
> > the optimizer moving it around over function calls etc.?
> > The instrumentation should still be useful when the program
> > crashes, so we don't want to delay logging too much.
> 
> You can't avoid moving it, yes.  But it can be even moved now, effectively
> detaching it from the "interesting" $pc ranges we have debug location lists 
> for?
> 
> > > Maybe even instead pass it a number of bytes so it models how atomics 
> > > work.
> >
> > How could that reject float?
> 
> Why does it need to reject floats?  Note you are already rejecting floats
> in the instrumentation pass.

The backend doesn't know how to generate the code for ptwrite %xmm0. 
So it would either need to learn about all these conversions, or reject
them in the hook so that the middle end eventually can generate
them. Or maybe hard code the knowledge in the middle end?

> > Mode seems better for now.
> >
> > Eventually might support float/double through memory, but not in the
> > first version.
> 
> Why does movq %xmm0, %rax; ptwrite %rax not work?

It works of course, the question is just who is in charge
of figuring out that the movq needs to be generated
(and that it's not a round, but a bit cast)

> Fair enough, the instrumentation would need to pad out smaller values
> and/or split larger values.  I think 1 and 2 byte values would be interesting
> so you can support char and shorts.  Eventually 16byte values for __int128
> or vectors.

Right. And for copies/clears too. But I was hoping to just get
the basics solid and then do these more advanced features later.

> > > btw, I wonder whether the vartrace attribute should have an argument like
> > > vartrace(locals,reads,...)?
> >
> > I was hoping this could be delayed until actually needed. It'll need
> > some changes because I don't want to do the full parsing for every decl
> > all the time, so would need to store a bitmap of options somewhere
> > in tree.
> 
> Hmm, indeed.  I wonder if you need the function attribute at all then?  Maybe

I really would like an opt-in per function. I think that's an important
use case. Just instrument a few functions that contribute to the bug
you're debugging to cut down the bandwidth.

The idea was that __attribute__((vartrace)) for a function would
log everything in that function, including locals.

I could see a use case to opt-in into a function without
locals (mainly because local tracking is more likely to cause
trace overflows if there is a lot of execution). But I think I can do
without that in v1 might add it later.

> That is, on types and decls you'd interpret it as if locals tracing is on
> then locals of type are traced but otherwise locals of type are not traced?

When the opt-in is on the type or the variable then the variable
should be tracked, even if it is an local (or maybe even a cast)

The locals check is mainly for the command line.

> That is, if I just want to trace variable i and do
> 
>  int i __attribute__((vartrace));
> 
> then what options do I enable to make that work and how can I avoid
> tracing anything else?  Similar for

Just enable -mptwrite (or nothing if it's implied with -mcpu=...) 

> 
>  typedef int traced_int __attribute_((vartrace));

Same.

> 
> ?  I guess we'd need -fvar-trace=locals to get the described effect for
> the type attribute and then -fvar-trace=locals,all to have _all_ locals
> traced?  Or -fvar-trace=locals,only-marked? ... or forgo with the idea
> of marking types?

You want a command line override to not trace if the attribute
is set?

-mno-ptwrite would work. Could also add something separate
in -fvartrace (perhaps implied in =off) but not sure if it's worth it.

I suppose it could make sense if someone wants to use the _ptwrite 
builtin separately still. I'll add it to =off.

> so there's no 'i' anymore.  If you enable debug-info you

Hmm I guess that's ok for now. I suppose it'll work with -Og.

> and instrumenting late.  (but ptwrite is only
> interesting for optimized code, no?)

It's very interesting for non optimized code too. In fact
that would be a common use case during debugging.


-Andi



Re: [PATCH v2 2/3] Add a pass to automatically add ptwrite instrumentation

2018-11-22 Thread Richard Biener
On Tue, Nov 20, 2018 at 7:27 PM Andi Kleen  wrote:
>
> On Tue, Nov 20, 2018 at 01:04:19PM +0100, Richard Biener wrote:
> > Since your builtin clobbers memory
>
> Hmm, maybe we could get rid of that, but then how to avoid
> the optimizer moving it around over function calls etc.?
> The instrumentation should still be useful when the program
> crashes, so we don't want to delay logging too much.

You can't avoid moving it, yes.  But it can be even moved now, effectively
detaching it from the "interesting" $pc ranges we have debug location lists for?

> > Maybe even instead pass it a number of bytes so it models how atomics work.
>
> How could that reject float?

Why does it need to reject floats?  Note you are already rejecting floats
in the instrumentation pass.

> Mode seems better for now.
>
> Eventually might support float/double through memory, but not in the
> first version.

Why does movq %xmm0, %rax; ptwrite %rax not work?

>
> > >NEXT_PASS (pass_tsan_O0);
> > >NEXT_PASS (pass_sanopt);
> > > +  NEXT_PASS (pass_vartrace);
> >
> > I'd move it one lower, after pass_cleanup_eh.  Further enhancement
> > would make it a
> > RTL pass ...
>
> It's after pass_nrv now.
>
> > So in reality the restriction is on the size of the object, correct?
>
> The instruction accepts 32 or 64bit memory or register.
>
> In principle everything could be logged through this, but i was trying
> to limit the cases to integers and pointers for now to simplify
> the problem.
>
> Right now the backend fails up when something else than 4 or 8 bytes
> is passed.

Fair enough, the instrumentation would need to pad out smaller values
and/or split larger values.  I think 1 and 2 byte values would be interesting
so you can support char and shorts.  Eventually 16byte values for __int128
or vectors.

> >
> > > +{
> > > +  if (!supported_type (t))
> >
> > You handle some nested cases below via recursion,
> > like a.b.c (but not a[i][j]).  But then this check will
> > fire.  I think it would be better to restructure the
> > function to look at the outermost level for whether
> > the op is of supported type, thus we can log it
> > at all and then get all the way down to the base via
> > sth like
> >
> >   if (!supported_type (t))
> > return false;
> >   enum attrstate s = ;
> >   do
> > {
> >s = supported_op (t, s);
> >if (s == force_off)
> >  return false;
> > }
> >   while (handled_component_p (t) && (t = TREE_OPERAND (t, 0)))
> >
> > Now t is either an SSA_NAME, a DECL (you fail to handle PARM_DECL
>
> Incoming arguments and returns are handled separately.
>
> > and RESULT_DECL below) or a [TARGET_]MEM_REF.  To get rid
> > of non-pointer indirections do then
> >
> >   t = get_base_address (t);
> >   if (DECL_P (t) && is_local (t))
> >   
> >
> > because...
> >
> > > +return false;
> > > +
> > > +  enum attrstate s = supported_op (t, neutral);
> > > +  if (s == force_off)
> > > +return false;
> > > +  if (s == force_on)
> > > +force = true;
> > > +
> > > +  switch (TREE_CODE (t))
> > > +{
> > > +case VAR_DECL:
> > > +  if (DECL_ARTIFICIAL (t))
> > > +   return false;
> > > +  if (is_local (t))
> > > +   return true;
> > > +  return s == force_on || force;
> > > +
> > > +case ARRAY_REF:
> > > +  t = TREE_OPERAND (t, 0);
> > > +  s = supported_op (t, s);
> > > +  if (s == force_off)
> > > +   return false;
> > > +  return supported_type (TREE_TYPE (t));
> >
> > Your supported_type is said to take a DECL.  And you
> > already asked for this type - it's the type of the original t
> > (well, the type of this type given TREE_TYPE (t) is an array type).
> > But you'll reject a[i][j] where the type of this type is an array type as 
> > well.
>
> Just to be clear, after your changes above I only need
> to handle VAR_DECL and SSA_NAME here then, correct?

Yes (and PARM_DECL and RESULT_DECL).

> So one of the reasons I handled ARRAY_REF this way is to
> trace the index as a local if needed. If I can assume
> it was always in a MEM with an own ASSIGN earlier if the local
> was a user visible that wouldn't be needed (and also some other similar
> code elsewhere)

If the index lives in memory it has a corresponding load.  For SSA uses
see my comment about instrumenting them at all (together with the
suggestion on how to handle them in an easier way).

>
> But when I look at a simple test case like vartrace-6
>
> void
> f (void)
> {
>   int i;
>   for (i = 0; i < 10; i++)
>f2 ();
> }
>
> i appears to be a SSA name only that is referenced everywhere without
> MEM. And if the user wants to track the value of i I would need
> to explicitely handle all these cases. Do I miss something here?

You handle those in different places I think.

> I'm starting to think i should perhaps drop locals support to simplify
> everything? But that might limit usability for debugging somewhat.

I think you confuse "locals" a bit.  In GIMPLE an automatic 

Re: [PATCH v2 2/3] Add a pass to automatically add ptwrite instrumentation

2018-11-20 Thread Andi Kleen
On Tue, Nov 20, 2018 at 01:04:19PM +0100, Richard Biener wrote:
> Since your builtin clobbers memory

Hmm, maybe we could get rid of that, but then how to avoid
the optimizer moving it around over function calls etc.?
The instrumentation should still be useful when the program
crashes, so we don't want to delay logging too much.

> Maybe even instead pass it a number of bytes so it models how atomics work.

How could that reject float?

Mode seems better for now.

Eventually might support float/double through memory, but not in the
first version.


> >NEXT_PASS (pass_tsan_O0);
> >NEXT_PASS (pass_sanopt);
> > +  NEXT_PASS (pass_vartrace);
> 
> I'd move it one lower, after pass_cleanup_eh.  Further enhancement
> would make it a
> RTL pass ...

It's after pass_nrv now.

> So in reality the restriction is on the size of the object, correct?

The instruction accepts 32 or 64bit memory or register.

In principle everything could be logged through this, but i was trying
to limit the cases to integers and pointers for now to simplify
the problem.

Right now the backend fails up when something else than 4 or 8 bytes
is passed.

> 
> > +{
> > +  if (!supported_type (t))
> 
> You handle some nested cases below via recursion,
> like a.b.c (but not a[i][j]).  But then this check will
> fire.  I think it would be better to restructure the
> function to look at the outermost level for whether
> the op is of supported type, thus we can log it
> at all and then get all the way down to the base via
> sth like
> 
>   if (!supported_type (t))
> return false;
>   enum attrstate s = ;
>   do
> {
>s = supported_op (t, s);
>if (s == force_off)
>  return false;
> }
>   while (handled_component_p (t) && (t = TREE_OPERAND (t, 0)))
> 
> Now t is either an SSA_NAME, a DECL (you fail to handle PARM_DECL

Incoming arguments and returns are handled separately.

> and RESULT_DECL below) or a [TARGET_]MEM_REF.  To get rid
> of non-pointer indirections do then
> 
>   t = get_base_address (t);
>   if (DECL_P (t) && is_local (t))
>   
> 
> because...
> 
> > +return false;
> > +
> > +  enum attrstate s = supported_op (t, neutral);
> > +  if (s == force_off)
> > +return false;
> > +  if (s == force_on)
> > +force = true;
> > +
> > +  switch (TREE_CODE (t))
> > +{
> > +case VAR_DECL:
> > +  if (DECL_ARTIFICIAL (t))
> > +   return false;
> > +  if (is_local (t))
> > +   return true;
> > +  return s == force_on || force;
> > +
> > +case ARRAY_REF:
> > +  t = TREE_OPERAND (t, 0);
> > +  s = supported_op (t, s);
> > +  if (s == force_off)
> > +   return false;
> > +  return supported_type (TREE_TYPE (t));
> 
> Your supported_type is said to take a DECL.  And you
> already asked for this type - it's the type of the original t
> (well, the type of this type given TREE_TYPE (t) is an array type).
> But you'll reject a[i][j] where the type of this type is an array type as 
> well.

Just to be clear, after your changes above I only need
to handle VAR_DECL and SSA_NAME here then, correct?

So one of the reasons I handled ARRAY_REF this way is to
trace the index as a local if needed. If I can assume
it was always in a MEM with an own ASSIGN earlier if the local 
was a user visible that wouldn't be needed (and also some other similar
code elsewhere)

But when I look at a simple test case like vartrace-6 

void
f (void)
{
  int i;
  for (i = 0; i < 10; i++)
   f2 ();
}

i appears to be a SSA name only that is referenced everywhere without
MEM. And if the user wants to track the value of i I would need
to explicitely handle all these cases. Do I miss something here?

I'm starting to think i should perhaps drop locals support to simplify
everything? But that might limit usability for debugging somewhat.


> gsi_insert_* does update_stmt already.  Btw, if you allow any
> SImode or DImode size value you can use a VIEW_CONVERT_EXPR

Just add them unconditionally? 

> > +bool
> > +instrument_args (function *fun)
> > +{
> > +  gimple_stmt_iterator gi;
> > +  bool changed = false;
> > +
> > +  /* Local tracing usually takes care of the argument too, when
> > + they are read. This avoids redundant trace instructions.  */
> 
> But only when instrumenting reads?

Yes will add the check.

> 
> Hmm, but then this requires the target instruction to have a memory operand?

Yes that's right for now. Eventually it will be fixed and x86 would
benefit too.

> That's going to be unlikely for RISCy cases?  On x86 does it work if
> combine later does not syntesize a ptwrite with memory operand?
> I also wonder how this survives RTL CSE since you are basically doing
> 
>   mem = val;  // orig stmt
>   val' = mem;
>   ptwrite (val');
> 
> that probably means when CSE removes the load there ends up a debug-insn
> reflecting what you want?

I'll check.

> > +  /* Handle operators in case they read locals.  */
> 
> Does it make sense at all to instrument SSA 

Re: [PATCH v2 2/3] Add a pass to automatically add ptwrite instrumentation

2018-11-20 Thread Richard Biener
On Fri, Nov 16, 2018 at 4:57 AM Andi Kleen  wrote:
>
> From: Andi Kleen 
>
> Add a new pass to automatically instrument changes to variables
> with the new PTWRITE instruction on x86. PTWRITE writes a 4 or 8 byte
> field into an Processor Trace log, which allows low over head
> logging of information. Essentially it's a hardware accelerated
> printf.
>
> This allows to reconstruct how values later from the log,
> which can be useful for debugging or other analysis of the program
> behavior. With the compiler support this can be done with without
> having to manually add instrumentation to the code.
>
> Using dwarf information this can be later mapped back to the variables.
> The decoder decodes the PTWRITE instructions using IP information
> in the log, and then looks up the argument in the debug information.
> Then this can be used to reconstruct the original variable
> name to display a value history for the variable.
>
> There are new options to enable instrumentation for different types,
> and also a new attribute to control analysis fine grained per
> function or variable level. The attributes can be set on both
> the variable and the type level, and also on structure fields.
> This allows to enable tracing only for specific code in large
> programs in a flexible matter.
>
> The pass is generic, but only the x86 backend enables the necessary
> hooks. When the backend enables the necessary hooks (with -mptwrite)
> there is an additional pass that looks through the code for
> attribute vartrace enabled functions or variables.
>
> The -fvartrace=locals option is experimental: it works, but it
> generates redundant ptwrites because the pass doesn't use
> the SSA information to minimize instrumentation.
> This could be optimized later.
>
> Currently the code can be tested with SDE, or on a Intel
> Gemini Lake system with a new enough Linux kernel (v4.10+)
> that supports PTWRITE for PT. Gemini Lake is used in low
> end laptops ("Intel Pentium Silver J5.. / Celeron N4... /
> Celeron J4...")
>
> Linux perf can be used to record the values
>
> perf record -e intel_pt/ptw=1,branch=0/ program
> perf script --itrace=crw -F +synth ...
>
> I have an experimential version of perf that can also use
> dwarf information to symbolize many[1] values back to their variable
> names. So far it is not in standard perf, but available at
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=perf/var-resolve-4
>
> It is currently not able to decode all variable locations to names,
> but a large subset.
>
> Longer term hopefully gdb will support this information too.
>
> The CPU can potentially generate very data high bandwidths when
> code doing a lot of computation is heavily instrumented.
> This can cause some data loss in both the CPU and also in perf
> logging the data when the disk cannot keep up.
>
> Running some larger workloads most workloads do not cause
> CPU level overflows, but I've seen it with -fvartrace
> with crafty, and with more workloads with -fvartrace-locals.
>
> Recommendation is to not fully instrument programs,
> but only areas of interest either at the file level or using
> the attributes.
>
> The other thing is that perf and the disk often cannot keep up
> with the data bandwidth for longer computations. In this case
> it's possible to use perf snapshot mode (add --snapshot
> to the command line above). The data will be only logged to
> a memory ring buffer then, and only dump the buffers on events
> of interest by sending SIGUSR2 to the perf binrary.
>
> In the future this will be hopefully better supported with
> core files and gdb.
>
> Passes bootstrap and test suite on x86_64-linux, also
> bootstrapped and tested gcc itself with full -fvartrace
> and -fvartrace=locals instrumentation.

In the cover mail you mentioned you didn't get rid of SSA update.
That is because your instrumentation does not set the calls
virtual operands.  Since your builtin clobbers memory and you
instrument non-memory ops that's only possible if you'd track
the active virtual operand during the walk over the function.  I
suppose using SSA update is OK for now.

More comments inline

> gcc/:
>
> 2018-11-15  Andi Kleen  
>
> * Makefile.in: Add tree-vartrace.o.
> * common.opt: Add -fvartrace
> * opts.c (parse_vartrace_options): Add.
> (common_handle_option): Call parse_vartrace_options.
> * config/i386/i386.c (ix86_vartrace_func): Add.
> (TARGET_VARTRACE_FUNC): Add.
> * doc/extend.texi: Document vartrace/no_vartrace
> attributes.
> * doc/invoke.texi: Document -fvartrace.
> * doc/tm.texi (TARGET_VARTRACE_FUNC): Add.
> * passes.def: Add vartrace pass.
> * target.def (vartrace_func): Add.
> * tree-pass.h (make_pass_vartrace): Add.
> * tree-vartrace.c: New file to implement vartrace pass.
>
> gcc/c-family/:
>
> 2018-11-15  Andi Kleen  
>
> * c-attribs.c (handle_vartrace_attribute,
>

[PATCH v2 2/3] Add a pass to automatically add ptwrite instrumentation

2018-11-15 Thread Andi Kleen
From: Andi Kleen 

Add a new pass to automatically instrument changes to variables
with the new PTWRITE instruction on x86. PTWRITE writes a 4 or 8 byte
field into an Processor Trace log, which allows low over head
logging of information. Essentially it's a hardware accelerated
printf.

This allows to reconstruct how values later from the log,
which can be useful for debugging or other analysis of the program
behavior. With the compiler support this can be done with without
having to manually add instrumentation to the code.

Using dwarf information this can be later mapped back to the variables.
The decoder decodes the PTWRITE instructions using IP information
in the log, and then looks up the argument in the debug information.
Then this can be used to reconstruct the original variable
name to display a value history for the variable.

There are new options to enable instrumentation for different types,
and also a new attribute to control analysis fine grained per
function or variable level. The attributes can be set on both
the variable and the type level, and also on structure fields.
This allows to enable tracing only for specific code in large
programs in a flexible matter.

The pass is generic, but only the x86 backend enables the necessary
hooks. When the backend enables the necessary hooks (with -mptwrite)
there is an additional pass that looks through the code for
attribute vartrace enabled functions or variables.

The -fvartrace=locals option is experimental: it works, but it
generates redundant ptwrites because the pass doesn't use
the SSA information to minimize instrumentation.
This could be optimized later.

Currently the code can be tested with SDE, or on a Intel
Gemini Lake system with a new enough Linux kernel (v4.10+)
that supports PTWRITE for PT. Gemini Lake is used in low
end laptops ("Intel Pentium Silver J5.. / Celeron N4... /
Celeron J4...")

Linux perf can be used to record the values

perf record -e intel_pt/ptw=1,branch=0/ program
perf script --itrace=crw -F +synth ...

I have an experimential version of perf that can also use
dwarf information to symbolize many[1] values back to their variable
names. So far it is not in standard perf, but available at

https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=perf/var-resolve-4

It is currently not able to decode all variable locations to names,
but a large subset.

Longer term hopefully gdb will support this information too.

The CPU can potentially generate very data high bandwidths when
code doing a lot of computation is heavily instrumented.
This can cause some data loss in both the CPU and also in perf
logging the data when the disk cannot keep up.

Running some larger workloads most workloads do not cause
CPU level overflows, but I've seen it with -fvartrace
with crafty, and with more workloads with -fvartrace-locals.

Recommendation is to not fully instrument programs,
but only areas of interest either at the file level or using
the attributes.

The other thing is that perf and the disk often cannot keep up
with the data bandwidth for longer computations. In this case
it's possible to use perf snapshot mode (add --snapshot
to the command line above). The data will be only logged to
a memory ring buffer then, and only dump the buffers on events
of interest by sending SIGUSR2 to the perf binrary.

In the future this will be hopefully better supported with
core files and gdb.

Passes bootstrap and test suite on x86_64-linux, also
bootstrapped and tested gcc itself with full -fvartrace
and -fvartrace=locals instrumentation.

gcc/:

2018-11-15  Andi Kleen  

* Makefile.in: Add tree-vartrace.o.
* common.opt: Add -fvartrace
* opts.c (parse_vartrace_options): Add.
(common_handle_option): Call parse_vartrace_options.
* config/i386/i386.c (ix86_vartrace_func): Add.
(TARGET_VARTRACE_FUNC): Add.
* doc/extend.texi: Document vartrace/no_vartrace
attributes.
* doc/invoke.texi: Document -fvartrace.
* doc/tm.texi (TARGET_VARTRACE_FUNC): Add.
* passes.def: Add vartrace pass.
* target.def (vartrace_func): Add.
* tree-pass.h (make_pass_vartrace): Add.
* tree-vartrace.c: New file to implement vartrace pass.

gcc/c-family/:

2018-11-15  Andi Kleen  

* c-attribs.c (handle_vartrace_attribute,
  handle_no_vartrace_attribute): New functions.
  (attr_vartrace_exclusions): Add.

config/:

2018-11-03  Andi Kleen  

* bootstrap-vartrace.mk: New.
* bootstrap-vartrace-locals.mk: New.
---
 config/bootstrap-vartrace-locals.mk |   3 +
 config/bootstrap-vartrace.mk|   3 +
 gcc/Makefile.in |   1 +
 gcc/c-family/c-attribs.c|  77 +
 gcc/common.opt  |   8 +
 gcc/config/i386/i386.c  |  32 ++
 gcc/doc/extend.texi |  32 ++
 gcc/doc/invoke.texi |  27 ++
 gcc/doc/tm.texi