Re: [PATCH v2 2/3] Add a pass to automatically add ptwrite instrumentation
On Thu, Nov 22, 2018 at 02:53:11PM +0100, Richard Biener wrote: > > the optimizer moving it around over function calls etc.? > > The instrumentation should still be useful when the program > > crashes, so we don't want to delay logging too much. > > You can't avoid moving it, yes. But it can be even moved now, effectively > detaching it from the "interesting" $pc ranges we have debug location lists > for? > > > > Maybe even instead pass it a number of bytes so it models how atomics > > > work. > > > > How could that reject float? > > Why does it need to reject floats? Note you are already rejecting floats > in the instrumentation pass. The backend doesn't know how to generate the code for ptwrite %xmm0. So it would either need to learn about all these conversions, or reject them in the hook so that the middle end eventually can generate them. Or maybe hard code the knowledge in the middle end? > > Mode seems better for now. > > > > Eventually might support float/double through memory, but not in the > > first version. > > Why does movq %xmm0, %rax; ptwrite %rax not work? It works of course, the question is just who is in charge of figuring out that the movq needs to be generated (and that it's not a round, but a bit cast) > Fair enough, the instrumentation would need to pad out smaller values > and/or split larger values. I think 1 and 2 byte values would be interesting > so you can support char and shorts. Eventually 16byte values for __int128 > or vectors. Right. And for copies/clears too. But I was hoping to just get the basics solid and then do these more advanced features later. > > > btw, I wonder whether the vartrace attribute should have an argument like > > > vartrace(locals,reads,...)? > > > > I was hoping this could be delayed until actually needed. It'll need > > some changes because I don't want to do the full parsing for every decl > > all the time, so would need to store a bitmap of options somewhere > > in tree. > > Hmm, indeed. I wonder if you need the function attribute at all then? Maybe I really would like an opt-in per function. I think that's an important use case. Just instrument a few functions that contribute to the bug you're debugging to cut down the bandwidth. The idea was that __attribute__((vartrace)) for a function would log everything in that function, including locals. I could see a use case to opt-in into a function without locals (mainly because local tracking is more likely to cause trace overflows if there is a lot of execution). But I think I can do without that in v1 might add it later. > That is, on types and decls you'd interpret it as if locals tracing is on > then locals of type are traced but otherwise locals of type are not traced? When the opt-in is on the type or the variable then the variable should be tracked, even if it is an local (or maybe even a cast) The locals check is mainly for the command line. > That is, if I just want to trace variable i and do > > int i __attribute__((vartrace)); > > then what options do I enable to make that work and how can I avoid > tracing anything else? Similar for Just enable -mptwrite (or nothing if it's implied with -mcpu=...) > > typedef int traced_int __attribute_((vartrace)); Same. > > ? I guess we'd need -fvar-trace=locals to get the described effect for > the type attribute and then -fvar-trace=locals,all to have _all_ locals > traced? Or -fvar-trace=locals,only-marked? ... or forgo with the idea > of marking types? You want a command line override to not trace if the attribute is set? -mno-ptwrite would work. Could also add something separate in -fvartrace (perhaps implied in =off) but not sure if it's worth it. I suppose it could make sense if someone wants to use the _ptwrite builtin separately still. I'll add it to =off. > so there's no 'i' anymore. If you enable debug-info you Hmm I guess that's ok for now. I suppose it'll work with -Og. > and instrumenting late. (but ptwrite is only > interesting for optimized code, no?) It's very interesting for non optimized code too. In fact that would be a common use case during debugging. -Andi
Re: [PATCH v2 2/3] Add a pass to automatically add ptwrite instrumentation
On Tue, Nov 20, 2018 at 7:27 PM Andi Kleen wrote: > > On Tue, Nov 20, 2018 at 01:04:19PM +0100, Richard Biener wrote: > > Since your builtin clobbers memory > > Hmm, maybe we could get rid of that, but then how to avoid > the optimizer moving it around over function calls etc.? > The instrumentation should still be useful when the program > crashes, so we don't want to delay logging too much. You can't avoid moving it, yes. But it can be even moved now, effectively detaching it from the "interesting" $pc ranges we have debug location lists for? > > Maybe even instead pass it a number of bytes so it models how atomics work. > > How could that reject float? Why does it need to reject floats? Note you are already rejecting floats in the instrumentation pass. > Mode seems better for now. > > Eventually might support float/double through memory, but not in the > first version. Why does movq %xmm0, %rax; ptwrite %rax not work? > > > >NEXT_PASS (pass_tsan_O0); > > >NEXT_PASS (pass_sanopt); > > > + NEXT_PASS (pass_vartrace); > > > > I'd move it one lower, after pass_cleanup_eh. Further enhancement > > would make it a > > RTL pass ... > > It's after pass_nrv now. > > > So in reality the restriction is on the size of the object, correct? > > The instruction accepts 32 or 64bit memory or register. > > In principle everything could be logged through this, but i was trying > to limit the cases to integers and pointers for now to simplify > the problem. > > Right now the backend fails up when something else than 4 or 8 bytes > is passed. Fair enough, the instrumentation would need to pad out smaller values and/or split larger values. I think 1 and 2 byte values would be interesting so you can support char and shorts. Eventually 16byte values for __int128 or vectors. > > > > > +{ > > > + if (!supported_type (t)) > > > > You handle some nested cases below via recursion, > > like a.b.c (but not a[i][j]). But then this check will > > fire. I think it would be better to restructure the > > function to look at the outermost level for whether > > the op is of supported type, thus we can log it > > at all and then get all the way down to the base via > > sth like > > > > if (!supported_type (t)) > > return false; > > enum attrstate s = ; > > do > > { > >s = supported_op (t, s); > >if (s == force_off) > > return false; > > } > > while (handled_component_p (t) && (t = TREE_OPERAND (t, 0))) > > > > Now t is either an SSA_NAME, a DECL (you fail to handle PARM_DECL > > Incoming arguments and returns are handled separately. > > > and RESULT_DECL below) or a [TARGET_]MEM_REF. To get rid > > of non-pointer indirections do then > > > > t = get_base_address (t); > > if (DECL_P (t) && is_local (t)) > > > > > > because... > > > > > +return false; > > > + > > > + enum attrstate s = supported_op (t, neutral); > > > + if (s == force_off) > > > +return false; > > > + if (s == force_on) > > > +force = true; > > > + > > > + switch (TREE_CODE (t)) > > > +{ > > > +case VAR_DECL: > > > + if (DECL_ARTIFICIAL (t)) > > > + return false; > > > + if (is_local (t)) > > > + return true; > > > + return s == force_on || force; > > > + > > > +case ARRAY_REF: > > > + t = TREE_OPERAND (t, 0); > > > + s = supported_op (t, s); > > > + if (s == force_off) > > > + return false; > > > + return supported_type (TREE_TYPE (t)); > > > > Your supported_type is said to take a DECL. And you > > already asked for this type - it's the type of the original t > > (well, the type of this type given TREE_TYPE (t) is an array type). > > But you'll reject a[i][j] where the type of this type is an array type as > > well. > > Just to be clear, after your changes above I only need > to handle VAR_DECL and SSA_NAME here then, correct? Yes (and PARM_DECL and RESULT_DECL). > So one of the reasons I handled ARRAY_REF this way is to > trace the index as a local if needed. If I can assume > it was always in a MEM with an own ASSIGN earlier if the local > was a user visible that wouldn't be needed (and also some other similar > code elsewhere) If the index lives in memory it has a corresponding load. For SSA uses see my comment about instrumenting them at all (together with the suggestion on how to handle them in an easier way). > > But when I look at a simple test case like vartrace-6 > > void > f (void) > { > int i; > for (i = 0; i < 10; i++) >f2 (); > } > > i appears to be a SSA name only that is referenced everywhere without > MEM. And if the user wants to track the value of i I would need > to explicitely handle all these cases. Do I miss something here? You handle those in different places I think. > I'm starting to think i should perhaps drop locals support to simplify > everything? But that might limit usability for debugging somewhat. I think you confuse "locals" a bit. In GIMPLE an automatic
Re: [PATCH v2 2/3] Add a pass to automatically add ptwrite instrumentation
On Tue, Nov 20, 2018 at 01:04:19PM +0100, Richard Biener wrote: > Since your builtin clobbers memory Hmm, maybe we could get rid of that, but then how to avoid the optimizer moving it around over function calls etc.? The instrumentation should still be useful when the program crashes, so we don't want to delay logging too much. > Maybe even instead pass it a number of bytes so it models how atomics work. How could that reject float? Mode seems better for now. Eventually might support float/double through memory, but not in the first version. > >NEXT_PASS (pass_tsan_O0); > >NEXT_PASS (pass_sanopt); > > + NEXT_PASS (pass_vartrace); > > I'd move it one lower, after pass_cleanup_eh. Further enhancement > would make it a > RTL pass ... It's after pass_nrv now. > So in reality the restriction is on the size of the object, correct? The instruction accepts 32 or 64bit memory or register. In principle everything could be logged through this, but i was trying to limit the cases to integers and pointers for now to simplify the problem. Right now the backend fails up when something else than 4 or 8 bytes is passed. > > > +{ > > + if (!supported_type (t)) > > You handle some nested cases below via recursion, > like a.b.c (but not a[i][j]). But then this check will > fire. I think it would be better to restructure the > function to look at the outermost level for whether > the op is of supported type, thus we can log it > at all and then get all the way down to the base via > sth like > > if (!supported_type (t)) > return false; > enum attrstate s = ; > do > { >s = supported_op (t, s); >if (s == force_off) > return false; > } > while (handled_component_p (t) && (t = TREE_OPERAND (t, 0))) > > Now t is either an SSA_NAME, a DECL (you fail to handle PARM_DECL Incoming arguments and returns are handled separately. > and RESULT_DECL below) or a [TARGET_]MEM_REF. To get rid > of non-pointer indirections do then > > t = get_base_address (t); > if (DECL_P (t) && is_local (t)) > > > because... > > > +return false; > > + > > + enum attrstate s = supported_op (t, neutral); > > + if (s == force_off) > > +return false; > > + if (s == force_on) > > +force = true; > > + > > + switch (TREE_CODE (t)) > > +{ > > +case VAR_DECL: > > + if (DECL_ARTIFICIAL (t)) > > + return false; > > + if (is_local (t)) > > + return true; > > + return s == force_on || force; > > + > > +case ARRAY_REF: > > + t = TREE_OPERAND (t, 0); > > + s = supported_op (t, s); > > + if (s == force_off) > > + return false; > > + return supported_type (TREE_TYPE (t)); > > Your supported_type is said to take a DECL. And you > already asked for this type - it's the type of the original t > (well, the type of this type given TREE_TYPE (t) is an array type). > But you'll reject a[i][j] where the type of this type is an array type as > well. Just to be clear, after your changes above I only need to handle VAR_DECL and SSA_NAME here then, correct? So one of the reasons I handled ARRAY_REF this way is to trace the index as a local if needed. If I can assume it was always in a MEM with an own ASSIGN earlier if the local was a user visible that wouldn't be needed (and also some other similar code elsewhere) But when I look at a simple test case like vartrace-6 void f (void) { int i; for (i = 0; i < 10; i++) f2 (); } i appears to be a SSA name only that is referenced everywhere without MEM. And if the user wants to track the value of i I would need to explicitely handle all these cases. Do I miss something here? I'm starting to think i should perhaps drop locals support to simplify everything? But that might limit usability for debugging somewhat. > gsi_insert_* does update_stmt already. Btw, if you allow any > SImode or DImode size value you can use a VIEW_CONVERT_EXPR Just add them unconditionally? > > +bool > > +instrument_args (function *fun) > > +{ > > + gimple_stmt_iterator gi; > > + bool changed = false; > > + > > + /* Local tracing usually takes care of the argument too, when > > + they are read. This avoids redundant trace instructions. */ > > But only when instrumenting reads? Yes will add the check. > > Hmm, but then this requires the target instruction to have a memory operand? Yes that's right for now. Eventually it will be fixed and x86 would benefit too. > That's going to be unlikely for RISCy cases? On x86 does it work if > combine later does not syntesize a ptwrite with memory operand? > I also wonder how this survives RTL CSE since you are basically doing > > mem = val; // orig stmt > val' = mem; > ptwrite (val'); > > that probably means when CSE removes the load there ends up a debug-insn > reflecting what you want? I'll check. > > + /* Handle operators in case they read locals. */ > > Does it make sense at all to instrument SSA
Re: [PATCH v2 2/3] Add a pass to automatically add ptwrite instrumentation
On Fri, Nov 16, 2018 at 4:57 AM Andi Kleen wrote: > > From: Andi Kleen > > Add a new pass to automatically instrument changes to variables > with the new PTWRITE instruction on x86. PTWRITE writes a 4 or 8 byte > field into an Processor Trace log, which allows low over head > logging of information. Essentially it's a hardware accelerated > printf. > > This allows to reconstruct how values later from the log, > which can be useful for debugging or other analysis of the program > behavior. With the compiler support this can be done with without > having to manually add instrumentation to the code. > > Using dwarf information this can be later mapped back to the variables. > The decoder decodes the PTWRITE instructions using IP information > in the log, and then looks up the argument in the debug information. > Then this can be used to reconstruct the original variable > name to display a value history for the variable. > > There are new options to enable instrumentation for different types, > and also a new attribute to control analysis fine grained per > function or variable level. The attributes can be set on both > the variable and the type level, and also on structure fields. > This allows to enable tracing only for specific code in large > programs in a flexible matter. > > The pass is generic, but only the x86 backend enables the necessary > hooks. When the backend enables the necessary hooks (with -mptwrite) > there is an additional pass that looks through the code for > attribute vartrace enabled functions or variables. > > The -fvartrace=locals option is experimental: it works, but it > generates redundant ptwrites because the pass doesn't use > the SSA information to minimize instrumentation. > This could be optimized later. > > Currently the code can be tested with SDE, or on a Intel > Gemini Lake system with a new enough Linux kernel (v4.10+) > that supports PTWRITE for PT. Gemini Lake is used in low > end laptops ("Intel Pentium Silver J5.. / Celeron N4... / > Celeron J4...") > > Linux perf can be used to record the values > > perf record -e intel_pt/ptw=1,branch=0/ program > perf script --itrace=crw -F +synth ... > > I have an experimential version of perf that can also use > dwarf information to symbolize many[1] values back to their variable > names. So far it is not in standard perf, but available at > > https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=perf/var-resolve-4 > > It is currently not able to decode all variable locations to names, > but a large subset. > > Longer term hopefully gdb will support this information too. > > The CPU can potentially generate very data high bandwidths when > code doing a lot of computation is heavily instrumented. > This can cause some data loss in both the CPU and also in perf > logging the data when the disk cannot keep up. > > Running some larger workloads most workloads do not cause > CPU level overflows, but I've seen it with -fvartrace > with crafty, and with more workloads with -fvartrace-locals. > > Recommendation is to not fully instrument programs, > but only areas of interest either at the file level or using > the attributes. > > The other thing is that perf and the disk often cannot keep up > with the data bandwidth for longer computations. In this case > it's possible to use perf snapshot mode (add --snapshot > to the command line above). The data will be only logged to > a memory ring buffer then, and only dump the buffers on events > of interest by sending SIGUSR2 to the perf binrary. > > In the future this will be hopefully better supported with > core files and gdb. > > Passes bootstrap and test suite on x86_64-linux, also > bootstrapped and tested gcc itself with full -fvartrace > and -fvartrace=locals instrumentation. In the cover mail you mentioned you didn't get rid of SSA update. That is because your instrumentation does not set the calls virtual operands. Since your builtin clobbers memory and you instrument non-memory ops that's only possible if you'd track the active virtual operand during the walk over the function. I suppose using SSA update is OK for now. More comments inline > gcc/: > > 2018-11-15 Andi Kleen > > * Makefile.in: Add tree-vartrace.o. > * common.opt: Add -fvartrace > * opts.c (parse_vartrace_options): Add. > (common_handle_option): Call parse_vartrace_options. > * config/i386/i386.c (ix86_vartrace_func): Add. > (TARGET_VARTRACE_FUNC): Add. > * doc/extend.texi: Document vartrace/no_vartrace > attributes. > * doc/invoke.texi: Document -fvartrace. > * doc/tm.texi (TARGET_VARTRACE_FUNC): Add. > * passes.def: Add vartrace pass. > * target.def (vartrace_func): Add. > * tree-pass.h (make_pass_vartrace): Add. > * tree-vartrace.c: New file to implement vartrace pass. > > gcc/c-family/: > > 2018-11-15 Andi Kleen > > * c-attribs.c (handle_vartrace_attribute, >
[PATCH v2 2/3] Add a pass to automatically add ptwrite instrumentation
From: Andi Kleen Add a new pass to automatically instrument changes to variables with the new PTWRITE instruction on x86. PTWRITE writes a 4 or 8 byte field into an Processor Trace log, which allows low over head logging of information. Essentially it's a hardware accelerated printf. This allows to reconstruct how values later from the log, which can be useful for debugging or other analysis of the program behavior. With the compiler support this can be done with without having to manually add instrumentation to the code. Using dwarf information this can be later mapped back to the variables. The decoder decodes the PTWRITE instructions using IP information in the log, and then looks up the argument in the debug information. Then this can be used to reconstruct the original variable name to display a value history for the variable. There are new options to enable instrumentation for different types, and also a new attribute to control analysis fine grained per function or variable level. The attributes can be set on both the variable and the type level, and also on structure fields. This allows to enable tracing only for specific code in large programs in a flexible matter. The pass is generic, but only the x86 backend enables the necessary hooks. When the backend enables the necessary hooks (with -mptwrite) there is an additional pass that looks through the code for attribute vartrace enabled functions or variables. The -fvartrace=locals option is experimental: it works, but it generates redundant ptwrites because the pass doesn't use the SSA information to minimize instrumentation. This could be optimized later. Currently the code can be tested with SDE, or on a Intel Gemini Lake system with a new enough Linux kernel (v4.10+) that supports PTWRITE for PT. Gemini Lake is used in low end laptops ("Intel Pentium Silver J5.. / Celeron N4... / Celeron J4...") Linux perf can be used to record the values perf record -e intel_pt/ptw=1,branch=0/ program perf script --itrace=crw -F +synth ... I have an experimential version of perf that can also use dwarf information to symbolize many[1] values back to their variable names. So far it is not in standard perf, but available at https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=perf/var-resolve-4 It is currently not able to decode all variable locations to names, but a large subset. Longer term hopefully gdb will support this information too. The CPU can potentially generate very data high bandwidths when code doing a lot of computation is heavily instrumented. This can cause some data loss in both the CPU and also in perf logging the data when the disk cannot keep up. Running some larger workloads most workloads do not cause CPU level overflows, but I've seen it with -fvartrace with crafty, and with more workloads with -fvartrace-locals. Recommendation is to not fully instrument programs, but only areas of interest either at the file level or using the attributes. The other thing is that perf and the disk often cannot keep up with the data bandwidth for longer computations. In this case it's possible to use perf snapshot mode (add --snapshot to the command line above). The data will be only logged to a memory ring buffer then, and only dump the buffers on events of interest by sending SIGUSR2 to the perf binrary. In the future this will be hopefully better supported with core files and gdb. Passes bootstrap and test suite on x86_64-linux, also bootstrapped and tested gcc itself with full -fvartrace and -fvartrace=locals instrumentation. gcc/: 2018-11-15 Andi Kleen * Makefile.in: Add tree-vartrace.o. * common.opt: Add -fvartrace * opts.c (parse_vartrace_options): Add. (common_handle_option): Call parse_vartrace_options. * config/i386/i386.c (ix86_vartrace_func): Add. (TARGET_VARTRACE_FUNC): Add. * doc/extend.texi: Document vartrace/no_vartrace attributes. * doc/invoke.texi: Document -fvartrace. * doc/tm.texi (TARGET_VARTRACE_FUNC): Add. * passes.def: Add vartrace pass. * target.def (vartrace_func): Add. * tree-pass.h (make_pass_vartrace): Add. * tree-vartrace.c: New file to implement vartrace pass. gcc/c-family/: 2018-11-15 Andi Kleen * c-attribs.c (handle_vartrace_attribute, handle_no_vartrace_attribute): New functions. (attr_vartrace_exclusions): Add. config/: 2018-11-03 Andi Kleen * bootstrap-vartrace.mk: New. * bootstrap-vartrace-locals.mk: New. --- config/bootstrap-vartrace-locals.mk | 3 + config/bootstrap-vartrace.mk| 3 + gcc/Makefile.in | 1 + gcc/c-family/c-attribs.c| 77 + gcc/common.opt | 8 + gcc/config/i386/i386.c | 32 ++ gcc/doc/extend.texi | 32 ++ gcc/doc/invoke.texi | 27 ++ gcc/doc/tm.texi