Re: [RFC] perf tool improvement requests
On Tue, Sep 04, 2018 at 08:50:07AM -0700, Stephane Eranian wrote: > > > When we get an exact IP (using PEBS) and were sampling a data related > > > event (say L1 misses), we can get the data type from the instruction > > > itself; that is, through DWARF. We _know_ what type (structure::member) > > > is read/written to. > > > I have been asking this from the compiler people for a long time! > I don't think it is there. I'd like each load/store to be annotated > with a data type + offset > within the type. It would allow data type profiling. This would not be > bulletproof though > because of the accessor function problem: > void incr(int *v) { (*v)++; } > struct foo { int a, int b } bar; > incr(); Cute, yes. Also, array accesses are tricky. But I think even with those caveats it would be _very_ useful. > There are concern with the volume of data that this > would generate. But my argument > is that this is just debug binaries, does not make the stripped binary > any bigger. Right; the alternative is that we build an asm interpreter and follow the data types throughout the function, because DWARF can tell us about the types at a number of places, like function call arguments etc.. That is, of course, a terrible lot of work :/
Re: [RFC] perf tool improvement requests
On Tue, Sep 04, 2018 at 08:50:07AM -0700, Stephane Eranian wrote: > > > When we get an exact IP (using PEBS) and were sampling a data related > > > event (say L1 misses), we can get the data type from the instruction > > > itself; that is, through DWARF. We _know_ what type (structure::member) > > > is read/written to. > > > I have been asking this from the compiler people for a long time! > I don't think it is there. I'd like each load/store to be annotated > with a data type + offset > within the type. It would allow data type profiling. This would not be > bulletproof though > because of the accessor function problem: > void incr(int *v) { (*v)++; } > struct foo { int a, int b } bar; > incr(); Cute, yes. Also, array accesses are tricky. But I think even with those caveats it would be _very_ useful. > There are concern with the volume of data that this > would generate. But my argument > is that this is just debug binaries, does not make the stripped binary > any bigger. Right; the alternative is that we build an asm interpreter and follow the data types throughout the function, because DWARF can tell us about the types at a number of places, like function call arguments etc.. That is, of course, a terrible lot of work :/
Re: [RFC] perf tool improvement requests
Arnaldo, On Tue, Sep 4, 2018 at 6:42 AM Arnaldo Carvalho de Melo wrote: > > Em Tue, Sep 04, 2018 at 09:10:49AM +0200, Peter Zijlstra escreveu: > > On Mon, Sep 03, 2018 at 07:45:48PM -0700, Stephane Eranian wrote: > > > A few weeks ago, you had asked if I had more requests for the perf tool. > > > I have one long standing one; that is IP based data structure > > annotation. > > > When we get an exact IP (using PEBS) and were sampling a data related > > event (say L1 misses), we can get the data type from the instruction > > itself; that is, through DWARF. We _know_ what type (structure::member) > > is read/written to. > I have been asking this from the compiler people for a long time! I don't think it is there. I'd like each load/store to be annotated with a data type + offset within the type. It would allow data type profiling. This would not be bulletproof though because of the accessor function problem: void incr(int *v) { (*v)++; } struct foo { int a, int b } bar; incr(); Here the load/store in incr() would see an int pointer, not an int inside struct foo at offset 0 which is what we want. There are concern with the volume of data that this would generate. But my argument is that this is just debug binaries, does not make the stripped binary any bigger. > > I would love to get that in a pahole style output. > Yes, me too! > > Better yet, when you measure both hits and misses, you can get a > > structure usage overview, and see what lines are used lots and what > > members inside that line are rarely used. Ideal information for data > > structure layout optimization. > > > 1000x more useful than that c2c crap. > c2c is about something else: more about NUMA issues and false sharing. > > Can we please get that? > > So, use 'c2c record' to get the samples: > > [root@jouet ~]# perf c2c record > ^C[ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 5.152 MB perf.data (4555 samples) ] > > Events collected: > > [root@jouet ~]# perf evlist -v > cpu/mem-loads,ldlat=30/P: type: 4, size: 112, config: 0x1cd, { sample_period, > sample_freq }: 4000, sample_type: > IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, > disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, > mmap_data: 1, sample_id_all: 1, mmap2: 1, comm_exec: 1, { bp_addr, config1 }: > 0x1f > cpu/mem-stores/P: type: 4, size: 112, config: 0x82d0, { sample_period, > sample_freq }: 4000, sample_type: > IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, > disabled: 1, inherit: 1, freq: 1, precise_ip: 3, sample_id_all: 1 > > Then we'll get a 'annotate --hits' option (just cooked up, will > polish) that will show the name of the function, info about it globally, > i.e. what annotate already produced, we may get this in CSV for better > post processing consumption: > > [root@jouet ~]# perf annotate --hits kmem_cache_alloc > Samples: 20 of event 'cpu/mem-loads,ldlat=30/P', 4000 Hz, Event count > (approx.): 875, [percent: local period] > kmem_cache_alloc() /usr/lib/debug/lib/modules/4.17.17-100.fc27.x86_64/vmlinux > 4.91 15: movgfp_allowed_mask,%ebx > 2.51 51: mov(%r15),%r8 > 17.14 54: mov%gs:0x8(%r8),%rdx > 6.51 61: cmpq $0x0,0x10(%r8) > 17.14 66: mov(%r8),%r14 > 6.29 78: mov0x20(%r15),%ebx > 5.71 7c: mov(%r15),%rdi > 29.49 85: xor0x138(%r15),%rbx > 2.86 9d: lea(%rdi),%rsi > 3.43 d7: pop%rbx > 2.29 dc: pop%r12 > 1.71 ed: testb $0x4,0xb(%rbp) > [root@jouet ~]# > How does this related to what Peter was asking? It has nothing about data types. What I'd like is a true data type profiler showing you the most accesses data types. and then an annotate-mode showing you which fields inside the types are mostly read or written with their sizes and alignment. Goal is to improve layout based on accesses to minimize the number of cachelines moved. You need DLA sampling on all loads and stores and then type annotation. As I said, I have prototyped this for self-sampling programs but not in the perf tool. It is harder there because you need type information and heap information. I think DWARF is one way to go assuming it is extended to support the right kind of load/store annotations. Another way is to track allocations and correlate to data types. > Then I need to get the DW_AT_location stuff parsed in pahole, so > that with those offsets (second column, ending with :) with hits (first > column, there its local period, but we can ask for some specific metric > [1]), I'll be able to figure out what DW_TAG_variable or > DW_TAG_formal_parameter is living there at that time, get the offset > from the decoded instruction, say that xor, 0x138 offset from the type > for %r15 at that offset (85) from kmem_cache_alloc, right? > > In a first milestone we'd have something like: > > perf annotate --hits function | pahole
Re: [RFC] perf tool improvement requests
Arnaldo, On Tue, Sep 4, 2018 at 6:42 AM Arnaldo Carvalho de Melo wrote: > > Em Tue, Sep 04, 2018 at 09:10:49AM +0200, Peter Zijlstra escreveu: > > On Mon, Sep 03, 2018 at 07:45:48PM -0700, Stephane Eranian wrote: > > > A few weeks ago, you had asked if I had more requests for the perf tool. > > > I have one long standing one; that is IP based data structure > > annotation. > > > When we get an exact IP (using PEBS) and were sampling a data related > > event (say L1 misses), we can get the data type from the instruction > > itself; that is, through DWARF. We _know_ what type (structure::member) > > is read/written to. > I have been asking this from the compiler people for a long time! I don't think it is there. I'd like each load/store to be annotated with a data type + offset within the type. It would allow data type profiling. This would not be bulletproof though because of the accessor function problem: void incr(int *v) { (*v)++; } struct foo { int a, int b } bar; incr(); Here the load/store in incr() would see an int pointer, not an int inside struct foo at offset 0 which is what we want. There are concern with the volume of data that this would generate. But my argument is that this is just debug binaries, does not make the stripped binary any bigger. > > I would love to get that in a pahole style output. > Yes, me too! > > Better yet, when you measure both hits and misses, you can get a > > structure usage overview, and see what lines are used lots and what > > members inside that line are rarely used. Ideal information for data > > structure layout optimization. > > > 1000x more useful than that c2c crap. > c2c is about something else: more about NUMA issues and false sharing. > > Can we please get that? > > So, use 'c2c record' to get the samples: > > [root@jouet ~]# perf c2c record > ^C[ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 5.152 MB perf.data (4555 samples) ] > > Events collected: > > [root@jouet ~]# perf evlist -v > cpu/mem-loads,ldlat=30/P: type: 4, size: 112, config: 0x1cd, { sample_period, > sample_freq }: 4000, sample_type: > IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, > disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, > mmap_data: 1, sample_id_all: 1, mmap2: 1, comm_exec: 1, { bp_addr, config1 }: > 0x1f > cpu/mem-stores/P: type: 4, size: 112, config: 0x82d0, { sample_period, > sample_freq }: 4000, sample_type: > IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, > disabled: 1, inherit: 1, freq: 1, precise_ip: 3, sample_id_all: 1 > > Then we'll get a 'annotate --hits' option (just cooked up, will > polish) that will show the name of the function, info about it globally, > i.e. what annotate already produced, we may get this in CSV for better > post processing consumption: > > [root@jouet ~]# perf annotate --hits kmem_cache_alloc > Samples: 20 of event 'cpu/mem-loads,ldlat=30/P', 4000 Hz, Event count > (approx.): 875, [percent: local period] > kmem_cache_alloc() /usr/lib/debug/lib/modules/4.17.17-100.fc27.x86_64/vmlinux > 4.91 15: movgfp_allowed_mask,%ebx > 2.51 51: mov(%r15),%r8 > 17.14 54: mov%gs:0x8(%r8),%rdx > 6.51 61: cmpq $0x0,0x10(%r8) > 17.14 66: mov(%r8),%r14 > 6.29 78: mov0x20(%r15),%ebx > 5.71 7c: mov(%r15),%rdi > 29.49 85: xor0x138(%r15),%rbx > 2.86 9d: lea(%rdi),%rsi > 3.43 d7: pop%rbx > 2.29 dc: pop%r12 > 1.71 ed: testb $0x4,0xb(%rbp) > [root@jouet ~]# > How does this related to what Peter was asking? It has nothing about data types. What I'd like is a true data type profiler showing you the most accesses data types. and then an annotate-mode showing you which fields inside the types are mostly read or written with their sizes and alignment. Goal is to improve layout based on accesses to minimize the number of cachelines moved. You need DLA sampling on all loads and stores and then type annotation. As I said, I have prototyped this for self-sampling programs but not in the perf tool. It is harder there because you need type information and heap information. I think DWARF is one way to go assuming it is extended to support the right kind of load/store annotations. Another way is to track allocations and correlate to data types. > Then I need to get the DW_AT_location stuff parsed in pahole, so > that with those offsets (second column, ending with :) with hits (first > column, there its local period, but we can ask for some specific metric > [1]), I'll be able to figure out what DW_TAG_variable or > DW_TAG_formal_parameter is living there at that time, get the offset > from the decoded instruction, say that xor, 0x138 offset from the type > for %r15 at that offset (85) from kmem_cache_alloc, right? > > In a first milestone we'd have something like: > > perf annotate --hits function | pahole
Re: [RFC] perf tool improvement requests
Em Tue, Sep 04, 2018 at 04:17:24PM +0200, Peter Zijlstra escreveu: > On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote: > > Then I need to get the DW_AT_location stuff parsed in pahole, so > > that with those offsets (second column, ending with :) with hits (first > > column, there its local period, but we can ask for some specific metric > > [1]), I'll be able to figure out what DW_TAG_variable or > > DW_TAG_formal_parameter is living there at that time, get the offset > > from the decoded instruction, say that xor, 0x138 offset from the type > > for %r15 at that offset (85) from kmem_cache_alloc, right? > > I'm not sure how the DWARF location stuff works; it could be it already > includes the offset and decoding the instruction is not needed. > > But yes, that's the basic idea; get DWARF to tell you what variable is > used at a certain IP. > > > In a first milestone we'd have something like: > > > > perf annotate --hits function | pahole --annotate -C task_struct > > > > perf annotate --hits | pahole --annotate > > Not sure keeping it two proglets makes sense, but whatever :-) This is just a start, trying to take advantage of existing codebases. > The alternative I suppose is making perf do the IP->struct::member > mapping and freed that to pahole, which then only uses it to annotate > the output. So, what I'm trying to do now is to make perf get the samples associated with functions/offsets + decoded instructions. Pahole, that already touches DWARF info, will just use the DW_AT_location, look at its description, from https://blog.tartanllama.xyz/writing-a-linux-debugger-variables/: --- Simple location descriptions describe the location of one contiguous piece (usually all) of an object. A simple location description may describe a location in addressable memory, or in a register, or the lack of a location (with or without a known value). Example: DW_OP_fbreg -32 A variable which is entirely stored -32 bytes from the stack frame base. Composite location descriptions describe an object in terms of pieces, each of which may be contained in part of a register or stored in a memory location unrelated to other pieces. Example: DW_OP_reg3 DW_OP_piece 4 DW_OP_reg10 DW_OP_piece 2 A variable whose first four bytes reside in register 3 and whose next two bytes reside in register 10. Location lists describe objects which have a limited lifetime or change location during their lifetime. Example: [ 0]DW_OP_reg0 [ 1]DW_OP_reg3 [ 2]DW_OP_reg2 A variable whose location moves between registers depending on the current value of the program counter --- So I have a list of DW_TAG_formal_parameter (function parameters) and DW_TAG_variable, and the above location lists/descriptions, stating in what registers and what IP ranges the variables are in, and in the DW_TAG_{formal_parameter,variable} I have DW_AT_type, that points to the type of that variable, couple that with the offset taken from the decoded instruction we get from 'perf annotate --hits' and we should have all we need, no? Then pahole can have all this painted on structs (like 'perf annotate') for the whole workload, or for specific callchains, etc. > Or, munge the entirety of pahole into perf.. That may be interesting at some point, yes. - Arnaldo
Re: [RFC] perf tool improvement requests
Em Tue, Sep 04, 2018 at 04:17:24PM +0200, Peter Zijlstra escreveu: > On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote: > > Then I need to get the DW_AT_location stuff parsed in pahole, so > > that with those offsets (second column, ending with :) with hits (first > > column, there its local period, but we can ask for some specific metric > > [1]), I'll be able to figure out what DW_TAG_variable or > > DW_TAG_formal_parameter is living there at that time, get the offset > > from the decoded instruction, say that xor, 0x138 offset from the type > > for %r15 at that offset (85) from kmem_cache_alloc, right? > > I'm not sure how the DWARF location stuff works; it could be it already > includes the offset and decoding the instruction is not needed. > > But yes, that's the basic idea; get DWARF to tell you what variable is > used at a certain IP. > > > In a first milestone we'd have something like: > > > > perf annotate --hits function | pahole --annotate -C task_struct > > > > perf annotate --hits | pahole --annotate > > Not sure keeping it two proglets makes sense, but whatever :-) This is just a start, trying to take advantage of existing codebases. > The alternative I suppose is making perf do the IP->struct::member > mapping and freed that to pahole, which then only uses it to annotate > the output. So, what I'm trying to do now is to make perf get the samples associated with functions/offsets + decoded instructions. Pahole, that already touches DWARF info, will just use the DW_AT_location, look at its description, from https://blog.tartanllama.xyz/writing-a-linux-debugger-variables/: --- Simple location descriptions describe the location of one contiguous piece (usually all) of an object. A simple location description may describe a location in addressable memory, or in a register, or the lack of a location (with or without a known value). Example: DW_OP_fbreg -32 A variable which is entirely stored -32 bytes from the stack frame base. Composite location descriptions describe an object in terms of pieces, each of which may be contained in part of a register or stored in a memory location unrelated to other pieces. Example: DW_OP_reg3 DW_OP_piece 4 DW_OP_reg10 DW_OP_piece 2 A variable whose first four bytes reside in register 3 and whose next two bytes reside in register 10. Location lists describe objects which have a limited lifetime or change location during their lifetime. Example: [ 0]DW_OP_reg0 [ 1]DW_OP_reg3 [ 2]DW_OP_reg2 A variable whose location moves between registers depending on the current value of the program counter --- So I have a list of DW_TAG_formal_parameter (function parameters) and DW_TAG_variable, and the above location lists/descriptions, stating in what registers and what IP ranges the variables are in, and in the DW_TAG_{formal_parameter,variable} I have DW_AT_type, that points to the type of that variable, couple that with the offset taken from the decoded instruction we get from 'perf annotate --hits' and we should have all we need, no? Then pahole can have all this painted on structs (like 'perf annotate') for the whole workload, or for specific callchains, etc. > Or, munge the entirety of pahole into perf.. That may be interesting at some point, yes. - Arnaldo
Re: [RFC] perf tool improvement requests
On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote: > Then I need to get the DW_AT_location stuff parsed in pahole, so > that with those offsets (second column, ending with :) with hits (first > column, there its local period, but we can ask for some specific metric > [1]), I'll be able to figure out what DW_TAG_variable or > DW_TAG_formal_parameter is living there at that time, get the offset > from the decoded instruction, say that xor, 0x138 offset from the type > for %r15 at that offset (85) from kmem_cache_alloc, right? I'm not sure how the DWARF location stuff works; it could be it already includes the offset and decoding the instruction is not needed. But yes, that's the basic idea; get DWARF to tell you what variable is used at a certain IP. > In a first milestone we'd have something like: > > perf annotate --hits function | pahole --annotate -C task_struct > > perf annotate --hits | pahole --annotate Not sure keeping it two proglets makes sense, but whatever :-) The alternative I suppose is making perf do the IP->struct::member mapping and freed that to pahole, which then only uses it to annotate the output. Or, munge the entirety of pahole into perf..
Re: [RFC] perf tool improvement requests
On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote: > Then I need to get the DW_AT_location stuff parsed in pahole, so > that with those offsets (second column, ending with :) with hits (first > column, there its local period, but we can ask for some specific metric > [1]), I'll be able to figure out what DW_TAG_variable or > DW_TAG_formal_parameter is living there at that time, get the offset > from the decoded instruction, say that xor, 0x138 offset from the type > for %r15 at that offset (85) from kmem_cache_alloc, right? I'm not sure how the DWARF location stuff works; it could be it already includes the offset and decoding the instruction is not needed. But yes, that's the basic idea; get DWARF to tell you what variable is used at a certain IP. > In a first milestone we'd have something like: > > perf annotate --hits function | pahole --annotate -C task_struct > > perf annotate --hits | pahole --annotate Not sure keeping it two proglets makes sense, but whatever :-) The alternative I suppose is making perf do the IP->struct::member mapping and freed that to pahole, which then only uses it to annotate the output. Or, munge the entirety of pahole into perf..
Re: [RFC] perf tool improvement requests
Em Tue, Sep 04, 2018 at 03:58:35PM +0200, Jiri Olsa escreveu: > On Tue, Sep 04, 2018 at 03:53:25PM +0200, Peter Zijlstra wrote: > > On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote: > > > So, use 'c2c record' to get the samples: > > IIRC that uses numa events and is completely useless. > I guess perf record on any other event would work > in Arnaldo's workflow Right. I should've avoided useless events ;-) - Arnaldo
Re: [RFC] perf tool improvement requests
Em Tue, Sep 04, 2018 at 03:58:35PM +0200, Jiri Olsa escreveu: > On Tue, Sep 04, 2018 at 03:53:25PM +0200, Peter Zijlstra wrote: > > On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote: > > > So, use 'c2c record' to get the samples: > > IIRC that uses numa events and is completely useless. > I guess perf record on any other event would work > in Arnaldo's workflow Right. I should've avoided useless events ;-) - Arnaldo
Re: [RFC] perf tool improvement requests
On Tue, Sep 04, 2018 at 03:53:25PM +0200, Peter Zijlstra wrote: > On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote: > > So, use 'c2c record' to get the samples: > > IIRC that uses numa events and is completely useless. I guess perf record on any other event would work in Arnaldo's workflow jirka
Re: [RFC] perf tool improvement requests
On Tue, Sep 04, 2018 at 03:53:25PM +0200, Peter Zijlstra wrote: > On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote: > > So, use 'c2c record' to get the samples: > > IIRC that uses numa events and is completely useless. I guess perf record on any other event would work in Arnaldo's workflow jirka
Re: [RFC] perf tool improvement requests
On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote: > So, use 'c2c record' to get the samples: IIRC that uses numa events and is completely useless.
Re: [RFC] perf tool improvement requests
On Tue, Sep 04, 2018 at 10:42:18AM -0300, Arnaldo Carvalho de Melo wrote: > So, use 'c2c record' to get the samples: IIRC that uses numa events and is completely useless.
Re: [RFC] perf tool improvement requests
Em Tue, Sep 04, 2018 at 09:10:49AM +0200, Peter Zijlstra escreveu: > On Mon, Sep 03, 2018 at 07:45:48PM -0700, Stephane Eranian wrote: > > A few weeks ago, you had asked if I had more requests for the perf tool. > I have one long standing one; that is IP based data structure > annotation. > When we get an exact IP (using PEBS) and were sampling a data related > event (say L1 misses), we can get the data type from the instruction > itself; that is, through DWARF. We _know_ what type (structure::member) > is read/written to. > I would love to get that in a pahole style output. > Better yet, when you measure both hits and misses, you can get a > structure usage overview, and see what lines are used lots and what > members inside that line are rarely used. Ideal information for data > structure layout optimization. > 1000x more useful than that c2c crap. > Can we please get that? So, use 'c2c record' to get the samples: [root@jouet ~]# perf c2c record ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 5.152 MB perf.data (4555 samples) ] Events collected: [root@jouet ~]# perf evlist -v cpu/mem-loads,ldlat=30/P: type: 4, size: 112, config: 0x1cd, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, mmap_data: 1, sample_id_all: 1, mmap2: 1, comm_exec: 1, { bp_addr, config1 }: 0x1f cpu/mem-stores/P: type: 4, size: 112, config: 0x82d0, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, freq: 1, precise_ip: 3, sample_id_all: 1 Then we'll get a 'annotate --hits' option (just cooked up, will polish) that will show the name of the function, info about it globally, i.e. what annotate already produced, we may get this in CSV for better post processing consumption: [root@jouet ~]# perf annotate --hits kmem_cache_alloc Samples: 20 of event 'cpu/mem-loads,ldlat=30/P', 4000 Hz, Event count (approx.): 875, [percent: local period] kmem_cache_alloc() /usr/lib/debug/lib/modules/4.17.17-100.fc27.x86_64/vmlinux 4.91 15: movgfp_allowed_mask,%ebx 2.51 51: mov(%r15),%r8 17.14 54: mov%gs:0x8(%r8),%rdx 6.51 61: cmpq $0x0,0x10(%r8) 17.14 66: mov(%r8),%r14 6.29 78: mov0x20(%r15),%ebx 5.71 7c: mov(%r15),%rdi 29.49 85: xor0x138(%r15),%rbx 2.86 9d: lea(%rdi),%rsi 3.43 d7: pop%rbx 2.29 dc: pop%r12 1.71 ed: testb $0x4,0xb(%rbp) [root@jouet ~]# Then I need to get the DW_AT_location stuff parsed in pahole, so that with those offsets (second column, ending with :) with hits (first column, there its local period, but we can ask for some specific metric [1]), I'll be able to figure out what DW_TAG_variable or DW_TAG_formal_parameter is living there at that time, get the offset from the decoded instruction, say that xor, 0x138 offset from the type for %r15 at that offset (85) from kmem_cache_alloc, right? In a first milestone we'd have something like: perf annotate --hits function | pahole --annotate -C task_struct perf annotate --hits | pahole --annotate Would show all structs with hits, for all functions with hits. Other options would show which struct has more hits, etc. - Arnaldo [1] [root@jouet ~]# perf annotate -h local Usage: perf annotate [] --percent-type Set percent type local/global-period/hits [root@jouet ~]# - Arnaldo
Re: [RFC] perf tool improvement requests
Em Tue, Sep 04, 2018 at 09:10:49AM +0200, Peter Zijlstra escreveu: > On Mon, Sep 03, 2018 at 07:45:48PM -0700, Stephane Eranian wrote: > > A few weeks ago, you had asked if I had more requests for the perf tool. > I have one long standing one; that is IP based data structure > annotation. > When we get an exact IP (using PEBS) and were sampling a data related > event (say L1 misses), we can get the data type from the instruction > itself; that is, through DWARF. We _know_ what type (structure::member) > is read/written to. > I would love to get that in a pahole style output. > Better yet, when you measure both hits and misses, you can get a > structure usage overview, and see what lines are used lots and what > members inside that line are rarely used. Ideal information for data > structure layout optimization. > 1000x more useful than that c2c crap. > Can we please get that? So, use 'c2c record' to get the samples: [root@jouet ~]# perf c2c record ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 5.152 MB perf.data (4555 samples) ] Events collected: [root@jouet ~]# perf evlist -v cpu/mem-loads,ldlat=30/P: type: 4, size: 112, config: 0x1cd, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, mmap_data: 1, sample_id_all: 1, mmap2: 1, comm_exec: 1, { bp_addr, config1 }: 0x1f cpu/mem-stores/P: type: 4, size: 112, config: 0x82d0, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|ID|CPU|PERIOD|DATA_SRC|WEIGHT|PHYS_ADDR, read_format: ID, disabled: 1, inherit: 1, freq: 1, precise_ip: 3, sample_id_all: 1 Then we'll get a 'annotate --hits' option (just cooked up, will polish) that will show the name of the function, info about it globally, i.e. what annotate already produced, we may get this in CSV for better post processing consumption: [root@jouet ~]# perf annotate --hits kmem_cache_alloc Samples: 20 of event 'cpu/mem-loads,ldlat=30/P', 4000 Hz, Event count (approx.): 875, [percent: local period] kmem_cache_alloc() /usr/lib/debug/lib/modules/4.17.17-100.fc27.x86_64/vmlinux 4.91 15: movgfp_allowed_mask,%ebx 2.51 51: mov(%r15),%r8 17.14 54: mov%gs:0x8(%r8),%rdx 6.51 61: cmpq $0x0,0x10(%r8) 17.14 66: mov(%r8),%r14 6.29 78: mov0x20(%r15),%ebx 5.71 7c: mov(%r15),%rdi 29.49 85: xor0x138(%r15),%rbx 2.86 9d: lea(%rdi),%rsi 3.43 d7: pop%rbx 2.29 dc: pop%r12 1.71 ed: testb $0x4,0xb(%rbp) [root@jouet ~]# Then I need to get the DW_AT_location stuff parsed in pahole, so that with those offsets (second column, ending with :) with hits (first column, there its local period, but we can ask for some specific metric [1]), I'll be able to figure out what DW_TAG_variable or DW_TAG_formal_parameter is living there at that time, get the offset from the decoded instruction, say that xor, 0x138 offset from the type for %r15 at that offset (85) from kmem_cache_alloc, right? In a first milestone we'd have something like: perf annotate --hits function | pahole --annotate -C task_struct perf annotate --hits | pahole --annotate Would show all structs with hits, for all functions with hits. Other options would show which struct has more hits, etc. - Arnaldo [1] [root@jouet ~]# perf annotate -h local Usage: perf annotate [] --percent-type Set percent type local/global-period/hits [root@jouet ~]# - Arnaldo
Re: [RFC] perf tool improvement requests
On Mon, Sep 03, 2018 at 07:45:48PM -0700, Stephane Eranian wrote: > Hi Arnaldo, Jiri, > > A few weeks ago, you had asked if I had more requests for the perf tool. I have one long standing one; that is IP based data structure annotation. When we get an exact IP (using PEBS) and were sampling a data related event (say L1 misses), we can get the data type from the instruction itself; that is, through DWARF. We _know_ what type (structure::member) is read/written to. I would love to get that in a pahole style output. Better yet, when you measure both hits and misses, you can get a structure usage overview, and see what lines are used lots and what members inside that line are rarely used. Ideal information for data structure layout optimization. 1000x more useful than that c2c crap. Can we please get that?
Re: [RFC] perf tool improvement requests
On Mon, Sep 03, 2018 at 07:45:48PM -0700, Stephane Eranian wrote: > Hi Arnaldo, Jiri, > > A few weeks ago, you had asked if I had more requests for the perf tool. I have one long standing one; that is IP based data structure annotation. When we get an exact IP (using PEBS) and were sampling a data related event (say L1 misses), we can get the data type from the instruction itself; that is, through DWARF. We _know_ what type (structure::member) is read/written to. I would love to get that in a pahole style output. Better yet, when you measure both hits and misses, you can get a structure usage overview, and see what lines are used lots and what members inside that line are rarely used. Ideal information for data structure layout optimization. 1000x more useful than that c2c crap. Can we please get that?
[RFC] perf tool improvement requests
Hi Arnaldo, Jiri, A few weeks ago, you had asked if I had more requests for the perf tool. I have put together the following list to improve the usability of the perf tool, at least for our usage. Nothing is very big just small improvements here and there. 1/ perf stat interval printing Today, the timestamp printed via perf stat -I is relative to the start of the measurements. It would be beneficial to also support a mode where it is using a source which can be synchronized with other traces or profiles. For instance, using gettimeofday() or clocktime(MONOTONIC). 2/ perf report event grouping if you do: $ perf record -e '{ cycles, instructions, branches }' $ perf report It will show the 3 profiles together which is VERY useful. However the output is confusing because it is hard to tell which % corresponds to which event. I know it is cmdline order. But it would be good to have a header in the columns to point to the events, instead of guessing. A few times, I had to revert to perf report --header-only to figure out the event order. I discovered the 'i' key on the function profile. But it is still hard to find the events, especially if you passed many of them. 3/ annotate output of loops Percent│401f00: xor%eax,%eax │401f02: test %edi,%edi │401f04: ↓ jle401f2b │401f06: nopw %cs:0x0(%rax,%rax,1) 34.20 │401f1┌─→ movsd (%rcx,%rax,8),%xmm1 14.60 │401f1│: mulsd %xmm0,%xmm1 33.24 │401f1│: addsd (%rdx,%rax,8),%xmm1 9.98 │401f1│: movsd %xmm1,(%rsi,%rax,8) 0.10 │401f2│: add$0x1,%rax 0.03 │401f2├── cmp%eax,%edi 7.84 │401f2└──↑ jg 401f10 │401f2b: mov$0x18,%eax │401f30: ← retq The loop arrows cut through the code addresses. That is annoying! 4/ sorting and event groups If I do: $ perf record -e '{cycles,instructions}' $ perf report It will sort the samples based on the first (leader) of the group. Yet here all events are sampling events. You could as well sort with the second event. But I don't think perf report support sort order on multiple events. Both are from the same category: syms (or ip). Right now, I would have to collect another profile: $ perf record -e '{instructions,cycles}' $ perf report 5) cgroups Today, to measure multiple group events in the same cgroup, you need to do: $ perf stat -e cycles,branch,instructions -G foo,foo,foo . You need to specify the cgroup N-times for N-events. It would be good to support a mode where you'd have to specify the cgroup once: $ perf stat -e cycles,branches,instructions --cgroup-all foo,bar Would measure cycles,branches,instructions for both cgroup foo and bar. 6) perf script ip vs. callchain I already submitted this request separately. It is about providing a way to generate the callchain separately from the ip in perf script. Right now, they are lumped together which is not always useful. Also right now, the callchain is a multi-line output which is not useful. perf script should stick with one line per sample, at least when symbolization is off. We have examples of that with brstack. I may have more requests but I wanted to start with these for now. Thanks for your efforts.
[RFC] perf tool improvement requests
Hi Arnaldo, Jiri, A few weeks ago, you had asked if I had more requests for the perf tool. I have put together the following list to improve the usability of the perf tool, at least for our usage. Nothing is very big just small improvements here and there. 1/ perf stat interval printing Today, the timestamp printed via perf stat -I is relative to the start of the measurements. It would be beneficial to also support a mode where it is using a source which can be synchronized with other traces or profiles. For instance, using gettimeofday() or clocktime(MONOTONIC). 2/ perf report event grouping if you do: $ perf record -e '{ cycles, instructions, branches }' $ perf report It will show the 3 profiles together which is VERY useful. However the output is confusing because it is hard to tell which % corresponds to which event. I know it is cmdline order. But it would be good to have a header in the columns to point to the events, instead of guessing. A few times, I had to revert to perf report --header-only to figure out the event order. I discovered the 'i' key on the function profile. But it is still hard to find the events, especially if you passed many of them. 3/ annotate output of loops Percent│401f00: xor%eax,%eax │401f02: test %edi,%edi │401f04: ↓ jle401f2b │401f06: nopw %cs:0x0(%rax,%rax,1) 34.20 │401f1┌─→ movsd (%rcx,%rax,8),%xmm1 14.60 │401f1│: mulsd %xmm0,%xmm1 33.24 │401f1│: addsd (%rdx,%rax,8),%xmm1 9.98 │401f1│: movsd %xmm1,(%rsi,%rax,8) 0.10 │401f2│: add$0x1,%rax 0.03 │401f2├── cmp%eax,%edi 7.84 │401f2└──↑ jg 401f10 │401f2b: mov$0x18,%eax │401f30: ← retq The loop arrows cut through the code addresses. That is annoying! 4/ sorting and event groups If I do: $ perf record -e '{cycles,instructions}' $ perf report It will sort the samples based on the first (leader) of the group. Yet here all events are sampling events. You could as well sort with the second event. But I don't think perf report support sort order on multiple events. Both are from the same category: syms (or ip). Right now, I would have to collect another profile: $ perf record -e '{instructions,cycles}' $ perf report 5) cgroups Today, to measure multiple group events in the same cgroup, you need to do: $ perf stat -e cycles,branch,instructions -G foo,foo,foo . You need to specify the cgroup N-times for N-events. It would be good to support a mode where you'd have to specify the cgroup once: $ perf stat -e cycles,branches,instructions --cgroup-all foo,bar Would measure cycles,branches,instructions for both cgroup foo and bar. 6) perf script ip vs. callchain I already submitted this request separately. It is about providing a way to generate the callchain separately from the ip in perf script. Right now, they are lumped together which is not always useful. Also right now, the callchain is a multi-line output which is not useful. perf script should stick with one line per sample, at least when symbolization is off. We have examples of that with brstack. I may have more requests but I wanted to start with these for now. Thanks for your efforts.