Re: [PATCH 1/9] perf, tools: Support handling complete branch stacks as histograms v6

2014-05-25 Thread Namhyung Kim
Hi Andi,

On Fri, 23 May 2014 14:35:03 -0700, Andi Kleen wrote:
> On Mon, May 19, 2014 at 05:21:15PM +0900, Namhyung Kim wrote:
>> This is gone with 540476de74c9 ("perf tools: Remove
>> symbol_conf.use_callchain check").
>
> The patchkit applies to tip/perf/core.

The commit 540476de74c9 is also in the tip/perf/core.  Please check
machine_resolve_callchain_sample().

>
>> > +   * Check for overlap into the callchain.
>> > +   * The return address is one off compared to
>> > +   * the branch entry. To adjust for this
>> > +   * assume the calling instruction is not longer
>> > +   * than 8 bytes.
>> > +   */
>> > +  if (be[i].from < chain->ips[first_call] &&
>> > +  be[i].from >= chain->ips[first_call] - 8)
>> > +  first_call++;
>> 
>> It seems that you need to check chain->ips[first_call] is greater than
>> PERF_CONTEXT_MAX and use such value as the cpumode...
>
> I don't understand the comment. The only IP that gets resolved is the from/to.
> And add_callchain_ip does it own resolution.
>
> Wouldn't make any sense to get it from first_call

Okay, let me explain it this way..

You're checking the branch stack with normal callchain to find overlap
by comparing the 'from' address and the address in chain->ips[].  But
chain->ips[0] doesn't contain a valid address but a PERF_CONTEXT_XXX for
cpumode of subsequent callchains.  So the first_call of 0 won't do
anything meaningful for you and it'd still contain overlapped
callchains.

  $ perf --version
  perf version 3.15.rc4.g816bf8
  
  $ perf record -b -g ./tcall
  
  $ perf report -D | grep -A35 SAMPLE
  4748858059190923 0x3608 [0x240]: PERF_RECORD_SAMPLE(IP, 0x1): 31914/31914: 
0x81043ffa period: 1 addr: 0
  ... chain: nr:17
  .  0: ff80
  .  1: 81043ffa
  .  2: 81029d40
  .  3: 81025554
  .  4: 811246c7
  .  5: 81125d69
  .  6: 811280dc
  .  7: 811a4266
  .  8: 811a4bb1
  .  9: 811f1a4f
  . 10: 811a344c
  . 11: 811a49bb
  . 12: 811a4ac8
  . 13: 811a4d3d
  . 14: 81664689
  . 15: fe00
  . 16: 003153ebca47
  ... branch stack: nr:16
  .  0: 81029d3b -> 81043ff0
  .  1: 810280c9 -> 81029d18
  .  2: 81043ffd -> 810280be
  .  3:  -> 
  .  4:  -> 
  .  5:  -> 
  .  6:  -> 
  .  7:  -> 
  .  8:  -> 
  .  9:  -> 
  . 10:  -> 
  . 11:  -> 
  . 12:  -> 
  . 13:  -> 
  . 14:  -> 
  . 15:  -> 

As you can see, chain->ips[0] is ff80 (= -128) that is
defined as PERF_CONTEXT_KERNEL.  And in this case nr of branch stack is
16 but it's actually 3.  I guess you need to ignore 0 entries..

Also perf report seems to fail to resolve symbols/srclines in branch
stack (possibly due to missing cpumode) and find loops.

  $ perf report --branch-history --stdio
  ...
   0.00%  native_writ  [k] native_write_msr_safe  [kernel.kallsyms]
  |
  ---0x81043ff0
 0x81029d3b
 0x81029d18
 0x810280c9
 0x810280be
 0x81043ffd
 |  
 |--99.77%-- 0x81043ff0
 |  0x81029d3b
 |  0x81029d18
 |  0x810280c9
 |  0x810280be
 |  0x81043ffd
 |  0
 |  0
 |  |  
 |  |--91.43%-- native_write_msr_safe +10
 |  |  intel_pmu_enable_all +80
 |  |  x86_pmu_enable +628
 |  |  perf_pmu_enable +39
 |  |  perf_event_context_sched_in +121
 |  |  perf_event_comm +364
 |  |  set_task_comm +102
 |  |  setup_new_exec +129
 |  |  load_elf_binary +1007
 |  |  
 |   --8.57%-- 0
 |

Re: [PATCH 1/9] perf, tools: Support handling complete branch stacks as histograms v6

2014-05-25 Thread Namhyung Kim
Hi Andi,

On Fri, 23 May 2014 14:35:03 -0700, Andi Kleen wrote:
 On Mon, May 19, 2014 at 05:21:15PM +0900, Namhyung Kim wrote:
 This is gone with 540476de74c9 (perf tools: Remove
 symbol_conf.use_callchain check).

 The patchkit applies to tip/perf/core.

The commit 540476de74c9 is also in the tip/perf/core.  Please check
machine_resolve_callchain_sample().


  +   * Check for overlap into the callchain.
  +   * The return address is one off compared to
  +   * the branch entry. To adjust for this
  +   * assume the calling instruction is not longer
  +   * than 8 bytes.
  +   */
  +  if (be[i].from  chain-ips[first_call] 
  +  be[i].from = chain-ips[first_call] - 8)
  +  first_call++;
 
 It seems that you need to check chain-ips[first_call] is greater than
 PERF_CONTEXT_MAX and use such value as the cpumode...

 I don't understand the comment. The only IP that gets resolved is the from/to.
 And add_callchain_ip does it own resolution.

 Wouldn't make any sense to get it from first_call

Okay, let me explain it this way..

You're checking the branch stack with normal callchain to find overlap
by comparing the 'from' address and the address in chain-ips[].  But
chain-ips[0] doesn't contain a valid address but a PERF_CONTEXT_XXX for
cpumode of subsequent callchains.  So the first_call of 0 won't do
anything meaningful for you and it'd still contain overlapped
callchains.

  $ perf --version
  perf version 3.15.rc4.g816bf8
  
  $ perf record -b -g ./tcall
  
  $ perf report -D | grep -A35 SAMPLE
  4748858059190923 0x3608 [0x240]: PERF_RECORD_SAMPLE(IP, 0x1): 31914/31914: 
0x81043ffa period: 1 addr: 0
  ... chain: nr:17
  .  0: ff80
  .  1: 81043ffa
  .  2: 81029d40
  .  3: 81025554
  .  4: 811246c7
  .  5: 81125d69
  .  6: 811280dc
  .  7: 811a4266
  .  8: 811a4bb1
  .  9: 811f1a4f
  . 10: 811a344c
  . 11: 811a49bb
  . 12: 811a4ac8
  . 13: 811a4d3d
  . 14: 81664689
  . 15: fe00
  . 16: 003153ebca47
  ... branch stack: nr:16
  .  0: 81029d3b - 81043ff0
  .  1: 810280c9 - 81029d18
  .  2: 81043ffd - 810280be
  .  3:  - 
  .  4:  - 
  .  5:  - 
  .  6:  - 
  .  7:  - 
  .  8:  - 
  .  9:  - 
  . 10:  - 
  . 11:  - 
  . 12:  - 
  . 13:  - 
  . 14:  - 
  . 15:  - 

As you can see, chain-ips[0] is ff80 (= -128) that is
defined as PERF_CONTEXT_KERNEL.  And in this case nr of branch stack is
16 but it's actually 3.  I guess you need to ignore 0 entries..

Also perf report seems to fail to resolve symbols/srclines in branch
stack (possibly due to missing cpumode) and find loops.

  $ perf report --branch-history --stdio
  ...
   0.00%  native_writ  [k] native_write_msr_safe  [kernel.kallsyms]
  |
  ---0x81043ff0
 0x81029d3b
 0x81029d18
 0x810280c9
 0x810280be
 0x81043ffd
 |  
 |--99.77%-- 0x81043ff0
 |  0x81029d3b
 |  0x81029d18
 |  0x810280c9
 |  0x810280be
 |  0x81043ffd
 |  0
 |  0
 |  |  
 |  |--91.43%-- native_write_msr_safe +10
 |  |  intel_pmu_enable_all +80
 |  |  x86_pmu_enable +628
 |  |  perf_pmu_enable +39
 |  |  perf_event_context_sched_in +121
 |  |  perf_event_comm +364
 |  |  set_task_comm +102
 |  |  setup_new_exec +129
 |  |  load_elf_binary +1007
 |  |  
 |   --8.57%-- 0
 | 0
 | |  

Re: [PATCH 1/9] perf, tools: Support handling complete branch stacks as histograms v6

2014-05-23 Thread Andi Kleen
On Mon, May 19, 2014 at 05:21:15PM +0900, Namhyung Kim wrote:
> This is gone with 540476de74c9 ("perf tools: Remove
> symbol_conf.use_callchain check").

The patchkit applies to tip/perf/core.

> > +* Check for overlap into the callchain.
> > +* The return address is one off compared to
> > +* the branch entry. To adjust for this
> > +* assume the calling instruction is not longer
> > +* than 8 bytes.
> > +*/
> > +   if (be[i].from < chain->ips[first_call] &&
> > +   be[i].from >= chain->ips[first_call] - 8)
> > +   first_call++;
> 
> It seems that you need to check chain->ips[first_call] is greater than
> PERF_CONTEXT_MAX and use such value as the cpumode...

I don't understand the comment. The only IP that gets resolved is the from/to.
And add_callchain_ip does it own resolution.

Wouldn't make any sense to get it from first_call

> 
> 
> > +   } else
> > +   be[i] = branch->entries[branch->nr - i - 1];
> > +   }
> > +
> > +   nr = remove_loops(be, nr);
> > +
> > +   for (i = 0; i < nr; i++) {
> > +   err = add_callchain_ip(machine, thread, parent,
> > +  root_al,
> > +  -1, be[i].to);
> > +   if (!err)
> > +   err = add_callchain_ip(machine, thread,
> > +  parent, root_al,
> > +  -1, be[i].from);
> 
> ... for here.
> 
> 
> > +   if (err == -EINVAL)
> > +   break;
> > +   if (err)
> > +   return err;
> > +   }
> > +   chain_nr -= nr;
> 
> It seems it could make some callchain nodes being ignored.  What if a
> case like small callchains with matches to only 2 nodes in the LBR?
> 
>   nr = 16, chain_nr = 10 and first_call = 2

The chain_nr variable is just to handle it when the user
specified a max_stack value. nr is always capped to max_stack too.
If lbr size is >= max_stack it will end up being 0 or negative and the 
following loop to add normal call stack entries will do nothing.

I think that's the correct behavior.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/9] perf, tools: Support handling complete branch stacks as histograms v6

2014-05-23 Thread Andi Kleen
On Mon, May 19, 2014 at 05:21:15PM +0900, Namhyung Kim wrote:
 This is gone with 540476de74c9 (perf tools: Remove
 symbol_conf.use_callchain check).

The patchkit applies to tip/perf/core.

  +* Check for overlap into the callchain.
  +* The return address is one off compared to
  +* the branch entry. To adjust for this
  +* assume the calling instruction is not longer
  +* than 8 bytes.
  +*/
  +   if (be[i].from  chain-ips[first_call] 
  +   be[i].from = chain-ips[first_call] - 8)
  +   first_call++;
 
 It seems that you need to check chain-ips[first_call] is greater than
 PERF_CONTEXT_MAX and use such value as the cpumode...

I don't understand the comment. The only IP that gets resolved is the from/to.
And add_callchain_ip does it own resolution.

Wouldn't make any sense to get it from first_call

 
 
  +   } else
  +   be[i] = branch-entries[branch-nr - i - 1];
  +   }
  +
  +   nr = remove_loops(be, nr);
  +
  +   for (i = 0; i  nr; i++) {
  +   err = add_callchain_ip(machine, thread, parent,
  +  root_al,
  +  -1, be[i].to);
  +   if (!err)
  +   err = add_callchain_ip(machine, thread,
  +  parent, root_al,
  +  -1, be[i].from);
 
 ... for here.
 
 
  +   if (err == -EINVAL)
  +   break;
  +   if (err)
  +   return err;
  +   }
  +   chain_nr -= nr;
 
 It seems it could make some callchain nodes being ignored.  What if a
 case like small callchains with matches to only 2 nodes in the LBR?
 
   nr = 16, chain_nr = 10 and first_call = 2

The chain_nr variable is just to handle it when the user
specified a max_stack value. nr is always capped to max_stack too.
If lbr size is = max_stack it will end up being 0 or negative and the 
following loop to add normal call stack entries will do nothing.

I think that's the correct behavior.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/9] perf, tools: Support handling complete branch stacks as histograms v6

2014-05-19 Thread Namhyung Kim
Hi Andi,

On Fri, 16 May 2014 10:05:30 -0700, Andi Kleen wrote:
> From: Andi Kleen 
>
> Currently branch stacks can be only shown as edge histograms for
> individual branches. I never found this display particularly useful.
>
> This implements an alternative mode that creates histograms over complete
> branch traces, instead of individual branches, similar to how normal
> callgraphs are handled. This is done by putting it in
> front of the normal callgraph and then using the normal callgraph
> histogram infrastructure to unify them.
>
> This way in complex functions we can understand the control flow
> that lead to a particular sample, and may even see some control
> flow in the caller for short functions.

[SNIP]
> +static int add_callchain_ip(struct machine *machine,
> + struct thread *thread,
> + struct symbol **parent,
> + struct addr_location *root_al,
> + int cpumode,
> + u64 ip)
> +{
> + struct addr_location al;
> +
> + al.filtered = 0;
> + al.sym = NULL;
> + thread__find_addr_location(thread, machine, cpumode, MAP__FUNCTION,
> +ip, );
> + if (al.sym != NULL) {
> + if (sort__has_parent && !*parent &&
> + symbol__match_regex(al.sym, _regex))
> + *parent = al.sym;
> + else if (have_ignore_callees && root_al &&
> +   symbol__match_regex(al.sym, _callees_regex)) {
> + /* Treat this symbol as the root,
> +forgetting its callees. */
> + *root_al = al;
> + callchain_cursor_reset(_cursor);
> + }
> + if (!symbol_conf.use_callchain)
> + return -EINVAL;

This is gone with 540476de74c9 ("perf tools: Remove
symbol_conf.use_callchain check").


> + }
> +
> + return callchain_cursor_append(_cursor, ip, al.map, al.sym);
> +}
> +
> +#define CHASHSZ 127
> +#define CHASHBITS 7
> +#define NO_ENTRY 0xff
> +
> +#define PERF_MAX_BRANCH_DEPTH 127
> +
> +/* Remove loops. */
> +static int remove_loops(struct branch_entry *l, int nr)
> +{
> + int i, j, off;
> + unsigned char chash[CHASHSZ];
> + memset(chash, -1, sizeof(chash));

s/-1/NO_ENTRY/ ?

> +
> + BUG_ON(nr >= 256);
> + for (i = 0; i < nr; i++) {
> + int h = hash_64(l[i].from, CHASHBITS) % CHASHSZ;
> +
> + /* no collision handling for now */
> + if (chash[h] == NO_ENTRY) {
> + chash[h] = i;
> + } else if (l[chash[h]].from == l[i].from) {
> + bool is_loop = true;
> + /* check if it is a real loop */
> + off = 0;
> + for (j = chash[h]; j < i && i + off < nr; j++, off++)
> + if (l[j].from != l[i + off].from) {
> + is_loop = false;
> + break;
> + }
> + if (is_loop) {
> + memmove(l + i, l + i + off,
> + (nr - (i + off))
> + * sizeof(struct branch_entry));
> + nr -= off;
> + }
> + }
> + }
> + return nr;
> +}
> +
>  static int machine__resolve_callchain_sample(struct machine *machine,
>struct thread *thread,
>struct ip_callchain *chain,
> +  struct branch_stack *branch,
>struct symbol **parent,
>struct addr_location *root_al,
>int max_stack)
> @@ -1290,17 +1363,73 @@ static int machine__resolve_callchain_sample(struct 
> machine *machine,
>   int chain_nr = min(max_stack, (int)chain->nr);
>   int i;
>   int err;
> + int first_call = 0;
>  
>   callchain_cursor_reset(_cursor);
>  
> + /*
> +  * Add branches to call stack for easier browsing. This gives
> +  * more context for a sample than just the callers.
> +  *
> +  * This uses individual histograms of paths compared to the
> +  * aggregated histograms the normal LBR mode uses.
> +  *
> +  * Limitations for now:
> +  * - No extra filters
> +  * - No annotations (should annotate somehow)
> +  */
> +
> + if (branch->nr > PERF_MAX_BRANCH_DEPTH) {
> + pr_warning("corrupted branch chain. skipping...\n");
> + return 0;
> + }
> +
> + if (callchain_param.branch_callstack) {
> + int nr = min(max_stack, (int)branch->nr);
> + struct branch_entry be[nr];
> +
> + for (i = 0; i < nr; i++) {
> +

Re: [PATCH 1/9] perf, tools: Support handling complete branch stacks as histograms v6

2014-05-19 Thread Namhyung Kim
Hi Andi,

On Fri, 16 May 2014 10:05:30 -0700, Andi Kleen wrote:
 From: Andi Kleen a...@linux.intel.com

 Currently branch stacks can be only shown as edge histograms for
 individual branches. I never found this display particularly useful.

 This implements an alternative mode that creates histograms over complete
 branch traces, instead of individual branches, similar to how normal
 callgraphs are handled. This is done by putting it in
 front of the normal callgraph and then using the normal callgraph
 histogram infrastructure to unify them.

 This way in complex functions we can understand the control flow
 that lead to a particular sample, and may even see some control
 flow in the caller for short functions.

[SNIP]
 +static int add_callchain_ip(struct machine *machine,
 + struct thread *thread,
 + struct symbol **parent,
 + struct addr_location *root_al,
 + int cpumode,
 + u64 ip)
 +{
 + struct addr_location al;
 +
 + al.filtered = 0;
 + al.sym = NULL;
 + thread__find_addr_location(thread, machine, cpumode, MAP__FUNCTION,
 +ip, al);
 + if (al.sym != NULL) {
 + if (sort__has_parent  !*parent 
 + symbol__match_regex(al.sym, parent_regex))
 + *parent = al.sym;
 + else if (have_ignore_callees  root_al 
 +   symbol__match_regex(al.sym, ignore_callees_regex)) {
 + /* Treat this symbol as the root,
 +forgetting its callees. */
 + *root_al = al;
 + callchain_cursor_reset(callchain_cursor);
 + }
 + if (!symbol_conf.use_callchain)
 + return -EINVAL;

This is gone with 540476de74c9 (perf tools: Remove
symbol_conf.use_callchain check).


 + }
 +
 + return callchain_cursor_append(callchain_cursor, ip, al.map, al.sym);
 +}
 +
 +#define CHASHSZ 127
 +#define CHASHBITS 7
 +#define NO_ENTRY 0xff
 +
 +#define PERF_MAX_BRANCH_DEPTH 127
 +
 +/* Remove loops. */
 +static int remove_loops(struct branch_entry *l, int nr)
 +{
 + int i, j, off;
 + unsigned char chash[CHASHSZ];
 + memset(chash, -1, sizeof(chash));

s/-1/NO_ENTRY/ ?

 +
 + BUG_ON(nr = 256);
 + for (i = 0; i  nr; i++) {
 + int h = hash_64(l[i].from, CHASHBITS) % CHASHSZ;
 +
 + /* no collision handling for now */
 + if (chash[h] == NO_ENTRY) {
 + chash[h] = i;
 + } else if (l[chash[h]].from == l[i].from) {
 + bool is_loop = true;
 + /* check if it is a real loop */
 + off = 0;
 + for (j = chash[h]; j  i  i + off  nr; j++, off++)
 + if (l[j].from != l[i + off].from) {
 + is_loop = false;
 + break;
 + }
 + if (is_loop) {
 + memmove(l + i, l + i + off,
 + (nr - (i + off))
 + * sizeof(struct branch_entry));
 + nr -= off;
 + }
 + }
 + }
 + return nr;
 +}
 +
  static int machine__resolve_callchain_sample(struct machine *machine,
struct thread *thread,
struct ip_callchain *chain,
 +  struct branch_stack *branch,
struct symbol **parent,
struct addr_location *root_al,
int max_stack)
 @@ -1290,17 +1363,73 @@ static int machine__resolve_callchain_sample(struct 
 machine *machine,
   int chain_nr = min(max_stack, (int)chain-nr);
   int i;
   int err;
 + int first_call = 0;
  
   callchain_cursor_reset(callchain_cursor);
  
 + /*
 +  * Add branches to call stack for easier browsing. This gives
 +  * more context for a sample than just the callers.
 +  *
 +  * This uses individual histograms of paths compared to the
 +  * aggregated histograms the normal LBR mode uses.
 +  *
 +  * Limitations for now:
 +  * - No extra filters
 +  * - No annotations (should annotate somehow)
 +  */
 +
 + if (branch-nr  PERF_MAX_BRANCH_DEPTH) {
 + pr_warning(corrupted branch chain. skipping...\n);
 + return 0;
 + }
 +
 + if (callchain_param.branch_callstack) {
 + int nr = min(max_stack, (int)branch-nr);
 + struct branch_entry be[nr];
 +
 + for (i = 0; i  nr; i++) {
 + if (callchain_param.order == ORDER_CALLEE) {
 +  

[PATCH 1/9] perf, tools: Support handling complete branch stacks as histograms v6

2014-05-16 Thread Andi Kleen
From: Andi Kleen 

Currently branch stacks can be only shown as edge histograms for
individual branches. I never found this display particularly useful.

This implements an alternative mode that creates histograms over complete
branch traces, instead of individual branches, similar to how normal
callgraphs are handled. This is done by putting it in
front of the normal callgraph and then using the normal callgraph
histogram infrastructure to unify them.

This way in complex functions we can understand the control flow
that lead to a particular sample, and may even see some control
flow in the caller for short functions.

Example (simplified, of course for such simple code this
is usually not needed):

tcall.c:

volatile a = 1, b = 10, c;

__attribute__((noinline)) f2()
{
c = a / b;
}

__attribute__((noinline)) f1()
{
f2();
f2();
}
main()
{
int i;
for (i = 0; i < 100; i++)
f1();
}

% perf record -b -g ./tsrc/tcall
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.044 MB perf.data (~1923 samples) ]
% perf report --branch-history
...
54.91%  tcall.c:6  [.] f2  tcall
|
|--65.53%-- f2 tcall.c:5
|  |
|  |--70.83%-- f1 tcall.c:11
|  |  f1 tcall.c:10
|  |  main tcall.c:18
|  |  main tcall.c:18
|  |  main tcall.c:17
|  |  main tcall.c:17
|  |  f1 tcall.c:13
|  |  f1 tcall.c:13
|  |  f2 tcall.c:7
|  |  f2 tcall.c:5
|  |  f1 tcall.c:12
|  |  f1 tcall.c:12
|  |  f2 tcall.c:7
|  |  f2 tcall.c:5
|  |  f1 tcall.c:11
|  |
|   --29.17%-- f1 tcall.c:12
| f1 tcall.c:12
| f2 tcall.c:7
| f2 tcall.c:5
| f1 tcall.c:11
| f1 tcall.c:10
| main tcall.c:18
| main tcall.c:18
| main tcall.c:17
| main tcall.c:17
| f1 tcall.c:13
| f1 tcall.c:13
| f2 tcall.c:7
| f2 tcall.c:5
| f1 tcall.c:12

The default output is unchanged.

This is only implemented in perf report, no change to record
or anywhere else.

This adds the basic code to report:
- add a new "branch" option to the -g option parser to enable this mode
- when the flag is set include the LBR into the callstack in machine.c.
The rest of the history code is unchanged and doesn't know the difference
between LBR entry and normal call entry.
- detect overlaps with the callchain
- remove small loop duplicates in the LBR

Current limitations:
- The LBR flags (mispredict etc.) are not shown in the history
and LBR entries have no special marker.
- It would be nice if annotate marked the LBR entries somehow
(e.g. with arrows)

v2: Various fixes.
v3: Merge further patches into this one. Fix white space.
v4: Improve manpage. Address review feedback.
v5: Rename functions. Better error message without -g. Fix crash without
-b.
v6: Rebase
Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-report.txt |   7 +-
 tools/perf/builtin-report.c  |   4 +-
 tools/perf/util/callchain.c  |  11 ++-
 tools/perf/util/callchain.h  |   1 +
 tools/perf/util/machine.c| 159 +++
 tools/perf/util/symbol.h |   3 +-
 6 files changed, 158 insertions(+), 27 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt 
b/tools/perf/Documentation/perf-report.txt
index 09af662..4f0f3d9 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -124,7 +124,7 @@ OPTIONS
 --dump-raw-trace::
 Dump raw trace in ASCII.
 
--g [type,min[,limit],order[,key]]::
+-g [type,min[,limit],order[,key][,branch]]::
 --call-graph::
 Display call chains using type, min percent threshold, optional print
limit and order.
@@ -142,6 +142,11 @@ OPTIONS
- function: compare on functions
- address: compare on individual code addresses
 
+   branch can be:
+   - branch: include last branch information in callgraph
+   when available. Usually more convenient to use --branch-history
+   for this.
+
Default: fractal,0.5,callee,function.
 
 --max-stack::
diff --git 

[PATCH 1/9] perf, tools: Support handling complete branch stacks as histograms v6

2014-05-16 Thread Andi Kleen
From: Andi Kleen a...@linux.intel.com

Currently branch stacks can be only shown as edge histograms for
individual branches. I never found this display particularly useful.

This implements an alternative mode that creates histograms over complete
branch traces, instead of individual branches, similar to how normal
callgraphs are handled. This is done by putting it in
front of the normal callgraph and then using the normal callgraph
histogram infrastructure to unify them.

This way in complex functions we can understand the control flow
that lead to a particular sample, and may even see some control
flow in the caller for short functions.

Example (simplified, of course for such simple code this
is usually not needed):

tcall.c:

volatile a = 1, b = 10, c;

__attribute__((noinline)) f2()
{
c = a / b;
}

__attribute__((noinline)) f1()
{
f2();
f2();
}
main()
{
int i;
for (i = 0; i  100; i++)
f1();
}

% perf record -b -g ./tsrc/tcall
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.044 MB perf.data (~1923 samples) ]
% perf report --branch-history
...
54.91%  tcall.c:6  [.] f2  tcall
|
|--65.53%-- f2 tcall.c:5
|  |
|  |--70.83%-- f1 tcall.c:11
|  |  f1 tcall.c:10
|  |  main tcall.c:18
|  |  main tcall.c:18
|  |  main tcall.c:17
|  |  main tcall.c:17
|  |  f1 tcall.c:13
|  |  f1 tcall.c:13
|  |  f2 tcall.c:7
|  |  f2 tcall.c:5
|  |  f1 tcall.c:12
|  |  f1 tcall.c:12
|  |  f2 tcall.c:7
|  |  f2 tcall.c:5
|  |  f1 tcall.c:11
|  |
|   --29.17%-- f1 tcall.c:12
| f1 tcall.c:12
| f2 tcall.c:7
| f2 tcall.c:5
| f1 tcall.c:11
| f1 tcall.c:10
| main tcall.c:18
| main tcall.c:18
| main tcall.c:17
| main tcall.c:17
| f1 tcall.c:13
| f1 tcall.c:13
| f2 tcall.c:7
| f2 tcall.c:5
| f1 tcall.c:12

The default output is unchanged.

This is only implemented in perf report, no change to record
or anywhere else.

This adds the basic code to report:
- add a new branch option to the -g option parser to enable this mode
- when the flag is set include the LBR into the callstack in machine.c.
The rest of the history code is unchanged and doesn't know the difference
between LBR entry and normal call entry.
- detect overlaps with the callchain
- remove small loop duplicates in the LBR

Current limitations:
- The LBR flags (mispredict etc.) are not shown in the history
and LBR entries have no special marker.
- It would be nice if annotate marked the LBR entries somehow
(e.g. with arrows)

v2: Various fixes.
v3: Merge further patches into this one. Fix white space.
v4: Improve manpage. Address review feedback.
v5: Rename functions. Better error message without -g. Fix crash without
-b.
v6: Rebase
Signed-off-by: Andi Kleen a...@linux.intel.com
---
 tools/perf/Documentation/perf-report.txt |   7 +-
 tools/perf/builtin-report.c  |   4 +-
 tools/perf/util/callchain.c  |  11 ++-
 tools/perf/util/callchain.h  |   1 +
 tools/perf/util/machine.c| 159 +++
 tools/perf/util/symbol.h |   3 +-
 6 files changed, 158 insertions(+), 27 deletions(-)

diff --git a/tools/perf/Documentation/perf-report.txt 
b/tools/perf/Documentation/perf-report.txt
index 09af662..4f0f3d9 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -124,7 +124,7 @@ OPTIONS
 --dump-raw-trace::
 Dump raw trace in ASCII.
 
--g [type,min[,limit],order[,key]]::
+-g [type,min[,limit],order[,key][,branch]]::
 --call-graph::
 Display call chains using type, min percent threshold, optional print
limit and order.
@@ -142,6 +142,11 @@ OPTIONS
- function: compare on functions
- address: compare on individual code addresses
 
+   branch can be:
+   - branch: include last branch information in callgraph
+   when available. Usually more convenient to use --branch-history
+   for this.
+
Default: fractal,0.5,callee,function.