Hi Peter and all,

Did you get a chance to review these patches?
Zheng is away. Should I re-send the patches?  

Thanks,
Kan

> 
> For many profiling tasks we need the callgraph. For example we often need
> to see the caller of a lock or the caller of a memcpy or other library 
> function
> to actually tune the program. Frame pointer unwinding is efficient and works
> well. But frame pointers are off by default on 64bit code (and on modern
> 32bit gccs), so there are many binaries around that do not use frame pointers.
> Profiling unchanged production code is very useful in practice. On some CPUs
> frame pointer also has a high cost. Dwarf2 unwinding also does not always
> work and is extremely slow (upto 20% overhead).
> 
> Haswell has a new feature that utilizes the existing Last Branch Record 
> facility
> to record call chains. When the feature is enabled, function call will be
> collected as normal, but as return instructions are executed the last captured
> branch record is popped from the on-chip LBR registers. The LBR call stack
> facility provides an alternative to get callgraph. It has some limitations 
> too,
> but should work in most cases and is significantly faster than dwarf. Frame
> pointer unwinding is still the best default, but LBR call stack is a good
> alternative when nothing else works.
> 
> When profiling bc(1) on Fedora 19:
>  echo 'scale=2000; 4*a(1)' > cmd; perf record -g fp bc -l < cmd
> 
> If this feature is enabled, perf report output looks like:
>     50.36%       bc  bc                 [.] bc_divide
>                  |
>                  --- bc_divide
>                      execute
>                      run_code
>                      yyparse
>                      main
>                      __libc_start_main
>                      _start
> 
>     33.66%       bc  bc                 [.] _one_mult
>                  |
>                  --- _one_mult
>                      bc_divide
>                      execute
>                      run_code
>                      yyparse
>                      main
>                      __libc_start_main
>                      _start
> 
>      7.62%       bc  bc                 [.] _bc_do_add
>                  |
>                  --- _bc_do_add
>                     |
>                     |--99.89%-- 0x2000186a8
>                      --0.11%-- [...]
> 
>      6.83%       bc  bc                 [.] _bc_do_sub
>                  |
>                  --- _bc_do_sub
>                     |
>                     |--99.94%-- bc_add
>                     |          execute
>                     |          run_code
>                     |          yyparse
>                     |          main
>                     |          __libc_start_main
>                     |          _start
>                      --0.06%-- [...]
> 
>      0.46%       bc  libc-2.17.so       [.] __memset_sse2
>                  |
>                  --- __memset_sse2
>                     |
>                     |--54.13%-- bc_new_num
>                     |          |
>                     |          |--51.00%-- bc_divide
>                     |          |          execute
>                     |          |          run_code
>                     |          |          yyparse
>                     |          |          main
>                     |          |          __libc_start_main
>                     |          |          _start
>                     |          |
>                     |          |--30.46%-- _bc_do_sub
>                     |          |          bc_add
>                     |          |          execute
>                     |          |          run_code
>                     |          |          yyparse
>                     |          |          main
>                     |          |          __libc_start_main
>                     |          |          _start
>                     |          |
>                     |           --18.55%-- _bc_do_add
>                     |                     bc_add
>                     |                     execute
>                     |                     run_code
>                     |                     yyparse
>                     |                     main
>                     |                     __libc_start_main
>                     |                     _start
>                     |
>                      --45.87%-- bc_divide
>                                execute
>                                run_code
>                                yyparse
>                                main
>                                __libc_start_main
>                                _start
> 
> If this feature is disabled, perf report output looks like:
>     50.49%       bc  bc                 [.] bc_divide
>                  |
>                  --- bc_divide
> 
>     33.57%       bc  bc                 [.] _one_mult
>                  |
>                  --- _one_mult
> 
>      7.61%       bc  bc                 [.] _bc_do_add
>                  |
>                  --- _bc_do_add
>                      0x2000186a8
> 
>      6.88%       bc  bc                 [.] _bc_do_sub
>                  |
>                  --- _bc_do_sub
> 
>      0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
>                  |
>                  --- __memcpy_ssse3_back
> 
> The LBR call stack has following known limitations
>  - Zero length calls are not filtered out by hardware
>  - Exception handing such as setjmp/longjmp will have calls/returns not
>    match
>  - Pushing different return address onto the stack will have calls/returns
>    not match
>  - If callstack is deeper than the LBR, only the last entries are captured
> 
> Changes since v1
>  - split change into more patches
>  - introduce context switch callback and use it to flush LBR
>  - use the context switch callback to save/restore LBR
>  - dynamic allocate memory area for storing LBR stack, always switch the
>    memory area during context switch
>  - disable this feature by default
>  - more description in change logs
> 
> Changes since v2
>  - don't use xchg to switch PMU specific data
>  - remove nr_branch_stack from struct perf_event_context
>  - simplify the save/restore LBR stack logical
>  - remove unnecessary 'has_branch_stack -> needs_branch_stack'
>    conversion
>  - more description in change logs
> 
> Changes since v3
>  - remove sysfs attribute file that disable this feature
> 
> Changes since v4
>  - re-organize code that save/resotre LBR stack
>  - allocate pmu specific data when it's needed
>  - update code comments
> 
> These patches are also available at:
> 
> These patches are also available at:
>  https://github.com/ukernel/linux.git perf-lbr-callstack
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to