Hi Ingo,

On Thu, Mar 12, 2015 at 11:41:19AM +0100, Ingo Molnar wrote:
> * Namhyung Kim <namhy...@kernel.org> wrote:
> 
> > Hello,
> > 
> > Currently perf kmem command only analyzes SLAB memory allocation.  And
> > I'd like to introduce page allocation analysis also.  Users can use
> >  --slab and/or --page option to select it.  If none of these options
> > are used, it does slab allocation analysis for backward compatibility.
> > 
> > The patch 1-3 are bugfix and cleanups.  Patch 4 implements basic
> > support for page allocation analysis, patch 5 deals with the callsite
> > and finally patch 6 implements sorting.
> > 
> > In this patchset, I used two kmem events: kmem:mm_page_alloc and
> > kmem_page_free for analysis as they can track every memory
> > allocation/free path AFAIK.  However, unlike slab tracepoint events,
> > those page allocation events don't provide callsite info directly.  So
> > I recorded callchains and extracted callsites like below:
> 
> Really cool features!

Thanks!


> 
> I have a couple of output typography observations:
> 
> > Normal page allocation callchains look like this:
> > 
> >   360a7e __alloc_pages_nodemask
> >   3a711c alloc_pages_current
> >   357bc7 __page_cache_alloc   <-- callsite
> >   357cf6 pagecache_get_page
> >    48b0a prepare_pages
> >    494d3 __btrfs_buffered_write
> >    49cdf btrfs_file_write_iter
> >   3ceb6e new_sync_write
> >   3cf447 vfs_write
> >   3cff99 sys_write
> >   7556e9 system_call
> >     f880 __write_nocancel
> >    33eb9 cmd_record
> >    4b38e cmd_kmem
> >    7aa23 run_builtin
> >    27a9a main
> >    20800 __libc_start_main
> > 
> > But first two are internal page allocation functions so it should be
> > skipped.  To determine such allocation functions, I used following regex:
> > 
> >   ^_?_?(alloc|get_free|get_zeroed)_pages?
> > 
> > This gave me a following list of functions (you can see this with -v):
> > 
> >   alloc func: __get_free_pages
> >   alloc func: get_zeroed_page
> >   alloc func: alloc_pages_exact
> >   alloc func: __alloc_pages_direct_compact
> >   alloc func: __alloc_pages_nodemask
> >   alloc func: alloc_page_interleave
> >   alloc func: alloc_pages_current
> >   alloc func: alloc_pages_vma
> >   alloc func: alloc_page_buffers
> >   alloc func: alloc_pages_exact_nid
> > 
> > After skipping those function, it got '__page_cache_alloc'.
> > 
> > Other information such as allocation order, migration type and gfp
> > flags are provided by tracepoint events.
> > 
> > Basically the output will be sorted by total allocation bytes, but you
> > can change it by using -s/--sort option.  The following sort keys are
> > added to support page analysis: page, order, mtype, gfp.  Existing
> > 'callsite', 'bytes' and 'hit' sort keys also can be used.
> > 
> > An example follows:
> > 
> >   # perf kmem record --slab --page sleep 1
> >   [ perf record: Woken up 0 times to write data ]
> >   [ perf record: Captured and wrote 49.277 MB perf.data (191027 samples) ]
> > 
> >   # perf kmem stat --page --caller -l 10 -s order,hit
> > 
> >   
> > --------------------------------------------------------------------------------------------
> >    Total_alloc/Per | Hit      | Order | Migrate type | GFP flag | Callsite
> 
> s/Per/Size
> s/Hit/Hits
> s/Migrate type/Migration type
> s/GFP flag/GFP flags
> 
> ?

OK, will change.  (They'll spend a bit more column spaces though.)


> 
> >   
> > --------------------------------------------------------------------------------------------
> >        65536/16384 |        4 |     2 |  RECLAIMABLE | 00285250 | new_slab
> >     51347456/4096  |    12536 |     0 |      MOVABLE | 0102005a | 
> > __page_cache_alloc
> >        53248/4096  |       13 |     0 |    UNMOVABLE | 002084d0 | 
> > pte_alloc_one
> >        40960/4096  |       10 |     0 |      MOVABLE | 000280da | 
> > handle_mm_fault
> >        28672/4096  |        7 |     0 |    UNMOVABLE | 000000d0 | __pollwait
> >        20480/4096  |        5 |     0 |      MOVABLE | 000200da | do_wp_page
> >        20480/4096  |        5 |     0 |      MOVABLE | 000200da | 
> > do_cow_fault
> >        16384/4096  |        4 |     0 |    UNMOVABLE | 00000200 | 
> > __tlb_remove_page
> >        16384/4096  |        4 |     0 |    UNMOVABLE | 000084d0 | 
> > __pmd_alloc
> >         8192/4096  |        2 |     0 |    UNMOVABLE | 000084d0 | 
> > __pud_alloc
> >    ...             | ...      | ...   | ...          | ...      | ...
> >   
> > --------------------------------------------------------------------------------------------
> > 
> >   SUMMARY (page allocator)
> >   ========================
> >   Total alloc requested: 12593
> >   Total alloc failure  : 0
> >   Total bytes allocated: 51630080
> >   Total free  requested: 115
> >   Total free  unmatched: 67
> >   Total bytes freed    : 471040
> 
> I'd suggest the following changes to the format:
> 
>   - Collapse stats into 3 groups: 'allocated+freed', 'allocated only', 
>     'freed only', depending on how much of their lifetime we've 
>     managed to trace. These groups are really distinct and it makes 
>     little sense to mix up their stats.

Good idea.  Actually I'm thinking about a new option that shows only
lively allocated memory (excluding freed page) in the table.  FYI
current number is total allocated memory (including freed page).


> 
>   - Add commas to the numbers, to make it easier to read and compare 
>     larger numbers.

OK

> 
>   - Right-align the numbers, to make them easy to compare when they
>     are placed under each other.

OK

> 
>   - Merge the 'count' and 'bytes' stats into a single line, so that 
>     it's more compact, easier to navigate, but also only comparable 
>     type numbers are placed under each other.

OK

> 
> I.e. something like this (mockup) output:
> 
>    SUMMARY (page allocator)
>    ========================
> 
>    Pages allocated+freed:       12,593   [     51,630,080 bytes ]
> 
>    Pages allocated-only:         2,342   [      1,235,010 bytes ]
>    Pages freed-only:                67   [        135,311 bytes ]
> 
>    Page allocation failures :        0

Looks a lot better!

One thing I need to tell you is that the numbers are not pages but
requests.


> 
> 
> >   Order     UNMOVABLE   RECLAIMABLE       MOVABLE      RESERVED   
> > CMA/ISOLATE
> >   -----  ------------  ------------  ------------  ------------  
> > ------------
> >       0            32             0         12557             0             > > 0
> >       1             0             0             0             0             > > 0
> >       2             0             4             0             0             > > 0
> >       3             0             0             0             0             > > 0
> >       4             0             0             0             0             > > 0
> >       5             0             0             0             0             > > 0
> >       6             0             0             0             0             > > 0
> >       7             0             0             0             0             > > 0
> >       8             0             0             0             0             > > 0
> >       9             0             0             0             0             > > 0
> >      10             0             0             0             0             > > 0
> 
> Here I'd suggest the following refinements:
> 
>  - Use '.' instead of '0', to make actual nonzero values stand out 
>    visually, while still keeping a tabular format

OK

> 
>  - Merge the 'Reserved', 'CMA/Isolate' columns into a single 'Special' 
>    colum: this will be zero in 99.9% of the cases, as those pages 
>    mostly deal with driver interfaces, mostly used during init/deinit.

I'm not sure about the CMA pages..

> 
>  - Capitalize less.

OK

> 
>  - Use comma-separated numbers for better readability.

OK

> 
> So something like this:
> 
> 
>    Order     Unmovable   Reclaimable       Movable       Special
>    -----  ------------  ------------  ------------  ------------
>        0            32             .        12,557             .
>        1             .             .             .             .
>        2             .             4             .             .
>        3             .             .             .             .
>        4             .             .             .             .
>        5             .             .             .             .
>        6             .             .             .             .
>        7             .             .             .             .
>        8             .             .             .             .
>        9             .             .             .             .
>       10             .             .             .             .
> 
> 
> Look for example how easily noticeable the '4' value is now, while it 
> was pretty easy to miss in the original table.

Indeed!

> 
> > I have some idea how to improve it.  But I'd also like to hear other 
> > idea, suggestion, feedback and so on.
> 
> So there's one thing that would be useful: to track pages allocated on 
> one node, but freed on another. Those kinds of allocation/free 
> patterns are especially expensive and might make sense to visualize.

I think it can be done easily as slab analysis already contains the info.

Thanks for your useful feedbacks!
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to