On 10/13/2017 08:31 AM, Aaron Lu wrote: > __rmqueue(), __rmqueue_fallback(), __rmqueue_smallest() and > __rmqueue_cma_fallback() are all in page allocator's hot path and > better be finished as soon as possible. One way to make them faster > is by making them inline. But as Andrew Morton and Andi Kleen pointed > out: > https://lkml.org/lkml/2017/10/10/1252 > https://lkml.org/lkml/2017/10/10/1279 > To make sure they are inlined, we should use __always_inline for them. > > With the will-it-scale/page_fault1/process benchmark, when using nr_cpu > processes to stress buddy, the results for will-it-scale.processes with > and without the patch are: > > On a 2-sockets Intel-Skylake machine: > > compiler base head > gcc-4.4.7 6496131 6911823 +6.4% > gcc-4.9.4 7225110 7731072 +7.0% > gcc-5.4.1 7054224 7688146 +9.0% > gcc-6.2.0 7059794 7651675 +8.4% > > On a 4-sockets Intel-Skylake machine: > > compiler base head > gcc-4.4.7 13162890 13508193 +2.6% > gcc-4.9.4 14997463 15484353 +3.2% > gcc-5.4.1 14708711 15449805 +5.0% > gcc-6.2.0 14574099 15349204 +5.3% > > The above 4 compilers are used becuase I've done the tests through Intel's > Linux Kernel Performance(LKP) infrastructure and they are the available > compilers there. > > The benefit being less on 4 sockets machine is due to the lock contention > there(perf-profile/native_queued_spin_lock_slowpath=81%) is less severe > than on the 2 sockets machine(85%). > > What the benchmark does is: it forks nr_cpu processes and then each > process does the following: > 1 mmap() 128M anonymous space; > 2 writes to each page there to trigger actual page allocation; > 3 munmap() it. > in a loop. > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
Are transparent hugepages enabled? If yes, __rmqueue() is called from rmqueue(), and there's only one page fault (and __rmqueue()) per 512 "writes to each page". If not, __rmqueue() is called from rmqueue_bulk() in bursts once pcplists are depleted. I guess it's the latter, otherwise I wouldn't expect a function call to have such visible overhead. I guess what would help much more would be a bulk __rmqueue_smallest() to grab multiple pages from the freelists. But can't argue with your numbers against this patch. > Binary size wise, I have locally built them with different compilers: > > [aaron@aaronlu obj]$ size */*/mm/page_alloc.o > text data bss dec hex filename > 37409 9904 8524 55837 da1d gcc-4.9.4/base/mm/page_alloc.o > 38273 9904 8524 56701 dd7d gcc-4.9.4/head/mm/page_alloc.o > 37465 9840 8428 55733 d9b5 gcc-5.5.0/base/mm/page_alloc.o > 38169 9840 8428 56437 dc75 gcc-5.5.0/head/mm/page_alloc.o > 37573 9840 8428 55841 da21 gcc-6.4.0/base/mm/page_alloc.o > 38261 9840 8428 56529 dcd1 gcc-6.4.0/head/mm/page_alloc.o > 36863 9840 8428 55131 d75b gcc-7.2.0/base/mm/page_alloc.o > 37711 9840 8428 55979 daab gcc-7.2.0/head/mm/page_alloc.o > > Text size increased about 800 bytes for mm/page_alloc.o. BTW, do you know about ./scripts/bloat-o-meter? :) With gcc 7.2.1: > ./scripts/bloat-o-meter base.o mm/page_alloc.o add/remove: 1/2 grow/shrink: 2/0 up/down: 2493/-1649 (844) function old new delta get_page_from_freelist 2898 4937 +2039 steal_suitable_fallback - 365 +365 find_suitable_fallback 31 120 +89 find_suitable_fallback.part 115 - -115 __rmqueue 1534 - -1534 > [aaron@aaronlu obj]$ size */*/vmlinux > text data bss dec hex filename > 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/base/vmlinux > 10342757 5903208 17723392 33969357 20654cd gcc-4.9.4/head/vmlinux > 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/base/vmlinux > 10332448 5836608 17715200 33884256 2050860 gcc-5.5.0/head/vmlinux > 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/base/vmlinux > 10094546 5836696 17715200 33646442 201676a gcc-6.4.0/head/vmlinux > 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/base/vmlinux > 10018775 5828732 17715200 33562707 2002053 gcc-7.2.0/head/vmlinux > > Text size for vmlinux has no change though, probably due to function > alignment. Yep that's useless to show. These differences do add up though, until they eventually cross the alignment boundary. Thanks, Vlastimil > > Signed-off-by: Aaron Lu <aaron...@intel.com> > --- > mm/page_alloc.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 0e309ce4a44a..0fe3e2095268 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1794,7 +1794,7 @@ static void prep_new_page(struct page *page, unsigned > int order, gfp_t gfp_flags > * Go through the free lists for the given migratetype and remove > * the smallest available page from the freelists > */ > -static inline > +static __always_inline > struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, > int migratetype) > { > @@ -1838,7 +1838,7 @@ static int fallbacks[MIGRATE_TYPES][4] = { > }; > > #ifdef CONFIG_CMA > -static struct page *__rmqueue_cma_fallback(struct zone *zone, > +static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone, > unsigned int order) > { > return __rmqueue_smallest(zone, order, MIGRATE_CMA); > @@ -2219,7 +2219,7 @@ static bool unreserve_highatomic_pageblock(const struct > alloc_context *ac, > * deviation from the rest of this file, to make the for loop > * condition simpler. > */ > -static inline bool > +static __always_inline bool > __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) > { > struct free_area *area; > @@ -2291,8 +2291,8 @@ __rmqueue_fallback(struct zone *zone, int order, int > start_migratetype) > * Do the hard work of removing an element from the buddy allocator. > * Call me with the zone->lock already held. > */ > -static struct page *__rmqueue(struct zone *zone, unsigned int order, > - int migratetype) > +static __always_inline struct page * > +__rmqueue(struct zone *zone, unsigned int order, int migratetype) > { > struct page *page; > >