Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages
On Tue, Mar 23, 2021 at 09:39:24PM +0100, Uladzislau Rezki wrote: > > On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote: > > > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote: > > > > I suspect the vast majority of the time is spent calling > > > > alloc_pages_node() > > > > 1024 times. Have you looked at Mel's patch to do ... well, exactly what > > > > vmalloc() wants? > > > > > > > > > > - __vmalloc_node_range > > > - 45.25% __alloc_pages_nodemask > > >- 37.59% get_page_from_freelist > > [...] > > > - 44.61% 0xc047348d > > > - __vunmap > > > - 35.56% free_unref_page > > > > Hmm! I hadn't been thinking about the free side of things. > > Does this make a difference? > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 4f5f8c907897..61d5b769fea0 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int > > deallocate_pages) > > vm_remove_mappings(area, deallocate_pages); > > > > if (deallocate_pages) { > > - int i; > > - > > - for (i = 0; i < area->nr_pages; i++) { > > - struct page *page = area->pages[i]; > > - > > - BUG_ON(!page); > > - __free_pages(page, 0); > > - } > > + release_pages(area->pages, area->nr_pages); > > atomic_long_sub(area->nr_pages, _vmalloc_pages); > > - > > kvfree(area->pages); > > } > > > Same test. 4MB allocation on a single CPU: > > default: loops: 100 avg: 93601889 usec > patch: loops: 100 avg: 98217904 usec > > > - __vunmap >- 41.17% free_unref_page > - 28.42% free_pcppages_bulk > - 6.38% __mod_zone_page_state > 4.79% check_preemption_disabled >2.63% __list_del_entry_valid >2.63% __list_add_valid > - 7.50% free_unref_page_commit >2.15% check_preemption_disabled >2.01% __list_add_valid > 2.31% free_unref_page_prepare.part.86 > 0.70% free_pcp_prepare > > > > - __vunmap >- 45.36% release_pages > - 37.70% free_unref_page_list > - 24.70% free_pcppages_bulk > - 5.42% __mod_zone_page_state > 4.23% check_preemption_disabled > 2.31% __list_add_valid > 2.07% __list_del_entry_valid > - 7.58% free_unref_page_commit > 2.47% check_preemption_disabled > 1.75% __list_add_valid >3.43% free_unref_page_prepare.part.86 > - 2.39% mem_cgroup_uncharge_list >uncharge_page > > > It is obvious that the default version is slightly better. It requires > less things to be done comparing with release_pages() variant. > > > > > release_pages does a bunch of checks that are unnecessary ... we could > > probably just do: > > > > LIST_HEAD(pages_to_free); > > > > for (i = 0; i < area->nr_pages; i++) { > > struct page *page = area->pages[i]; > > if (put_page_testzero(page)) > > list_add(>lru, _to_free); > > } > > free_unref_page_list(_to_free); > > > > but let's see if the provided interface gets us the performance we want. > > > I will test it tomorrow. From the first glance it looks like a more light > version :) > Here we go: diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 4f5f8c907897..349024768ba6 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2254,6 +2254,7 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages) static void __vunmap(const void *addr, int deallocate_pages) { struct vm_struct *area; + LIST_HEAD(pages_to_free); if (!addr) return; @@ -2282,11 +2283,12 @@ static void __vunmap(const void *addr, int deallocate_pages) for (i = 0; i < area->nr_pages; i++) { struct page *page = area->pages[i]; - BUG_ON(!page); - __free_pages(page, 0); + if (put_page_testzero(page)) + list_add(>lru, _to_free); } - atomic_long_sub(area->nr_pages, _vmalloc_pages); + free_unref_page_list(_to_free); + atomic_long_sub(area->nr_pages, _vmalloc_pages); kvfree(area->pages); } # patch 4MB allocation, single cpu, loops: 100 avg: 89065758 usec 4MB allocation, single cpu, loops: 100 avg: 90258523 usec 4MB allocation, single cpu, loops:
Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages
On Tue, Mar 23, 2021 at 02:07:22PM +, Matthew Wilcox wrote: > On Tue, Mar 23, 2021 at 02:39:48PM +0100, Uladzislau Rezki wrote: > > On Tue, Mar 23, 2021 at 12:39:13PM +, Matthew Wilcox wrote: > > > On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote: > > > > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote: > > > > > I suspect the vast majority of the time is spent calling > > > > > alloc_pages_node() > > > > > 1024 times. Have you looked at Mel's patch to do ... well, exactly > > > > > what > > > > > vmalloc() wants? > > > > > > > > > > > > > - __vmalloc_node_range > > > > - 45.25% __alloc_pages_nodemask > > > >- 37.59% get_page_from_freelist > > > [...] > > > > - 44.61% 0xc047348d > > > > - __vunmap > > > > - 35.56% free_unref_page > > > > > > Hmm! I hadn't been thinking about the free side of things. > > > Does this make a difference? > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > index 4f5f8c907897..61d5b769fea0 100644 > > > --- a/mm/vmalloc.c > > > +++ b/mm/vmalloc.c > > > @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int > > > deallocate_pages) > > > vm_remove_mappings(area, deallocate_pages); > > > > > > if (deallocate_pages) { > > > - int i; > > > - > > > - for (i = 0; i < area->nr_pages; i++) { > > > - struct page *page = area->pages[i]; > > > - > > > - BUG_ON(!page); > > > - __free_pages(page, 0); > > > - } > > > + release_pages(area->pages, area->nr_pages); > > > atomic_long_sub(area->nr_pages, _vmalloc_pages); > > > - > > > kvfree(area->pages); > > > } > > > > > Will check it today! > > > > > release_pages does a bunch of checks that are unnecessary ... we could > > > probably just do: > > > > > > LIST_HEAD(pages_to_free); > > > > > > for (i = 0; i < area->nr_pages; i++) { > > > struct page *page = area->pages[i]; > > > if (put_page_testzero(page)) > > > list_add(>lru, _to_free); > > > } > > > free_unref_page_list(_to_free); > > > > > > but let's see if the provided interface gets us the performance we want. > > > > > > > Reviewed-by: Uladzislau Rezki (Sony) > > > > > > > > Thanks! > > > > > > Thank you! > > You are welcome. A small nit: > > > > CC mm/vmalloc.o > > mm/vmalloc.c: In function ‘__vmalloc_area_node’: > > mm/vmalloc.c:2492:14: warning: passing argument 4 of ‘kvmalloc_node_caller’ > > makes integer from pointer without a cast [-Wint-conversion] > > area->caller); > > ^~~~ > > In file included from mm/vmalloc.c:12: > > ./include/linux/mm.h:782:7: note: expected ‘long unsigned int’ but argument > > is of type ‘const void *’ > > void *kvmalloc_node_caller(size_t size, gfp_t flags, int node, > > Oh, thank you! I confused myself by changing the type halfway through. > vmalloc() uses void * to match __builtin_return_address while most > of the rest of the kernel uses unsigned long to match _RET_IP_. > I'll submit another patch to convert vmalloc to use _RET_IP_. > Thanks! > > As for the bulk-array interface. I have checked the: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git > > mm-bulk-rebase-v6r2 > > > > applied the patch that is in question + below one: > > > > > > @@ -2503,25 +2498,13 @@ static void *__vmalloc_area_node(struct vm_struct > > *area, gfp_t gfp_mask, > > area->pages = pages; > > area->nr_pages = nr_pages; > > > > - for (i = 0; i < area->nr_pages; i++) { > > - struct page *page; > > - > > - if (node == NUMA_NO_NODE) > > - page = alloc_page(gfp_mask); > > - else > > - page = alloc_pages_node(node, gfp_mask, 0); > > - > > - if (unlikely(!page)) { > > - /* Successfully allocated i pages, free them in > > __vfree() */ > > - area->nr_pages = i; > > - atomic_long_add(area->nr_pages, _vmalloc_pages); > > - goto fail; > > - } > > - area->pages[i] = page; > > - if (gfpflags_allow_blocking(gfp_mask)) > > - cond_resched(); > > + ret = alloc_pages_bulk_array(gfp_mask, area->nr_pages, area->pages); > > + if (ret == nr_pages) > > + atomic_long_add(area->nr_pages, _vmalloc_pages); > > + else { > > + area->nr_pages = ret; > > + goto fail; > > } > > - atomic_long_add(area->nr_pages, _vmalloc_pages); > > > > > > single CPU, 4MB allocation, 100 avg: 70639437 usec > > single CPU, 4MB allocation, 100 avg: 89218654 usec > > > > and now we get ~21% delta. That is very good :) > > Amazing! That's great news for
Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages
> On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote: > > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote: > > > I suspect the vast majority of the time is spent calling > > > alloc_pages_node() > > > 1024 times. Have you looked at Mel's patch to do ... well, exactly what > > > vmalloc() wants? > > > > > > > - __vmalloc_node_range > > - 45.25% __alloc_pages_nodemask > >- 37.59% get_page_from_freelist > [...] > > - 44.61% 0xc047348d > > - __vunmap > > - 35.56% free_unref_page > > Hmm! I hadn't been thinking about the free side of things. > Does this make a difference? > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 4f5f8c907897..61d5b769fea0 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int > deallocate_pages) > vm_remove_mappings(area, deallocate_pages); > > if (deallocate_pages) { > - int i; > - > - for (i = 0; i < area->nr_pages; i++) { > - struct page *page = area->pages[i]; > - > - BUG_ON(!page); > - __free_pages(page, 0); > - } > + release_pages(area->pages, area->nr_pages); > atomic_long_sub(area->nr_pages, _vmalloc_pages); > - > kvfree(area->pages); > } > Same test. 4MB allocation on a single CPU: default: loops: 100 avg: 93601889 usec patch: loops: 100 avg: 98217904 usec - __vunmap - 41.17% free_unref_page - 28.42% free_pcppages_bulk - 6.38% __mod_zone_page_state 4.79% check_preemption_disabled 2.63% __list_del_entry_valid 2.63% __list_add_valid - 7.50% free_unref_page_commit 2.15% check_preemption_disabled 2.01% __list_add_valid 2.31% free_unref_page_prepare.part.86 0.70% free_pcp_prepare - __vunmap - 45.36% release_pages - 37.70% free_unref_page_list - 24.70% free_pcppages_bulk - 5.42% __mod_zone_page_state 4.23% check_preemption_disabled 2.31% __list_add_valid 2.07% __list_del_entry_valid - 7.58% free_unref_page_commit 2.47% check_preemption_disabled 1.75% __list_add_valid 3.43% free_unref_page_prepare.part.86 - 2.39% mem_cgroup_uncharge_list uncharge_page It is obvious that the default version is slightly better. It requires less things to be done comparing with release_pages() variant. > > release_pages does a bunch of checks that are unnecessary ... we could > probably just do: > > LIST_HEAD(pages_to_free); > > for (i = 0; i < area->nr_pages; i++) { > struct page *page = area->pages[i]; > if (put_page_testzero(page)) > list_add(>lru, _to_free); > } > free_unref_page_list(_to_free); > > but let's see if the provided interface gets us the performance we want. > I will test it tomorrow. From the first glance it looks like a more light version :) -- Vlad Rezki
Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages
On Tue, Mar 23, 2021 at 02:39:48PM +0100, Uladzislau Rezki wrote: > On Tue, Mar 23, 2021 at 12:39:13PM +, Matthew Wilcox wrote: > > On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote: > > > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote: > > > > I suspect the vast majority of the time is spent calling > > > > alloc_pages_node() > > > > 1024 times. Have you looked at Mel's patch to do ... well, exactly what > > > > vmalloc() wants? > > > > > > > > > > - __vmalloc_node_range > > > - 45.25% __alloc_pages_nodemask > > >- 37.59% get_page_from_freelist > > [...] > > > - 44.61% 0xc047348d > > > - __vunmap > > > - 35.56% free_unref_page > > > > Hmm! I hadn't been thinking about the free side of things. > > Does this make a difference? > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 4f5f8c907897..61d5b769fea0 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int > > deallocate_pages) > > vm_remove_mappings(area, deallocate_pages); > > > > if (deallocate_pages) { > > - int i; > > - > > - for (i = 0; i < area->nr_pages; i++) { > > - struct page *page = area->pages[i]; > > - > > - BUG_ON(!page); > > - __free_pages(page, 0); > > - } > > + release_pages(area->pages, area->nr_pages); > > atomic_long_sub(area->nr_pages, _vmalloc_pages); > > - > > kvfree(area->pages); > > } > > > Will check it today! > > > release_pages does a bunch of checks that are unnecessary ... we could > > probably just do: > > > > LIST_HEAD(pages_to_free); > > > > for (i = 0; i < area->nr_pages; i++) { > > struct page *page = area->pages[i]; > > if (put_page_testzero(page)) > > list_add(>lru, _to_free); > > } > > free_unref_page_list(_to_free); > > > > but let's see if the provided interface gets us the performance we want. > > > > > Reviewed-by: Uladzislau Rezki (Sony) > > > > > > Thanks! > > > > Thank you! > You are welcome. A small nit: > > CC mm/vmalloc.o > mm/vmalloc.c: In function ‘__vmalloc_area_node’: > mm/vmalloc.c:2492:14: warning: passing argument 4 of ‘kvmalloc_node_caller’ > makes integer from pointer without a cast [-Wint-conversion] > area->caller); > ^~~~ > In file included from mm/vmalloc.c:12: > ./include/linux/mm.h:782:7: note: expected ‘long unsigned int’ but argument > is of type ‘const void *’ > void *kvmalloc_node_caller(size_t size, gfp_t flags, int node, Oh, thank you! I confused myself by changing the type halfway through. vmalloc() uses void * to match __builtin_return_address while most of the rest of the kernel uses unsigned long to match _RET_IP_. I'll submit another patch to convert vmalloc to use _RET_IP_. > As for the bulk-array interface. I have checked the: > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git > mm-bulk-rebase-v6r2 > > applied the patch that is in question + below one: > > > @@ -2503,25 +2498,13 @@ static void *__vmalloc_area_node(struct vm_struct > *area, gfp_t gfp_mask, > area->pages = pages; > area->nr_pages = nr_pages; > > - for (i = 0; i < area->nr_pages; i++) { > - struct page *page; > - > - if (node == NUMA_NO_NODE) > - page = alloc_page(gfp_mask); > - else > - page = alloc_pages_node(node, gfp_mask, 0); > - > - if (unlikely(!page)) { > - /* Successfully allocated i pages, free them in > __vfree() */ > - area->nr_pages = i; > - atomic_long_add(area->nr_pages, _vmalloc_pages); > - goto fail; > - } > - area->pages[i] = page; > - if (gfpflags_allow_blocking(gfp_mask)) > - cond_resched(); > + ret = alloc_pages_bulk_array(gfp_mask, area->nr_pages, area->pages); > + if (ret == nr_pages) > + atomic_long_add(area->nr_pages, _vmalloc_pages); > + else { > + area->nr_pages = ret; > + goto fail; > } > - atomic_long_add(area->nr_pages, _vmalloc_pages); > > > single CPU, 4MB allocation, 100 avg: 70639437 usec > single CPU, 4MB allocation, 100 avg: 89218654 usec > > and now we get ~21% delta. That is very good :) Amazing! That's great news for Mel's patch as well as the kvmalloc change. (there's an entirely separate issue that they really shouldn't be allocating 4MB of memory, but we can at least make what we have faster).
Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages
On Tue, Mar 23, 2021 at 12:39:13PM +, Matthew Wilcox wrote: > On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote: > > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote: > > > I suspect the vast majority of the time is spent calling > > > alloc_pages_node() > > > 1024 times. Have you looked at Mel's patch to do ... well, exactly what > > > vmalloc() wants? > > > > > > > - __vmalloc_node_range > > - 45.25% __alloc_pages_nodemask > >- 37.59% get_page_from_freelist > [...] > > - 44.61% 0xc047348d > > - __vunmap > > - 35.56% free_unref_page > > Hmm! I hadn't been thinking about the free side of things. > Does this make a difference? > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 4f5f8c907897..61d5b769fea0 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int > deallocate_pages) > vm_remove_mappings(area, deallocate_pages); > > if (deallocate_pages) { > - int i; > - > - for (i = 0; i < area->nr_pages; i++) { > - struct page *page = area->pages[i]; > - > - BUG_ON(!page); > - __free_pages(page, 0); > - } > + release_pages(area->pages, area->nr_pages); > atomic_long_sub(area->nr_pages, _vmalloc_pages); > - > kvfree(area->pages); > } > Will check it today! > release_pages does a bunch of checks that are unnecessary ... we could > probably just do: > > LIST_HEAD(pages_to_free); > > for (i = 0; i < area->nr_pages; i++) { > struct page *page = area->pages[i]; > if (put_page_testzero(page)) > list_add(>lru, _to_free); > } > free_unref_page_list(_to_free); > > but let's see if the provided interface gets us the performance we want. > > > Reviewed-by: Uladzislau Rezki (Sony) > > > > Thanks! > > Thank you! You are welcome. A small nit: CC mm/vmalloc.o mm/vmalloc.c: In function ‘__vmalloc_area_node’: mm/vmalloc.c:2492:14: warning: passing argument 4 of ‘kvmalloc_node_caller’ makes integer from pointer without a cast [-Wint-conversion] area->caller); ^~~~ In file included from mm/vmalloc.c:12: ./include/linux/mm.h:782:7: note: expected ‘long unsigned int’ but argument is of type ‘const void *’ void *kvmalloc_node_caller(size_t size, gfp_t flags, int node, diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 8a202ba263f6..ee6fa44983bc 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2489,7 +2489,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, /* Please note that the recursion is strictly bounded. */ pages = kvmalloc_node_caller(array_size, nested_gfp, node, -area->caller); + (unsigned long) area->caller); if (!pages) { free_vm_area(area); return NULL; As for the bulk-array interface. I have checked the: git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2 applied the patch that is in question + below one: @@ -2503,25 +2498,13 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, area->pages = pages; area->nr_pages = nr_pages; - for (i = 0; i < area->nr_pages; i++) { - struct page *page; - - if (node == NUMA_NO_NODE) - page = alloc_page(gfp_mask); - else - page = alloc_pages_node(node, gfp_mask, 0); - - if (unlikely(!page)) { - /* Successfully allocated i pages, free them in __vfree() */ - area->nr_pages = i; - atomic_long_add(area->nr_pages, _vmalloc_pages); - goto fail; - } - area->pages[i] = page; - if (gfpflags_allow_blocking(gfp_mask)) - cond_resched(); + ret = alloc_pages_bulk_array(gfp_mask, area->nr_pages, area->pages); + if (ret == nr_pages) + atomic_long_add(area->nr_pages, _vmalloc_pages); + else { + area->nr_pages = ret; + goto fail; } - atomic_long_add(area->nr_pages, _vmalloc_pages); single CPU, 4MB allocation, 100 avg: 70639437 usec single CPU, 4MB allocation, 100 avg: 89218654 usec and now we get ~21% delta. That is very good :) -- Vlad Rezki
Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages
On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote: > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote: > > I suspect the vast majority of the time is spent calling alloc_pages_node() > > 1024 times. Have you looked at Mel's patch to do ... well, exactly what > > vmalloc() wants? > > > > - __vmalloc_node_range > - 45.25% __alloc_pages_nodemask >- 37.59% get_page_from_freelist [...] > - 44.61% 0xc047348d > - __vunmap > - 35.56% free_unref_page Hmm! I hadn't been thinking about the free side of things. Does this make a difference? diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 4f5f8c907897..61d5b769fea0 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int deallocate_pages) vm_remove_mappings(area, deallocate_pages); if (deallocate_pages) { - int i; - - for (i = 0; i < area->nr_pages; i++) { - struct page *page = area->pages[i]; - - BUG_ON(!page); - __free_pages(page, 0); - } + release_pages(area->pages, area->nr_pages); atomic_long_sub(area->nr_pages, _vmalloc_pages); - kvfree(area->pages); } release_pages does a bunch of checks that are unnecessary ... we could probably just do: LIST_HEAD(pages_to_free); for (i = 0; i < area->nr_pages; i++) { struct page *page = area->pages[i]; if (put_page_testzero(page)) list_add(>lru, _to_free); } free_unref_page_list(_to_free); but let's see if the provided interface gets us the performance we want. > Reviewed-by: Uladzislau Rezki (Sony) > > Thanks! Thank you!
Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages
On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote: > On Mon, Mar 22, 2021 at 11:36:19PM +0100, Uladzislau Rezki wrote: > > On Mon, Mar 22, 2021 at 07:38:20PM +, Matthew Wilcox (Oracle) wrote: > > > If we're trying to allocate 4MB of memory, the table will be 8KiB in size > > > (1024 pointers * 8 bytes per pointer), which can usually be satisfied > > > by a kmalloc (which is significantly faster). Instead of changing this > > > open-coded implementation, just use kvmalloc(). > > > > > > Signed-off-by: Matthew Wilcox (Oracle) > > > --- > > > mm/vmalloc.c | 7 +-- > > > 1 file changed, 1 insertion(+), 6 deletions(-) > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > index 96444d64129a..32b640a84250 100644 > > > --- a/mm/vmalloc.c > > > +++ b/mm/vmalloc.c > > > @@ -2802,13 +2802,8 @@ static void *__vmalloc_area_node(struct vm_struct > > > *area, gfp_t gfp_mask, > > > gfp_mask |= __GFP_HIGHMEM; > > > > > > /* Please note that the recursion is strictly bounded. */ > > > - if (array_size > PAGE_SIZE) { > > > - pages = __vmalloc_node(array_size, 1, nested_gfp, node, > > > + pages = kvmalloc_node_caller(array_size, nested_gfp, node, > > > area->caller); > > > - } else { > > > - pages = kmalloc_node(array_size, nested_gfp, node); > > > - } > > > - > > > if (!pages) { > > > free_vm_area(area); > > > return NULL; > > > -- > > > 2.30.2 > > Makes sense to me. Though i expected a bigger difference: > > > > # patch > > single CPU, 4MB allocation, loops: 100 avg: 85293854 usec > > > > # default > > single CPU, 4MB allocation, loops: 100 avg: 89275857 usec > > Well, 4.5% isn't something to leave on the table ... but yeah, I was > expecting more in the 10-20% range. It may be more significant if > there's contention on the spinlocks (like if this crazy ksmbd is calling > vmalloc(4MB) on multiple nodes simultaneously). > Yep, it can be that simultaneous allocations will show even bigger improvements because of lock contention. Anyway there is an advantage in switching to SLAB - 5% is also a win :) > > I suspect the vast majority of the time is spent calling alloc_pages_node() > 1024 times. Have you looked at Mel's patch to do ... well, exactly what > vmalloc() wants? > - 97.37% 0.00% vmalloc_test/0 [kernel.vmlinux] [k] ret_from_fork ◆ ret_from_fork ▒ kthread ▒ - 0xc047373b ▒ - 52.67% 0xc047349f ▒ __vmalloc_node ▒ - __vmalloc_node_range ▒ - 45.25% __alloc_pages_nodemask ▒ - 37.59% get_page_from_freelist ▒ 4.34% __list_del_entry_valid ▒ 3.67% __list_add_valid ▒ 1.52% prep_new_page ▒ 1.20% check_preemption_disabled ▒ 3.75% map_kernel_range_noflush ▒ - 0.64%
Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages
On Mon, Mar 22, 2021 at 11:36:19PM +0100, Uladzislau Rezki wrote: > On Mon, Mar 22, 2021 at 07:38:20PM +, Matthew Wilcox (Oracle) wrote: > > If we're trying to allocate 4MB of memory, the table will be 8KiB in size > > (1024 pointers * 8 bytes per pointer), which can usually be satisfied > > by a kmalloc (which is significantly faster). Instead of changing this > > open-coded implementation, just use kvmalloc(). > > > > Signed-off-by: Matthew Wilcox (Oracle) > > --- > > mm/vmalloc.c | 7 +-- > > 1 file changed, 1 insertion(+), 6 deletions(-) > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 96444d64129a..32b640a84250 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -2802,13 +2802,8 @@ static void *__vmalloc_area_node(struct vm_struct > > *area, gfp_t gfp_mask, > > gfp_mask |= __GFP_HIGHMEM; > > > > /* Please note that the recursion is strictly bounded. */ > > - if (array_size > PAGE_SIZE) { > > - pages = __vmalloc_node(array_size, 1, nested_gfp, node, > > + pages = kvmalloc_node_caller(array_size, nested_gfp, node, > > area->caller); > > - } else { > > - pages = kmalloc_node(array_size, nested_gfp, node); > > - } > > - > > if (!pages) { > > free_vm_area(area); > > return NULL; > > -- > > 2.30.2 > Makes sense to me. Though i expected a bigger difference: > > # patch > single CPU, 4MB allocation, loops: 100 avg: 85293854 usec > > # default > single CPU, 4MB allocation, loops: 100 avg: 89275857 usec Well, 4.5% isn't something to leave on the table ... but yeah, I was expecting more in the 10-20% range. It may be more significant if there's contention on the spinlocks (like if this crazy ksmbd is calling vmalloc(4MB) on multiple nodes simultaneously). I suspect the vast majority of the time is spent calling alloc_pages_node() 1024 times. Have you looked at Mel's patch to do ... well, exactly what vmalloc() wants? https://lore.kernel.org/linux-mm/20210322091845.16437-1-mgor...@techsingularity.net/ > One question. Should we care much about fragmentation? I mean > with the patch, allocations > 2MB will do request to SLAB bigger > then PAGE_SIZE. We're pretty good about allocating memory in larger chunks these days. Looking at my laptop's slabinfo, kmalloc-8k 219232 819248 : tunables000 : sla bdata 58 58 0 That's using 8 pages per slab, so that's order-3 allocations. There's a few more of those: $ sudo grep '8 :' /proc/slabinfo |wc 42 6724508 so I have confidence that kvmalloc() will manage to use kmalloc up to 16MB vmalloc allocations, and after that it'll tend to fall back to vmalloc.
Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages
On Mon, Mar 22, 2021 at 07:38:20PM +, Matthew Wilcox (Oracle) wrote: > If we're trying to allocate 4MB of memory, the table will be 8KiB in size > (1024 pointers * 8 bytes per pointer), which can usually be satisfied > by a kmalloc (which is significantly faster). Instead of changing this > open-coded implementation, just use kvmalloc(). > > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/vmalloc.c | 7 +-- > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 96444d64129a..32b640a84250 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2802,13 +2802,8 @@ static void *__vmalloc_area_node(struct vm_struct > *area, gfp_t gfp_mask, > gfp_mask |= __GFP_HIGHMEM; > > /* Please note that the recursion is strictly bounded. */ > - if (array_size > PAGE_SIZE) { > - pages = __vmalloc_node(array_size, 1, nested_gfp, node, > + pages = kvmalloc_node_caller(array_size, nested_gfp, node, > area->caller); > - } else { > - pages = kmalloc_node(array_size, nested_gfp, node); > - } > - > if (!pages) { > free_vm_area(area); > return NULL; > -- > 2.30.2 Makes sense to me. Though i expected a bigger difference: # patch single CPU, 4MB allocation, loops: 100 avg: 85293854 usec # default single CPU, 4MB allocation, loops: 100 avg: 89275857 usec One question. Should we care much about fragmentation? I mean with the patch, allocations > 2MB will do request to SLAB bigger then PAGE_SIZE. Thanks! -- Vlad Rezki