Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-24 Thread Uladzislau Rezki
On Tue, Mar 23, 2021 at 09:39:24PM +0100, Uladzislau Rezki wrote:
> > On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote:
> > > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> > > > I suspect the vast majority of the time is spent calling 
> > > > alloc_pages_node()
> > > > 1024 times.  Have you looked at Mel's patch to do ... well, exactly what
> > > > vmalloc() wants?
> > > > 
> > > 
> > >  - __vmalloc_node_range
> > > - 45.25% __alloc_pages_nodemask
> > >- 37.59% get_page_from_freelist
> > [...]
> > >   - 44.61% 0xc047348d
> > >  - __vunmap
> > > - 35.56% free_unref_page
> > 
> > Hmm!  I hadn't been thinking about the free side of things.
> > Does this make a difference?
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 4f5f8c907897..61d5b769fea0 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int 
> > deallocate_pages)
> > vm_remove_mappings(area, deallocate_pages);
> >  
> > if (deallocate_pages) {
> > -   int i;
> > -
> > -   for (i = 0; i < area->nr_pages; i++) {
> > -   struct page *page = area->pages[i];
> > -
> > -   BUG_ON(!page);
> > -   __free_pages(page, 0);
> > -   }
> > +   release_pages(area->pages, area->nr_pages);
> > atomic_long_sub(area->nr_pages, _vmalloc_pages);
> > -
> > kvfree(area->pages);
> > }
> >
> Same test. 4MB allocation on a single CPU:
> 
> default: loops: 100 avg: 93601889 usec
> patch:   loops: 100 avg: 98217904 usec
> 
> 
> - __vunmap
>- 41.17% free_unref_page
>   - 28.42% free_pcppages_bulk
>  - 6.38% __mod_zone_page_state
>   4.79% check_preemption_disabled
>2.63% __list_del_entry_valid
>2.63% __list_add_valid
>   - 7.50% free_unref_page_commit
>2.15% check_preemption_disabled
>2.01% __list_add_valid
> 2.31% free_unref_page_prepare.part.86
> 0.70% free_pcp_prepare
> 
> 
> 
> - __vunmap
>- 45.36% release_pages
>   - 37.70% free_unref_page_list
>  - 24.70% free_pcppages_bulk
> - 5.42% __mod_zone_page_state
>  4.23% check_preemption_disabled
>   2.31% __list_add_valid
>   2.07% __list_del_entry_valid
>  - 7.58% free_unref_page_commit
>   2.47% check_preemption_disabled
>   1.75% __list_add_valid
>3.43% free_unref_page_prepare.part.86
>   - 2.39% mem_cgroup_uncharge_list
>uncharge_page
> 
> 
> It is obvious that the default version is slightly better. It requires
> less things to be done comparing with release_pages() variant.
> 
> > 
> > release_pages does a bunch of checks that are unnecessary ... we could
> > probably just do:
> > 
> > LIST_HEAD(pages_to_free);
> > 
> > for (i = 0; i < area->nr_pages; i++) {
> > struct page *page = area->pages[i];
> > if (put_page_testzero(page))
> > list_add(>lru, _to_free);
> > }
> > free_unref_page_list(_to_free);
> > 
> > but let's see if the provided interface gets us the performance we want.
> >  
> I will test it tomorrow. From the first glance it looks like a more light 
> version :)
> 
Here we go:


diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4f5f8c907897..349024768ba6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2254,6 +2254,7 @@ static void vm_remove_mappings(struct vm_struct *area, 
int deallocate_pages)
 static void __vunmap(const void *addr, int deallocate_pages)
 {
struct vm_struct *area;
+   LIST_HEAD(pages_to_free);
 
if (!addr)
return;
@@ -2282,11 +2283,12 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
for (i = 0; i < area->nr_pages; i++) {
struct page *page = area->pages[i];
 
-   BUG_ON(!page);
-   __free_pages(page, 0);
+   if (put_page_testzero(page))
+   list_add(>lru, _to_free);
}
-   atomic_long_sub(area->nr_pages, _vmalloc_pages);
 
+   free_unref_page_list(_to_free);
+   atomic_long_sub(area->nr_pages, _vmalloc_pages);
kvfree(area->pages);
}


# patch
4MB allocation, single cpu, loops: 100 avg: 89065758 usec
4MB allocation, single cpu, loops: 100 avg: 90258523 usec
4MB allocation, single cpu, loops: 

Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-23 Thread Uladzislau Rezki
On Tue, Mar 23, 2021 at 02:07:22PM +, Matthew Wilcox wrote:
> On Tue, Mar 23, 2021 at 02:39:48PM +0100, Uladzislau Rezki wrote:
> > On Tue, Mar 23, 2021 at 12:39:13PM +, Matthew Wilcox wrote:
> > > On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote:
> > > > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> > > > > I suspect the vast majority of the time is spent calling 
> > > > > alloc_pages_node()
> > > > > 1024 times.  Have you looked at Mel's patch to do ... well, exactly 
> > > > > what
> > > > > vmalloc() wants?
> > > > > 
> > > > 
> > > >  - __vmalloc_node_range
> > > > - 45.25% __alloc_pages_nodemask
> > > >- 37.59% get_page_from_freelist
> > > [...]
> > > >   - 44.61% 0xc047348d
> > > >  - __vunmap
> > > > - 35.56% free_unref_page
> > > 
> > > Hmm!  I hadn't been thinking about the free side of things.
> > > Does this make a difference?
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 4f5f8c907897..61d5b769fea0 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int 
> > > deallocate_pages)
> > >   vm_remove_mappings(area, deallocate_pages);
> > >  
> > >   if (deallocate_pages) {
> > > - int i;
> > > -
> > > - for (i = 0; i < area->nr_pages; i++) {
> > > - struct page *page = area->pages[i];
> > > -
> > > - BUG_ON(!page);
> > > - __free_pages(page, 0);
> > > - }
> > > + release_pages(area->pages, area->nr_pages);
> > >   atomic_long_sub(area->nr_pages, _vmalloc_pages);
> > > -
> > >   kvfree(area->pages);
> > >   }
> > > 
> > Will check it today!
> > 
> > > release_pages does a bunch of checks that are unnecessary ... we could
> > > probably just do:
> > > 
> > >   LIST_HEAD(pages_to_free);
> > > 
> > >   for (i = 0; i < area->nr_pages; i++) {
> > >   struct page *page = area->pages[i];
> > >   if (put_page_testzero(page))
> > >   list_add(>lru, _to_free);
> > >   }
> > >   free_unref_page_list(_to_free);
> > > 
> > > but let's see if the provided interface gets us the performance we want.
> > >  
> > > > Reviewed-by: Uladzislau Rezki (Sony) 
> > > > 
> > > > Thanks!
> > > 
> > > Thank you!
> > You are welcome. A small nit:
> > 
> >   CC  mm/vmalloc.o
> > mm/vmalloc.c: In function ‘__vmalloc_area_node’:
> > mm/vmalloc.c:2492:14: warning: passing argument 4 of ‘kvmalloc_node_caller’ 
> > makes integer from pointer without a cast [-Wint-conversion]
> >   area->caller);
> >   ^~~~
> > In file included from mm/vmalloc.c:12:
> > ./include/linux/mm.h:782:7: note: expected ‘long unsigned int’ but argument 
> > is of type ‘const void *’
> >  void *kvmalloc_node_caller(size_t size, gfp_t flags, int node,
> 
> Oh, thank you!  I confused myself by changing the type halfway through.
> vmalloc() uses void * to match __builtin_return_address while most
> of the rest of the kernel uses unsigned long to match _RET_IP_.
> I'll submit another patch to convert vmalloc to use _RET_IP_.
> 
Thanks!

> > As for the bulk-array interface. I have checked the:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
> > mm-bulk-rebase-v6r2
> > 
> > applied the patch that is in question + below one:
> > 
> > 
> > @@ -2503,25 +2498,13 @@ static void *__vmalloc_area_node(struct vm_struct 
> > *area, gfp_t gfp_mask,
> > area->pages = pages;
> > area->nr_pages = nr_pages;
> >  
> > -   for (i = 0; i < area->nr_pages; i++) {
> > -   struct page *page;
> > -
> > -   if (node == NUMA_NO_NODE)
> > -   page = alloc_page(gfp_mask);
> > -   else
> > -   page = alloc_pages_node(node, gfp_mask, 0);
> > -
> > -   if (unlikely(!page)) {
> > -   /* Successfully allocated i pages, free them in 
> > __vfree() */
> > -   area->nr_pages = i;
> > -   atomic_long_add(area->nr_pages, _vmalloc_pages);
> > -   goto fail;
> > -   }
> > -   area->pages[i] = page;
> > -   if (gfpflags_allow_blocking(gfp_mask))
> > -   cond_resched();
> > +   ret = alloc_pages_bulk_array(gfp_mask, area->nr_pages, area->pages);
> > +   if (ret == nr_pages)
> > +   atomic_long_add(area->nr_pages, _vmalloc_pages);
> > +   else {
> > +   area->nr_pages = ret;
> > +   goto fail;
> > }
> > -   atomic_long_add(area->nr_pages, _vmalloc_pages);
> > 
> > 
> > single CPU, 4MB allocation, 100 avg: 70639437 usec
> > single CPU, 4MB allocation, 100 avg: 89218654 usec
> > 
> > and now we get ~21% delta. That is very good :)
> 
> Amazing!  That's great news for 

Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-23 Thread Uladzislau Rezki
> On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote:
> > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> > > I suspect the vast majority of the time is spent calling 
> > > alloc_pages_node()
> > > 1024 times.  Have you looked at Mel's patch to do ... well, exactly what
> > > vmalloc() wants?
> > > 
> > 
> >  - __vmalloc_node_range
> > - 45.25% __alloc_pages_nodemask
> >- 37.59% get_page_from_freelist
> [...]
> >   - 44.61% 0xc047348d
> >  - __vunmap
> > - 35.56% free_unref_page
> 
> Hmm!  I hadn't been thinking about the free side of things.
> Does this make a difference?
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 4f5f8c907897..61d5b769fea0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int 
> deallocate_pages)
>   vm_remove_mappings(area, deallocate_pages);
>  
>   if (deallocate_pages) {
> - int i;
> -
> - for (i = 0; i < area->nr_pages; i++) {
> - struct page *page = area->pages[i];
> -
> - BUG_ON(!page);
> - __free_pages(page, 0);
> - }
> + release_pages(area->pages, area->nr_pages);
>   atomic_long_sub(area->nr_pages, _vmalloc_pages);
> -
>   kvfree(area->pages);
>   }
>
Same test. 4MB allocation on a single CPU:

default: loops: 100 avg: 93601889 usec
patch:   loops: 100 avg: 98217904 usec


- __vunmap
   - 41.17% free_unref_page
  - 28.42% free_pcppages_bulk
 - 6.38% __mod_zone_page_state
  4.79% check_preemption_disabled
   2.63% __list_del_entry_valid
   2.63% __list_add_valid
  - 7.50% free_unref_page_commit
   2.15% check_preemption_disabled
   2.01% __list_add_valid
2.31% free_unref_page_prepare.part.86
0.70% free_pcp_prepare



- __vunmap
   - 45.36% release_pages
  - 37.70% free_unref_page_list
 - 24.70% free_pcppages_bulk
- 5.42% __mod_zone_page_state
 4.23% check_preemption_disabled
  2.31% __list_add_valid
  2.07% __list_del_entry_valid
 - 7.58% free_unref_page_commit
  2.47% check_preemption_disabled
  1.75% __list_add_valid
   3.43% free_unref_page_prepare.part.86
  - 2.39% mem_cgroup_uncharge_list
   uncharge_page


It is obvious that the default version is slightly better. It requires
less things to be done comparing with release_pages() variant.

> 
> release_pages does a bunch of checks that are unnecessary ... we could
> probably just do:
> 
>   LIST_HEAD(pages_to_free);
> 
>   for (i = 0; i < area->nr_pages; i++) {
>   struct page *page = area->pages[i];
>   if (put_page_testzero(page))
>   list_add(>lru, _to_free);
>   }
>   free_unref_page_list(_to_free);
> 
> but let's see if the provided interface gets us the performance we want.
>  
I will test it tomorrow. From the first glance it looks like a more light 
version :)

--
Vlad Rezki


Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-23 Thread Matthew Wilcox
On Tue, Mar 23, 2021 at 02:39:48PM +0100, Uladzislau Rezki wrote:
> On Tue, Mar 23, 2021 at 12:39:13PM +, Matthew Wilcox wrote:
> > On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote:
> > > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> > > > I suspect the vast majority of the time is spent calling 
> > > > alloc_pages_node()
> > > > 1024 times.  Have you looked at Mel's patch to do ... well, exactly what
> > > > vmalloc() wants?
> > > > 
> > > 
> > >  - __vmalloc_node_range
> > > - 45.25% __alloc_pages_nodemask
> > >- 37.59% get_page_from_freelist
> > [...]
> > >   - 44.61% 0xc047348d
> > >  - __vunmap
> > > - 35.56% free_unref_page
> > 
> > Hmm!  I hadn't been thinking about the free side of things.
> > Does this make a difference?
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 4f5f8c907897..61d5b769fea0 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int 
> > deallocate_pages)
> > vm_remove_mappings(area, deallocate_pages);
> >  
> > if (deallocate_pages) {
> > -   int i;
> > -
> > -   for (i = 0; i < area->nr_pages; i++) {
> > -   struct page *page = area->pages[i];
> > -
> > -   BUG_ON(!page);
> > -   __free_pages(page, 0);
> > -   }
> > +   release_pages(area->pages, area->nr_pages);
> > atomic_long_sub(area->nr_pages, _vmalloc_pages);
> > -
> > kvfree(area->pages);
> > }
> > 
> Will check it today!
> 
> > release_pages does a bunch of checks that are unnecessary ... we could
> > probably just do:
> > 
> > LIST_HEAD(pages_to_free);
> > 
> > for (i = 0; i < area->nr_pages; i++) {
> > struct page *page = area->pages[i];
> > if (put_page_testzero(page))
> > list_add(>lru, _to_free);
> > }
> > free_unref_page_list(_to_free);
> > 
> > but let's see if the provided interface gets us the performance we want.
> >  
> > > Reviewed-by: Uladzislau Rezki (Sony) 
> > > 
> > > Thanks!
> > 
> > Thank you!
> You are welcome. A small nit:
> 
>   CC  mm/vmalloc.o
> mm/vmalloc.c: In function ‘__vmalloc_area_node’:
> mm/vmalloc.c:2492:14: warning: passing argument 4 of ‘kvmalloc_node_caller’ 
> makes integer from pointer without a cast [-Wint-conversion]
>   area->caller);
>   ^~~~
> In file included from mm/vmalloc.c:12:
> ./include/linux/mm.h:782:7: note: expected ‘long unsigned int’ but argument 
> is of type ‘const void *’
>  void *kvmalloc_node_caller(size_t size, gfp_t flags, int node,

Oh, thank you!  I confused myself by changing the type halfway through.
vmalloc() uses void * to match __builtin_return_address while most
of the rest of the kernel uses unsigned long to match _RET_IP_.
I'll submit another patch to convert vmalloc to use _RET_IP_.

> As for the bulk-array interface. I have checked the:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 
> mm-bulk-rebase-v6r2
> 
> applied the patch that is in question + below one:
> 
> 
> @@ -2503,25 +2498,13 @@ static void *__vmalloc_area_node(struct vm_struct 
> *area, gfp_t gfp_mask,
> area->pages = pages;
> area->nr_pages = nr_pages;
>  
> -   for (i = 0; i < area->nr_pages; i++) {
> -   struct page *page;
> -
> -   if (node == NUMA_NO_NODE)
> -   page = alloc_page(gfp_mask);
> -   else
> -   page = alloc_pages_node(node, gfp_mask, 0);
> -
> -   if (unlikely(!page)) {
> -   /* Successfully allocated i pages, free them in 
> __vfree() */
> -   area->nr_pages = i;
> -   atomic_long_add(area->nr_pages, _vmalloc_pages);
> -   goto fail;
> -   }
> -   area->pages[i] = page;
> -   if (gfpflags_allow_blocking(gfp_mask))
> -   cond_resched();
> +   ret = alloc_pages_bulk_array(gfp_mask, area->nr_pages, area->pages);
> +   if (ret == nr_pages)
> +   atomic_long_add(area->nr_pages, _vmalloc_pages);
> +   else {
> +   area->nr_pages = ret;
> +   goto fail;
> }
> -   atomic_long_add(area->nr_pages, _vmalloc_pages);
> 
> 
> single CPU, 4MB allocation, 100 avg: 70639437 usec
> single CPU, 4MB allocation, 100 avg: 89218654 usec
> 
> and now we get ~21% delta. That is very good :)

Amazing!  That's great news for Mel's patch as well as the kvmalloc
change.

(there's an entirely separate issue that they really shouldn't be
allocating 4MB of memory, but we can at least make what we have faster).


Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-23 Thread Uladzislau Rezki
On Tue, Mar 23, 2021 at 12:39:13PM +, Matthew Wilcox wrote:
> On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote:
> > On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> > > I suspect the vast majority of the time is spent calling 
> > > alloc_pages_node()
> > > 1024 times.  Have you looked at Mel's patch to do ... well, exactly what
> > > vmalloc() wants?
> > > 
> > 
> >  - __vmalloc_node_range
> > - 45.25% __alloc_pages_nodemask
> >- 37.59% get_page_from_freelist
> [...]
> >   - 44.61% 0xc047348d
> >  - __vunmap
> > - 35.56% free_unref_page
> 
> Hmm!  I hadn't been thinking about the free side of things.
> Does this make a difference?
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 4f5f8c907897..61d5b769fea0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int 
> deallocate_pages)
>   vm_remove_mappings(area, deallocate_pages);
>  
>   if (deallocate_pages) {
> - int i;
> -
> - for (i = 0; i < area->nr_pages; i++) {
> - struct page *page = area->pages[i];
> -
> - BUG_ON(!page);
> - __free_pages(page, 0);
> - }
> + release_pages(area->pages, area->nr_pages);
>   atomic_long_sub(area->nr_pages, _vmalloc_pages);
> -
>   kvfree(area->pages);
>   }
> 
Will check it today!

> release_pages does a bunch of checks that are unnecessary ... we could
> probably just do:
> 
>   LIST_HEAD(pages_to_free);
> 
>   for (i = 0; i < area->nr_pages; i++) {
>   struct page *page = area->pages[i];
>   if (put_page_testzero(page))
>   list_add(>lru, _to_free);
>   }
>   free_unref_page_list(_to_free);
> 
> but let's see if the provided interface gets us the performance we want.
>  
> > Reviewed-by: Uladzislau Rezki (Sony) 
> > 
> > Thanks!
> 
> Thank you!
You are welcome. A small nit:

  CC  mm/vmalloc.o
mm/vmalloc.c: In function ‘__vmalloc_area_node’:
mm/vmalloc.c:2492:14: warning: passing argument 4 of ‘kvmalloc_node_caller’ 
makes integer from pointer without a cast [-Wint-conversion]
  area->caller);
  ^~~~
In file included from mm/vmalloc.c:12:
./include/linux/mm.h:782:7: note: expected ‘long unsigned int’ but argument is 
of type ‘const void *’
 void *kvmalloc_node_caller(size_t size, gfp_t flags, int node,


diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 8a202ba263f6..ee6fa44983bc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2489,7 +2489,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, 
gfp_t gfp_mask,
 
/* Please note that the recursion is strictly bounded. */
pages = kvmalloc_node_caller(array_size, nested_gfp, node,
-area->caller);
+   (unsigned long) area->caller);
if (!pages) {
free_vm_area(area);
return NULL;


As for the bulk-array interface. I have checked the:

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2

applied the patch that is in question + below one:


@@ -2503,25 +2498,13 @@ static void *__vmalloc_area_node(struct vm_struct 
*area, gfp_t gfp_mask,
area->pages = pages;
area->nr_pages = nr_pages;
 
-   for (i = 0; i < area->nr_pages; i++) {
-   struct page *page;
-
-   if (node == NUMA_NO_NODE)
-   page = alloc_page(gfp_mask);
-   else
-   page = alloc_pages_node(node, gfp_mask, 0);
-
-   if (unlikely(!page)) {
-   /* Successfully allocated i pages, free them in 
__vfree() */
-   area->nr_pages = i;
-   atomic_long_add(area->nr_pages, _vmalloc_pages);
-   goto fail;
-   }
-   area->pages[i] = page;
-   if (gfpflags_allow_blocking(gfp_mask))
-   cond_resched();
+   ret = alloc_pages_bulk_array(gfp_mask, area->nr_pages, area->pages);
+   if (ret == nr_pages)
+   atomic_long_add(area->nr_pages, _vmalloc_pages);
+   else {
+   area->nr_pages = ret;
+   goto fail;
}
-   atomic_long_add(area->nr_pages, _vmalloc_pages);


single CPU, 4MB allocation, 100 avg: 70639437 usec
single CPU, 4MB allocation, 100 avg: 89218654 usec

and now we get ~21% delta. That is very good :)

--
Vlad Rezki


Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-23 Thread Matthew Wilcox
On Tue, Mar 23, 2021 at 01:04:36PM +0100, Uladzislau Rezki wrote:
> On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> > I suspect the vast majority of the time is spent calling alloc_pages_node()
> > 1024 times.  Have you looked at Mel's patch to do ... well, exactly what
> > vmalloc() wants?
> > 
> 
>  - __vmalloc_node_range
> - 45.25% __alloc_pages_nodemask
>- 37.59% get_page_from_freelist
[...]
>   - 44.61% 0xc047348d
>  - __vunmap
> - 35.56% free_unref_page

Hmm!  I hadn't been thinking about the free side of things.
Does this make a difference?

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4f5f8c907897..61d5b769fea0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2277,16 +2277,8 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
vm_remove_mappings(area, deallocate_pages);
 
if (deallocate_pages) {
-   int i;
-
-   for (i = 0; i < area->nr_pages; i++) {
-   struct page *page = area->pages[i];
-
-   BUG_ON(!page);
-   __free_pages(page, 0);
-   }
+   release_pages(area->pages, area->nr_pages);
atomic_long_sub(area->nr_pages, _vmalloc_pages);
-
kvfree(area->pages);
}

release_pages does a bunch of checks that are unnecessary ... we could
probably just do:

LIST_HEAD(pages_to_free);

for (i = 0; i < area->nr_pages; i++) {
struct page *page = area->pages[i];
if (put_page_testzero(page))
list_add(>lru, _to_free);
}
free_unref_page_list(_to_free);

but let's see if the provided interface gets us the performance we want.
 
> Reviewed-by: Uladzislau Rezki (Sony) 
> 
> Thanks!

Thank you!


Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-23 Thread Uladzislau Rezki
On Mon, Mar 22, 2021 at 11:03:11PM +, Matthew Wilcox wrote:
> On Mon, Mar 22, 2021 at 11:36:19PM +0100, Uladzislau Rezki wrote:
> > On Mon, Mar 22, 2021 at 07:38:20PM +, Matthew Wilcox (Oracle) wrote:
> > > If we're trying to allocate 4MB of memory, the table will be 8KiB in size
> > > (1024 pointers * 8 bytes per pointer), which can usually be satisfied
> > > by a kmalloc (which is significantly faster).  Instead of changing this
> > > open-coded implementation, just use kvmalloc().
> > > 
> > > Signed-off-by: Matthew Wilcox (Oracle) 
> > > ---
> > >  mm/vmalloc.c | 7 +--
> > >  1 file changed, 1 insertion(+), 6 deletions(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 96444d64129a..32b640a84250 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2802,13 +2802,8 @@ static void *__vmalloc_area_node(struct vm_struct 
> > > *area, gfp_t gfp_mask,
> > >   gfp_mask |= __GFP_HIGHMEM;
> > >  
> > >   /* Please note that the recursion is strictly bounded. */
> > > - if (array_size > PAGE_SIZE) {
> > > - pages = __vmalloc_node(array_size, 1, nested_gfp, node,
> > > + pages = kvmalloc_node_caller(array_size, nested_gfp, node,
> > >   area->caller);
> > > - } else {
> > > - pages = kmalloc_node(array_size, nested_gfp, node);
> > > - }
> > > -
> > >   if (!pages) {
> > >   free_vm_area(area);
> > >   return NULL;
> > > -- 
> > > 2.30.2
> > Makes sense to me. Though i expected a bigger difference:
> > 
> > # patch
> > single CPU, 4MB allocation, loops: 100 avg: 85293854 usec
> > 
> > # default
> > single CPU, 4MB allocation, loops: 100 avg: 89275857 usec
> 
> Well, 4.5% isn't something to leave on the table ... but yeah, I was
> expecting more in the 10-20% range.  It may be more significant if
> there's contention on the spinlocks (like if this crazy ksmbd is calling
> vmalloc(4MB) on multiple nodes simultaneously).
> 
Yep, it can be that simultaneous allocations will show even bigger
improvements because of lock contention. Anyway there is an advantage
in switching to SLAB - 5% is also a win :) 

>
> I suspect the vast majority of the time is spent calling alloc_pages_node()
> 1024 times.  Have you looked at Mel's patch to do ... well, exactly what
> vmalloc() wants?
> 

-   97.37% 0.00%  vmalloc_test/0   [kernel.vmlinux]  [k] ret_from_fork  

◆
 ret_from_fork  

▒
 kthread

▒
   - 0xc047373b 

▒
  - 52.67% 0xc047349f   

▒
   __vmalloc_node   

▒
 - __vmalloc_node_range 

▒
- 45.25% __alloc_pages_nodemask 

▒
   - 37.59% get_page_from_freelist  

▒
4.34% __list_del_entry_valid

▒
3.67% __list_add_valid  

▒
1.52% prep_new_page 

▒
1.20% check_preemption_disabled 

▒
  3.75% map_kernel_range_noflush

▒
- 0.64% 

Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-22 Thread Matthew Wilcox
On Mon, Mar 22, 2021 at 11:36:19PM +0100, Uladzislau Rezki wrote:
> On Mon, Mar 22, 2021 at 07:38:20PM +, Matthew Wilcox (Oracle) wrote:
> > If we're trying to allocate 4MB of memory, the table will be 8KiB in size
> > (1024 pointers * 8 bytes per pointer), which can usually be satisfied
> > by a kmalloc (which is significantly faster).  Instead of changing this
> > open-coded implementation, just use kvmalloc().
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) 
> > ---
> >  mm/vmalloc.c | 7 +--
> >  1 file changed, 1 insertion(+), 6 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 96444d64129a..32b640a84250 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2802,13 +2802,8 @@ static void *__vmalloc_area_node(struct vm_struct 
> > *area, gfp_t gfp_mask,
> > gfp_mask |= __GFP_HIGHMEM;
> >  
> > /* Please note that the recursion is strictly bounded. */
> > -   if (array_size > PAGE_SIZE) {
> > -   pages = __vmalloc_node(array_size, 1, nested_gfp, node,
> > +   pages = kvmalloc_node_caller(array_size, nested_gfp, node,
> > area->caller);
> > -   } else {
> > -   pages = kmalloc_node(array_size, nested_gfp, node);
> > -   }
> > -
> > if (!pages) {
> > free_vm_area(area);
> > return NULL;
> > -- 
> > 2.30.2
> Makes sense to me. Though i expected a bigger difference:
> 
> # patch
> single CPU, 4MB allocation, loops: 100 avg: 85293854 usec
> 
> # default
> single CPU, 4MB allocation, loops: 100 avg: 89275857 usec

Well, 4.5% isn't something to leave on the table ... but yeah, I was
expecting more in the 10-20% range.  It may be more significant if
there's contention on the spinlocks (like if this crazy ksmbd is calling
vmalloc(4MB) on multiple nodes simultaneously).

I suspect the vast majority of the time is spent calling alloc_pages_node()
1024 times.  Have you looked at Mel's patch to do ... well, exactly what
vmalloc() wants?

https://lore.kernel.org/linux-mm/20210322091845.16437-1-mgor...@techsingularity.net/

> One question. Should we care much about fragmentation? I mean
> with the patch, allocations > 2MB will do request to SLAB bigger
> then PAGE_SIZE.

We're pretty good about allocating memory in larger chunks these days.
Looking at my laptop's slabinfo,
kmalloc-8k   219232   819248 : tunables000 : sla
bdata 58 58  0

That's using 8 pages per slab, so that's order-3 allocations.  There's a
few more of those:

$ sudo grep '8 :' /proc/slabinfo |wc
 42 6724508

so I have confidence that kvmalloc() will manage to use kmalloc up to 16MB
vmalloc allocations, and after that it'll tend to fall back to vmalloc.



Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages

2021-03-22 Thread Uladzislau Rezki
On Mon, Mar 22, 2021 at 07:38:20PM +, Matthew Wilcox (Oracle) wrote:
> If we're trying to allocate 4MB of memory, the table will be 8KiB in size
> (1024 pointers * 8 bytes per pointer), which can usually be satisfied
> by a kmalloc (which is significantly faster).  Instead of changing this
> open-coded implementation, just use kvmalloc().
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/vmalloc.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 96444d64129a..32b640a84250 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2802,13 +2802,8 @@ static void *__vmalloc_area_node(struct vm_struct 
> *area, gfp_t gfp_mask,
>   gfp_mask |= __GFP_HIGHMEM;
>  
>   /* Please note that the recursion is strictly bounded. */
> - if (array_size > PAGE_SIZE) {
> - pages = __vmalloc_node(array_size, 1, nested_gfp, node,
> + pages = kvmalloc_node_caller(array_size, nested_gfp, node,
>   area->caller);
> - } else {
> - pages = kmalloc_node(array_size, nested_gfp, node);
> - }
> -
>   if (!pages) {
>   free_vm_area(area);
>   return NULL;
> -- 
> 2.30.2
Makes sense to me. Though i expected a bigger difference:

# patch
single CPU, 4MB allocation, loops: 100 avg: 85293854 usec

# default
single CPU, 4MB allocation, loops: 100 avg: 89275857 usec

One question. Should we care much about fragmentation? I mean
with the patch, allocations > 2MB will do request to SLAB bigger
then PAGE_SIZE.

Thanks!

--
Vlad Rezki