Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Michal Hocko
On Mon 23-11-20 12:45:13, Matthew Wilcox wrote:
> On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote:
> > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > from pages which you release from the vmemmap page tables. Those pages
> > > > might get reused as soon sa they are freed to the page allocator.
> > > 
> > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > 
> > And this doesn't really happen in an atomic fashion from the pfn walker
> > POV, right? So it is very well possible that 
> > 
> > struct page *page = pfn_to_page();
> > // remapping happens here
> > // page content is no longer valid because its backing memory can be
> > // reused for whatever purpose.
> 
> pfn_to_page() returns you a virtual address.  That virtual address
> remains a valid pointer to exactly the same contents, it's just that
> the page tables change to point to a different struct page which has
> the same compound_head().

You are right. I have managed to completely confuse myself. Sorry about
the noise!

-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Matthew Wilcox
On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote:
> On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > No I really mean that pfn_to_page will give you a struct page pointer
> > > from pages which you release from the vmemmap page tables. Those pages
> > > might get reused as soon sa they are freed to the page allocator.
> > 
> > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > frame 1. And then we free page frame 2-7 to the buddy allocator.
> 
> And this doesn't really happen in an atomic fashion from the pfn walker
> POV, right? So it is very well possible that 
> 
> struct page *page = pfn_to_page();
> // remapping happens here
> // page content is no longer valid because its backing memory can be
> // reused for whatever purpose.

pfn_to_page() returns you a virtual address.  That virtual address
remains a valid pointer to exactly the same contents, it's just that
the page tables change to point to a different struct page which has
the same compound_head().


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Michal Hocko
On Mon 23-11-20 20:40:40, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 8:18 PM Michal Hocko  wrote:
> >
> > On Mon 23-11-20 20:07:23, Muchun Song wrote:
> > > On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko  wrote:
> > [...]
> > > > > > > > No I really mean that pfn_to_page will give you a struct page 
> > > > > > > > pointer
> > > > > > > > from pages which you release from the vmemmap page tables. 
> > > > > > > > Those pages
> > > > > > > > might get reused as soon sa they are freed to the page 
> > > > > > > > allocator.
> > > > > > >
> > > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > > > > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > > > > >
> > > > > > And this doesn't really happen in an atomic fashion from the pfn 
> > > > > > walker
> > > > > > POV, right? So it is very well possible that
> > > > >
> > > > > Yeah, you are right. But it may not be a problem for HugeTLB pages.
> > > > > Because in most cases, we only read the tail struct page and get the
> > > > > head struct page through compound_head() when the pfn is within
> > > > > a HugeTLB range. Right?
> > > >
> > > > Many pfn walkers would encounter the head page first and then skip over
> > > > the rest. Those should be reasonably safe. But there is no guarantee and
> > > > the fact that you need a valid page->compound_head which might get
> > > > scribbled over once you have the struct page makes this extremely
> > > > subtle.
> > >
> > > In this patch series, we can guarantee that the page->compound_head
> > > is always valid. Because we reuse the first tail page. Maybe you need to
> > > look closer at this series. Thanks.
> >
> > I must be really terrible exaplaining my concern. Let me try one last
> > time. It is really _irrelevant_ what you do with tail pages. The
> > underlying problem is that you are changing struct pages under users
> > without any synchronization. What used to be a valid struct page will
> > turn into garbage as soon as you remap vmemmap page tables.
> 
> Thank you very much for your patient explanation. So if the pfn walkers
> always try get the head struct page through compound_head() when it
> encounter a tail struct page. There will be no concerns. Do you agree?

No, I do not agree. Please read again. The content of the struct page
might be a complete garbage at any time after pfn_to_page returns a
struct page. So there is no valid compound_head anywamore.
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Muchun Song
On Mon, Nov 23, 2020 at 8:45 PM Matthew Wilcox  wrote:
>
> On Mon, Nov 23, 2020 at 11:42:58AM +0100, Michal Hocko wrote:
> > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > from pages which you release from the vmemmap page tables. Those pages
> > > > might get reused as soon sa they are freed to the page allocator.
> > >
> > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> >
> > And this doesn't really happen in an atomic fashion from the pfn walker
> > POV, right? So it is very well possible that
> >
> > struct page *page = pfn_to_page();
> > // remapping happens here
> > // page content is no longer valid because its backing memory can be
> > // reused for whatever purpose.
>
> pfn_to_page() returns you a virtual address.  That virtual address
> remains a valid pointer to exactly the same contents, it's just that
> the page tables change to point to a different struct page which has
> the same compound_head().

I agree with you.

Hi Michal,

Maybe you need to look at this.

-- 
Yours,
Muchun


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Muchun Song
On Mon, Nov 23, 2020 at 8:18 PM Michal Hocko  wrote:
>
> On Mon 23-11-20 20:07:23, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko  wrote:
> [...]
> > > > > > > No I really mean that pfn_to_page will give you a struct page 
> > > > > > > pointer
> > > > > > > from pages which you release from the vmemmap page tables. Those 
> > > > > > > pages
> > > > > > > might get reused as soon sa they are freed to the page allocator.
> > > > > >
> > > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > > > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > > > >
> > > > > And this doesn't really happen in an atomic fashion from the pfn 
> > > > > walker
> > > > > POV, right? So it is very well possible that
> > > >
> > > > Yeah, you are right. But it may not be a problem for HugeTLB pages.
> > > > Because in most cases, we only read the tail struct page and get the
> > > > head struct page through compound_head() when the pfn is within
> > > > a HugeTLB range. Right?
> > >
> > > Many pfn walkers would encounter the head page first and then skip over
> > > the rest. Those should be reasonably safe. But there is no guarantee and
> > > the fact that you need a valid page->compound_head which might get
> > > scribbled over once you have the struct page makes this extremely
> > > subtle.
> >
> > In this patch series, we can guarantee that the page->compound_head
> > is always valid. Because we reuse the first tail page. Maybe you need to
> > look closer at this series. Thanks.
>
> I must be really terrible exaplaining my concern. Let me try one last
> time. It is really _irrelevant_ what you do with tail pages. The
> underlying problem is that you are changing struct pages under users
> without any synchronization. What used to be a valid struct page will
> turn into garbage as soon as you remap vmemmap page tables.

Thank you very much for your patient explanation. So if the pfn walkers
always try get the head struct page through compound_head() when it
encounter a tail struct page. There will be no concerns. Do you agree?

> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Michal Hocko
On Mon 23-11-20 20:07:23, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko  wrote:
[...]
> > > > > > No I really mean that pfn_to_page will give you a struct page 
> > > > > > pointer
> > > > > > from pages which you release from the vmemmap page tables. Those 
> > > > > > pages
> > > > > > might get reused as soon sa they are freed to the page allocator.
> > > > >
> > > > > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > > > > frame 1. And then we free page frame 2-7 to the buddy allocator.
> > > >
> > > > And this doesn't really happen in an atomic fashion from the pfn walker
> > > > POV, right? So it is very well possible that
> > >
> > > Yeah, you are right. But it may not be a problem for HugeTLB pages.
> > > Because in most cases, we only read the tail struct page and get the
> > > head struct page through compound_head() when the pfn is within
> > > a HugeTLB range. Right?
> >
> > Many pfn walkers would encounter the head page first and then skip over
> > the rest. Those should be reasonably safe. But there is no guarantee and
> > the fact that you need a valid page->compound_head which might get
> > scribbled over once you have the struct page makes this extremely
> > subtle.
> 
> In this patch series, we can guarantee that the page->compound_head
> is always valid. Because we reuse the first tail page. Maybe you need to
> look closer at this series. Thanks.

I must be really terrible exaplaining my concern. Let me try one last
time. It is really _irrelevant_ what you do with tail pages. The
underlying problem is that you are changing struct pages under users
without any synchronization. What used to be a valid struct page will
turn into garbage as soon as you remap vmemmap page tables.
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Muchun Song
On Mon, Nov 23, 2020 at 7:32 PM Michal Hocko  wrote:
>
> On Mon 23-11-20 19:16:18, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko  wrote:
> > >
> > > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko  wrote:
> > > > >
> > > > > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > > > > [...]
> > > > > > > > > > >
> > > > > > > > > > > Thanks for improving the cover letter and providing some 
> > > > > > > > > > > numbers. I have
> > > > > > > > > > > only glanced through the patchset because I didn't really 
> > > > > > > > > > > have more time
> > > > > > > > > > > to dive depply into them.
> > > > > > > > > > >
> > > > > > > > > > > Overall it looks promissing. To summarize. I would prefer 
> > > > > > > > > > > to not have
> > > > > > > > > > > the feature enablement controlled by compile time option 
> > > > > > > > > > > and the kernel
> > > > > > > > > > > command line option should be opt-in. I also do not like 
> > > > > > > > > > > that freeing
> > > > > > > > > > > the pool can trigger the oom killer or even shut the 
> > > > > > > > > > > system down if no
> > > > > > > > > > > oom victim is eligible.
> > > > > > > > > >
> > > > > > > > > > Hi Michal,
> > > > > > > > > >
> > > > > > > > > > I have replied to you about those questions on the other 
> > > > > > > > > > mail thread.
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > One thing that I didn't really get to think hard about is 
> > > > > > > > > > > what is the
> > > > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. 
> > > > > > > > > > > pfn_to_page can be
> > > > > > > > > > > invalid when racing with the split. How do we enforce 
> > > > > > > > > > > that this won't
> > > > > > > > > > > blow up?
> > > > > > > > > >
> > > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > > > > in this case, the pfn_to_page can work. The return value of 
> > > > > > > > > > the
> > > > > > > > > > pfn_to_page is actually the address of it's struct page 
> > > > > > > > > > struct.
> > > > > > > > > > I can not figure out where the problem is. Can you describe 
> > > > > > > > > > the
> > > > > > > > > > problem in detail please? Thanks.
> > > > > > > > >
> > > > > > > > > struct page returned by pfn_to_page might get invalid right 
> > > > > > > > > when it is
> > > > > > > > > returned because vmemmap could get freed up and the 
> > > > > > > > > respective memory
> > > > > > > > > released to the page allocator and reused for something else. 
> > > > > > > > > See?
> > > > > > > >
> > > > > > > > If the HugeTLB page is already allocated from the buddy 
> > > > > > > > allocator,
> > > > > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > > > > >
> > > > > > > Nope, struct pages only ever get deallocated when the respective 
> > > > > > > memory
> > > > > > > (they describe) is hotremoved via hotplug.
> > > > > > >
> > > > > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > > > > (cannot access the struct page)?
> > > > > > >
> > > > > > > But I do not follow how that relates to my concern above.
> > > > > >
> > > > > > Sorry. I shouldn't understand your concerns.
> > > > > >
> > > > > > vmemmap pages page frame
> > > > > > +---+   mapping to   +---+
> > > > > > |   | -> | 0 |
> > > > > > +---++---+
> > > > > > |   | -> | 1 |
> > > > > > +---++---+
> > > > > > |   | -> | 2 |
> > > > > > +---++---+
> > > > > > |   | -> | 3 |
> > > > > > +---++---+
> > > > > > |   | -> | 4 |
> > > > > > +---++---+
> > > > > > |   | -> | 5 |
> > > > > > +---++---+
> > > > > > |   | -> | 6 |
> > > > > > +---++---+
> > > > > > |   | -> | 7 |
> > > > > > +---++---+
> > > > > >
> > > > > > In this patch series, we will free the page frame 2-7 to the
> > > > > > buddy allocator. You mean that pfn_to_page can return invalid
> > > > > > value when the 

Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Michal Hocko
On Mon 23-11-20 19:16:18, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko  wrote:
> >
> > On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko  wrote:
> > > >
> > > > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko  wrote:
> > > > > >
> > > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > Thanks for improving the cover letter and providing some 
> > > > > > > > > > numbers. I have
> > > > > > > > > > only glanced through the patchset because I didn't really 
> > > > > > > > > > have more time
> > > > > > > > > > to dive depply into them.
> > > > > > > > > >
> > > > > > > > > > Overall it looks promissing. To summarize. I would prefer 
> > > > > > > > > > to not have
> > > > > > > > > > the feature enablement controlled by compile time option 
> > > > > > > > > > and the kernel
> > > > > > > > > > command line option should be opt-in. I also do not like 
> > > > > > > > > > that freeing
> > > > > > > > > > the pool can trigger the oom killer or even shut the system 
> > > > > > > > > > down if no
> > > > > > > > > > oom victim is eligible.
> > > > > > > > >
> > > > > > > > > Hi Michal,
> > > > > > > > >
> > > > > > > > > I have replied to you about those questions on the other mail 
> > > > > > > > > thread.
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > One thing that I didn't really get to think hard about is 
> > > > > > > > > > what is the
> > > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page 
> > > > > > > > > > can be
> > > > > > > > > > invalid when racing with the split. How do we enforce that 
> > > > > > > > > > this won't
> > > > > > > > > > blow up?
> > > > > > > > >
> > > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > > > in this case, the pfn_to_page can work. The return value of 
> > > > > > > > > the
> > > > > > > > > pfn_to_page is actually the address of it's struct page 
> > > > > > > > > struct.
> > > > > > > > > I can not figure out where the problem is. Can you describe 
> > > > > > > > > the
> > > > > > > > > problem in detail please? Thanks.
> > > > > > > >
> > > > > > > > struct page returned by pfn_to_page might get invalid right 
> > > > > > > > when it is
> > > > > > > > returned because vmemmap could get freed up and the respective 
> > > > > > > > memory
> > > > > > > > released to the page allocator and reused for something else. 
> > > > > > > > See?
> > > > > > >
> > > > > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > > > >
> > > > > > Nope, struct pages only ever get deallocated when the respective 
> > > > > > memory
> > > > > > (they describe) is hotremoved via hotplug.
> > > > > >
> > > > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > > > (cannot access the struct page)?
> > > > > >
> > > > > > But I do not follow how that relates to my concern above.
> > > > >
> > > > > Sorry. I shouldn't understand your concerns.
> > > > >
> > > > > vmemmap pages page frame
> > > > > +---+   mapping to   +---+
> > > > > |   | -> | 0 |
> > > > > +---++---+
> > > > > |   | -> | 1 |
> > > > > +---++---+
> > > > > |   | -> | 2 |
> > > > > +---++---+
> > > > > |   | -> | 3 |
> > > > > +---++---+
> > > > > |   | -> | 4 |
> > > > > +---++---+
> > > > > |   | -> | 5 |
> > > > > +---++---+
> > > > > |   | -> | 6 |
> > > > > +---++---+
> > > > > |   | -> | 7 |
> > > > > +---++---+
> > > > >
> > > > > In this patch series, we will free the page frame 2-7 to the
> > > > > buddy allocator. You mean that pfn_to_page can return invalid
> > > > > value when the pfn is the page frame 2-7? Thanks.
> > > >
> > > > No I really mean that pfn_to_page will give you a struct page pointer
> > > > from pages which you release from the vmemmap page tables. Those pages
> > > > might get reused as soon sa they are freed to the page allocator.
> > >
> > > We 

Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Muchun Song
On Mon, Nov 23, 2020 at 6:43 PM Michal Hocko  wrote:
>
> On Mon 23-11-20 18:36:33, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko  wrote:
> > >
> > > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko  wrote:
> > > > >
> > > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > > [...]
> > > > > > > > >
> > > > > > > > > Thanks for improving the cover letter and providing some 
> > > > > > > > > numbers. I have
> > > > > > > > > only glanced through the patchset because I didn't really 
> > > > > > > > > have more time
> > > > > > > > > to dive depply into them.
> > > > > > > > >
> > > > > > > > > Overall it looks promissing. To summarize. I would prefer to 
> > > > > > > > > not have
> > > > > > > > > the feature enablement controlled by compile time option and 
> > > > > > > > > the kernel
> > > > > > > > > command line option should be opt-in. I also do not like that 
> > > > > > > > > freeing
> > > > > > > > > the pool can trigger the oom killer or even shut the system 
> > > > > > > > > down if no
> > > > > > > > > oom victim is eligible.
> > > > > > > >
> > > > > > > > Hi Michal,
> > > > > > > >
> > > > > > > > I have replied to you about those questions on the other mail 
> > > > > > > > thread.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > One thing that I didn't really get to think hard about is 
> > > > > > > > > what is the
> > > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page 
> > > > > > > > > can be
> > > > > > > > > invalid when racing with the split. How do we enforce that 
> > > > > > > > > this won't
> > > > > > > > > blow up?
> > > > > > > >
> > > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > > > problem in detail please? Thanks.
> > > > > > >
> > > > > > > struct page returned by pfn_to_page might get invalid right when 
> > > > > > > it is
> > > > > > > returned because vmemmap could get freed up and the respective 
> > > > > > > memory
> > > > > > > released to the page allocator and reused for something else. See?
> > > > > >
> > > > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > > >
> > > > > Nope, struct pages only ever get deallocated when the respective 
> > > > > memory
> > > > > (they describe) is hotremoved via hotplug.
> > > > >
> > > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > > (cannot access the struct page)?
> > > > >
> > > > > But I do not follow how that relates to my concern above.
> > > >
> > > > Sorry. I shouldn't understand your concerns.
> > > >
> > > > vmemmap pages page frame
> > > > +---+   mapping to   +---+
> > > > |   | -> | 0 |
> > > > +---++---+
> > > > |   | -> | 1 |
> > > > +---++---+
> > > > |   | -> | 2 |
> > > > +---++---+
> > > > |   | -> | 3 |
> > > > +---++---+
> > > > |   | -> | 4 |
> > > > +---++---+
> > > > |   | -> | 5 |
> > > > +---++---+
> > > > |   | -> | 6 |
> > > > +---++---+
> > > > |   | -> | 7 |
> > > > +---++---+
> > > >
> > > > In this patch series, we will free the page frame 2-7 to the
> > > > buddy allocator. You mean that pfn_to_page can return invalid
> > > > value when the pfn is the page frame 2-7? Thanks.
> > >
> > > No I really mean that pfn_to_page will give you a struct page pointer
> > > from pages which you release from the vmemmap page tables. Those pages
> > > might get reused as soon sa they are freed to the page allocator.
> >
> > We will remap vmemmap pages 2-7 (virtual addresses) to page
> > frame 1. And then we free page frame 2-7 to the buddy allocator.
>
> And this doesn't really happen in an atomic fashion from the pfn walker
> POV, right? So it is very well possible that

Yeah, you are right. But it may not be a problem for HugeTLB pages.
Because in 

Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Michal Hocko
On Mon 23-11-20 18:36:33, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko  wrote:
> >
> > On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko  wrote:
> > > >
> > > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko  wrote:
> > > > > >
> > > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > > [...]
> > > > > > > >
> > > > > > > > Thanks for improving the cover letter and providing some 
> > > > > > > > numbers. I have
> > > > > > > > only glanced through the patchset because I didn't really have 
> > > > > > > > more time
> > > > > > > > to dive depply into them.
> > > > > > > >
> > > > > > > > Overall it looks promissing. To summarize. I would prefer to 
> > > > > > > > not have
> > > > > > > > the feature enablement controlled by compile time option and 
> > > > > > > > the kernel
> > > > > > > > command line option should be opt-in. I also do not like that 
> > > > > > > > freeing
> > > > > > > > the pool can trigger the oom killer or even shut the system 
> > > > > > > > down if no
> > > > > > > > oom victim is eligible.
> > > > > > >
> > > > > > > Hi Michal,
> > > > > > >
> > > > > > > I have replied to you about those questions on the other mail 
> > > > > > > thread.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > >
> > > > > > > > One thing that I didn't really get to think hard about is what 
> > > > > > > > is the
> > > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can 
> > > > > > > > be
> > > > > > > > invalid when racing with the split. How do we enforce that this 
> > > > > > > > won't
> > > > > > > > blow up?
> > > > > > >
> > > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > > problem in detail please? Thanks.
> > > > > >
> > > > > > struct page returned by pfn_to_page might get invalid right when it 
> > > > > > is
> > > > > > returned because vmemmap could get freed up and the respective 
> > > > > > memory
> > > > > > released to the page allocator and reused for something else. See?
> > > > >
> > > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > > the struct page of the HugeTLB can be freed? Does this exist?
> > > >
> > > > Nope, struct pages only ever get deallocated when the respective memory
> > > > (they describe) is hotremoved via hotplug.
> > > >
> > > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > > (cannot access the struct page)?
> > > >
> > > > But I do not follow how that relates to my concern above.
> > >
> > > Sorry. I shouldn't understand your concerns.
> > >
> > > vmemmap pages page frame
> > > +---+   mapping to   +---+
> > > |   | -> | 0 |
> > > +---++---+
> > > |   | -> | 1 |
> > > +---++---+
> > > |   | -> | 2 |
> > > +---++---+
> > > |   | -> | 3 |
> > > +---++---+
> > > |   | -> | 4 |
> > > +---++---+
> > > |   | -> | 5 |
> > > +---++---+
> > > |   | -> | 6 |
> > > +---++---+
> > > |   | -> | 7 |
> > > +---++---+
> > >
> > > In this patch series, we will free the page frame 2-7 to the
> > > buddy allocator. You mean that pfn_to_page can return invalid
> > > value when the pfn is the page frame 2-7? Thanks.
> >
> > No I really mean that pfn_to_page will give you a struct page pointer
> > from pages which you release from the vmemmap page tables. Those pages
> > might get reused as soon sa they are freed to the page allocator.
> 
> We will remap vmemmap pages 2-7 (virtual addresses) to page
> frame 1. And then we free page frame 2-7 to the buddy allocator.

And this doesn't really happen in an atomic fashion from the pfn walker
POV, right? So it is very well possible that 

struct page *page = pfn_to_page();
// remapping happens here
// page content is no longer valid because its backing memory can be
// reused for whatever purpose.
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Muchun Song
On Mon, Nov 23, 2020 at 5:43 PM Michal Hocko  wrote:
>
> On Mon 23-11-20 16:53:53, Muchun Song wrote:
> > On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko  wrote:
> > >
> > > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko  wrote:
> > > > >
> > > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > > [...]
> > > > > > >
> > > > > > > Thanks for improving the cover letter and providing some numbers. 
> > > > > > > I have
> > > > > > > only glanced through the patchset because I didn't really have 
> > > > > > > more time
> > > > > > > to dive depply into them.
> > > > > > >
> > > > > > > Overall it looks promissing. To summarize. I would prefer to not 
> > > > > > > have
> > > > > > > the feature enablement controlled by compile time option and the 
> > > > > > > kernel
> > > > > > > command line option should be opt-in. I also do not like that 
> > > > > > > freeing
> > > > > > > the pool can trigger the oom killer or even shut the system down 
> > > > > > > if no
> > > > > > > oom victim is eligible.
> > > > > >
> > > > > > Hi Michal,
> > > > > >
> > > > > > I have replied to you about those questions on the other mail 
> > > > > > thread.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > >
> > > > > > > One thing that I didn't really get to think hard about is what is 
> > > > > > > the
> > > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > > invalid when racing with the split. How do we enforce that this 
> > > > > > > won't
> > > > > > > blow up?
> > > > > >
> > > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > > I can not figure out where the problem is. Can you describe the
> > > > > > problem in detail please? Thanks.
> > > > >
> > > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > > returned because vmemmap could get freed up and the respective memory
> > > > > released to the page allocator and reused for something else. See?
> > > >
> > > > If the HugeTLB page is already allocated from the buddy allocator,
> > > > the struct page of the HugeTLB can be freed? Does this exist?
> > >
> > > Nope, struct pages only ever get deallocated when the respective memory
> > > (they describe) is hotremoved via hotplug.
> > >
> > > > If yes, how to free the HugeTLB page to the buddy allocator
> > > > (cannot access the struct page)?
> > >
> > > But I do not follow how that relates to my concern above.
> >
> > Sorry. I shouldn't understand your concerns.
> >
> > vmemmap pages page frame
> > +---+   mapping to   +---+
> > |   | -> | 0 |
> > +---++---+
> > |   | -> | 1 |
> > +---++---+
> > |   | -> | 2 |
> > +---++---+
> > |   | -> | 3 |
> > +---++---+
> > |   | -> | 4 |
> > +---++---+
> > |   | -> | 5 |
> > +---++---+
> > |   | -> | 6 |
> > +---++---+
> > |   | -> | 7 |
> > +---++---+
> >
> > In this patch series, we will free the page frame 2-7 to the
> > buddy allocator. You mean that pfn_to_page can return invalid
> > value when the pfn is the page frame 2-7? Thanks.
>
> No I really mean that pfn_to_page will give you a struct page pointer
> from pages which you release from the vmemmap page tables. Those pages
> might get reused as soon sa they are freed to the page allocator.

We will remap vmemmap pages 2-7 (virtual addresses) to page
frame 1. And then we free page frame 2-7 to the buddy allocator.
Then accessing the struct page pointer that returned by pfn_to_page
will be reflected on page frame 1. I think that here is no problem.

Thanks.

> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Michal Hocko
On Mon 23-11-20 16:53:53, Muchun Song wrote:
> On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko  wrote:
> >
> > On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko  wrote:
> > > >
> > > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko  wrote:
> > > > > >
> > > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > > [...]
> > > > > >
> > > > > > Thanks for improving the cover letter and providing some numbers. I 
> > > > > > have
> > > > > > only glanced through the patchset because I didn't really have more 
> > > > > > time
> > > > > > to dive depply into them.
> > > > > >
> > > > > > Overall it looks promissing. To summarize. I would prefer to not 
> > > > > > have
> > > > > > the feature enablement controlled by compile time option and the 
> > > > > > kernel
> > > > > > command line option should be opt-in. I also do not like that 
> > > > > > freeing
> > > > > > the pool can trigger the oom killer or even shut the system down if 
> > > > > > no
> > > > > > oom victim is eligible.
> > > > >
> > > > > Hi Michal,
> > > > >
> > > > > I have replied to you about those questions on the other mail thread.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > >
> > > > > > One thing that I didn't really get to think hard about is what is 
> > > > > > the
> > > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > > invalid when racing with the split. How do we enforce that this 
> > > > > > won't
> > > > > > blow up?
> > > > >
> > > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > > in this case, the pfn_to_page can work. The return value of the
> > > > > pfn_to_page is actually the address of it's struct page struct.
> > > > > I can not figure out where the problem is. Can you describe the
> > > > > problem in detail please? Thanks.
> > > >
> > > > struct page returned by pfn_to_page might get invalid right when it is
> > > > returned because vmemmap could get freed up and the respective memory
> > > > released to the page allocator and reused for something else. See?
> > >
> > > If the HugeTLB page is already allocated from the buddy allocator,
> > > the struct page of the HugeTLB can be freed? Does this exist?
> >
> > Nope, struct pages only ever get deallocated when the respective memory
> > (they describe) is hotremoved via hotplug.
> >
> > > If yes, how to free the HugeTLB page to the buddy allocator
> > > (cannot access the struct page)?
> >
> > But I do not follow how that relates to my concern above.
> 
> Sorry. I shouldn't understand your concerns.
> 
> vmemmap pages page frame
> +---+   mapping to   +---+
> |   | -> | 0 |
> +---++---+
> |   | -> | 1 |
> +---++---+
> |   | -> | 2 |
> +---++---+
> |   | -> | 3 |
> +---++---+
> |   | -> | 4 |
> +---++---+
> |   | -> | 5 |
> +---++---+
> |   | -> | 6 |
> +---++---+
> |   | -> | 7 |
> +---++---+
> 
> In this patch series, we will free the page frame 2-7 to the
> buddy allocator. You mean that pfn_to_page can return invalid
> value when the pfn is the page frame 2-7? Thanks.

No I really mean that pfn_to_page will give you a struct page pointer
from pages which you release from the vmemmap page tables. Those pages
might get reused as soon sa they are freed to the page allocator.
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-23 Thread Muchun Song
On Mon, Nov 23, 2020 at 3:40 PM Michal Hocko  wrote:
>
> On Fri 20-11-20 23:44:26, Muchun Song wrote:
> > On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko  wrote:
> > >
> > > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko  wrote:
> > > > >
> > > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > > [...]
> > > > >
> > > > > Thanks for improving the cover letter and providing some numbers. I 
> > > > > have
> > > > > only glanced through the patchset because I didn't really have more 
> > > > > time
> > > > > to dive depply into them.
> > > > >
> > > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > > the feature enablement controlled by compile time option and the 
> > > > > kernel
> > > > > command line option should be opt-in. I also do not like that freeing
> > > > > the pool can trigger the oom killer or even shut the system down if no
> > > > > oom victim is eligible.
> > > >
> > > > Hi Michal,
> > > >
> > > > I have replied to you about those questions on the other mail thread.
> > > >
> > > > Thanks.
> > > >
> > > > >
> > > > > One thing that I didn't really get to think hard about is what is the
> > > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > > invalid when racing with the split. How do we enforce that this won't
> > > > > blow up?
> > > >
> > > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > > in this case, the pfn_to_page can work. The return value of the
> > > > pfn_to_page is actually the address of it's struct page struct.
> > > > I can not figure out where the problem is. Can you describe the
> > > > problem in detail please? Thanks.
> > >
> > > struct page returned by pfn_to_page might get invalid right when it is
> > > returned because vmemmap could get freed up and the respective memory
> > > released to the page allocator and reused for something else. See?
> >
> > If the HugeTLB page is already allocated from the buddy allocator,
> > the struct page of the HugeTLB can be freed? Does this exist?
>
> Nope, struct pages only ever get deallocated when the respective memory
> (they describe) is hotremoved via hotplug.
>
> > If yes, how to free the HugeTLB page to the buddy allocator
> > (cannot access the struct page)?
>
> But I do not follow how that relates to my concern above.

Sorry. I shouldn't understand your concerns.

vmemmap pages page frame
+---+   mapping to   +---+
|   | -> | 0 |
+---++---+
|   | -> | 1 |
+---++---+
|   | -> | 2 |
+---++---+
|   | -> | 3 |
+---++---+
|   | -> | 4 |
+---++---+
|   | -> | 5 |
+---++---+
|   | -> | 6 |
+---++---+
|   | -> | 7 |
+---++---+

In this patch series, we will free the page frame 2-7 to the
buddy allocator. You mean that pfn_to_page can return invalid
value when the pfn is the page frame 2-7? Thanks.

>
> --
> Michal Hocko
> SUSE Labs



--
Yours,
Muchun


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-22 Thread Michal Hocko
On Fri 20-11-20 23:44:26, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko  wrote:
> >
> > On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko  wrote:
> > > >
> > > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > > [...]
> > > >
> > > > Thanks for improving the cover letter and providing some numbers. I have
> > > > only glanced through the patchset because I didn't really have more time
> > > > to dive depply into them.
> > > >
> > > > Overall it looks promissing. To summarize. I would prefer to not have
> > > > the feature enablement controlled by compile time option and the kernel
> > > > command line option should be opt-in. I also do not like that freeing
> > > > the pool can trigger the oom killer or even shut the system down if no
> > > > oom victim is eligible.
> > >
> > > Hi Michal,
> > >
> > > I have replied to you about those questions on the other mail thread.
> > >
> > > Thanks.
> > >
> > > >
> > > > One thing that I didn't really get to think hard about is what is the
> > > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > > invalid when racing with the split. How do we enforce that this won't
> > > > blow up?
> > >
> > > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > > in this case, the pfn_to_page can work. The return value of the
> > > pfn_to_page is actually the address of it's struct page struct.
> > > I can not figure out where the problem is. Can you describe the
> > > problem in detail please? Thanks.
> >
> > struct page returned by pfn_to_page might get invalid right when it is
> > returned because vmemmap could get freed up and the respective memory
> > released to the page allocator and reused for something else. See?
> 
> If the HugeTLB page is already allocated from the buddy allocator,
> the struct page of the HugeTLB can be freed? Does this exist?

Nope, struct pages only ever get deallocated when the respective memory
(they describe) is hotremoved via hotplug.

> If yes, how to free the HugeTLB page to the buddy allocator
> (cannot access the struct page)?

But I do not follow how that relates to my concern above.

-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-21 Thread Muchun Song
On Sat, Nov 21, 2020 at 1:47 AM Mike Kravetz  wrote:
>
> On 11/20/20 1:43 AM, David Hildenbrand wrote:
> > On 20.11.20 10:39, Michal Hocko wrote:
> >> On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
> >>> On 20.11.20 09:42, Michal Hocko wrote:
>  On Fri 20-11-20 14:43:04, Muchun Song wrote:
>  [...]
> 
>  Thanks for improving the cover letter and providing some numbers. I have
>  only glanced through the patchset because I didn't really have more time
>  to dive depply into them.
> 
>  Overall it looks promissing. To summarize. I would prefer to not have
>  the feature enablement controlled by compile time option and the kernel
>  command line option should be opt-in. I also do not like that freeing
>  the pool can trigger the oom killer or even shut the system down if no
>  oom victim is eligible.
> 
>  One thing that I didn't really get to think hard about is what is the
>  effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
>  invalid when racing with the split. How do we enforce that this won't
>  blow up?
> >>>
> >>> I have the same concerns - the sections are online the whole time and
> >>> anybody with pfn_to_online_page() can grab them
> >>>
> >>> I think we have similar issues with memory offlining when removing the
> >>> vmemmap, it's just very hard to trigger and we can easily protect by
> >>> grabbing the memhotplug lock.
> >>
> >> I am not sure we can/want to span memory hotplug locking out to all pfn
> >> walkers. But you are right that the underlying problem is similar but
> >> much harder to trigger because vmemmaps are only removed when the
> >> physical memory is hotremoved and that happens very seldom. Maybe it
> >> will happen more with virtualization usecases. But this work makes it
> >> even more tricky. If a pfn walker races with a hotremove then it would
> >> just blow up when accessing the unmapped physical address space. For
> >> this feature a pfn walker would just grab a real struct page re-used for
> >> some unpredictable use under its feet. Any failure would be silent and
> >> hard to debug.
> >
> > Right, we don't want the memory hotplug locking, thus discussions regarding 
> > rcu. Luckily, for now I never saw a BUG report regarding this - maybe 
> > because the time between memory offlining (offline_pages()) and 
> > memory/vmemmap getting removed (try_remove_memory()) is just too long. 
> > Someone would have to sleep after pfn_to_online_page() for quite a while to 
> > trigger it.
> >
> >>
> >> [...]
> >>> To keep things easy, maybe simply never allow to free these hugetlb pages
> >>> again for now? If they were reserved during boot and the vmemmap 
> >>> condensed,
> >>> then just let them stick around for all eternity.
> >>
> >> Not sure I understand. Do you propose to only free those vmemmap pages
> >> when the pool is initialized during boot time and never allow to free
> >> them up? That would certainly make it safer and maybe even simpler wrt
> >> implementation.
> >
> > Exactly, let's keep it simple for now. I guess most use cases of this 
> > (virtualization, databases, ...) will allocate hugepages during boot and 
> > never free them.
>
> Not sure if I agree with that last statement.  Database and virtualization
> use cases from my employer allocate allocate hugetlb pages after boot.  It
> is shortly after boot, but still not from boot/kernel command line.
>
> Somewhat related, but not exactly addressing this issue ...
>
> One idea discussed in a previous patch set was to disable PMD/huge page
> mapping of vmemmap if this feature was enabled.  This would eliminate a bunch
> of the complex code doing page table manipulation.  It does not address
> the issue of struct page pages going away which is being discussed here,
> but it could be a way to simply the first version of this code.  If this
> is going to be an 'opt in' feature as previously suggested, then eliminating
> the  PMD/huge page vmemmap mapping may be acceptable.  My guess is that
> sysadmins would only 'opt in' if they expect most of system memory to be used
> by hugetlb pages.  We certainly have database and virtualization use cases
> where this is true.

Hi Mike,

Yeah, I agree with you that the first version of this feature should be
simply. I can do that (disable PMD/huge page mapping of vmemmap)
in the next version patch. But I have another question: what the
problem is when struct page pages go away? I have not understood
the issues discussed here, hope you can answer for me. Thanks.

> --
> Mike Kravetz



-- 
Yours,
Muchun


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-20 Thread Muchun Song
On Fri, Nov 20, 2020 at 9:11 PM Michal Hocko  wrote:
>
> On Fri 20-11-20 20:40:46, Muchun Song wrote:
> > On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko  wrote:
> > >
> > > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > > [...]
> > >
> > > Thanks for improving the cover letter and providing some numbers. I have
> > > only glanced through the patchset because I didn't really have more time
> > > to dive depply into them.
> > >
> > > Overall it looks promissing. To summarize. I would prefer to not have
> > > the feature enablement controlled by compile time option and the kernel
> > > command line option should be opt-in. I also do not like that freeing
> > > the pool can trigger the oom killer or even shut the system down if no
> > > oom victim is eligible.
> >
> > Hi Michal,
> >
> > I have replied to you about those questions on the other mail thread.
> >
> > Thanks.
> >
> > >
> > > One thing that I didn't really get to think hard about is what is the
> > > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > > invalid when racing with the split. How do we enforce that this won't
> > > blow up?
> >
> > This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> > in this case, the pfn_to_page can work. The return value of the
> > pfn_to_page is actually the address of it's struct page struct.
> > I can not figure out where the problem is. Can you describe the
> > problem in detail please? Thanks.
>
> struct page returned by pfn_to_page might get invalid right when it is
> returned because vmemmap could get freed up and the respective memory
> released to the page allocator and reused for something else. See?

If the HugeTLB page is already allocated from the buddy allocator,
the struct page of the HugeTLB can be freed? Does this exist?
If yes, how to free the HugeTLB page to the buddy allocator
(cannot access the struct page)?

>
> > > I have also asked in a previous version whether the vmemmap manipulation
> > > should be really unconditional. E.g. shortlived hugetlb pages allocated
> > > from the buddy allocator directly rather than for a pool. Maybe it
> > > should be restricted for the pool allocation as those are considered
> > > long term and therefore the overhead will be amortized and freeing path
> > > restrictions better understandable.
> >
> > Yeah, I agree with you. This can be an optimization. And we can
> > add it to the todo list and implement it in the future. Now the patch
> > series is already huge.
>
> Yes the patchset is large and the primary aim should be reducing
> functionality to make it smaller in the first incarnation. Especially
> when it is tricky to implement. Releasing vmemmap sparse hugepages is
> one of those things. Do you really need it for your usecase?
> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-20 Thread Michal Hocko
On Fri 20-11-20 20:40:46, Muchun Song wrote:
> On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko  wrote:
> >
> > On Fri 20-11-20 14:43:04, Muchun Song wrote:
> > [...]
> >
> > Thanks for improving the cover letter and providing some numbers. I have
> > only glanced through the patchset because I didn't really have more time
> > to dive depply into them.
> >
> > Overall it looks promissing. To summarize. I would prefer to not have
> > the feature enablement controlled by compile time option and the kernel
> > command line option should be opt-in. I also do not like that freeing
> > the pool can trigger the oom killer or even shut the system down if no
> > oom victim is eligible.
> 
> Hi Michal,
> 
> I have replied to you about those questions on the other mail thread.
> 
> Thanks.
> 
> >
> > One thing that I didn't really get to think hard about is what is the
> > effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> > invalid when racing with the split. How do we enforce that this won't
> > blow up?
> 
> This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
> in this case, the pfn_to_page can work. The return value of the
> pfn_to_page is actually the address of it's struct page struct.
> I can not figure out where the problem is. Can you describe the
> problem in detail please? Thanks.

struct page returned by pfn_to_page might get invalid right when it is
returned because vmemmap could get freed up and the respective memory
released to the page allocator and reused for something else. See?

> > I have also asked in a previous version whether the vmemmap manipulation
> > should be really unconditional. E.g. shortlived hugetlb pages allocated
> > from the buddy allocator directly rather than for a pool. Maybe it
> > should be restricted for the pool allocation as those are considered
> > long term and therefore the overhead will be amortized and freeing path
> > restrictions better understandable.
> 
> Yeah, I agree with you. This can be an optimization. And we can
> add it to the todo list and implement it in the future. Now the patch
> series is already huge.

Yes the patchset is large and the primary aim should be reducing
functionality to make it smaller in the first incarnation. Especially
when it is tricky to implement. Releasing vmemmap sparse hugepages is
one of those things. Do you really need it for your usecase?
-- 
Michal Hocko
SUSE Labs


Re: [External] Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

2020-11-20 Thread Muchun Song
On Fri, Nov 20, 2020 at 4:42 PM Michal Hocko  wrote:
>
> On Fri 20-11-20 14:43:04, Muchun Song wrote:
> [...]
>
> Thanks for improving the cover letter and providing some numbers. I have
> only glanced through the patchset because I didn't really have more time
> to dive depply into them.
>
> Overall it looks promissing. To summarize. I would prefer to not have
> the feature enablement controlled by compile time option and the kernel
> command line option should be opt-in. I also do not like that freeing
> the pool can trigger the oom killer or even shut the system down if no
> oom victim is eligible.

Hi Michal,

I have replied to you about those questions on the other mail thread.

Thanks.

>
> One thing that I didn't really get to think hard about is what is the
> effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
> invalid when racing with the split. How do we enforce that this won't
> blow up?

This feature depends on the CONFIG_SPARSEMEM_VMEMMAP,
in this case, the pfn_to_page can work. The return value of the
pfn_to_page is actually the address of it's struct page struct.
I can not figure out where the problem is. Can you describe the
problem in detail please? Thanks.

>
> I have also asked in a previous version whether the vmemmap manipulation
> should be really unconditional. E.g. shortlived hugetlb pages allocated
> from the buddy allocator directly rather than for a pool. Maybe it
> should be restricted for the pool allocation as those are considered
> long term and therefore the overhead will be amortized and freeing path
> restrictions better understandable.

Yeah, I agree with you. This can be an optimization. And we can
add it to the todo list and implement it in the future. Now the patch
series is already huge.

>
> >  Documentation/admin-guide/kernel-parameters.txt |   9 +
> >  Documentation/admin-guide/mm/hugetlbpage.rst|   3 +
> >  arch/x86/include/asm/hugetlb.h  |  17 +
> >  arch/x86/include/asm/pgtable_64_types.h |   8 +
> >  arch/x86/mm/init_64.c   |   7 +-
> >  fs/Kconfig  |  14 +
> >  include/linux/bootmem_info.h|  78 +++
> >  include/linux/hugetlb.h |  19 +
> >  include/linux/hugetlb_cgroup.h  |  15 +-
> >  include/linux/memory_hotplug.h  |  27 -
> >  mm/Makefile |   2 +
> >  mm/bootmem_info.c   | 124 
> >  mm/hugetlb.c| 163 -
> >  mm/hugetlb_vmemmap.c| 765 
> > 
> >  mm/hugetlb_vmemmap.h| 103 
>
> I will need to look closer but I suspect that a non-trivial part of the
> vmemmap manipulation really belongs to mm/sparse-vmemmap.c because the
> split and remapping shouldn't really be hugetlb specific. Sure hugetlb
> knows how to split but all the splitting should be implemented in
> vmemmap proper.
>
> >  mm/memory_hotplug.c | 116 
> >  mm/sparse.c |   5 +-
> >  17 files changed, 1295 insertions(+), 180 deletions(-)
> >  create mode 100644 include/linux/bootmem_info.h
> >  create mode 100644 mm/bootmem_info.c
> >  create mode 100644 mm/hugetlb_vmemmap.c
> >  create mode 100644 mm/hugetlb_vmemmap.h
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs



-- 
Yours,
Muchun