On Fri, Feb 12, 2021 at 11:32 PM Michal Hocko <mho...@suse.com> wrote: > > On Mon 08-02-21 16:50:09, Muchun Song wrote: > > When we free a HugeTLB page to the buddy allocator, we should allocate the > > vmemmap pages associated with it. But we may cannot allocate vmemmap pages > > when the system is under memory pressure, in this case, we just refuse to > > free the HugeTLB page instead of looping forever trying to allocate the > > pages. > > Thanks for simplifying the implementation from your early proposal! > > This will not be looping for ever. The allocation will usually trigger > the OOM killer and sooner or later there will be a memory to allocate > from or the system panics when there are no eligible tasks to kill. This > is just a side note. > > I think the changelog could benefit from a more explicit documentation > of those error failures. There are different cases when the hugetlb page > is freed. It can be due to an admin intervention (decrease the pool), > overcommit, migration, dissolving and likely some others. Most of them > should be fine to stay in the pool which would just increase the surplus > pages in the pool. I am not so sure about dissolving path.
Thanks. I will update the changelog. > [...] > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > > index 0209b736e0b4..3d85e3ab7caa 100644 > > --- a/mm/hugetlb_vmemmap.c > > +++ b/mm/hugetlb_vmemmap.c > > @@ -169,6 +169,8 @@ > > * (last) level. So this type of HugeTLB page can be optimized only when > > its > > * size of the struct page structs is greater than 2 pages. > > */ > > +#define pr_fmt(fmt) "HugeTLB: " fmt > > + > > #include "hugetlb_vmemmap.h" > > > > /* > > @@ -198,6 +200,34 @@ static inline unsigned long > > free_vmemmap_pages_size_per_hpage(struct hstate *h) > > return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT; > > } > > > > +int alloc_huge_page_vmemmap(struct hstate *h, struct page *head) > > +{ > > + int ret; > > + unsigned long vmemmap_addr = (unsigned long)head; > > + unsigned long vmemmap_end, vmemmap_reuse; > > + > > + if (!free_vmemmap_pages_per_hpage(h)) > > + return 0; > > + > > + vmemmap_addr += RESERVE_VMEMMAP_SIZE; > > + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h); > > + vmemmap_reuse = vmemmap_addr - PAGE_SIZE; > > + > > + /* > > + * The pages which the vmemmap virtual address range [@vmemmap_addr, > > + * @vmemmap_end) are mapped to are freed to the buddy allocator, and > > + * the range is mapped to the page which @vmemmap_reuse is mapped to. > > + * When a HugeTLB page is freed to the buddy allocator, previously > > + * discarded vmemmap pages must be allocated and remapping. > > + */ > > + ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse, > > + GFP_ATOMIC | __GFP_NOWARN | __GFP_THISNODE); > > I do not think that this is a good allocation mode. GFP_ATOMIC is a non > sleeping allocation and a medium memory pressure might cause it to > fail prematurely. I do not think this is really an atomic context which > couldn't afford memory reclaim. I also do not think we want to grant Because alloc_huge_page_vmemmap is called under hugetlb_lock now. So using GFP_ATOMIC indeed makes the code more simpler. >From the document of the kernel, I learned that __GFP_NOMEMALLOC can be used to explicitly forbid access to emergency reserves. So if we do not want to use the reserve memory. How about replacing it to GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE Thanks. > access to memory reserve is reasonable. Just think of a huge number of > hugetlb pages being freed which can deplete the memory reserve for > atomic allocations. I think that you want > GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN | __GFP_THISNODE > > for an initial implementation. The justification being that the > allocation should at least try to reclaim but it shouldn't cause any > major disruption because the failure is not fatal. If the failure rate > would be impractically high then just drop NORETRY part. You can replace > it by __GFP_RETRY_MAYFAIL but that shouldn't be strictly necessary > because __GFP_THISNODE on its own implies on OOM killer, but that is > kinda ugly to rely on. > -- > Michal Hocko > SUSE Labs