On 24 Sep 2025, at 19:45, Alistair Popple wrote: > On 2025-09-25 at 03:49 +1000, Zi Yan <[email protected]> wrote... >> On 24 Sep 2025, at 7:04, David Hildenbrand wrote: >> >>> On 23.09.25 05:47, Balbir Singh wrote: >>>> On 9/19/25 23:26, Zi Yan wrote: >>>>> On 19 Sep 2025, at 1:01, Balbir Singh wrote: >>>>> >>>>>> On 9/18/25 12:49, Zi Yan wrote: >>>>>>> On 16 Sep 2025, at 8:21, Balbir Singh wrote: >>>>>>> >>>>>>>> Add routines to support allocation of large order zone device folios >>>>>>>> and helper functions for zone device folios, to check if a folio is >>>>>>>> device private and helpers for setting zone device data. >>>>>>>> >>>>>>>> When large folios are used, the existing page_free() callback in >>>>>>>> pgmap is called when the folio is freed, this is true for both >>>>>>>> PAGE_SIZE and higher order pages. >>>>>>>> >>>>>>>> Zone device private large folios do not support deferred split and >>>>>>>> scan like normal THP folios. >>>>>>>> >>>>>>>> Signed-off-by: Balbir Singh <[email protected]> >>>>>>>> Cc: David Hildenbrand <[email protected]> >>>>>>>> Cc: Zi Yan <[email protected]> >>>>>>>> Cc: Joshua Hahn <[email protected]> >>>>>>>> Cc: Rakie Kim <[email protected]> >>>>>>>> Cc: Byungchul Park <[email protected]> >>>>>>>> Cc: Gregory Price <[email protected]> >>>>>>>> Cc: Ying Huang <[email protected]> >>>>>>>> Cc: Alistair Popple <[email protected]> >>>>>>>> Cc: Oscar Salvador <[email protected]> >>>>>>>> Cc: Lorenzo Stoakes <[email protected]> >>>>>>>> Cc: Baolin Wang <[email protected]> >>>>>>>> Cc: "Liam R. Howlett" <[email protected]> >>>>>>>> Cc: Nico Pache <[email protected]> >>>>>>>> Cc: Ryan Roberts <[email protected]> >>>>>>>> Cc: Dev Jain <[email protected]> >>>>>>>> Cc: Barry Song <[email protected]> >>>>>>>> Cc: Lyude Paul <[email protected]> >>>>>>>> Cc: Danilo Krummrich <[email protected]> >>>>>>>> Cc: David Airlie <[email protected]> >>>>>>>> Cc: Simona Vetter <[email protected]> >>>>>>>> Cc: Ralph Campbell <[email protected]> >>>>>>>> Cc: Mika Penttilä <[email protected]> >>>>>>>> Cc: Matthew Brost <[email protected]> >>>>>>>> Cc: Francois Dugast <[email protected]> >>>>>>>> --- >>>>>>>> include/linux/memremap.h | 10 +++++++++- >>>>>>>> mm/memremap.c | 34 +++++++++++++++++++++------------- >>>>>>>> mm/rmap.c | 6 +++++- >>>>>>>> 3 files changed, 35 insertions(+), 15 deletions(-) >>>>>>>> >>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>>>>> index e5951ba12a28..9c20327c2be5 100644 >>>>>>>> --- a/include/linux/memremap.h >>>>>>>> +++ b/include/linux/memremap.h >>>>>>>> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page >>>>>>>> *page) >>>>>>>> } >>>>>>>> >>>>>>>> #ifdef CONFIG_ZONE_DEVICE >>>>>>>> -void zone_device_page_init(struct page *page); >>>>>>>> +void zone_device_folio_init(struct folio *folio, unsigned int order); >>>>>>>> void *memremap_pages(struct dev_pagemap *pgmap, int nid); >>>>>>>> void memunmap_pages(struct dev_pagemap *pgmap); >>>>>>>> void *devm_memremap_pages(struct device *dev, struct dev_pagemap >>>>>>>> *pgmap); >>>>>>>> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long >>>>>>>> pfn); >>>>>>>> bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn); >>>>>>>> >>>>>>>> unsigned long memremap_compat_align(void); >>>>>>>> + >>>>>>>> +static inline void zone_device_page_init(struct page *page) >>>>>>>> +{ >>>>>>>> + struct folio *folio = page_folio(page); >>>>>>>> + >>>>>>>> + zone_device_folio_init(folio, 0); >>>>>>> >>>>>>> I assume it is for legacy code, where only non-compound page exists? >>>>>>> >>>>>>> It seems that you assume @page is always order-0, but there is no check >>>>>>> for it. Adding VM_WARN_ON_ONCE_FOLIO(folio_order(folio) != 0, folio) >>>>>>> above it would be useful to detect misuse. >>>>>>> >>>>>>>> +} >>>>>>>> + >>>>>>>> #else >>>>>>>> static inline void *devm_memremap_pages(struct device *dev, >>>>>>>> struct dev_pagemap *pgmap) >>>>>>>> diff --git a/mm/memremap.c b/mm/memremap.c >>>>>>>> index 46cb1b0b6f72..a8481ebf94cc 100644 >>>>>>>> --- a/mm/memremap.c >>>>>>>> +++ b/mm/memremap.c >>>>>>>> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap); >>>>>>>> void free_zone_device_folio(struct folio *folio) >>>>>>>> { >>>>>>>> struct dev_pagemap *pgmap = folio->pgmap; >>>>>>>> + unsigned long nr = folio_nr_pages(folio); >>>>>>>> + int i; >>>>>>>> >>>>>>>> if (WARN_ON_ONCE(!pgmap)) >>>>>>>> return; >>>>>>>> >>>>>>>> mem_cgroup_uncharge(folio); >>>>>>>> >>>>>>>> - /* >>>>>>>> - * Note: we don't expect anonymous compound pages yet. Once >>>>>>>> supported >>>>>>>> - * and we could PTE-map them similar to THP, we'd have to clear >>>>>>>> - * PG_anon_exclusive on all tail pages. >>>>>>>> - */ >>>>>>>> if (folio_test_anon(folio)) { >>>>>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >>>>>>>> - __ClearPageAnonExclusive(folio_page(folio, 0)); >>>>>>>> + for (i = 0; i < nr; i++) >>>>>>>> + __ClearPageAnonExclusive(folio_page(folio, i)); >>>>>>>> + } else { >>>>>>>> + VM_WARN_ON_ONCE(folio_test_large(folio)); >>>>>>>> } >>>>>>>> >>>>>>>> /* >>>>>>>> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio) >>>>>>>> case MEMORY_DEVICE_COHERENT: >>>>>>>> if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free)) >>>>>>>> break; >>>>>>>> - pgmap->ops->page_free(folio_page(folio, 0)); >>>>>>>> - put_dev_pagemap(pgmap); >>>>>>>> + pgmap->ops->page_free(&folio->page); >>>>>>>> + percpu_ref_put_many(&folio->pgmap->ref, nr); >>>>>>>> break; >>>>>>>> >>>>>>>> case MEMORY_DEVICE_GENERIC: >>>>>>>> @@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio) >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> -void zone_device_page_init(struct page *page) >>>>>>>> +void zone_device_folio_init(struct folio *folio, unsigned int order) >>>>>>>> { >>>>>>>> + struct page *page = folio_page(folio, 0); >>>>>>> >>>>>>> It is strange to see a folio is converted back to page in >>>>>>> a function called zone_device_folio_init(). >>>>>>> >>>>>>>> + >>>>>>>> + VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES); >>>>>>>> + >>>>>>>> /* >>>>>>>> * Drivers shouldn't be allocating pages after calling >>>>>>>> * memunmap_pages(). >>>>>>>> */ >>>>>>>> - WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref)); >>>>>>>> - set_page_count(page, 1); >>>>>>>> + WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 >>>>>>>> << order)); >>>>>>>> + folio_set_count(folio, 1); >>>>>>>> lock_page(page); >>>>>>>> + >>>>>>>> + if (order > 1) { > > Why is this only called for order > 1 rather than order > 0 ? > >>>>>>>> + prep_compound_page(page, order); >>>>>>>> + folio_set_large_rmappable(folio); >>>>>>>> + } >>>>>>> >>>>>>> OK, so basically, @folio is not a compound page yet when >>>>>>> zone_device_folio_init() >>>>>>> is called. >>>>>>> >>>>>>> I feel that your zone_device_page_init() and zone_device_folio_init() >>>>>>> implementations are inverse. They should follow the same pattern >>>>>>> as __alloc_pages_noprof() and __folio_alloc_noprof(), where >>>>>>> zone_device_page_init() does the actual initialization and >>>>>>> zone_device_folio_init() just convert a page to folio. >>>>>>> >>>>>>> Something like: >>>>>>> >>>>>>> void zone_device_page_init(struct page *page, unsigned int order) >>>>>>> { >>>>>>> VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES); >>>>>>> >>>>>>> /* >>>>>>> * Drivers shouldn't be allocating pages after calling >>>>>>> * memunmap_pages(). >>>>>>> */ >>>>>>> >>>>>>> WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << >>>>>>> order)); >>>>>>> >>>>>>> /* >>>>>>> * anonymous folio does not support order-1, high order >>>>>>> file-backed folio >>>>>>> * is not supported at all. >>>>>>> */ > > I guess that answers my question :-) > >>>>>>> VM_WARN_ON_ONCE(order == 1); >>>>>>> >>>>>>> if (order > 1) >>>>>>> prep_compound_page(page, order); >>>>>>> >>>>>>> /* page has to be compound head here */ >>>>>>> set_page_count(page, 1); >>>>>>> lock_page(page); >>>>>>> } >>>>>>> >>>>>>> void zone_device_folio_init(struct folio *folio, unsigned int order) >>>>>>> { >>>>>>> struct page *page = folio_page(folio, 0); >>>>>>> >>>>>>> zone_device_page_init(page, order); >>>>>>> page_rmappable_folio(page); >>>>>>> } >>>>>>> >>>>>>> Or >>>>>>> >>>>>>> struct folio *zone_device_folio_init(struct page *page, unsigned int >>>>>>> order) >>>>>>> { >>>>>>> zone_device_page_init(page, order); >>>>>>> return page_rmappable_folio(page); >>>>>>> } >>>>>>> >>>>>>> >>>>>>> Then, it comes to free_zone_device_folio() above, >>>>>>> I feel that pgmap->ops->page_free() should take an additional order >>>>>>> parameter to free a compound page like free_frozen_pages(). >>>>>>> >>>>>>> >>>>>>> This is my impression after reading the patch and zone device page code. >>>>>>> >>>>>>> Alistair and David can correct me if this is wrong, since I am new to >>>>>>> zone device page code. >>>>>>> >>>>>> >>>>>> Thanks, I did not want to change zone_device_page_init() for several >>>>>> drivers (outside my test scope) that already assume it has an order size >>>>>> of 0. > > It's a trivial change, so I don't think avoiding changes to other drivers > should > be a concern. > >>>>> >>>>> But my proposed zone_device_page_init() should still work for order-0 >>>>> pages. You just need to change call site to add 0 as a new parameter. >>>>> >>>> >>>> I did not want to change existing callers (increases testing impact) >>>> without a strong reason. >>>> >>>>> >>>>> One strange thing I found in the original zone_device_page_init() is >>>>> the use of page_pgmap() in >>>>> WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order)). >>>>> page_pgmap() calls page_folio() on the given page to access pgmap field. >>>>> And pgmap field is only available in struct folio. The code initializes >>>>> struct page, but in middle it suddenly finds the page is actually a folio, >>>>> then treat it as a page afterwards. I wonder if it can be done better. >>>>> >>>>> This might be a question to Alistair, since he made the change. > > Hello! I might be him :) > > I think this situation is just historical - when I originally wrote > zone_device_page_init() the pgmap was stored on the page rather than the > folio. > That only changed fairly recently with commit 82ba975e4c43 ("mm: allow > compound > zone device pages"). > > The reason pgmap is now only available on the folio is described in the > commit log. The TLDR is switching FS DAX to use compound pages required > page->compound_head to be available for use, and that was being shared > with page->pgmap. So the solution was to move pgmap to the folio freeing up > page->compound_head for use on tail pages. > > The whole percpu pgmap->ref could actually now go away - I've debated removing > it but haven't found the motivation as it provides a small advantage on driver > tear down. Basically it just tracks how many pages are allocated in the pgmap > so drivers could use that to determine if they need to trigger migrations > before > tearing down the pgmap. > > The alternative is just to loop over every page in the pgmap to ensure the > folio/page refcounts are 0 before tear down. > >>>>> >>>> >>>> I'll let him answer it :) >>> >>> Not him, but I think this goes back to my question raised in my other >>> reply: When would we allocate "struct folio" in the future. >>> >>> If it's "always" then actually most of the zone-device code would only ever >>> operate on folios and never on pages in the future. >>> >>> I recall during a discussion at LSF/MM I raised that, and the answer was >>> (IIRC) that we will allocate "struct folio" as we will initialize the >>> memmap for dax. > > Sounds about right. > >>> So essentially, we'd always have folios and would never really have to >>> operate on pages. > > Yeah, I think I mentioned to Matthew at LSF/MM that I thought ZONE_DEVICE (and > in particular ZONE_DEVICE_PRIVATE) might be a good candidate to experiment > with > removing struct pages entirely and switching to memdesc's or whatever. Because > we should, in theory at least, only need to operate on folio's. But I'm still > a > little vague on the details how that would actually work. It's been on my TODO > list for a while, so myabe I will try and look at it for LPC as a healthy bit > of > conference driven development. > >> Hmm, then what is the point of having “struct folio”, which originally is >> added to save compound_head() calls, where everything is a folio in device >> private world? We might need DAX people to explain the rationale of >> “always struct folio”. > > Longer term isn't there an aim to remove struct page? So I assumed moving to
Right. But my current impression based on my code reading and this patchset is that every device private page is a folio. To form a high order folio, each device private folio is converted to page, prep_compound*()’d, then converted back to folio. Based on what you said above, this weird conversion might be temporary until the code is switched to memdesc. I am looking forward to more details on how device private will be switched to memdesc from you. :) > folio's was part of that effort. As you say though many of the clean-ups thus > far related to switching ZONE_DEVICE to folios have indeed just been about > removing compound_head() calls. Best Regards, Yan, Zi
