On 9/25/25 09:58, Alistair Popple wrote: > On 2025-09-25 at 03:36 +1000, Zi Yan <[email protected]> wrote... >> On 24 Sep 2025, at 6:55, David Hildenbrand wrote: >> >>> On 18.09.25 04:49, Zi Yan wrote: >>>> On 16 Sep 2025, at 8:21, Balbir Singh wrote: >>>> >>>>> Add routines to support allocation of large order zone device folios >>>>> and helper functions for zone device folios, to check if a folio is >>>>> device private and helpers for setting zone device data. >>>>> >>>>> When large folios are used, the existing page_free() callback in >>>>> pgmap is called when the folio is freed, this is true for both >>>>> PAGE_SIZE and higher order pages. >>>>> >>>>> Zone device private large folios do not support deferred split and >>>>> scan like normal THP folios. >>>>> >>>>> Signed-off-by: Balbir Singh <[email protected]> >>>>> Cc: David Hildenbrand <[email protected]> >>>>> Cc: Zi Yan <[email protected]> >>>>> Cc: Joshua Hahn <[email protected]> >>>>> Cc: Rakie Kim <[email protected]> >>>>> Cc: Byungchul Park <[email protected]> >>>>> Cc: Gregory Price <[email protected]> >>>>> Cc: Ying Huang <[email protected]> >>>>> Cc: Alistair Popple <[email protected]> >>>>> Cc: Oscar Salvador <[email protected]> >>>>> Cc: Lorenzo Stoakes <[email protected]> >>>>> Cc: Baolin Wang <[email protected]> >>>>> Cc: "Liam R. Howlett" <[email protected]> >>>>> Cc: Nico Pache <[email protected]> >>>>> Cc: Ryan Roberts <[email protected]> >>>>> Cc: Dev Jain <[email protected]> >>>>> Cc: Barry Song <[email protected]> >>>>> Cc: Lyude Paul <[email protected]> >>>>> Cc: Danilo Krummrich <[email protected]> >>>>> Cc: David Airlie <[email protected]> >>>>> Cc: Simona Vetter <[email protected]> >>>>> Cc: Ralph Campbell <[email protected]> >>>>> Cc: Mika Penttilä <[email protected]> >>>>> Cc: Matthew Brost <[email protected]> >>>>> Cc: Francois Dugast <[email protected]> >>>>> --- >>>>> include/linux/memremap.h | 10 +++++++++- >>>>> mm/memremap.c | 34 +++++++++++++++++++++------------- >>>>> mm/rmap.c | 6 +++++- >>>>> 3 files changed, 35 insertions(+), 15 deletions(-) >>>>> >>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>> index e5951ba12a28..9c20327c2be5 100644 >>>>> --- a/include/linux/memremap.h >>>>> +++ b/include/linux/memremap.h >>>>> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page >>>>> *page) >>>>> } >>>>> >>>>> #ifdef CONFIG_ZONE_DEVICE >>>>> -void zone_device_page_init(struct page *page); >>>>> +void zone_device_folio_init(struct folio *folio, unsigned int order); >>>>> void *memremap_pages(struct dev_pagemap *pgmap, int nid); >>>>> void memunmap_pages(struct dev_pagemap *pgmap); >>>>> void *devm_memremap_pages(struct device *dev, struct dev_pagemap >>>>> *pgmap); >>>>> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long >>>>> pfn); >>>>> bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn); >>>>> >>>>> unsigned long memremap_compat_align(void); >>>>> + >>>>> +static inline void zone_device_page_init(struct page *page) >>>>> +{ >>>>> + struct folio *folio = page_folio(page); >>>>> + >>>>> + zone_device_folio_init(folio, 0); >>>> >>>> I assume it is for legacy code, where only non-compound page exists? >>>> >>>> It seems that you assume @page is always order-0, but there is no check >>>> for it. Adding VM_WARN_ON_ONCE_FOLIO(folio_order(folio) != 0, folio) >>>> above it would be useful to detect misuse. >>>> >>>>> +} >>>>> + >>>>> #else >>>>> static inline void *devm_memremap_pages(struct device *dev, >>>>> struct dev_pagemap *pgmap) >>>>> diff --git a/mm/memremap.c b/mm/memremap.c >>>>> index 46cb1b0b6f72..a8481ebf94cc 100644 >>>>> --- a/mm/memremap.c >>>>> +++ b/mm/memremap.c >>>>> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap); >>>>> void free_zone_device_folio(struct folio *folio) >>>>> { >>>>> struct dev_pagemap *pgmap = folio->pgmap; >>>>> + unsigned long nr = folio_nr_pages(folio); >>>>> + int i; >>>>> >>>>> if (WARN_ON_ONCE(!pgmap)) >>>>> return; >>>>> >>>>> mem_cgroup_uncharge(folio); >>>>> >>>>> - /* >>>>> - * Note: we don't expect anonymous compound pages yet. Once supported >>>>> - * and we could PTE-map them similar to THP, we'd have to clear >>>>> - * PG_anon_exclusive on all tail pages. >>>>> - */ >>>>> if (folio_test_anon(folio)) { >>>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >>>>> - __ClearPageAnonExclusive(folio_page(folio, 0)); >>>>> + for (i = 0; i < nr; i++) >>>>> + __ClearPageAnonExclusive(folio_page(folio, i)); >>>>> + } else { >>>>> + VM_WARN_ON_ONCE(folio_test_large(folio)); >>>>> } >>>>> >>>>> /* >>>>> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio) >>>>> case MEMORY_DEVICE_COHERENT: >>>>> if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free)) >>>>> break; >>>>> - pgmap->ops->page_free(folio_page(folio, 0)); >>>>> - put_dev_pagemap(pgmap); >>>>> + pgmap->ops->page_free(&folio->page); >>>>> + percpu_ref_put_many(&folio->pgmap->ref, nr); >>>>> break; >>>>> >>>>> case MEMORY_DEVICE_GENERIC: >>>>> @@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio) >>>>> } >>>>> } >>>>> >>>>> -void zone_device_page_init(struct page *page) >>>>> +void zone_device_folio_init(struct folio *folio, unsigned int order) >>>>> { >>>>> + struct page *page = folio_page(folio, 0); >>>> >>>> It is strange to see a folio is converted back to page in >>>> a function called zone_device_folio_init(). >>>> >>>>> + >>>>> + VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES); >>>>> + >>>>> /* >>>>> * Drivers shouldn't be allocating pages after calling >>>>> * memunmap_pages(). >>>>> */ >>>>> - WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref)); >>>>> - set_page_count(page, 1); >>>>> + WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << >>>>> order)); >>>>> + folio_set_count(folio, 1); >>>>> lock_page(page); >>>>> + >>>>> + if (order > 1) { >>>>> + prep_compound_page(page, order); >>>>> + folio_set_large_rmappable(folio); >>>>> + } >>>> >>>> OK, so basically, @folio is not a compound page yet when >>>> zone_device_folio_init() >>>> is called. >>>> >>>> I feel that your zone_device_page_init() and zone_device_folio_init() >>>> implementations are inverse. They should follow the same pattern >>>> as __alloc_pages_noprof() and __folio_alloc_noprof(), where >>>> zone_device_page_init() does the actual initialization and >>>> zone_device_folio_init() just convert a page to folio. >>>> >>>> Something like: >>>> >>>> void zone_device_page_init(struct page *page, unsigned int order) >>>> { >>>> VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES); >>>> >>>> /* >>>> * Drivers shouldn't be allocating pages after calling >>>> * memunmap_pages(). >>>> */ >>>> >>>> WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << >>>> order)); >>>> >>>> /* >>>> * anonymous folio does not support order-1, high order file-backed >>>> folio >>>> * is not supported at all. >>>> */ >>>> VM_WARN_ON_ONCE(order == 1); >>>> >>>> if (order > 1) >>>> prep_compound_page(page, order); >>>> >>>> /* page has to be compound head here */ >>>> set_page_count(page, 1); >>>> lock_page(page); >>>> } >>>> >>>> void zone_device_folio_init(struct folio *folio, unsigned int order) >>>> { >>>> struct page *page = folio_page(folio, 0); >>>> >>>> zone_device_page_init(page, order); >>>> page_rmappable_folio(page); >>>> } >>>> >>>> Or >>>> >>>> struct folio *zone_device_folio_init(struct page *page, unsigned int order) >>>> { >>>> zone_device_page_init(page, order); >>>> return page_rmappable_folio(page); >>>> } >>> >>> I think the problem is that it will all be weird once we dynamically >>> allocate "struct folio". >>> >>> I have not yet a clear understanding on how that would really work. >>> >>> For example, should it be pgmap->ops->page_folio() ? >>> >>> Who allocates the folio? Do we allocate all order-0 folios initially, to >>> then merge them when constructing large folios? How do we manage the >>> "struct folio" during such merging splitting? >> >> Right. Either we would waste memory by simply concatenating all “struct >> folio” >> and putting paddings at the end, or we would free tail “struct folio” first, >> then allocate tail “struct page”. Both are painful and do not match core mm’s >> memdesc pattern, where “struct folio” is allocated when caller is asking >> for a folio. If “struct folio” is always allocated, there is no difference >> between “struct folio” and “struct page”. > > As mentioned in my other reply I need to investigate this some more, but I > don't think we _need_ to always allocate folios (or pages for that matter). > The ZONE_DEVICE code just uses folios/pages for interacting with the core mm, > not for managing the device memory itself, so we should be able to make it > more > closely match the memdesc pattern. It's just I'm still a bit unsure what that > pattern will actually look like. > >>> >>> With that in mind, I don't really know what the proper interface should be >>> today. >>> >>> >>> zone_device_folio_init(struct page *page, unsigned int order) >>> >>> looks cleaner, agreed. > > Agreed. > >>>> >>>> >>>> Then, it comes to free_zone_device_folio() above, >>>> I feel that pgmap->ops->page_free() should take an additional order >>>> parameter to free a compound page like free_frozen_pages(). > > Where would the order parameter come from? Presumably > folio_order(compound_head(page)) in which case shouldn't the op actually just > be > pgmap->ops->folio_free()? > ->page_free() can detect if the page is of large order. The patchset was designed to make folios and opt-in and avoid unnecessary changes to existing drivers. But I can revisit that thought process if it helps with cleaner code.
Balbir
