On 9/19/25 23:26, Zi Yan wrote: > On 19 Sep 2025, at 1:01, Balbir Singh wrote: > >> On 9/18/25 12:49, Zi Yan wrote: >>> On 16 Sep 2025, at 8:21, Balbir Singh wrote: >>> >>>> Add routines to support allocation of large order zone device folios >>>> and helper functions for zone device folios, to check if a folio is >>>> device private and helpers for setting zone device data. >>>> >>>> When large folios are used, the existing page_free() callback in >>>> pgmap is called when the folio is freed, this is true for both >>>> PAGE_SIZE and higher order pages. >>>> >>>> Zone device private large folios do not support deferred split and >>>> scan like normal THP folios. >>>> >>>> Signed-off-by: Balbir Singh <[email protected]> >>>> Cc: David Hildenbrand <[email protected]> >>>> Cc: Zi Yan <[email protected]> >>>> Cc: Joshua Hahn <[email protected]> >>>> Cc: Rakie Kim <[email protected]> >>>> Cc: Byungchul Park <[email protected]> >>>> Cc: Gregory Price <[email protected]> >>>> Cc: Ying Huang <[email protected]> >>>> Cc: Alistair Popple <[email protected]> >>>> Cc: Oscar Salvador <[email protected]> >>>> Cc: Lorenzo Stoakes <[email protected]> >>>> Cc: Baolin Wang <[email protected]> >>>> Cc: "Liam R. Howlett" <[email protected]> >>>> Cc: Nico Pache <[email protected]> >>>> Cc: Ryan Roberts <[email protected]> >>>> Cc: Dev Jain <[email protected]> >>>> Cc: Barry Song <[email protected]> >>>> Cc: Lyude Paul <[email protected]> >>>> Cc: Danilo Krummrich <[email protected]> >>>> Cc: David Airlie <[email protected]> >>>> Cc: Simona Vetter <[email protected]> >>>> Cc: Ralph Campbell <[email protected]> >>>> Cc: Mika Penttilä <[email protected]> >>>> Cc: Matthew Brost <[email protected]> >>>> Cc: Francois Dugast <[email protected]> >>>> --- >>>> include/linux/memremap.h | 10 +++++++++- >>>> mm/memremap.c | 34 +++++++++++++++++++++------------- >>>> mm/rmap.c | 6 +++++- >>>> 3 files changed, 35 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>> index e5951ba12a28..9c20327c2be5 100644 >>>> --- a/include/linux/memremap.h >>>> +++ b/include/linux/memremap.h >>>> @@ -206,7 +206,7 @@ static inline bool is_fsdax_page(const struct page >>>> *page) >>>> } >>>> >>>> #ifdef CONFIG_ZONE_DEVICE >>>> -void zone_device_page_init(struct page *page); >>>> +void zone_device_folio_init(struct folio *folio, unsigned int order); >>>> void *memremap_pages(struct dev_pagemap *pgmap, int nid); >>>> void memunmap_pages(struct dev_pagemap *pgmap); >>>> void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); >>>> @@ -215,6 +215,14 @@ struct dev_pagemap *get_dev_pagemap(unsigned long >>>> pfn); >>>> bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn); >>>> >>>> unsigned long memremap_compat_align(void); >>>> + >>>> +static inline void zone_device_page_init(struct page *page) >>>> +{ >>>> + struct folio *folio = page_folio(page); >>>> + >>>> + zone_device_folio_init(folio, 0); >>> >>> I assume it is for legacy code, where only non-compound page exists? >>> >>> It seems that you assume @page is always order-0, but there is no check >>> for it. Adding VM_WARN_ON_ONCE_FOLIO(folio_order(folio) != 0, folio) >>> above it would be useful to detect misuse. >>> >>>> +} >>>> + >>>> #else >>>> static inline void *devm_memremap_pages(struct device *dev, >>>> struct dev_pagemap *pgmap) >>>> diff --git a/mm/memremap.c b/mm/memremap.c >>>> index 46cb1b0b6f72..a8481ebf94cc 100644 >>>> --- a/mm/memremap.c >>>> +++ b/mm/memremap.c >>>> @@ -416,20 +416,19 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap); >>>> void free_zone_device_folio(struct folio *folio) >>>> { >>>> struct dev_pagemap *pgmap = folio->pgmap; >>>> + unsigned long nr = folio_nr_pages(folio); >>>> + int i; >>>> >>>> if (WARN_ON_ONCE(!pgmap)) >>>> return; >>>> >>>> mem_cgroup_uncharge(folio); >>>> >>>> - /* >>>> - * Note: we don't expect anonymous compound pages yet. Once supported >>>> - * and we could PTE-map them similar to THP, we'd have to clear >>>> - * PG_anon_exclusive on all tail pages. >>>> - */ >>>> if (folio_test_anon(folio)) { >>>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); >>>> - __ClearPageAnonExclusive(folio_page(folio, 0)); >>>> + for (i = 0; i < nr; i++) >>>> + __ClearPageAnonExclusive(folio_page(folio, i)); >>>> + } else { >>>> + VM_WARN_ON_ONCE(folio_test_large(folio)); >>>> } >>>> >>>> /* >>>> @@ -456,8 +455,8 @@ void free_zone_device_folio(struct folio *folio) >>>> case MEMORY_DEVICE_COHERENT: >>>> if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->page_free)) >>>> break; >>>> - pgmap->ops->page_free(folio_page(folio, 0)); >>>> - put_dev_pagemap(pgmap); >>>> + pgmap->ops->page_free(&folio->page); >>>> + percpu_ref_put_many(&folio->pgmap->ref, nr); >>>> break; >>>> >>>> case MEMORY_DEVICE_GENERIC: >>>> @@ -480,14 +479,23 @@ void free_zone_device_folio(struct folio *folio) >>>> } >>>> } >>>> >>>> -void zone_device_page_init(struct page *page) >>>> +void zone_device_folio_init(struct folio *folio, unsigned int order) >>>> { >>>> + struct page *page = folio_page(folio, 0); >>> >>> It is strange to see a folio is converted back to page in >>> a function called zone_device_folio_init(). >>> >>>> + >>>> + VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES); >>>> + >>>> /* >>>> * Drivers shouldn't be allocating pages after calling >>>> * memunmap_pages(). >>>> */ >>>> - WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref)); >>>> - set_page_count(page, 1); >>>> + WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << >>>> order)); >>>> + folio_set_count(folio, 1); >>>> lock_page(page); >>>> + >>>> + if (order > 1) { >>>> + prep_compound_page(page, order); >>>> + folio_set_large_rmappable(folio); >>>> + } >>> >>> OK, so basically, @folio is not a compound page yet when >>> zone_device_folio_init() >>> is called. >>> >>> I feel that your zone_device_page_init() and zone_device_folio_init() >>> implementations are inverse. They should follow the same pattern >>> as __alloc_pages_noprof() and __folio_alloc_noprof(), where >>> zone_device_page_init() does the actual initialization and >>> zone_device_folio_init() just convert a page to folio. >>> >>> Something like: >>> >>> void zone_device_page_init(struct page *page, unsigned int order) >>> { >>> VM_WARN_ON_ONCE(order > MAX_ORDER_NR_PAGES); >>> >>> /* >>> * Drivers shouldn't be allocating pages after calling >>> * memunmap_pages(). >>> */ >>> >>> WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << >>> order)); >>> >>> /* >>> * anonymous folio does not support order-1, high order file-backed >>> folio >>> * is not supported at all. >>> */ >>> VM_WARN_ON_ONCE(order == 1); >>> >>> if (order > 1) >>> prep_compound_page(page, order); >>> >>> /* page has to be compound head here */ >>> set_page_count(page, 1); >>> lock_page(page); >>> } >>> >>> void zone_device_folio_init(struct folio *folio, unsigned int order) >>> { >>> struct page *page = folio_page(folio, 0); >>> >>> zone_device_page_init(page, order); >>> page_rmappable_folio(page); >>> } >>> >>> Or >>> >>> struct folio *zone_device_folio_init(struct page *page, unsigned int order) >>> { >>> zone_device_page_init(page, order); >>> return page_rmappable_folio(page); >>> } >>> >>> >>> Then, it comes to free_zone_device_folio() above, >>> I feel that pgmap->ops->page_free() should take an additional order >>> parameter to free a compound page like free_frozen_pages(). >>> >>> >>> This is my impression after reading the patch and zone device page code. >>> >>> Alistair and David can correct me if this is wrong, since I am new to >>> zone device page code. >>> >> >> Thanks, I did not want to change zone_device_page_init() for several >> drivers (outside my test scope) that already assume it has an order size of >> 0. > > But my proposed zone_device_page_init() should still work for order-0 > pages. You just need to change call site to add 0 as a new parameter. >
I did not want to change existing callers (increases testing impact) without a strong reason. > > One strange thing I found in the original zone_device_page_init() is > the use of page_pgmap() in > WARN_ON_ONCE(!percpu_ref_tryget_many(&page_pgmap(page)->ref, 1 << order)). > page_pgmap() calls page_folio() on the given page to access pgmap field. > And pgmap field is only available in struct folio. The code initializes > struct page, but in middle it suddenly finds the page is actually a folio, > then treat it as a page afterwards. I wonder if it can be done better. > > This might be a question to Alistair, since he made the change. > I'll let him answer it :) Thanks for the review Balbir
