zone_device: Reinitialize large zone device private folios

Alistair Popple Sun, 18 Jan 2026 21:42:31 -0800

On 2026-01-17 at 11:19 +1100, Jason Gunthorpe <[email protected]> wrote...
> On Fri, Jan 16, 2026 at 08:17:22PM +0100, Vlastimil Babka wrote:
> > >> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > >> +                /*
> > >> +                 * This pointer math looks odd, but new_page could have 
> > >> been
> > >> +                 * part of a previous higher order folio, which sets 
> > >> _nr_pages
> > >> +                 * in page + 1 (new_page). Therefore, we use pointer 
> > >> casting to
> > >> +                 * correctly locate the _nr_pages bits within new_page 
> > >> which
> > >> +                 * could have modified by previous higher order folio.
> > >> +                 */
> > >> +                ((struct folio *)(new_page - 1))->_nr_pages = 0;
> > >> +#endif
> > > 
> > > This seems too weird, why is it in the loop?  There is only one
> > > _nr_pages per folio.


Yeah, I don't really know what the motivation is for going via the folio
field which needs the odd pointer math versus just setting page->memcg_data
= 0 directly which would work equally well and would have avoided a lot of
confusion.

> > I suppose we could be getting say an order-9 folio that was previously used
> > as two order-8 folios? And each of them had their _nr_pages in their head
> > and we can't know that at this point so we have to reset everything?
> 
> Er, did I miss something - who reads _nr_pages from a random tail
> page? Doesn't everything working with random tail pages read order,
> compute the head page, cast to folio and then access _nr_pages?
> 
> > Or maybe you mean that stray _nr_pages in some tail page from previous
> > lifetimes can't affect the current lifetime in a wrong way for something
> > looking at said page? I don't know immediately.
> 
> Yes, exactly.
> 
> Basically, what bytes exactly need to be set to what in tail pages for
> the system to work? Those should be set.
> 
> And if we want to have things set on free that's fine too, but there
> should be reasons for doing stuff, and this weird thing above makes
> zero sense.

You can't think of these as tail pages or head pages. They are just random
struct pages, possibly order-0 or PageHead or PageTail, with fields in a
"random" state based on what they were last used for.

All this function should be trying to do is initialising this random state to
something sane as defined by the core-mm for it to consume. Yes, some might
later end up being tail (or head) pages if order > 0 and prep_compound_page()
is called. But the point of this function and the loop is to initialise the
struct page as an order-0 page with "sane" fields to pass to core-mm or call
prep_compound_page() on.

This could for example just use memset(new_page, 0, sizeof(struct page)) and
then refill all the fields correctly (although Vlastimil pointed out some page
flags need preservation). But a big part of the problem is there is no single
definition (AFAIK) of what state a struct page should be in before handing it to
the core-mm via either vm_insert_page()/pages()/etc. or migrate_vma_*() nor what
state the kernel leaves it in once freed.

I would like to see this addressed because it leads to all sorts of weirdness -
for example vm_insert_page() and migrate_vma_*() both require the page refcount
to be 1 for no good reason (drivers usually have to drop it immediately after
the call and they implicitly own the ZONE_DEVICE page lifetimes anyway so why 
make them
hold a reference just to map the page). Yet only migrate_vma_*() requires the
page to be locked (so other ZONE_DEVICE users just have to immediately unlock).

And I presume page->memcg_data must be set to zero, or Matthew wouldn't have
run into problems prompting him to reinit it. But I don't really know what other
requirements there are for setting page fields, they all sort of come implicitly
from the vm_insert_page/migrate_vma APIs.

 - Alistair

> Jason

Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

Reply via email to