On 2026-01-17 at 11:19 +1100, Jason Gunthorpe <[email protected]> wrote... > On Fri, Jan 16, 2026 at 08:17:22PM +0100, Vlastimil Babka wrote: > > >> +#ifdef NR_PAGES_IN_LARGE_FOLIO > > >> + /* > > >> + * This pointer math looks odd, but new_page could have > > >> been > > >> + * part of a previous higher order folio, which sets > > >> _nr_pages > > >> + * in page + 1 (new_page). Therefore, we use pointer > > >> casting to > > >> + * correctly locate the _nr_pages bits within new_page > > >> which > > >> + * could have modified by previous higher order folio. > > >> + */ > > >> + ((struct folio *)(new_page - 1))->_nr_pages = 0; > > >> +#endif > > > > > > This seems too weird, why is it in the loop? There is only one > > > _nr_pages per folio.
Yeah, I don't really know what the motivation is for going via the folio field which needs the odd pointer math versus just setting page->memcg_data = 0 directly which would work equally well and would have avoided a lot of confusion. > > I suppose we could be getting say an order-9 folio that was previously used > > as two order-8 folios? And each of them had their _nr_pages in their head > > and we can't know that at this point so we have to reset everything? > > Er, did I miss something - who reads _nr_pages from a random tail > page? Doesn't everything working with random tail pages read order, > compute the head page, cast to folio and then access _nr_pages? > > > Or maybe you mean that stray _nr_pages in some tail page from previous > > lifetimes can't affect the current lifetime in a wrong way for something > > looking at said page? I don't know immediately. > > Yes, exactly. > > Basically, what bytes exactly need to be set to what in tail pages for > the system to work? Those should be set. > > And if we want to have things set on free that's fine too, but there > should be reasons for doing stuff, and this weird thing above makes > zero sense. You can't think of these as tail pages or head pages. They are just random struct pages, possibly order-0 or PageHead or PageTail, with fields in a "random" state based on what they were last used for. All this function should be trying to do is initialising this random state to something sane as defined by the core-mm for it to consume. Yes, some might later end up being tail (or head) pages if order > 0 and prep_compound_page() is called. But the point of this function and the loop is to initialise the struct page as an order-0 page with "sane" fields to pass to core-mm or call prep_compound_page() on. This could for example just use memset(new_page, 0, sizeof(struct page)) and then refill all the fields correctly (although Vlastimil pointed out some page flags need preservation). But a big part of the problem is there is no single definition (AFAIK) of what state a struct page should be in before handing it to the core-mm via either vm_insert_page()/pages()/etc. or migrate_vma_*() nor what state the kernel leaves it in once freed. I would like to see this addressed because it leads to all sorts of weirdness - for example vm_insert_page() and migrate_vma_*() both require the page refcount to be 1 for no good reason (drivers usually have to drop it immediately after the call and they implicitly own the ZONE_DEVICE page lifetimes anyway so why make them hold a reference just to map the page). Yet only migrate_vma_*() requires the page to be locked (so other ZONE_DEVICE users just have to immediately unlock). And I presume page->memcg_data must be set to zero, or Matthew wouldn't have run into problems prompting him to reinit it. But I don't really know what other requirements there are for setting page fields, they all sort of come implicitly from the vm_insert_page/migrate_vma APIs. - Alistair > Jason
