zone_device: Add order argument to folio_free callback

Zi Yan Mon, 12 Jan 2026 08:32:18 -0800

On 12 Jan 2026, at 8:45, Jason Gunthorpe wrote:

> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
>>
>>> On 1/12/26 08:35, Matthew Wilcox wrote:
>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
>>>>> The core MM splits the folio before calling folio_free, restoring the
>>>>> zone pages associated with the folio to an initialized state (e.g.,
>>>>> non-compound, pgmap valid, etc...). The order argument represents the
>>>>> folio’s order prior to the split which can be used driver side to know
>>>>> how many pages are being freed.
>>>>
>>>> This really feels like the wrong way to fix this problem.
>>>>
>>
>> Hi Matthew,
>>
>> I think the wording is confusing, since the actual issue is that:
>>
>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
>> 2. but free_zone_device_folio() never reverse the course,
>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
>>    be done before driver callback ->folio_free(), since once ->folio_free()
>>    is called, the folio can be reallocated immediately,
>> 4. after the undo of prep_compound_page(), folio_order() can no longer 
>> provide
>>    the original order information, thus, folio_free() needs that for proper
>>    device side ref manipulation.
>
> There is something wrong with the driver if the "folio can be
> reallocated immediately".
>
> The flow generally expects there to be a driver allocator linked to
> folio_free()
>
> 1) Allocator finds free memory
> 2) zone_device_page_init() allocates the memory and makes refcount=1
> 3) __folio_put() knows the recount 0.
> 4) free_zone_device_folio() calls folio_free(), but it doesn't
>    actually need to undo prep_compound_page() because *NOTHING* can
>    use the page pointer at this point.
> 5) Driver puts the memory back into the allocator and now #1 can
>    happen. It knows how much memory to put back because folio->order
>    is valid from #2
> 6) #1 happens again, then #2 happens again and the folio is in the
>    right state for use. The successor #2 fully undoes the work of the
>    predecessor #2.


But how can a successor #2 undo the work if the second #1 only allocates
half of the original folio? For example, an order-9 at PFN 0 is
allocated and freed, then an order-8 at PFN 0 is allocated and another
order-8 at PFN 256 is allocated. How can two #2s undo the same order-9
without corrupting each other’s data?


>
> If you have races where #1 can happen immediately after #3 then the
> driver design is fundamentally broken and passing around order isn't
> going to help anything.
>
> If the allocator is using the struct page memory then step #5 should
> also clean up the struct page with the allocator data before returning
> it to the allocator.

Do you mean ->folio_free() callback should undo prep_compound_page()
instead?

>
> I vaugely remember talking about this before in the context of the Xe
> driver.. You can't just take an existing VRAM allocator and layer it
> on top of the folios and have it broadly ignore the folio_free
> callback.


Best Regards,
Yan, Zi

Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback

Reply via email to