On 9 Jan 2026, at 17:11, Balbir Singh wrote: > On 1/10/26 07:43, Zi Yan wrote: >> On 9 Jan 2026, at 16:34, Balbir Singh wrote: >> >>> On 1/10/26 06:15, Zi Yan wrote: >>>> On 9 Jan 2026, at 15:03, Matthew Brost wrote: >>>> >>>>> On Fri, Jan 09, 2026 at 02:23:49PM -0500, Zi Yan wrote: >>>>>> On 9 Jan 2026, at 14:08, Matthew Brost wrote: >>>>>> >>>>>>> On Fri, Jan 09, 2026 at 01:53:33PM -0500, Zi Yan wrote: >>>>>>>> On 9 Jan 2026, at 13:26, Matthew Brost wrote: >>>>>>>> >>>>>>>>> On Fri, Jan 09, 2026 at 12:28:22PM -0500, Zi Yan wrote: >>>>>>>>>> On 9 Jan 2026, at 6:09, Mika Penttilä wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> On 1/9/26 10:54, Francois Dugast wrote: >>>>>>>>>>> >>>>>>>>>>>> From: Matthew Brost <[email protected]> >>>>>>>>>>>> >>>>>>>>>>>> Split device-private and coherent folios into individual pages >>>>>>>>>>>> before >>>>>>>>>>>> freeing so that any order folio can be formed upon the next use of >>>>>>>>>>>> the >>>>>>>>>>>> pages. >>>>>>>>>>>> >>>>>>>>>>>> Cc: Balbir Singh <[email protected]> >>>>>>>>>>>> Cc: Alistair Popple <[email protected]> >>>>>>>>>>>> Cc: Zi Yan <[email protected]> >>>>>>>>>>>> Cc: David Hildenbrand <[email protected]> >>>>>>>>>>>> Cc: Oscar Salvador <[email protected]> >>>>>>>>>>>> Cc: Andrew Morton <[email protected]> >>>>>>>>>>>> Cc: [email protected] >>>>>>>>>>>> Cc: [email protected] >>>>>>>>>>>> Cc: [email protected] >>>>>>>>>>>> Signed-off-by: Matthew Brost <[email protected]> >>>>>>>>>>>> Signed-off-by: Francois Dugast <[email protected]> >>>>>>>>>>>> --- >>>>>>>>>>>> mm/memremap.c | 2 ++ >>>>>>>>>>>> 1 file changed, 2 insertions(+) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/mm/memremap.c b/mm/memremap.c >>>>>>>>>>>> index 63c6ab4fdf08..7289cdd6862f 100644 >>>>>>>>>>>> --- a/mm/memremap.c >>>>>>>>>>>> +++ b/mm/memremap.c >>>>>>>>>>>> @@ -453,6 +453,8 @@ void free_zone_device_folio(struct folio >>>>>>>>>>>> *folio) >>>>>>>>>>>> case MEMORY_DEVICE_COHERENT: >>>>>>>>>>>> if (WARN_ON_ONCE(!pgmap->ops || >>>>>>>>>>>> !pgmap->ops->folio_free)) >>>>>>>>>>>> break; >>>>>>>>>>>> + >>>>>>>>>>>> + folio_split_unref(folio); >>>>>>>>>>>> pgmap->ops->folio_free(folio); >>>>>>>>>>>> percpu_ref_put_many(&folio->pgmap->ref, nr); >>>>>>>>>>>> break; >>>>>>>>>>> >>>>>>>>>>> This breaks folio_free implementations like nouveau_dmem_folio_free >>>>>>>>>>> which checks the folio order and act upon that. >>>>>>>>>>> Maybe add an order parameter to folio_free or let the driver handle >>>>>>>>>>> the split? >>>>>>>>> >>>>>>>>> 'let the driver handle the split?' - I had consisder this as an >>>>>>>>> option. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Passing an order parameter might be better to avoid exposing core MM >>>>>>>>>> internals >>>>>>>>>> by asking drivers to undo compound pages. >>>>>>>>>> >>>>>>>>> >>>>>>>>> It looks like Nouveau tracks free folios and free pages—something Xe’s >>>>>>>>> device memory allocator (DRM Buddy) cannot do. I guess this answers my >>>>>>>>> earlier question of how Nouveau avoids hitting the same bug as Xe / >>>>>>>>> GPU >>>>>>>>> SVM with respect to reusing folios. It appears Nouveau prefers not to >>>>>>>>> split the folio, so I’m leaning toward moving this call into the >>>>>>>>> driver’s folio_free function. >>>>>>>> >>>>>>>> No, that creates asymmetric page handling and is error prone. >>>>>>>> >>>>>>> >>>>>>> I agree it is asymmetric and symmetric is likely better. >>>>>>> >>>>>>>> In addition, looking at nouveau’s implementation in >>>>>>>> nouveau_dmem_page_alloc_locked(), it gets a folio from >>>>>>>> drm->dmem->free_folios, >>>>>>>> which is never split, and passes it to zone_device_folio_init(). This >>>>>>>> is wrong, since if the folio is large, it will go through >>>>>>>> prep_compound_page() >>>>>>>> again. The bug has not manifested because there is only order-9 large >>>>>>>> folios. >>>>>>>> Once mTHP support is added, how is nouveau going to allocate a order-4 >>>>>>>> folio >>>>>>>> from a free order-9 folio? Maintain a per-order free folio list and >>>>>>>> reimplement a buddy allocator? Nevertheless, nouveau’s implementation >>>>>>> >>>>>>> The way Nouveau handles memory allocations here looks wrong to me—it >>>>>>> should probably use DRM Buddy and convert a block buddy to pages rather >>>>>>> than tracking a free folio list and free page list. But this is not my >>>>>>> driver. >>>>>>> >>>>>>>> is wrong by calling prep_compound_page() on a folio (already compound >>>>>>>> page). >>>>>>>> >>>>>>> >>>>>>> I don’t disagree that this implementation is questionable. >>>>>>> >>>>>>> So what’s the suggestion here—add folio order to folio_free just to >>>>>>> accommodate Nouveau’s rather odd memory allocation algorithm? That >>>>>>> doesn’t seem right to me either. >>>>>> >>>>>> Splitting the folio in free_zone_device_folio() and passing folio order >>>>>> to folio_free() make sense to me, since after the split, the folio passed >>>>> >>>>> If this is concensous / direction - I can do this but a tree wide >>>>> change. >>>>> >>>>> I do have another question for everyone here - do we think this spliting >>>>> implementation should be considered a Fixes so this can go into 6.19? >>>> >>>> IMHO, this should be a fix, since it is wrong to call prep_compound_page() >>>> on a large folio. IIUC this seems to only affect nouveau now, I will let >>>> them to decide. >>>> >>> >>> Agreed, free_zone_device_folio() needs to split the folio on put. >>> >>> >>>>> >>>>>> to folio_free() contains no order information, but just the used-to-be >>>>>> head page and the remaining 511 pages are free. How does Intel Xe driver >>>>>> handle it without knowing folio order? >>>>>> >>>>> >>>>> It’s a bit convoluted, but folio/page->zone_device_data points to a >>>>> reference-counted object in GPU SVM. When the object’s reference count >>>>> drops to zero, we callback into the driver layer to release the memory. >>>>> In Xe, this is a TTM BO that resolves to a DRM Buddy allocation, which >>>>> is then released. If it’s not clear, our original allocation size >>>>> determines the granularity at which we free the backing store. >>>>> >>>>>> Do we really need the order info in ->folio_free() if the folio is split >>>>>> in free_zone_device_folio()? free_zone_device_folio() should just call >>>>>> ->folio_free() 2^order times to free individual page. >>>>>> >>>>> >>>>> No. If it’s a higher-order folio—let’s say a 2MB folio—we have one >>>>> reference to our GPU SVM object, so we can free the backing in a single >>>>> ->folio_free call. >>>>> >>>>> Now, if that folio gets split at some point into 4KB pages, then we’d >>>>> have 512 references to this object set up in the ->folio_split calls. >>>>> We’d then expect 512 ->folio_free() calls. Same case here: if, for >>>>> whatever reason, we can’t create a 2MB device page during a 2MB >>>>> migration and need to create 512 4KB pages so we'd have 512 references >>>>> to our GPU SVM object. >>>> >>> >>> I still don't follow why the folio_order does not capture the order of the >>> folio. >>> If the folio is split, we should now have 512 split folios for THP >> >> folio_order() should return 0 after the folio is split. >> >> In terms of the number of after-split folios, it is 512 for current code base >> since THP is only 2MB in zone devices, but not future proof if mTHP support >> is added. It also causes confusion in core MM, where folio can have >> all kinds of orders. >> >> > > I see that folio_split_unref() to see that there is no driver > callback during the split. Patch 3 controls the order of > > + folio_split_unref(folio); > pgmap->ops->folio_free(folio); > > @Matthew, is there a reason to do the split prior to free? > pgmap->ops->folio_free(folio) > shouldn't impact the folio itself, the backing memory can be freed and then > the > folio split?
Quote Matthew from [1]: ... this step must be done before calling folio_free and include a barrier, as the page can be immediately reallocated. [1] https://lore.kernel.org/all/[email protected]/ Best Regards, Yan, Zi
