On 30.05.25 18:54, Vlastimil Babka wrote:
On 5/20/25 18:42, David Hildenbrand wrote:
A long-term goal is supporting frozen PageOffline pages, and later
PageOffline pages that don't have a refcount at all. Some more work for

Looking forward to that :)

:) It's definitely ... challenging.


that is needed -- in particular around non-folio page migration and
memory ballooning drivers -- but let's start by handling PageOffline pages
that can be skipped during memory offlining differently.

Note that PageOffline is used to mark pages that are logically offline
in an otherwise online memory block (e.g., 128 MiB). If a memory
block is offline, the memmap is considered compeltely uninitialized
and stale (see pfn_to_online_page()).

Let's introduce a PageOffline specific page flag (PG_offline_skippable)
that for now reuses PG_owner_2. In the memdesc future, it will be one of
a small number of per-memdesc flags stored alongside the type.

By setting PG_offline_skippable, a driver indicates that it can
restore the PageOffline state of these specific pages when re-onlining a
memory block: it knows that these pages are supposed to be PageOffline()
without the information in the vmemmap, so it can filter them out and
not expose them to the buddy -> they stay PageOffline().

While PG_offline_offlineable might be clearer, it is also super
confusing. Alternatives (PG_offline_sticky?) also don't quite feel right.
So let's use "skippable" for now.

The flag is not supposed to be used for movable PageOffline pages as
used for balloon compaction; movable PageOffline() pages can simply be
migrated during the memory offlining stage, turning the migration
destination page PageOffline() and turning the migration source page
into a free buddy page.

Let's convert the single user from our MEM_GOING_OFFLINE approach
to the new PG_offline_skippable approach: virtio-mem. Fortunately,
this simplifies the code quite a lot. The only corner case we have to
take care of is when force-unloading the virtio-mem driver: we have to
prevent partially-plugged memory blocks from getting offlined by
clearing PG_offline_skippable again.

What if someone decides to grab a reference on these pages although they
really shouldn't? After all, we'll now keep the refcount at 1 (until we
can properly stop using the refcount completely).

Well, less worse things will happen than would currently: currently,
if someone would grab a reference to these pages, in MEM_GOING_OFFLINE
we would run into the
                if (WARN_ON(!page_ref_dec_and_test(page)))
                        dump_page(page, "fake-offline page referenced");

And once that unexpected reference would get dropped, we would end up
freeing that page to the buddy: ouch.

Now, we'll allow for offlining that memory, and when that unexpected
reference would get dropped, we would not end up freeing that page to
the buddy. Once we have frozen PageOffline() pages, it will all get a
lot cleaner.

Hmm, a question on that later in the code (assuming I identified the right
place).

Note that we didn't see the existing WARN_ON so far, because nobody
should ever be referencing such pages.

It's mostly a speculative refcount increase from a pfn walker, such as
compaction scanner, that can happen due to its inherent raciness.

PFN walkers are in general careful to only try grabbing a reference if the target page is a possible candidate.

E.g., compaction checks for LRU or movable non-folio pages before calling folio_get_nontail_page().

KSM uses MEM_GOING_OFFLINE to pause scanning (and thereby folio_try_get()) while memory is getting offlined, so it won't try accessing any page in the trees that might now be stale.


An alternative might be to have another callback chain from memory hotplug
code, where a driver that owns that page could agree to skip the
PageOffline() page. However, we would have to repeatedly issue these
callbacks for individual PageOffline() pages, which does not sound
compelling. As we have spare bits, let's use this simpler approach for
now.

Acked-by: Zi Yan <z...@nvidia.com>
Signed-off-by: David Hildenbrand <da...@redhat.com>

Acked-by: Vlastimil Babka <vba...@suse.cz> # page allocator

I'll leave hotplug to the experts :)

<snip>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f6482223e28a2..7e4c41e46a911 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long 
start_pfn,
                        continue;
                }
                /*
-                * At this point all remaining PageOffline() pages have a
-                * reference count of 0 and can simply be skipped.
+                * At this point all remaining PageOffline() pages must be
+                * "skippable" and have exactly one reference.
                 */
                if (PageOffline(page)) {
-                       BUG_ON(page_count(page));
-                       BUG_ON(PageBuddy(page));
+                       WARN_ON_ONCE(!PageOfflineSkippable(page));
+                       WARN_ON_ONCE(page_count(page) != 1);

So is this the part where an unexpected speculative refcount might be
detected? Should be harmless then as it will then decrease the refcount from
e.g. 2 to 1 and nothing will happen right.

Yes, unless it would be taken for a longer time, which is not what usually happens: speculative references are immediately dropped if the target page is of no use.

That's assuming that once we pass __offline_isolated_pages(), the following
actions wont modify the refcount or the struct page won't be zeroed, or
removed completely (vmemmap). Probably something already prevents that...

Right, we'd have similar problems also if the refcount would be 0 and someone would do a folio_try_get(), while we are tearing down or poisoning the vmemmap.

The idea here is that these folios were PageOffline() already for a while (before even starting memory offlining), so any speculative references from pre-PageOffline times should long be gone.

It's still racy, but not more racy than our existing pfn_to_online_page() race, or the race we already had in the old code.

Thanks!

--
Cheers,

David / dhildenb


Reply via email to