On Mon, Jan 26, 2026 at 12:14:35PM +0100, Jan Beulich wrote:
> On 22.01.2026 18:38, Roger Pau Monne wrote:
> > Physmap population has the need to use pages as big as possible to reduce
> > p2m shattering. However that triggers issues when big enough pages are not
> > yet scrubbed, and so scrubbing must be done at allocation time. On some
> > scenarios with added contention the watchdog can trigger:
> >
> > Watchdog timer detects that CPU55 is stuck!
> > ----[ Xen-4.17.5-21 x86_64 debug=n Not tainted ]----
> > CPU: 55
> > RIP: e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
> > RFLAGS: 0000000000000202 CONTEXT: hypervisor (d0v12)
> > [...]
> > Xen call trace:
> > [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
> > [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
> > [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
> > [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
> > [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
> > [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970
> >
> > Introduce a mechanism to preempt page scrubbing in populate_physmap(). It
> > relies on stashing the dirty page in the domain struct temporarily to
> > preempt to guest context, so the scrubbing can resume when the domain
> > re-enters the hypercall. The added deferral mechanism will only be used for
> > domain construction, and is designed to be used with a single threaded
> > domain builder. If the toolstack makes concurrent calls to
> > XENMEM_populate_physmap for the same target domain it will trash stashed
> > pages, resulting in slow domain physmap population.
> >
> > Note a similar issue is present in increase reservation. However that
> > hypercall is likely to only be used once the domain is already running and
> > the known implementations use 4K pages. It will be deal with in a separate
> > patch using a different approach, that will also take care of the
> > allocation in populate_physmap() once the domain is running.
> >
> > Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
> > Signed-off-by: Roger Pau Monné <[email protected]>
> > ---
> > Changes since v2:
> > - Introduce FREE_DOMHEAP_PAGE{,S}().
> > - Remove j local counter.
> > - Free page pending scrub in domain_kill() also.
>
> Yet still not right in domain_unpause_by_systemcontroller() as well. I.e. a
> toolstack action is still needed after the crash to make the memory usable
> again. If you made ...
Oh, I've misread your previous reply and it seemed to me your
preference was to do it in domain_kill().
> > @@ -1286,6 +1293,19 @@ int domain_kill(struct domain *d)
> > rspin_barrier(&d->domain_lock);
> > argo_destroy(d);
> > vnuma_destroy(d->vnuma);
> > + /*
> > + * Attempt to free any pages pending scrub early. Toolstack can
> > still
> > + * trigger populate_physmap() operations at this point, and hence a
> > + * final cleanup must be done in _domain_destroy().
> > + */
> > + rspin_lock(&d->page_alloc_lock);
> > + if ( d->pending_scrub )
> > + {
> > + FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
> > + d->pending_scrub_order = 0;
> > + d->pending_scrub_index = 0;
> > + }
> > + rspin_unlock(&d->page_alloc_lock);
>
> ... this into a small helper function (usable even from _domain_destroy(),
> as locking being used doesn't matter there), it would have negligible
> footprint there.
>
> As to the comment, not being a native speaker it still feels to me as if
> moving "early" earlier (after "free") might help parsing of the 1st sentence.
I could also drop "early" completely from the sentence. I've moved
the comment at the top of the newly introduced helper and reworded it
as:
/*
* Called multiple times during domain destruction, to attempt to early free
* any stashed pages to be scrubbed. The call from _domain_destroy() is done
* when the toolstack can no longer stash any pages.
*/
Let me know if that's OK.
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -159,6 +159,66 @@ static void increase_reservation(struct memop_args *a)
> > a->nr_done = i;
> > }
> >
> > +/*
> > + * Temporary storage for a domain assigned page that's not been fully
> > scrubbed.
> > + * Stored pages must be domheap ones.
> > + *
> > + * The stashed page can be freed at any time by Xen, the caller must pass
> > the
> > + * order and NUMA node requirement to the fetch function to ensure the
> > + * currently stashed page matches it's requirements.
> > + */
> > +static void stash_allocation(struct domain *d, struct page_info *page,
> > + unsigned int order, unsigned int scrub_index)
> > +{
> > + rspin_lock(&d->page_alloc_lock);
> > +
> > + /*
> > + * Drop any stashed allocation to accommodated the current one. This
> > + * interface is designed to be used for single-threaded domain
> > creation.
> > + */
> > + if ( d->pending_scrub )
> > + free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
>
> Didn't you indicate you'd move the freeing ...
>
> > + d->pending_scrub_index = scrub_index;
> > + d->pending_scrub_order = order;
> > + d->pending_scrub = page;
> > +
> > + rspin_unlock(&d->page_alloc_lock);
> > +}
> > +
> > +static struct page_info *get_stashed_allocation(struct domain *d,
> > + unsigned int order,
> > + nodeid_t node,
> > + unsigned int *scrub_index)
> > +{
>
> ... into this function?
I could add freeing to get_stashed_allocation(), but it seems
pointless, because the freeing in stash_allocation() will have to stay
to deal with concurrent callers. Even if a context frees the stashed
page in get_stashed_allocation() there's no guarantee the field will
still be free when stash_allocation() is called, as another concurrent
thread might have stashed a page in the meantime.
I think it's best to consistently do it only in stash_allocation(), as
that's clearer.
Thanks, Roger.