On Mon, Jan 26, 2026 at 12:14:35PM +0100, Jan Beulich wrote:
> On 22.01.2026 18:38, Roger Pau Monne wrote:
> > Physmap population has the need to use pages as big as possible to reduce
> > p2m shattering.  However that triggers issues when big enough pages are not
> > yet scrubbed, and so scrubbing must be done at allocation time.  On some
> > scenarios with added contention the watchdog can trigger:
> > 
> > Watchdog timer detects that CPU55 is stuck!
> > ----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
> > CPU:    55
> > RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
> > RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
> > [...]
> > Xen call trace:
> >    [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
> >    [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
> >    [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
> >    [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
> >    [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
> >    [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970
> > 
> > Introduce a mechanism to preempt page scrubbing in populate_physmap().  It
> > relies on stashing the dirty page in the domain struct temporarily to
> > preempt to guest context, so the scrubbing can resume when the domain
> > re-enters the hypercall.  The added deferral mechanism will only be used for
> > domain construction, and is designed to be used with a single threaded
> > domain builder.  If the toolstack makes concurrent calls to
> > XENMEM_populate_physmap for the same target domain it will trash stashed
> > pages, resulting in slow domain physmap population.
> > 
> > Note a similar issue is present in increase reservation.  However that
> > hypercall is likely to only be used once the domain is already running and
> > the known implementations use 4K pages. It will be deal with in a separate
> > patch using a different approach, that will also take care of the
> > allocation in populate_physmap() once the domain is running.
> > 
> > Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
> > Signed-off-by: Roger Pau Monné <[email protected]>
> > ---
> > Changes since v2:
> >  - Introduce FREE_DOMHEAP_PAGE{,S}().
> >  - Remove j local counter.
> >  - Free page pending scrub in domain_kill() also.
> 
> Yet still not right in domain_unpause_by_systemcontroller() as well. I.e. a
> toolstack action is still needed after the crash to make the memory usable
> again. If you made ...

Oh, I've misread your previous reply and it seemed to me your
preference was to do it in domain_kill().

> > @@ -1286,6 +1293,19 @@ int domain_kill(struct domain *d)
> >          rspin_barrier(&d->domain_lock);
> >          argo_destroy(d);
> >          vnuma_destroy(d->vnuma);
> > +        /*
> > +         * Attempt to free any pages pending scrub early.  Toolstack can 
> > still
> > +         * trigger populate_physmap() operations at this point, and hence a
> > +         * final cleanup must be done in _domain_destroy().
> > +         */
> > +        rspin_lock(&d->page_alloc_lock);
> > +        if ( d->pending_scrub )
> > +        {
> > +            FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
> > +            d->pending_scrub_order = 0;
> > +            d->pending_scrub_index = 0;
> > +        }
> > +        rspin_unlock(&d->page_alloc_lock);
> 
> ... this into a small helper function (usable even from _domain_destroy(),
> as locking being used doesn't matter there), it would have negligible
> footprint there.
> 
> As to the comment, not being a native speaker it still feels to me as if
> moving "early" earlier (after "free") might help parsing of the 1st sentence.

I could also drop "early" completely from the sentence.  I've moved
the comment at the top of the newly introduced helper and reworded it
as:

/*
 * Called multiple times during domain destruction, to attempt to early free
 * any stashed pages to be scrubbed.  The call from _domain_destroy() is done
 * when the toolstack can no longer stash any pages.
 */

Let me know if that's OK.

> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -159,6 +159,66 @@ static void increase_reservation(struct memop_args *a)
> >      a->nr_done = i;
> >  }
> >  
> > +/*
> > + * Temporary storage for a domain assigned page that's not been fully 
> > scrubbed.
> > + * Stored pages must be domheap ones.
> > + *
> > + * The stashed page can be freed at any time by Xen, the caller must pass 
> > the
> > + * order and NUMA node requirement to the fetch function to ensure the
> > + * currently stashed page matches it's requirements.
> > + */
> > +static void stash_allocation(struct domain *d, struct page_info *page,
> > +                             unsigned int order, unsigned int scrub_index)
> > +{
> > +    rspin_lock(&d->page_alloc_lock);
> > +
> > +    /*
> > +     * Drop any stashed allocation to accommodated the current one.  This
> > +     * interface is designed to be used for single-threaded domain 
> > creation.
> > +     */
> > +    if ( d->pending_scrub )
> > +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
> 
> Didn't you indicate you'd move the freeing ...
> 
> > +    d->pending_scrub_index = scrub_index;
> > +    d->pending_scrub_order = order;
> > +    d->pending_scrub = page;
> > +
> > +    rspin_unlock(&d->page_alloc_lock);
> > +}
> > +
> > +static struct page_info *get_stashed_allocation(struct domain *d,
> > +                                                unsigned int order,
> > +                                                nodeid_t node,
> > +                                                unsigned int *scrub_index)
> > +{
> 
> ... into this function?

I could add freeing to get_stashed_allocation(), but it seems
pointless, because the freeing in stash_allocation() will have to stay
to deal with concurrent callers.  Even if a context frees the stashed
page in get_stashed_allocation() there's no guarantee the field will
still be free when stash_allocation() is called, as another concurrent
thread might have stashed a page in the meantime.

I think it's best to consistently do it only in stash_allocation(), as
that's clearer.

Thanks, Roger.

Reply via email to