mm: allow deferred scrub of physmap populate allocated pages

Jan Beulich Tue, 27 Jan 2026 03:07:04 -0800

On 27.01.2026 11:40, Roger Pau Monné wrote:
> On Mon, Jan 26, 2026 at 12:14:35PM +0100, Jan Beulich wrote:
>> On 22.01.2026 18:38, Roger Pau Monne wrote:
>>> Physmap population has the need to use pages as big as possible to reduce
>>> p2m shattering.  However that triggers issues when big enough pages are not
>>> yet scrubbed, and so scrubbing must be done at allocation time.  On some
>>> scenarios with added contention the watchdog can trigger:
>>>
>>> Watchdog timer detects that CPU55 is stuck!
>>> ----[ Xen-4.17.5-21  x86_64  debug=n  Not tainted ]----
>>> CPU:    55
>>> RIP:    e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
>>> RFLAGS: 0000000000000202   CONTEXT: hypervisor (d0v12)
>>> [...]
>>> Xen call trace:
>>>    [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
>>>    [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
>>>    [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
>>>    [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
>>>    [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
>>>    [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970
>>>
>>> Introduce a mechanism to preempt page scrubbing in populate_physmap().  It
>>> relies on stashing the dirty page in the domain struct temporarily to
>>> preempt to guest context, so the scrubbing can resume when the domain
>>> re-enters the hypercall.  The added deferral mechanism will only be used for
>>> domain construction, and is designed to be used with a single threaded
>>> domain builder.  If the toolstack makes concurrent calls to
>>> XENMEM_populate_physmap for the same target domain it will trash stashed
>>> pages, resulting in slow domain physmap population.
>>>
>>> Note a similar issue is present in increase reservation.  However that
>>> hypercall is likely to only be used once the domain is already running and
>>> the known implementations use 4K pages. It will be deal with in a separate
>>> patch using a different approach, that will also take care of the
>>> allocation in populate_physmap() once the domain is running.
>>>
>>> Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
>>> Signed-off-by: Roger Pau Monné <[email protected]>
>>> ---
>>> Changes since v2:
>>>  - Introduce FREE_DOMHEAP_PAGE{,S}().
>>>  - Remove j local counter.
>>>  - Free page pending scrub in domain_kill() also.
>>
>> Yet still not right in domain_unpause_by_systemcontroller() as well. I.e. a
>> toolstack action is still needed after the crash to make the memory usable
>> again. If you made ...
> 
> Oh, I've misread your previous reply and it seemed to me your
> preference was to do it in domain_kill().


I meant to (possibly) have it kept there and be done yet earlier as well.

>>> @@ -1286,6 +1293,19 @@ int domain_kill(struct domain *d)
>>>          rspin_barrier(&d->domain_lock);
>>>          argo_destroy(d);
>>>          vnuma_destroy(d->vnuma);
>>> +        /*
>>> +         * Attempt to free any pages pending scrub early.  Toolstack can 
>>> still
>>> +         * trigger populate_physmap() operations at this point, and hence a
>>> +         * final cleanup must be done in _domain_destroy().
>>> +         */
>>> +        rspin_lock(&d->page_alloc_lock);
>>> +        if ( d->pending_scrub )
>>> +        {
>>> +            FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
>>> +            d->pending_scrub_order = 0;
>>> +            d->pending_scrub_index = 0;
>>> +        }
>>> +        rspin_unlock(&d->page_alloc_lock);
>>
>> ... this into a small helper function (usable even from _domain_destroy(),
>> as locking being used doesn't matter there), it would have negligible
>> footprint there.
>>
>> As to the comment, not being a native speaker it still feels to me as if
>> moving "early" earlier (after "free") might help parsing of the 1st sentence.
> 
> I could also drop "early" completely from the sentence.  I've moved
> the comment at the top of the newly introduced helper and reworded it
> as:
> 
> /*
>  * Called multiple times during domain destruction, to attempt to early free
>  * any stashed pages to be scrubbed.  The call from _domain_destroy() is done
>  * when the toolstack can no longer stash any pages.
>  */
> 
> Let me know if that's OK.

Fine with me.

>>> --- a/xen/common/memory.c
>>> +++ b/xen/common/memory.c
>>> @@ -159,6 +159,66 @@ static void increase_reservation(struct memop_args *a)
>>>      a->nr_done = i;
>>>  }
>>>  
>>> +/*
>>> + * Temporary storage for a domain assigned page that's not been fully 
>>> scrubbed.
>>> + * Stored pages must be domheap ones.
>>> + *
>>> + * The stashed page can be freed at any time by Xen, the caller must pass 
>>> the
>>> + * order and NUMA node requirement to the fetch function to ensure the
>>> + * currently stashed page matches it's requirements.
>>> + */
>>> +static void stash_allocation(struct domain *d, struct page_info *page,
>>> +                             unsigned int order, unsigned int scrub_index)
>>> +{
>>> +    rspin_lock(&d->page_alloc_lock);
>>> +
>>> +    /*
>>> +     * Drop any stashed allocation to accommodated the current one.  This
>>> +     * interface is designed to be used for single-threaded domain 
>>> creation.
>>> +     */
>>> +    if ( d->pending_scrub )
>>> +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
>>
>> Didn't you indicate you'd move the freeing ...
>>
>>> +    d->pending_scrub_index = scrub_index;
>>> +    d->pending_scrub_order = order;
>>> +    d->pending_scrub = page;
>>> +
>>> +    rspin_unlock(&d->page_alloc_lock);
>>> +}
>>> +
>>> +static struct page_info *get_stashed_allocation(struct domain *d,
>>> +                                                unsigned int order,
>>> +                                                nodeid_t node,
>>> +                                                unsigned int *scrub_index)
>>> +{
>>
>> ... into this function?
> 
> I could add freeing to get_stashed_allocation(), but it seems
> pointless, because the freeing in stash_allocation() will have to stay
> to deal with concurrent callers.  Even if a context frees the stashed
> page in get_stashed_allocation() there's no guarantee the field will
> still be free when stash_allocation() is called, as another concurrent
> thread might have stashed a page in the meantime.

Hmm, yes, yet still ...

> I think it's best to consistently do it only in stash_allocation(), as
> that's clearer.

... no, as (to me) "clearer" is only a secondary criteria here. What I'm
worried of is potentially holding back a 1Gb page when the new request is,
say, a 2Mb one, and then not having enough memory available just because
of that detained huge page.

In fact, if stash_allocation() finds the field re-populated despite
get_stashed_allocation() having cleared it, it's not quite clear which
of the two allocations should actually be undone. The other vCPU may be
quicker in retrying, and to avoid ping-pong freeing the new (local)
allocation rather than stashing it might possibly be better. Thoughts?

Jan

Re: [PATCH v3 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages

Reply via email to