On 27.01.2026 11:40, Roger Pau Monné wrote:
> On Mon, Jan 26, 2026 at 12:14:35PM +0100, Jan Beulich wrote:
>> On 22.01.2026 18:38, Roger Pau Monne wrote:
>>> Physmap population has the need to use pages as big as possible to reduce
>>> p2m shattering. However that triggers issues when big enough pages are not
>>> yet scrubbed, and so scrubbing must be done at allocation time. On some
>>> scenarios with added contention the watchdog can trigger:
>>>
>>> Watchdog timer detects that CPU55 is stuck!
>>> ----[ Xen-4.17.5-21 x86_64 debug=n Not tainted ]----
>>> CPU: 55
>>> RIP: e008:[<ffff82d040204c4a>] clear_page_sse2+0x1a/0x30
>>> RFLAGS: 0000000000000202 CONTEXT: hypervisor (d0v12)
>>> [...]
>>> Xen call trace:
>>> [<ffff82d040204c4a>] R clear_page_sse2+0x1a/0x30
>>> [<ffff82d04022a121>] S clear_domain_page+0x11/0x20
>>> [<ffff82d04022c170>] S common/page_alloc.c#alloc_heap_pages+0x400/0x5a0
>>> [<ffff82d04022d4a7>] S alloc_domheap_pages+0x67/0x180
>>> [<ffff82d040226f9f>] S common/memory.c#populate_physmap+0x22f/0x3b0
>>> [<ffff82d040228ec8>] S do_memory_op+0x728/0x1970
>>>
>>> Introduce a mechanism to preempt page scrubbing in populate_physmap(). It
>>> relies on stashing the dirty page in the domain struct temporarily to
>>> preempt to guest context, so the scrubbing can resume when the domain
>>> re-enters the hypercall. The added deferral mechanism will only be used for
>>> domain construction, and is designed to be used with a single threaded
>>> domain builder. If the toolstack makes concurrent calls to
>>> XENMEM_populate_physmap for the same target domain it will trash stashed
>>> pages, resulting in slow domain physmap population.
>>>
>>> Note a similar issue is present in increase reservation. However that
>>> hypercall is likely to only be used once the domain is already running and
>>> the known implementations use 4K pages. It will be deal with in a separate
>>> patch using a different approach, that will also take care of the
>>> allocation in populate_physmap() once the domain is running.
>>>
>>> Fixes: 74d2e11ccfd2 ("mm: Scrub pages in alloc_heap_pages() if needed")
>>> Signed-off-by: Roger Pau Monné <[email protected]>
>>> ---
>>> Changes since v2:
>>> - Introduce FREE_DOMHEAP_PAGE{,S}().
>>> - Remove j local counter.
>>> - Free page pending scrub in domain_kill() also.
>>
>> Yet still not right in domain_unpause_by_systemcontroller() as well. I.e. a
>> toolstack action is still needed after the crash to make the memory usable
>> again. If you made ...
>
> Oh, I've misread your previous reply and it seemed to me your
> preference was to do it in domain_kill().
I meant to (possibly) have it kept there and be done yet earlier as well.
>>> @@ -1286,6 +1293,19 @@ int domain_kill(struct domain *d)
>>> rspin_barrier(&d->domain_lock);
>>> argo_destroy(d);
>>> vnuma_destroy(d->vnuma);
>>> + /*
>>> + * Attempt to free any pages pending scrub early. Toolstack can
>>> still
>>> + * trigger populate_physmap() operations at this point, and hence a
>>> + * final cleanup must be done in _domain_destroy().
>>> + */
>>> + rspin_lock(&d->page_alloc_lock);
>>> + if ( d->pending_scrub )
>>> + {
>>> + FREE_DOMHEAP_PAGES(d->pending_scrub, d->pending_scrub_order);
>>> + d->pending_scrub_order = 0;
>>> + d->pending_scrub_index = 0;
>>> + }
>>> + rspin_unlock(&d->page_alloc_lock);
>>
>> ... this into a small helper function (usable even from _domain_destroy(),
>> as locking being used doesn't matter there), it would have negligible
>> footprint there.
>>
>> As to the comment, not being a native speaker it still feels to me as if
>> moving "early" earlier (after "free") might help parsing of the 1st sentence.
>
> I could also drop "early" completely from the sentence. I've moved
> the comment at the top of the newly introduced helper and reworded it
> as:
>
> /*
> * Called multiple times during domain destruction, to attempt to early free
> * any stashed pages to be scrubbed. The call from _domain_destroy() is done
> * when the toolstack can no longer stash any pages.
> */
>
> Let me know if that's OK.
Fine with me.
>>> --- a/xen/common/memory.c
>>> +++ b/xen/common/memory.c
>>> @@ -159,6 +159,66 @@ static void increase_reservation(struct memop_args *a)
>>> a->nr_done = i;
>>> }
>>>
>>> +/*
>>> + * Temporary storage for a domain assigned page that's not been fully
>>> scrubbed.
>>> + * Stored pages must be domheap ones.
>>> + *
>>> + * The stashed page can be freed at any time by Xen, the caller must pass
>>> the
>>> + * order and NUMA node requirement to the fetch function to ensure the
>>> + * currently stashed page matches it's requirements.
>>> + */
>>> +static void stash_allocation(struct domain *d, struct page_info *page,
>>> + unsigned int order, unsigned int scrub_index)
>>> +{
>>> + rspin_lock(&d->page_alloc_lock);
>>> +
>>> + /*
>>> + * Drop any stashed allocation to accommodated the current one. This
>>> + * interface is designed to be used for single-threaded domain
>>> creation.
>>> + */
>>> + if ( d->pending_scrub )
>>> + free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
>>
>> Didn't you indicate you'd move the freeing ...
>>
>>> + d->pending_scrub_index = scrub_index;
>>> + d->pending_scrub_order = order;
>>> + d->pending_scrub = page;
>>> +
>>> + rspin_unlock(&d->page_alloc_lock);
>>> +}
>>> +
>>> +static struct page_info *get_stashed_allocation(struct domain *d,
>>> + unsigned int order,
>>> + nodeid_t node,
>>> + unsigned int *scrub_index)
>>> +{
>>
>> ... into this function?
>
> I could add freeing to get_stashed_allocation(), but it seems
> pointless, because the freeing in stash_allocation() will have to stay
> to deal with concurrent callers. Even if a context frees the stashed
> page in get_stashed_allocation() there's no guarantee the field will
> still be free when stash_allocation() is called, as another concurrent
> thread might have stashed a page in the meantime.
Hmm, yes, yet still ...
> I think it's best to consistently do it only in stash_allocation(), as
> that's clearer.
... no, as (to me) "clearer" is only a secondary criteria here. What I'm
worried of is potentially holding back a 1Gb page when the new request is,
say, a 2Mb one, and then not having enough memory available just because
of that detained huge page.
In fact, if stash_allocation() finds the field re-populated despite
get_stashed_allocation() having cleared it, it's not quite clear which
of the two allocations should actually be undone. The other vCPU may be
quicker in retrying, and to avoid ping-pong freeing the new (local)
allocation rather than stashing it might possibly be better. Thoughts?
Jan