On 09.01.2026 11:29, Andrew Cooper wrote:
> On 09/01/2026 10:15 am, Jan Beulich wrote:
>> On 08.01.2026 18:55, Roger Pau Monne wrote:
>>> In XenServer we have seen the watchdog occasionally triggering during
>>> domain creation if 1GB pages are scrubbed in-place during physmap
>>> population.
>> That's pretty extreme - writing to 1Gb of memory can't really take over 5s,
>> can it?
> 
> Sure it can.

Under what unusual circumstances, or on what extremely slow hardware? (Of
course improperly set MTRRs could cause such, for example.)

>> Is there lock contention involved?
> 
> Almost certainly, and it's probably the more relevant aspect in this case.

Thing is - the scrubbing happens after alloc_heap_pages() has already
dropped the heap lock. And I can't spot the XENMEM_populate_physmap path
to take any locks outward from alloc_heap_pages(). And the domain's
page alloc lock (which in principle should be uncontended anyway unless
the toolstack tries to race with itself) is acquired only later.

If it was a lock contention problem, the first goal ought to be to move
the scrubbing outside of any (potentially contended) locks.

>> Or is this when very many CPUs
>> try to do the same in parallel?
> 
> The scenario is reboot of a VM when Xapi is doing NUMA placement using
> per-node claims.
> 
> In this case, even with sufficient scrubbed RAM on other nodes, you need
> to take from the node you claimed on which might need scrubbing.

Much like if there was an exact-node request without involving claims.

> The underlying problem is the need to do a long running operation in a
> context where you cannot continue, and cannot (reasonably) fail.

Right.

Jan

Reply via email to