On 09.01.2026 11:29, Andrew Cooper wrote: > On 09/01/2026 10:15 am, Jan Beulich wrote: >> On 08.01.2026 18:55, Roger Pau Monne wrote: >>> In XenServer we have seen the watchdog occasionally triggering during >>> domain creation if 1GB pages are scrubbed in-place during physmap >>> population. >> That's pretty extreme - writing to 1Gb of memory can't really take over 5s, >> can it? > > Sure it can.
Under what unusual circumstances, or on what extremely slow hardware? (Of course improperly set MTRRs could cause such, for example.) >> Is there lock contention involved? > > Almost certainly, and it's probably the more relevant aspect in this case. Thing is - the scrubbing happens after alloc_heap_pages() has already dropped the heap lock. And I can't spot the XENMEM_populate_physmap path to take any locks outward from alloc_heap_pages(). And the domain's page alloc lock (which in principle should be uncontended anyway unless the toolstack tries to race with itself) is acquired only later. If it was a lock contention problem, the first goal ought to be to move the scrubbing outside of any (potentially contended) locks. >> Or is this when very many CPUs >> try to do the same in parallel? > > The scenario is reboot of a VM when Xapi is doing NUMA placement using > per-node claims. > > In this case, even with sufficient scrubbed RAM on other nodes, you need > to take from the node you claimed on which might need scrubbing. Much like if there was an exact-node request without involving claims. > The underlying problem is the need to do a long running operation in a > context where you cannot continue, and cannot (reasonably) fail. Right. Jan
