> On Nov 3, 2025, at 4:14 PM, Daniel P. Berrangé <[email protected]> wrote:
> 
> On Mon, Nov 03, 2025 at 11:57:50AM -0700, Jon Kohler wrote:
>> Increase MAX_MEM_PREALLOC_THREAD_COUNT from 16 to 32. This was last
>> touched in 2017 [1] and, since then, physical machine sizes and VMs
>> therein have continue to get even bigger, both on average and on the
>> extremes.
>> 
>> For very large VMs, using 16 threads to preallocate memory can be a
>> non-trivial bottleneck during VM start-up and migration. Increasing
>> this limit to 32 threads reduces the time taken for these operations.
>> 
>> Test results from quad socket Intel 8490H (4x 60 cores) show a fairly
>> linear gain of 50% with the 2x thread count increase.
>> 
>> ---------------------------------------------
>> Idle Guest w/ 2M HugePages   | Start-up time
>> ---------------------------------------------
>> 240 vCPU, 7.5TB (16 threads) | 2m41.955s
>> ---------------------------------------------
>> 240 vCPU, 7.5TB (32 threads) | 1m19.404s
>> ---------------------------------------------
> 
> If we're configuring a guest with 240 vCPUs, then this implies the admin
> is expecting that the guest will consume upto 240 host CPUs worth of
> compute time.
> 
> What is the purpose of limiting the number of prealloc threads to a
> value that is an order of magnitude less than the number of vCPUs the
> guest has been given ?

Daniel - thanks for the quick review and thoughts here.

I looked back through the original commits that led up to the current 16
thread max, and it wasn’t immediately clear to me why we clamped it at
16. Perhaps there was some other contention at the time.

> Have you measured what startup time would look like with 240 prealloc
> threads ? Do we hit some scaling limit before that point making more
> prealloc threads counter-productive ?

I have, and it isn’t wildly better, it comes down to about 50-ish seconds,
as you start running into practical limitations on the speed of memory, as
well as context switching if you’re doing other things on the host at the
same time.

In playing around with some other values, here’s how they shake out:
32 threads: 1m19s
48 threads: 1m4s
64 threads: 59s
…
240 threads: 50s

This also looks much less exciting when the amount of memory is
smaller. For smaller memory sizes (I’m testing with 7.5TB), anything
smaller than that gets less and less fun from a speedup perspective.

Putting that all together, 32 seemed like a sane number with a solid
speedup on fairly modern hardware.

For posterity, I am testing with kernel 6.12 LTS, but could also try
newer kernels if you were curious.

Most of the time is spent in clear_pages_erms and outside of an
experimental series on LKML [1], there really isn’t any improvements
on this state of the art.

For posterity, also adding Ankur into the mix as the author of that
series, as this is something they’ve been looking at for a while I
believe.

[1] 
https://patchwork.kernel.org/project/linux-mm/cover/[email protected]/

> I guess there could be different impact for hotadd vs cold add. With
> cold startup the vCPU threads are not yet consuming CPU time, so we
> can reasonably consume that resource for prealloc, where as for
> hot-add any prealloc is on top of what vCPUs are already consuming.


Reply via email to