Hi,

On Mon, Mar 09, 2026 at 08:11:45PM +0000, Michael Kelly wrote:
> During sbuilds of haskell packages there are dependent packages installed
> that have a large installed size (ghc-doc for example is ~700M). Often
> during the write of this data, the system seems to enter a blocked
> state. Normal page allocation is suspended and so non-vm privileged tasks,
> including ext2fs servers, soon get blocked if they require more memory. Any
> process accessing file storage is also likely to block on pagein from the
> stalled servers so even the console becomes unresponsive.
> 
> The system is not actually totally stuck. Pageout processing continues at a
> low level. There is no default pager running so only external pages can
> considered for pageout. Appropriate memory_object_data_return requests are
> issued to external pagers at the rate of approximately 100 per second. The
> CPU load is so low that the virtual machine 'CPU usage' graph superficially
> looks like it is zero. None of these m_o_d_r messages can be handled and
> actually free pages steadily decline.
> 
> I added some debugging to log every 100th pageout attempt from when
> vm_page_alloc_paused becomes set. In one example, free pages steadily drop
> from ~67500 to about ~32000 over a period of ~22minutes. Then suddenly the
> pageout processing comes across a large series of pages (~38000) that can be
> trivially reclaimed which are sufficient to terminate the pageout activity
> and resume normal page allocation. The system becomes usable again.

Wow, cool. What exact patch did you use?

> Might it be that boralus is also behaving this way without it being noticed?
> The use of sync=5 might reduce the likelihood of this occurring, I'd guess,
> but I have also seen this scenario occur using sync=5 myself.

As a data point, the 64bit Postgres buildfarm animal VM I am running is
also running without mach-defpager and with sync=5. Normal operation is
pretty stable, but when I try to run the TAP tests (which create and
destroy Postgres server instances at a great frequency with lots of
I/O), it gets stuck pretty quickly as well. I never had the patience to
let it recover by itself (assuming it was stuck for good), but I could
try to reproduce it with your debugging code added.


Michael

Reply via email to