Re: landisk stuck in pmrwait

Mark Kettenis Wed, 31 Dec 2025 03:53:06 -0800

> Date: Wed, 31 Dec 2025 06:42:00 +0000
> From: Miod Vallat <[email protected]>
> 
> I can still reproduce landisk stalling during a cvs update, after having
> baken a full muild and xenocara. This time with the kernel-side
> traceback of the process stuck in pmrwait.
> 
> ddb> ps
>    PID     TID   PPID    UID  S       FLAGS  WAIT          COMMAND
>  22735  236842  22266   1500  3    0x100003  pmrwait       ssh
>  22266   24911   8155   1500  3    0x100003  biowait       cvs
>  25498  116159  90374   1500  3    0x100083  ttyin         ksh
>   8155  467717  90374   1500  3    0x10008b  sigsusp       ksh
>  90374  138597  98462   1500  3        0x10  biowait       sshd-session
>  98462  191897  15001      0  3        0x92  kqread        sshd-session
>  57416  152946      1      0  3    0x100003  biowait       getty
>  42265  403899      1      0  3    0x100098  kqread        cron
>  99827  455058      1      0  3    0x100090  kqread        inetd
>  70128  215727   4854     95  3   0x1100092  kqread        smtpd
>   8281  217519   4854    103  3   0x1100092  kqread        smtpd
>  76435   92448   4854     95  3   0x1100092  kqread        smtpd
>   1915  373935   4854     95  3    0x100092  kqread        smtpd
>  91271  501146   4854     95  3   0x1100092  kqread        smtpd
>  36815  115371   4854     95  3   0x1100092  kqread        smtpd
>   4854  296662      1      0  3    0x100080  kqread        smtpd
>  15001  426794      1      0  3        0x88  kqread        sshd
>   5586  268874      0      0  3     0x14280  nfsidl        nfsio
>  86082   74082      0      0  3     0x14280  nfsidl        nfsio
>   4201  354622      0      0  3     0x14280  nfsidl        nfsio
>   9170    1439      0      0  3     0x14280  nfsidl        nfsio
>   7090  276569      1      0  3           0  biowait       ypbind
>  47046  449110      1     28  3   0x1100010  biowait       portmap
>  95989   36830      1      0  3    0x100080  kqread        ntpd
>  53651  165033  76253     83  3    0x100092  kqread        ntpd
>  76253  352661      1     83  3   0x1100092  kqread        ntpd
>  39789  280571   4078     74  3   0x1100092  bpf           pflogd
>   4078   55888      1      0  3        0x80  sbwait        pflogd
>  84079  106662  16372     73  3   0x1100090  kqread        syslogd
>  16372   35927      1      0  3    0x100082  sbwait        syslogd
>  78962  437988  31797    115  3    0x100092  kqread        slaacd
>  85919  478332  31797    115  3    0x100092  kqread        slaacd
>  31797  480049      1      0  3    0x100080  kqread        slaacd
>  27846  437548      0      0  3     0x14200  bored         smr
>  41958  344917      0      0  3     0x14200  pgzero        zerothread
>   6076  129835      0      0  3     0x14200  aiodoned      aiodoned
>  84866  363355      0      0  3     0x14200  syncer        update
>  87134  418449      0      0  3     0x14200  cleaner       cleaner
>  53321  503280      0      0  3     0x14200  reaper        reaper
>  88581   91939      0      0  3     0x14200  pgdaemon      pagedaemon
>  45312  521165      0      0  3     0x14200  usbtsk        usbtask
>  37631  261918      0      0  3     0x14200  usbatsk       usbatsk
>  18147  310057      0      0  3     0x14200  bored         softnet0
>  37766  422281      0      0  3     0x14200  bored         systqmp
>  83868   57728      0      0  3     0x14200  bored         systq
>  62182   91648      0      0  3  0x40014200  tmoslp        softclock
> *  973  455824      0      0  7  0x40014200                idle0
>      1  152093      0      0  3        0x82  wait          init
>      0       0     -1      0  3     0x10200  scheduler     swapper
> ddb> show uvmexp
> Current UVM status:
>   pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
>   14864 VM pages: 3 active, 1 inactive, 1 wired, 10118 free (1280 zero)
>   freemin=495, free-target=660, inactive-target=661, wired-max=4954
>   faults=145296324, traps=74992724, intrs=26647938, ctxswitch=14048753 
> fpuswitc
> h=0
>   softint=12776668, syscalls=74992722, kmapent=8
>   fault counts:
>     noram=996715, noanon=0, noamap=0, pgwait=23949, pgrele=0
>     relocks=1867838(14104), upgrades=0(0) anget(retries)=57776611(585752), 
> amap
> copy=23259327
>     neighbor anon/obj pg=39185700/89679236, gets(lock/unlock)=26907893/1286410
>     cases: anon=47576129, anoncow=10190414, obj=23737048, prcopy=3166521, 
> przer
> o=60563995
>   daemon and swap counts:
>     woke=1566238, revs=284861851, scans=2683067, obscans=506886, 
> anscans=181199
> 8
>     busy=0, freed=1421072, reactivate=363889, deactivate=8877746
>     pageouts=179395, pending=144550, nswget=539823
>     nswapdev=1
>     swpages=4194415, swpginuse=4625, swpgonly=4621 paging=3
>   kernel pointers:
>     objs(kern)=0x8c3cccfc
> ddb> show bcstats
> Current Buffer Cache status:
> numbufs 183 busymapped 1, delwri 0
> kvaslots 185 avail kva slots 184
> bufpages 678, dmapages 678, dirtypages 0
> pendingreads 1, pendingwrites 1
> highflips 0, highflops 0, dmaflips 0
> ddb> tr /t 0t236842
> mi_switch() at mi_switch+0x8a
> sleep_finish() at sleep_finish+0xb8
> msleep_nsec() at msleep_nsec+0xdc
> uvm_wait_pla() at uvm_wait_pla+0x90
> uvm_pmr_getpages() at uvm_pmr_getpages+0x87a
> km_alloc() at km_alloc+0x24c
> pool_multi_alloc() at pool_multi_alloc+0x7a
> m_pool_alloc() at m_pool_alloc+0x36
> pool_allocator_alloc() at pool_allocator_alloc+0x18
> pool_p_alloc() at pool_p_alloc+0x3e
> pool_do_get() at pool_do_get+0x174
> pool_get() at pool_get+0xba
> m_clget() at m_clget+0x38
> m_getuio() at m_getuio+0xa6
> sosend() at sosend+0x218
> soo_write() at soo_write+0x28
> dofilewritev() at dofilewritev+0x7e
> sys_write() at sys_write+0x40
> syscall() at syscall+0x2ca
> (EXPEVT 160; SSR=00000001) at 0x39dd9a30


So you're allocating a (potentially) largish mbuf cluster.  The mbuf
pools use &kp_dma_contig, which means that they ask for phys contig
memory.  Even with the smallest (2k) clusters I believe the pool
allocates "poges" of at least 16k.  But larger cluster sizes will ask
for much larger pages

I think this means that physmem got fragmented to the point where it
is no longer possible to allocate a larger phys contig pool page.

This has always been a problem There is no easy solution.  Very well
possible that the allocation patterns in the kernel changed over time
such that fragmentation is more likely.  But it is unfair to blame
mpi@ for that.

Re: landisk stuck in pmrwait

Reply via email to