Re: High vm scan rate and dropped keystrokes thru X?

2021-07-27 Thread Paul Ripke
On Tue, Jul 27, 2021 at 06:28:39PM +1200, Lloyd Parkes wrote:
> 
> 
> On 27/07/21 12:19 am, Paul Ripke wrote:
> > On Mon, Jul 26, 2021 at 05:53:19PM +1200, Lloyd Parkes wrote:
> > > That's 12GB of RAM in use and 86MB of RAM free. Sounds pretty awful to me.
> > 
> > Sounds normal to me - I don't expect to see any free RAM unless I've just
> > - exited a large process
> > - deleted a large file with large cache footprint
> > - released a large chunk of RAM by other means (mmap, madvise, semctl, etc).
> 
> I haven't run NetBSD on a desktop for a while now, but I still think 12GB is
> a lot of memory in use. Maybe I'll get a new MacBook when they start
> shipping 32GB Apple CPU ones and then put NetBSD on my current MacBook.

There's a bunch of junk running. 3 java processes for 3GiB, mongodb,
postgres, apache, firefox, prusa slicer, and it runs as the local network
router/proxy with all the usual junk running. I also run pkgsrc builds and
netbsd builds, and it handles all that fine.

> > A big chunk of it is in file cache, which is unsurprising when reading
> > thru a 400GiB file...
> 
> Page activity lasts 20s and at 30MB/s that means you should have 600MB of
> file data active. Add 50% for inactive pages and that's still only 900MB.
> I'm willing to bet money that zstd only reads each block of data once
> (sequentially in fact) and so it doesn't need any file data cache at all.
> File metadata is a different matter, but that probably stays active and
> there won't be much of it.

Yes, it's just cache churn due to sequential read I/O. I can cat the file
thru zstd with the same effect. I can even cat the file to /dev/null with
the same issue. Yes, the file data cache is pure cost in this case.

> I suspect that your vm.filemax is set to more memory than you have available
> for the file cache and once that happens anonymous pages start to get
> swapped out. My experience is that while anonymous pages sound unimportant,
> they are in fact the most important pages to keep in RAM. Thinking about it,
> they are the irreplaceable bits of all our running software.
> 
> Try setting vm.filemin=5 and vm.filemax=10. Really. I did it when processing
> vast amounts of files in CVS and it worked for me.

I would agree, except there's basically zero paging activity for the entire
duration. I tried this anyway, and there's no change in behaviour,
whatsoever.

> Out of curiosity, what are you doing with zstd. You mentioned backups. Is
> this dump or restore? dump implements its own file cache, which won;t help
> with the memory burden.

I just do compressed dumps to an external drive. Doing the dump is fine,
but just reading it back leads to bad performance when the page daemon
goes nuts.

> "top -ores" will tell you what programs are using the most anonymous pages,
> which might help identify where all this memory pressure is coming from.

I know these, but there is no real memory pressure. It's just that normally
the page daemon scans and frees the same number of pages, but for some
reason, at some point, it starts scanning 1M+ pages without freeing any.

-- 
Paul Ripke
"Great minds discuss ideas, average minds discuss events, small minds
 discuss people."
-- Disputed: Often attributed to Eleanor Roosevelt. 1948.


Re: High vm scan rate and dropped keystrokes thru X?

2021-07-26 Thread Lloyd Parkes




On 27/07/21 12:19 am, Paul Ripke wrote:

On Mon, Jul 26, 2021 at 05:53:19PM +1200, Lloyd Parkes wrote:

That's 12GB of RAM in use and 86MB of RAM free. Sounds pretty awful to me.


Sounds normal to me - I don't expect to see any free RAM unless I've just
- exited a large process
- deleted a large file with large cache footprint
- released a large chunk of RAM by other means (mmap, madvise, semctl, etc).


I haven't run NetBSD on a desktop for a while now, but I still think 
12GB is a lot of memory in use. Maybe I'll get a new MacBook when they 
start shipping 32GB Apple CPU ones and then put NetBSD on my current 
MacBook.



A big chunk of it is in file cache, which is unsurprising when reading
thru a 400GiB file...


Page activity lasts 20s and at 30MB/s that means you should have 600MB 
of file data active. Add 50% for inactive pages and that's still only 
900MB. I'm willing to bet money that zstd only reads each block of data 
once (sequentially in fact) and so it doesn't need any file data cache 
at all. File metadata is a different matter, but that probably stays 
active and there won't be much of it.


I suspect that your vm.filemax is set to more memory than you have 
available for the file cache and once that happens anonymous pages start 
to get swapped out. My experience is that while anonymous pages sound 
unimportant, they are in fact the most important pages to keep in RAM. 
Thinking about it, they are the irreplaceable bits of all our running 
software.


Try setting vm.filemin=5 and vm.filemax=10. Really. I did it when 
processing vast amounts of files in CVS and it worked for me.


Out of curiosity, what are you doing with zstd. You mentioned backups. 
Is this dump or restore? dump implements its own file cache, which won;t 
help with the memory burden.


"top -ores" will tell you what programs are using the most anonymous 
pages, which might help identify where all this memory pressure is 
coming from.


Cheers,
Lloyd


Re: High vm scan rate and dropped keystrokes thru X?

2021-07-26 Thread Paul Ripke
On Mon, Jul 26, 2021 at 11:56:13PM +0900, Izumi Tsutsui wrote:
> > NetBSD 9.2, amd64, 16GiB RAM, quad core + hyperthreading.
> > 
> > I've repeatedly noticed an issue where a large amount of disk reads can
> > result in lost keystrokes, jerky mouse behaviour and other weirdness.
>  :
> > "vmstat 1" during these events shows climbing runqueue, falling free
> > memory, high reclaim rate, very high scan rate, and 8 CPUs worth of
> > system time - and I hear the BIOS spinning up the CPU fan.
> 
> What "vmstat -m" shows?
> 
> if kmem-160 (or kmem-192) has a large number, maybe caused by
> radeondrmkms(4) leaks.
>  https://mail-index.netbsd.org/netbsd-bugs/2021/07/12/msg072460.html

No, no radeon here. To be clear, I don't believe this is a leak. It's
just some intermittently poor behaviour during high cache churn.

ksh$ vmstat -m | sort -k 8nr | head
vcachepl 336 615620540 60809350 2819365 2720482 98883 258691 0  inf0
buf2k   2048  27108120  2543840 1224274 1136951 87323 116413 110
ffsdino2 256 609337370 60181639 1907605 1823135 84470 193820 0  inf0
ffsino   256 608793480 60127246 1906802 1823012 83790 193820 0  inf0
anonpl32 383061100 35856325 43412  3062 40350 42502 0   inf0
ncache   192 124854900 11813042 39948   122 39826 39828 0   inf0
mutex 64 598124920 58941710 200286 164419 35867 51528   0   inf0
bufpl296  24502520  2210437 136128 114245 21883 24008   0   inf  381
buf16k  16384 10464170   983052 143815 126570 17245 22064   1 10
kmem-2048   2048   1866432   169723 35112 26265  8847 12270 0   inf1

'systat vm' shows the system mostly stalled with high sys CPU%, doing
page scans:

   18 usersLoad  6.98  3.39  2.35  Tue Jul 27 09:21:44

Proc:r  d  sCsw  Traps SysCal  Intr   Soft  Fault PAGING   SWAPPING
18  1230   1215 58   4060   978953 58 in  out   in  out
ops  
  82.7% Sy   0.5% Us   0.0% Ni   4.4% In  12.4% Idpages
|||||||||||
=>%%  forks
  fkppw
Anon  8904872  54%   zero 8928  1362 Interrupts   fksvm
Exec   457148   2%   wired  450292   284 TLB shootdownpwait
File  3286872  20%   inact 2338880   100 cpu0 timer   relck
Meta  1371982%   bufs   234364 4 ioapic0 pin 18   rlkok
 (kB)real   swaponly  free   906 ioapic0 pin 16   noram
Active98590041745392 10128   ioapic0 pin 23 3 ndcpy
Namei Sys-cache Proc-cache68 msi1 vec 0 1 fltcp
Calls hits% hits % 22 zfod
  236  234   99   cow
 2048 fmin
  Disks:   seeks   xfers   bytes   %busy 2730 ftarg
 wd0  20253K14.3  itarg
 wd1  18242K13.1  flnan
 cd0   29 pdfre
 cd1  1180584 pdscn
 sd0   1 91K 0.2
   raid0  21273K33.4


-- 
Paul Ripke
"Great minds discuss ideas, average minds discuss events, small minds
 discuss people."
-- Disputed: Often attributed to Eleanor Roosevelt. 1948.


Re: High vm scan rate and dropped keystrokes thru X?

2021-07-26 Thread Izumi Tsutsui
> NetBSD 9.2, amd64, 16GiB RAM, quad core + hyperthreading.
> 
> I've repeatedly noticed an issue where a large amount of disk reads can
> result in lost keystrokes, jerky mouse behaviour and other weirdness.
 :
> "vmstat 1" during these events shows climbing runqueue, falling free
> memory, high reclaim rate, very high scan rate, and 8 CPUs worth of
> system time - and I hear the BIOS spinning up the CPU fan.

What "vmstat -m" shows?

if kmem-160 (or kmem-192) has a large number, maybe caused by
radeondrmkms(4) leaks.
 https://mail-index.netbsd.org/netbsd-bugs/2021/07/12/msg072460.html

---
Izumi Tsutsui


Re: High vm scan rate and dropped keystrokes thru X?

2021-07-26 Thread Paul Ripke
On Mon, Jul 26, 2021 at 05:53:19PM +1200, Lloyd Parkes wrote:
> It has been a very long time since I had to look at UVM stuff, but luckily
> past me post to
> https://mail-index.netbsd.org/tech-repository/2010/02/01/msg000364.html.
> Well done past me.
> 
> Copying from that post, I was using
>   vm.anonmin = 10
>   vm.filemin = 5
>   vm.execmin = 5
>   vm.anonmax = 90
>   vm.filemax = 10
>   vm.execmax = 30
> 
> 
> On 25/07/21 5:37 pm, Paul Ripke wrote:
> > NetBSD 9.2, amd64, 16GiB RAM, quad core + hyperthreading.
> 
> Sounds normal enough.
> 
> >   procsmemory  page   disks   faults  cpu
> >   r b  avmfre  flt  re  pi   po   fr   sr w0 w1   in   sy  cs us sy 
> > id
> >   0 2 12214336  86564 4043   0   0000 66 66 2415 9142 4588 0  3 
> > 97
> 
> That's 12GB of RAM in use and 86MB of RAM free. Sounds pretty awful to me.

Sounds normal to me - I don't expect to see any free RAM unless I've just
- exited a large process
- deleted a large file with large cache footprint
- released a large chunk of RAM by other means (mmap, madvise, semctl, etc).

> What does top or vmstat -s say about pages active/inactive and
> anonymous/cachdd file/cached executable pages. This might give you a hint
> about where all your memory has gone and what it is being used for.

A big chunk of it is in file cache, which is unsurprising when reading
thru a 400GiB file...

>From top, around the time things go south - note that these firefox processes
aren't actually that busy, their percentages are normally <2%, but the
percentages spike up during periods of high scan rate. I'm pretty sure this
is just a monitoring artifact.

load averages:  3.08,  2.79,  1.87;   up 41+10:40:47
130 processes: 1 runnable, 124 sleeping, 1 stopped, 4 on CPU
CPU states:  0.2% user,  0.0% nice, 36.8% system,  1.9% interrupt, 60.9% idle
Memory: 8545M Act, 3257M Inact, 441M Wired, 446M Exec, 2842M File, 8120K Free
Swap: 10G Total, 3074M Used, 7166M Free

  PID USERNAME PRI NICE   SIZE   RES STATE  TIME   WCPUCPU COMMAND
0 root 1260 0K   41M CPU/7 25.7H 53.27% 53.27% [system]
 5629 stix  430  3512M  641M parked/4  58:59 29.69% 29.69% firefox
10881 stix  430  3806M  920M parked/5 304:50 14.55% 14.55% firefox
12200 stix  430  3276M  555M parked/2 921:49 12.99% 12.99% firefox
 6583 stix 223033M 6604K CPU/2  0:59 10.74% 10.74% zstd
19767 stix  410  3708M  906M CPU/6274:46  6.05%  6.05% firefox
 1110 root  850   231M   29M select/1 313:02  4.49%  4.49% X
 4227 stix  85060M   32M ttyraw/0  26.6H  4.05%  4.05% systat
28842 stix  850  2791M 1211M psem/5   700:48  1.17%  1.17% java
  981 stix  85078M 4952K select/6  23:28  0.88%  0.88% xterm

Looking at 'vmstat -s' around the time of badness, I don't see anything
obvious standing out. Apart from apparently we have a bug causing several
counters to either be negative or up around int64_max...

 4096 bytes per page
8 page colors
  4055762 pages managed
 2485 pages free
  2031945 pages active   
   991997 pages inactive
0 pages paging
   112918 pages wired
 1999 zero pages
1 reserve pagedaemon pages
   40 reserve kernel pages 
   118829 boot kernel pages 
   873053 kernel pool pages 
 
  2280040 anonymous pages   
 
   742869 cached file pages
   114192 cached executable pages   
 2048 minimum free pages
 
 2730 target free pages 
 
  1351920 maximum wired pages
1 swap devices  
 
  2621439 swap pages   
   786650 swap pages in use   
  6124373 swap allocations 
12709593653 total faults taken
11622482264 traps   
 
1587775504 device interrupts
 
10242360957 CPU context switches
3331649724 software interrupts
51409703491 system calls
  6103141 pagein requests
  1446005 pageout requests
0 pages swapped in
 11581719 pages swapped out
 19154795 forks total
  8291070 forks blocked parent
  8291070 forks shared address space w

Re: High vm scan rate and dropped keystrokes thru X?

2021-07-25 Thread Lloyd Parkes
It has been a very long time since I had to look at UVM stuff, but 
luckily past me post to 
https://mail-index.netbsd.org/tech-repository/2010/02/01/msg000364.html. 
Well done past me.


Copying from that post, I was using
  vm.anonmin = 10
  vm.filemin = 5
  vm.execmin = 5
  vm.anonmax = 90
  vm.filemax = 10
  vm.execmax = 30


On 25/07/21 5:37 pm, Paul Ripke wrote:

NetBSD 9.2, amd64, 16GiB RAM, quad core + hyperthreading.


Sounds normal enough.


  procsmemory  page   disks   faults  cpu
  r b  avmfre  flt  re  pi   po   fr   sr w0 w1   in   sy  cs us sy id
  0 2 12214336  86564 4043   0   0000 66 66 2415 9142 4588 0  3 97


That's 12GB of RAM in use and 86MB of RAM free. Sounds pretty awful to me.

What does top or vmstat -s say about pages active/inactive and 
anonymous/cachdd file/cached executable pages. This might give you a 
hint about where all your memory has gone and what it is being used for.


Cheers,
Lloyd



High vm scan rate and dropped keystrokes thru X?

2021-07-24 Thread Paul Ripke
NetBSD 9.2, amd64, 16GiB RAM, quad core + hyperthreading.

I've repeatedly noticed an issue where a large amount of disk reads can
result in lost keystrokes, jerky mouse behaviour and other weirdness.

On this occasion, I was trying to "zstd -vt" a 400GiB backup archive
from an ffsv2 USB attached HDD. Normally, it hums along at 30MiB/s, and
the system is perfectly capable of other tasks. But occasionally (maybe
even once or twice a minute), the system partially wedges, drops
keystrokes (logged in via X), jerky mouse, and largely unresponsive.

"vmstat 1" during these events shows climbing runqueue, falling free
memory, high reclaim rate, very high scan rate, and 8 CPUs worth of
system time - and I hear the BIOS spinning up the CPU fan.

 procsmemory  page   disks   faults  cpu
 r b  avmfre  flt  re  pi   po   fr   sr w0 w1   in   sy  cs us sy id
 0 2 12214336  86564 4043   0   0000 66 66 2415 9142 4588 0  3 97
 0 2 12246244  54100 4171   0   0000  1  0 2040 16832 4405 1 1 98
 4 2 12277980  21652 4103   0   0000 13  6 2075 13280 4222 1 3 96
 1 2 12264920  36772 3593 730   00 10950 44492 0  0 1934 9351 3595 1  3 96
 1 2 12297212   8880 4043   0   00 1080 1301  0  0 1994 8011 3396 0  1 99
 2 2 12275644  26016 3612 3942  00 11516 52959 7  7 2011 8851 3658 1  2 97
 1 2 12264288  37536 4370 2238  10 10020 44666 1  0 1975 11566 3643 1 3 96
 1 2 12296692  34392 4182   0   00 7347 7534 29 14 2333 16088 4706 1 2 97
 1 3 14096876  43680 1895 40019 00 6029 697167 10 10 1192 9816 2831 1 12 87
 3 2 12283292  18412 3240   0   0000 54 54 2004 8961 3589 1  5 94
 3 2 12260528  42140 2571 11816 00 11055 233685 0 0 1559 7628 2929 0 11 88
 0 2 12292320  14028 4106   0   10  893  928  1  0 2134 11051 3810 1 1 98
 1 2 12277084  26908 2753 4339  00 8147 231620 19 8 1567 17264 3798 1 14 85
 9 1 12274492  28764 2441 22971 00 5247 1078197 10 13 1792 15713 4132 1 16 
83
 3 1 16027172  11588 2250 18831 00  242 1078430 195 82 1665 7999 4199 1 17 
82
 5 2 13761124  11968  161 6880  00  406 1078429 39 33 734 6303 1888 1 53 46
 14 3 12292180 10920  276 17300 00   17 1078429 10 127 580 9782 1833 0 76 23
 0 3 13678700   8304  543 13567 00   21 1424324 30 13 713 6863 1786 1 43 56
 5 2 16608812   8212   44 17185 00   39 1811100 7 4  553 5491 1479 1 18 82
 7 1 15149532   8196   22 5954  00   11 1078667 35 33 605 4981 1571 0 62 38
 6 2 12453392   80609 9646  001 1078680 100 106 559 5284 1446 0 60 
40
 3 2 13072560   8084   23 8158  00   13 1273217 1 1  433 6344 1389 0 79 21
 8 1 16595948   80609 15398 100 1959533 1 1  507 4450 1207 0 44 56
 3 3 15631436   7888   43 9199  00   47 1082027 4 2  339 4153 1121 0 61 39
 4 3 14373172   7940   10 9538  00   15 1078707 0 0  408 4972 1172 0 60 40
 1 6 12424060   7000  206 9925  008 1078463 20 24 628 7176 1881 0 47 53
 8 4 14632192   5888  413 13300 00   43 1662082 13 8 427 8161 1813 1 67 33
 8 2 16117916   6468  405 17990 00   53 1450183 3 1  523 10788 1787 0 69 31
 9 4 16260308   7500   31 10739 00   30 1201978 17 15 541 4683 1284 1 62 38
 10 4 14928620  9044   27 7773  00   14 1078812 5 4  384 4760 1284 1 83 16
 7 3 13699356  10464   30 6783  00   17 1079299 3 2  392 5252 1109 0 82 18
 1 3 13023464   9600  651 9757  00   14 1260070 75 61 920 9891 2786 0 35 64
 0 6 16074608   9512   42 13714 00   32 1841906 2 11 463 5624 1379 0 37 63
 13 1 16224928  9676   33 10763 00   88 1214395 8 12 512 4572 1360 0 32 67
 28 3 14692152  9060   28 4717  000 1079161 1 0  524 3802 967  0 66 34
 3 9 12545240   8108   31 12217 005 1079165 4 2  637 6603 1341 0 44 55
 12 3 13690532  7692   23 12431 008 1427075 48 48 571 4545 1474 0 55 45
 10 3 15619384  7432   15 8081  00   10 1561463 4 2  400 4818 1289 0 66 34
 25 2 16226876  73005 10344 000 1231134 0 0  425 4729 1140 0 98  2
 14 2 16615924  72365 7563  001 1176532 6 6  452 4660 1232 0 99  1
 38 2 15239700  71688 8346  000 1079291 8 7  625 4333 1127 0 99  1
 0 2 12275088  26032 3857   0   10 11140 11261 95 89 2127 131045 5501 3 40 
58
 1 2 12265968  37412 3728   0   00 10115 10523 17 15 1770 16068 3847 1 4 95
 1 2 12259936  43424 3906   0   00 8966 9136  9  9 1864 12628 3535 1 2 97
 0 2 12292244  11120 4063   0   0000  3  1 2087 9112 3538 1  2 97
 0 2 12287804  15964 4414   0   00 8524 8609  8  2 1956 8249 3440 1  3 96
 1 1 12282612  20756 3674   0   00 8529 8614 42 29 2126 11998 4235 1 4 95
 1 1 12265308  37156 3602   0   00 11299 11826 0  0 1831 7291 3287 0  3 97
 0 2 12280260  43680 3784   0   00 9141 9345  7  7 1941 11775 3904 0 2 97
 1 2 12290916  12452 3905   0   0000  0  0 1933 5999 3066 0  2 98

I'm wondering if my tweaked vm sysctl's might be to blame?

vm.anonmin=30
vm.filemax=20

But they're not a huge departure from default