Re: recurring tstile hangs on -current
On Fri, Jul 05, 2019 at 10:29:24PM +0200, Thomas Klausner wrote: > >From some debugging so far, the cause for the hang seems to be that > the nvme driver is waiting for an interrupt that doesn't come. > > At least once I got it to get unstuck by call "nvme_intr()" on the > softc address from ddb. > Thomas Obvious fix. Index: wizd.8 === RCS file: /cvsroot/src/share/man/man8/wizd.8,v retrieving revision 1.11 diff -u -r1.11 wizd.8 --- wizd.8 12 Jun 2019 09:03:32 - 1.11 +++ wizd.8 5 Jul 2019 21:10:27 - @@ -48,6 +48,9 @@ also performs periodic sanity checks on the distribution set lists. E-mail alerts will be triggered if the build installs files that are marked as obsolete and therefore get automatically removed. +.Pp +.Nm +provides interrupts when the kernel drops them. .Sh SIGNALS .Nm responds to the following signals:
Re: recurring tstile hangs on -current
On Mon, Jul 01, 2019 at 07:02:06AM -, Michael van Elst wrote: > t...@giga.or.at (Thomas Klausner) writes: > > >So it looks like it could be a very extreme slowness instead of a > >complete deadlock. > > When it stops, try to reduce kern.maxvnodes to something low (like 100), > you can restore it, if the machine wakes up. > > If this is a memory shortage instead of a deadlock, this will easily > free kernel memory. You may need to wait 1-2 minutes to let the > page daemon collect free pages. > > If you cannot issue a sysctl command, you can patch the kernel variable > "desiredvnodes" with DDB instead. >From some debugging so far, the cause for the hang seems to be that the nvme driver is waiting for an interrupt that doesn't come. At least once I got it to get unstuck by call "nvme_intr()" on the softc address from ddb. Thomas
Re: recurring tstile hangs on -current
hello. If I were looking at this issue, I'd be looking at the perl process stuck in bioloc, to see what it's doing. As I understand it, processes stuck in tstile are a symptom, rather than a cause. that is, any process that is waiting for access to some subsystem in an indirect manner will show as waiting in tstile, rather than the actual thing it's waiting for. Perl, on the other hand, is in bioloc, short for biolock I assume, and my question is why? If you can clear that process, I'm thinking everything else will spring to life and begin working again. Just my 2 cents. -thanks -Brian On Jun 28, 9:42pm, Thomas Klausner wrote: } 28391 pbulk117033M 6240K bioloc/0 0:00 0.00% 0.00% perl
Re: recurring tstile hangs on -current
On Fri, Jun 28, 2019 at 09:42:08PM +0200, Thomas Klausner wrote: > To reduce the bug surface, I've disconnected the wd0 device which was > attached at ahcisata. This also removed the swap device, but the > machine is far from needing to swap. > > After ~5 hours the machine is currently hanging in tstile again. I > noticed the bulk build wasn't progressing (in a perl module) and > stopped it, then tried removing the work directory to get pbulk to > break off, that hung the rm process. I notice you also have a perl process in biowait, this is probably the one that is really hung. I've seen this occasionally, on various hardware, but I can't easily reproduce it. It looks like a I/O request is either never processed, or never biodone()'ed. But I couldn't spot any race condition in this path ... -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: recurring tstile hangs on -current
t...@giga.or.at (Thomas Klausner) writes: >So it looks like it could be a very extreme slowness instead of a >complete deadlock. When it stops, try to reduce kern.maxvnodes to something low (like 100), you can restore it, if the machine wakes up. If this is a memory shortage instead of a deadlock, this will easily free kernel memory. You may need to wait 1-2 minutes to let the page daemon collect free pages. If you cannot issue a sysctl command, you can patch the kernel variable "desiredvnodes" with DDB instead. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: recurring tstile hangs on -current
To reduce the bug surface, I've disconnected the wd0 device which was attached at ahcisata. This also removed the swap device, but the machine is far from needing to swap. After ~5 hours the machine is currently hanging in tstile again. I noticed the bulk build wasn't progressing (in a perl module) and stopped it, then tried removing the work directory to get pbulk to break off, that hung the rm process. top: load averages: 0.12, 0.09, 0.08; up 0+09:10:01 21:39:56 42 processes: 40 sleeping, 2 on CPU CPU states: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.5% idle Memory: 8005M Act, 15M Inact, 16M Wired, 23M Exec, 7904M File, 113G Free Swap: PID USERNAME PRI NICE SIZE RES STATE TIME WCPUCPU COMMAND 9833 wiz 85076M 51M ttyraw/3 1:12 11.08% 11.08% systat 0 root 00 0K 94M CPU/31 305:54 0.00% 0.00% [system] 191 root 85032M 2652K kqueue/1 2:13 0.00% 0.00% syslogd 24125 wiz 85086M 4984K select/1 0:13 0.00% 0.00% sshd 10315 wiz 43024M 2836K CPU/1 0:04 0.00% 0.00% top 374 root 85030M 16M pause/10:01 0.00% 0.00% ntpd 28391 pbulk117033M 6240K bioloc/0 0:00 0.00% 0.00% perl 8152 pbulk117018M 2300K tstile/1 0:00 0.00% 0.00% make 4194 root 117015M 1612K tstile/0 0:00 0.00% 0.00% rm 12818 pbulk117015M 1596K tstile/2 0:00 0.00% 0.00% chmod 28169 pbulk110036M 7376K tstile/1 0:00 0.00% 0.00% perl 8873 pbulk110016M 1608K tstile/2 0:00 0.00% 0.00% chmod 2843 root 85086M 6824K select/3 0:00 0.00% 0.00% sshd 22714 root 85086M 6816K select/5 0:00 0.00% 0.00% sshd 571 root 85070M 5060K wait/0 0:00 0.00% 0.00% login 13087 wiz 85086M 5028K select/0 0:00 0.00% 0.00% sshd 14025 pbulk 85021M 4812K wait/290:00 0.00% 0.00% make 17393 pbulk 85020M 4696K wait/140:00 0.00% 0.00% make 541 postfix 85057M 4568K kqueue/6 0:00 0.00% 0.00% qmgr 24142 postfix 85057M 4548K kqueue/5 0:00 0.00% 0.00% pickup 404 root 85072M 3424K select/3 0:00 0.00% 0.00% sshd 386 root 85057M 3316K kqueue/1 0:00 0.00% 0.00% master 8973 root 85020M 2092K wait/0 0:00 0.00% 0.00% sh 516 root 85023M 2080K nanosl/6 0:00 0.00% 0.00% cron 18983 pbulk 85021M 1996K wait/140:00 0.00% 0.00% sh 451 root 85016M 1952K pause/00:00 0.00% 0.00% ksh 19700 wiz 85019M 1948K pause/30:00 0.00% 0.00% ksh 11499 wiz 85017M 1944K pause/31 0:00 0.00% 0.00% ksh 562 root 85023M 1836K ttyraw/2 0:00 0.00% 0.00% getty 604 root 85020M 1832K ttyraw/1 0:00 0.00% 0.00% getty 467 root 85021M 1828K ttyraw/3 0:00 0.00% 0.00% getty 1 root 85020M 1828K wait/4 0:00 0.00% 0.00% init 599 root 85026M 1648K kqueue/1 0:00 0.00% 0.00% inetd 362 root 85020M 1516K kqueue/3 0:00 0.00% 0.00% powerd 11168 pbulk 78037M 6240K bioloc/1 0:00 0.00% 0.00% perl 8831 pbulk 78033M 6240K biowai/1 0:00 0.00% 0.00% perl 3270 pbulk 78023M 1960K wait/180:00 0.00% 0.00% sh 8456 pbulk 78020M 1960K wait/3 0:00 0.00% 0.00% sh 15353 pbulk 78020M 1960K wait/0 0:00 0.00% 0.00% sh 9212 pbulk 78020M 1960K wait/290:00 0.00% 0.00% sh 8360 pbulk 78020M 1960K wait/1 0:00 0.00% 0.00% sh 2790 pbulk 78020M 1960K wait/290:00 0.00% 0.00% sh systat vm: 3 usersLoad 0.16 0.10 0.09 Fri Jun 28 21:40:13 Proc:r d sCsw Traps SysCal Intr Soft Fault PAGING SWAPPING 8 6 25 14566212in out in out ops 0.4% Sy 0.0% Us 0.0% Ni 0.0% In 99.6% Idpages ||||||||||| forks fkppw Anon 111464% zero 64998924 167 Interrupts fksvm Exec23444% wired 16308 TLB shootdownpwait File 8093520 6% inact 15064 100 cpu0 timer relck Meta 2016665% bufs 406873 3 msix2 vec 0 rlkok (kB)real swaponly free 3 msix2 vec 1 noram Active8197056118749328 msix2 vec 2 ndcpy
Re: recurring tstile hangs on -current
This time it recovered! It took an hour or so, but the tstile blocked processes are now gone (finished) and I got my console shell (with the rm) back. Nothing in dmesg or /var/log/messages. So it looks like it could be a very extreme slowness instead of a complete deadlock. Thomas On Fri, Jun 28, 2019 at 09:42:08PM +0200, Thomas Klausner wrote: > To reduce the bug surface, I've disconnected the wd0 device which was > attached at ahcisata. This also removed the swap device, but the > machine is far from needing to swap. > > After ~5 hours the machine is currently hanging in tstile again. I > noticed the bulk build wasn't progressing (in a perl module) and > stopped it, then tried removing the work directory to get pbulk to > break off, that hung the rm process. > > top: > > load averages: 0.12, 0.09, 0.08; up 0+09:10:01 > > 21:39:56 > 42 processes: 40 sleeping, 2 on CPU > CPU states: 0.0% user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.5% idle > Memory: 8005M Act, 15M Inact, 16M Wired, 23M Exec, 7904M File, 113G Free > Swap: > > PID USERNAME PRI NICE SIZE RES STATE TIME WCPUCPU COMMAND > 9833 wiz 85076M 51M ttyraw/3 1:12 11.08% 11.08% systat > 0 root 00 0K 94M CPU/31 305:54 0.00% 0.00% [system] > 191 root 85032M 2652K kqueue/1 2:13 0.00% 0.00% syslogd > 24125 wiz 85086M 4984K select/1 0:13 0.00% 0.00% sshd > 10315 wiz 43024M 2836K CPU/1 0:04 0.00% 0.00% top > 374 root 85030M 16M pause/10:01 0.00% 0.00% ntpd > 28391 pbulk117033M 6240K bioloc/0 0:00 0.00% 0.00% perl > 8152 pbulk117018M 2300K tstile/1 0:00 0.00% 0.00% make > 4194 root 117015M 1612K tstile/0 0:00 0.00% 0.00% rm > 12818 pbulk117015M 1596K tstile/2 0:00 0.00% 0.00% chmod > 28169 pbulk110036M 7376K tstile/1 0:00 0.00% 0.00% perl > 8873 pbulk110016M 1608K tstile/2 0:00 0.00% 0.00% chmod > 2843 root 85086M 6824K select/3 0:00 0.00% 0.00% sshd > 22714 root 85086M 6816K select/5 0:00 0.00% 0.00% sshd > 571 root 85070M 5060K wait/0 0:00 0.00% 0.00% login > 13087 wiz 85086M 5028K select/0 0:00 0.00% 0.00% sshd > 14025 pbulk 85021M 4812K wait/290:00 0.00% 0.00% make > 17393 pbulk 85020M 4696K wait/140:00 0.00% 0.00% make > 541 postfix 85057M 4568K kqueue/6 0:00 0.00% 0.00% qmgr > 24142 postfix 85057M 4548K kqueue/5 0:00 0.00% 0.00% pickup > 404 root 85072M 3424K select/3 0:00 0.00% 0.00% sshd > 386 root 85057M 3316K kqueue/1 0:00 0.00% 0.00% master > 8973 root 85020M 2092K wait/0 0:00 0.00% 0.00% sh > 516 root 85023M 2080K nanosl/6 0:00 0.00% 0.00% cron > 18983 pbulk 85021M 1996K wait/140:00 0.00% 0.00% sh > 451 root 85016M 1952K pause/00:00 0.00% 0.00% ksh > 19700 wiz 85019M 1948K pause/30:00 0.00% 0.00% ksh > 11499 wiz 85017M 1944K pause/31 0:00 0.00% 0.00% ksh > 562 root 85023M 1836K ttyraw/2 0:00 0.00% 0.00% getty > 604 root 85020M 1832K ttyraw/1 0:00 0.00% 0.00% getty > 467 root 85021M 1828K ttyraw/3 0:00 0.00% 0.00% getty > 1 root 85020M 1828K wait/4 0:00 0.00% 0.00% init > 599 root 85026M 1648K kqueue/1 0:00 0.00% 0.00% inetd > 362 root 85020M 1516K kqueue/3 0:00 0.00% 0.00% powerd > 11168 pbulk 78037M 6240K bioloc/1 0:00 0.00% 0.00% perl > 8831 pbulk 78033M 6240K biowai/1 0:00 0.00% 0.00% perl > 3270 pbulk 78023M 1960K wait/180:00 0.00% 0.00% sh > 8456 pbulk 78020M 1960K wait/3 0:00 0.00% 0.00% sh > 15353 pbulk 78020M 1960K wait/0 0:00 0.00% 0.00% sh > 9212 pbulk 78020M 1960K wait/290:00 0.00% 0.00% sh > 8360 pbulk 78020M 1960K wait/1 0:00 0.00% 0.00% sh > 2790 pbulk 78020M 1960K wait/290:00 0.00% 0.00% sh > > systat vm: > > 3 usersLoad 0.16 0.10 0.09 Fri Jun 28 21:40:13 > > Proc:r d sCsw Traps SysCal Intr Soft Fault PAGING > SWAPPING > 8 6 25 14566212in out in > out > ops >0.4% Sy 0.0% Us 0.0% Ni 0.0% In 99.6% Idpages > ||||||||||| > > forks >
Re: recurring tstile hangs on -current
With dmesg this time. On Fri, Jun 28, 2019 at 11:39:05AM +0200, Thomas Klausner wrote: > Hi Frank! > > I checked some process states in ddb. > > "master", the 2 "bjam" and at least one "cp" hanging in tstile have: > sleepq_block() > turnstile_block() > rw_vector_enter() > genfs_lock() > VOP_LOCK() > vn_lock() > namei_tryemulroot() > namei() > check_exec() > execve_loadvm() > execve1() > syscall() > > These look quite similar to your backtraces. > > The "cp" hanging in biolock has: > sleepq_block > cv_timedwait > bbusy > getblk > bio_doread > ffs_init_vnode > ffs_newvnode > vcache_new > ufs_makeinode > ufs_create > VOP_CREATE > vn_open > do_open > do_sys_openat > sys_open > syscall > > I can't agree with the statement that it's a general -current problem > -- my current working machine does not have this issue. It "only" has > 32GB and 12 cores though, and no nvme. dmesg attached. > > Do you see the issue on machines without nvme? Just to eliminate that. > (I wanted to try replacing the nvme boot disk next.) > Thomas > > > On Fri, Jun 28, 2019 at 11:20:45AM +0200, Frank Kardel wrote: > > Hi Thomas, > > > > glad that this is observed elsewhere. > > > > Maybe following bugs could resonate with your observations: > > > > kern/54207 [serious/high]: > > -current locks up solidly when pkgsrc building > > adapta-gtk-theme-3.95.0.11 > > looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB) > > > > kern/54210 [serious/high]: > > NetBSD-8 processes presumably not exiting > > not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 > > CPU @ 3.20GHz, ~380Gb) > > > > At this time I am not too confident, that -current is reliably able to do a > > pkgsrc build, though I have seen occasionally bulk builds that did finish. > > Most of the time I run into hard lockups with no information about the > > system state available (no console, no X, no network, no DDB). > > > > Frank > > > > > > On 06/28/19 10:46, Thomas Klausner wrote: > > > Hi! > > > > > > I've set up a new machine for bulk building. I have tried various > > > things, but in the end it always hangs in tstile. > > > > > > First try was what I currently use: tmpfs sandboxes with nullfs > > > mounted /bin, /lib, ... When it hung, the suspicion was that it's > > > nullfs' fault. (The same setup works fine on my current machine.) > > > > > > The second try was tmpfs with copied-in /bin, /lib, ... and > > > NFS-mounted packages/distfiles/pkgsrc (from localhost). That also > > > hung. So the suspicion was that tmpfs or NFS are broken. > > > > > > The last try was building in the root file system, i.e. not even a > > > sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages > > > are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this > > > finished one pkgsrc build (where some packages didn't build because of > > > missing distfiles, or because they randomly break like rust). When I > > > restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours. > > > > > > I have the following systat output: > > > > > > 2 usersLoad 8.78 7.19 3.62 Fri Jun 28 > > > 04:27:32 > > > > > > Proc:r d sCsw Traps SysCal Intr Soft Fault PAGING > > > SWAPPING > > > 2410 7548 265849 157956 3504 2399 265476 in out > > > in out > > > ops > > >56.2% Sy 1.2% Us 0.0% Ni 0.0% In 42.5% Idpages > > > ||||||||||| > > > > 670 > > > forks > > > > > > fkppw > > > Anon 294104% zero 62161268 5572 Interrupts > > > fksvm > > > Exec14116% wired 16296 1968 TLB shootdown > > > pwait > > > File 24587740 18% inact 43756 100 cpu0 timer > > > relck > > > Meta 2606694% bufs 495676 msi1 vec 0 > > > rlkok > > > (kB)real swaponly free 9 msix2 vec 0 > > > noram > > > Active 24835908100033996 9 msix2 vec 157262 > > > ndcpy > > > Namei Sys-cache Proc-cache msix2 vec 227906 > > > fltcp > > > Calls hits% hits % 3427 ioapic1 pin 12 > > > 87178 zfod > > > 125076 122834 98 80 059 ioapic2 pin 0 > > > 35775 cow > > > msix7 vec 0 > > > 8192 fmin > > >Disks: seeks xfers bytes %busy > > > 10922 ftarg > > > ld01969 16130K34.8 > > > itarg > > > dk01969 16130K34.8 > > > flnan > > > wd0
Re: recurring tstile hangs on -current
When I tried to get a core, I saw: > reboot 0x104 dumping to dev 168,2 (offset=73677660, size=33524130) dump ahcisata0 port 5: clearing WDCTL_RST failed for drive 0 wddump: device timed out i/o error rebooting... Thomas On Fri, Jun 28, 2019 at 11:39:05AM +0200, Thomas Klausner wrote: > Hi Frank! > > I checked some process states in ddb. > > "master", the 2 "bjam" and at least one "cp" hanging in tstile have: > sleepq_block() > turnstile_block() > rw_vector_enter() > genfs_lock() > VOP_LOCK() > vn_lock() > namei_tryemulroot() > namei() > check_exec() > execve_loadvm() > execve1() > syscall() > > These look quite similar to your backtraces. > > The "cp" hanging in biolock has: > sleepq_block > cv_timedwait > bbusy > getblk > bio_doread > ffs_init_vnode > ffs_newvnode > vcache_new > ufs_makeinode > ufs_create > VOP_CREATE > vn_open > do_open > do_sys_openat > sys_open > syscall > > I can't agree with the statement that it's a general -current problem > -- my current working machine does not have this issue. It "only" has > 32GB and 12 cores though, and no nvme. dmesg attached. > > Do you see the issue on machines without nvme? Just to eliminate that. > (I wanted to try replacing the nvme boot disk next.) > Thomas > > > On Fri, Jun 28, 2019 at 11:20:45AM +0200, Frank Kardel wrote: > > Hi Thomas, > > > > glad that this is observed elsewhere. > > > > Maybe following bugs could resonate with your observations: > > > > kern/54207 [serious/high]: > > -current locks up solidly when pkgsrc building > > adapta-gtk-theme-3.95.0.11 > > looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB) > > > > kern/54210 [serious/high]: > > NetBSD-8 processes presumably not exiting > > not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 > > CPU @ 3.20GHz, ~380Gb) > > > > At this time I am not too confident, that -current is reliably able to do a > > pkgsrc build, though I have seen occasionally bulk builds that did finish. > > Most of the time I run into hard lockups with no information about the > > system state available (no console, no X, no network, no DDB). > > > > Frank > > > > > > On 06/28/19 10:46, Thomas Klausner wrote: > > > Hi! > > > > > > I've set up a new machine for bulk building. I have tried various > > > things, but in the end it always hangs in tstile. > > > > > > First try was what I currently use: tmpfs sandboxes with nullfs > > > mounted /bin, /lib, ... When it hung, the suspicion was that it's > > > nullfs' fault. (The same setup works fine on my current machine.) > > > > > > The second try was tmpfs with copied-in /bin, /lib, ... and > > > NFS-mounted packages/distfiles/pkgsrc (from localhost). That also > > > hung. So the suspicion was that tmpfs or NFS are broken. > > > > > > The last try was building in the root file system, i.e. not even a > > > sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages > > > are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this > > > finished one pkgsrc build (where some packages didn't build because of > > > missing distfiles, or because they randomly break like rust). When I > > > restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours. > > > > > > I have the following systat output: > > > > > > 2 usersLoad 8.78 7.19 3.62 Fri Jun 28 > > > 04:27:32 > > > > > > Proc:r d sCsw Traps SysCal Intr Soft Fault PAGING > > > SWAPPING > > > 2410 7548 265849 157956 3504 2399 265476 in out > > > in out > > > ops > > >56.2% Sy 1.2% Us 0.0% Ni 0.0% In 42.5% Idpages > > > ||||||||||| > > > > 670 > > > forks > > > > > > fkppw > > > Anon 294104% zero 62161268 5572 Interrupts > > > fksvm > > > Exec14116% wired 16296 1968 TLB shootdown > > > pwait > > > File 24587740 18% inact 43756 100 cpu0 timer > > > relck > > > Meta 2606694% bufs 495676 msi1 vec 0 > > > rlkok > > > (kB)real swaponly free 9 msix2 vec 0 > > > noram > > > Active 24835908100033996 9 msix2 vec 157262 > > > ndcpy > > > Namei Sys-cache Proc-cache msix2 vec 227906 > > > fltcp > > > Calls hits% hits % 3427 ioapic1 pin 12 > > > 87178 zfod > > > 125076 122834 98 80 059 ioapic2 pin 0 > > > 35775 cow > > > msix7 vec 0 > > > 8192 fmin > > >Disks: seeks xfers bytes %busy > > > 10922 ftarg > > >
Re: recurring tstile hangs on -current
On Fri, Jun 28, 2019 at 11:44:37AM +0100, Robert Swindells wrote: > > Thomas Klausner wrote: > >I've set up a new machine for bulk building. I have tried various > >things, but in the end it always hangs in tstile. > > > >First try was what I currently use: tmpfs sandboxes with nullfs > >mounted /bin, /lib, ... When it hung, the suspicion was that it's > >nullfs' fault. (The same setup works fine on my current machine.) > > > >The second try was tmpfs with copied-in /bin, /lib, ... and > >NFS-mounted packages/distfiles/pkgsrc (from localhost). That also > >hung. So the suspicion was that tmpfs or NFS are broken. > > > >The last try was building in the root file system, i.e. not even a > >sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages > >are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this > >finished one pkgsrc build (where some packages didn't build because of > >missing distfiles, or because they randomly break like rust). When I > >restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours. > > Is a bulk build trying to make packages in parallel ? No. > If so, then I'm guessing that the multiple instances of cp(1) are > putting the package tarball into the same directory somewhere. I guess this is just the installation of the boost-headers. Also, I think that multiple programs writing different files in the same directory should work ;-) Thomas
Re: recurring tstile hangs on -current
Thomas Klausner wrote: >I've set up a new machine for bulk building. I have tried various >things, but in the end it always hangs in tstile. > >First try was what I currently use: tmpfs sandboxes with nullfs >mounted /bin, /lib, ... When it hung, the suspicion was that it's >nullfs' fault. (The same setup works fine on my current machine.) > >The second try was tmpfs with copied-in /bin, /lib, ... and >NFS-mounted packages/distfiles/pkgsrc (from localhost). That also >hung. So the suspicion was that tmpfs or NFS are broken. > >The last try was building in the root file system, i.e. not even a >sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages >are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this >finished one pkgsrc build (where some packages didn't build because of >missing distfiles, or because they randomly break like rust). When I >restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours. Is a bulk build trying to make packages in parallel ? If so, then I'm guessing that the multiple instances of cp(1) are putting the package tarball into the same directory somewhere.
Re: recurring tstile hangs on -current
Hi Frank! I checked some process states in ddb. "master", the 2 "bjam" and at least one "cp" hanging in tstile have: sleepq_block() turnstile_block() rw_vector_enter() genfs_lock() VOP_LOCK() vn_lock() namei_tryemulroot() namei() check_exec() execve_loadvm() execve1() syscall() These look quite similar to your backtraces. The "cp" hanging in biolock has: sleepq_block cv_timedwait bbusy getblk bio_doread ffs_init_vnode ffs_newvnode vcache_new ufs_makeinode ufs_create VOP_CREATE vn_open do_open do_sys_openat sys_open syscall I can't agree with the statement that it's a general -current problem -- my current working machine does not have this issue. It "only" has 32GB and 12 cores though, and no nvme. dmesg attached. Do you see the issue on machines without nvme? Just to eliminate that. (I wanted to try replacing the nvme boot disk next.) Thomas On Fri, Jun 28, 2019 at 11:20:45AM +0200, Frank Kardel wrote: > Hi Thomas, > > glad that this is observed elsewhere. > > Maybe following bugs could resonate with your observations: > > kern/54207 [serious/high]: > -current locks up solidly when pkgsrc building > adapta-gtk-theme-3.95.0.11 > looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB) > > kern/54210 [serious/high]: > NetBSD-8 processes presumably not exiting > not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 > CPU @ 3.20GHz, ~380Gb) > > At this time I am not too confident, that -current is reliably able to do a > pkgsrc build, though I have seen occasionally bulk builds that did finish. > Most of the time I run into hard lockups with no information about the system > state available (no console, no X, no network, no DDB). > > Frank > > > On 06/28/19 10:46, Thomas Klausner wrote: > > Hi! > > > > I've set up a new machine for bulk building. I have tried various > > things, but in the end it always hangs in tstile. > > > > First try was what I currently use: tmpfs sandboxes with nullfs > > mounted /bin, /lib, ... When it hung, the suspicion was that it's > > nullfs' fault. (The same setup works fine on my current machine.) > > > > The second try was tmpfs with copied-in /bin, /lib, ... and > > NFS-mounted packages/distfiles/pkgsrc (from localhost). That also > > hung. So the suspicion was that tmpfs or NFS are broken. > > > > The last try was building in the root file system, i.e. not even a > > sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages > > are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this > > finished one pkgsrc build (where some packages didn't build because of > > missing distfiles, or because they randomly break like rust). When I > > restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours. > > > > I have the following systat output: > > > > 2 usersLoad 8.78 7.19 3.62 Fri Jun 28 04:27:32 > > > > Proc:r d sCsw Traps SysCal Intr Soft Fault PAGING > > SWAPPING > > 2410 7548 265849 157956 3504 2399 265476 in out in > > out > > ops > >56.2% Sy 1.2% Us 0.0% Ni 0.0% In 42.5% Idpages > > ||||||||||| > > > 670 > > forks > > > > fkppw > > Anon 294104% zero 62161268 5572 Interrupts > > fksvm > > Exec14116% wired 16296 1968 TLB shootdown > > pwait > > File 24587740 18% inact 43756 100 cpu0 timer > > relck > > Meta 2606694% bufs 495676 msi1 vec 0 > > rlkok > > (kB)real swaponly free 9 msix2 vec 0 > > noram > > Active 24835908100033996 9 msix2 vec 157262 > > ndcpy > > Namei Sys-cache Proc-cache msix2 vec 227906 > > fltcp > > Calls hits% hits % 3427 ioapic1 pin 12 87178 > > zfod > > 125076 122834 98 80 059 ioapic2 pin 0 35775 > > cow > > msix7 vec 0 8192 > > fmin > >Disks: seeks xfers bytes %busy10922 > > ftarg > > ld01969 16130K34.8 > > itarg > > dk01969 16130K34.8 > > flnan > > wd0 > > pdfre > > dk1 > > pdscn > > dk2 > > > > and this from top: > > > > load averages: 5.13, 6.53, 3.56; up 1+16:08:05 > > >
Re: recurring tstile hangs on -current
Hi Thomas, glad that this is observed elsewhere. Maybe following bugs could resonate with your observations: kern/54207 [serious/high]: -current locks up solidly when pkgsrc building adapta-gtk-theme-3.95.0.11 looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB) kern/54210 [serious/high]: NetBSD-8 processes presumably not exiting not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz, ~380Gb) At this time I am not too confident, that -current is reliably able to do a pkgsrc build, though I have seen occasionally bulk builds that did finish. Most of the time I run into hard lockups with no information about the system state available (no console, no X, no network, no DDB). Frank On 06/28/19 10:46, Thomas Klausner wrote: Hi! I've set up a new machine for bulk building. I have tried various things, but in the end it always hangs in tstile. First try was what I currently use: tmpfs sandboxes with nullfs mounted /bin, /lib, ... When it hung, the suspicion was that it's nullfs' fault. (The same setup works fine on my current machine.) The second try was tmpfs with copied-in /bin, /lib, ... and NFS-mounted packages/distfiles/pkgsrc (from localhost). That also hung. So the suspicion was that tmpfs or NFS are broken. The last try was building in the root file system, i.e. not even a sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this finished one pkgsrc build (where some packages didn't build because of missing distfiles, or because they randomly break like rust). When I restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours. I have the following systat output: 2 usersLoad 8.78 7.19 3.62 Fri Jun 28 04:27:32 Proc:r d sCsw Traps SysCal Intr Soft Fault PAGING SWAPPING 2410 7548 265849 157956 3504 2399 265476 in out in out ops 56.2% Sy 1.2% Us 0.0% Ni 0.0% In 42.5% Idpages ||||||||||| > 670 forks fkppw Anon 294104% zero 62161268 5572 Interrupts fksvm Exec14116% wired 16296 1968 TLB shootdownpwait File 24587740 18% inact 43756 100 cpu0 timer relck Meta 2606694% bufs 495676 msi1 vec 0 rlkok (kB)real swaponly free 9 msix2 vec 0 noram Active 24835908100033996 9 msix2 vec 157262 ndcpy Namei Sys-cache Proc-cache msix2 vec 227906 fltcp Calls hits% hits % 3427 ioapic1 pin 12 87178 zfod 125076 122834 98 80 059 ioapic2 pin 0 35775 cow msix7 vec 0 8192 fmin Disks: seeks xfers bytes %busy10922 ftarg ld01969 16130K34.8 itarg dk01969 16130K34.8 flnan wd0 pdfre dk1 pdscn dk2 and this from top: load averages: 5.13, 6.53, 3.56; up 1+16:08:05 04:28:13 59 processes: 2 runnable, 55 sleeping, 2 on CPU CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.9% idle Memory: 24G Act, 43M Inact, 16M Wired, 14M Exec, 23G File, 95G Free Swap: 163G Total, 163G Free PID USERNAME PRI NICE SIZE RES STATE TIME WCPUCPU COMMAND 10353 pbulk 770 185M 172M select/0 0:13 4.74% 4.54% bjam 12120 wiz 109083M 59M tstile/1 165:46 1.46% 1.46% systat 0 root 00 0K 93M CPU/3135:39 0.00% 0.00% [system] 219 root 85032M 2676K kqueue/4 7:34 0.00% 0.00% syslogd 13354 wiz 85089M 4948K select/0 0:52 0.00% 0.00% sshd 380 root 85030M 16M pause/40:04 0.00% 0.00% ntpd 10918 wiz 43025M 2872K CPU/3 0:01 0.00% 0.00% top 1 root 85020M 1756K wait/290:01 0.00% 0.00% init 5594 pbulk 00 0K0K RUN/0 0:00 0.00% 0.00% bjam 22861 pbulk 00 0K0K RUN/0 0:00 0.00% 0.00% bjam 747 root 117020M 2080K tstile/8 0:00 0.00% 0.00% cron 16473 pbulk117018M 1564K tstile/2 0:00 0.00% 0.00% cp 9705 pbulk1170
recurring tstile hangs on -current
Hi! I've set up a new machine for bulk building. I have tried various things, but in the end it always hangs in tstile. First try was what I currently use: tmpfs sandboxes with nullfs mounted /bin, /lib, ... When it hung, the suspicion was that it's nullfs' fault. (The same setup works fine on my current machine.) The second try was tmpfs with copied-in /bin, /lib, ... and NFS-mounted packages/distfiles/pkgsrc (from localhost). That also hung. So the suspicion was that tmpfs or NFS are broken. The last try was building in the root file system, i.e. not even a sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this finished one pkgsrc build (where some packages didn't build because of missing distfiles, or because they randomly break like rust). When I restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours. I have the following systat output: 2 usersLoad 8.78 7.19 3.62 Fri Jun 28 04:27:32 Proc:r d sCsw Traps SysCal Intr Soft Fault PAGING SWAPPING 2410 7548 265849 157956 3504 2399 265476 in out in out ops 56.2% Sy 1.2% Us 0.0% Ni 0.0% In 42.5% Idpages ||||||||||| > 670 forks fkppw Anon 294104% zero 62161268 5572 Interrupts fksvm Exec14116% wired 16296 1968 TLB shootdownpwait File 24587740 18% inact 43756 100 cpu0 timer relck Meta 2606694% bufs 495676 msi1 vec 0 rlkok (kB)real swaponly free 9 msix2 vec 0 noram Active 24835908100033996 9 msix2 vec 157262 ndcpy Namei Sys-cache Proc-cache msix2 vec 227906 fltcp Calls hits% hits % 3427 ioapic1 pin 12 87178 zfod 125076 122834 98 80 059 ioapic2 pin 0 35775 cow msix7 vec 0 8192 fmin Disks: seeks xfers bytes %busy10922 ftarg ld01969 16130K34.8 itarg dk01969 16130K34.8 flnan wd0 pdfre dk1 pdscn dk2 and this from top: load averages: 5.13, 6.53, 3.56; up 1+16:08:05 04:28:13 59 processes: 2 runnable, 55 sleeping, 2 on CPU CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.9% idle Memory: 24G Act, 43M Inact, 16M Wired, 14M Exec, 23G File, 95G Free Swap: 163G Total, 163G Free PID USERNAME PRI NICE SIZE RES STATE TIME WCPUCPU COMMAND 10353 pbulk 770 185M 172M select/0 0:13 4.74% 4.54% bjam 12120 wiz 109083M 59M tstile/1 165:46 1.46% 1.46% systat 0 root 00 0K 93M CPU/3135:39 0.00% 0.00% [system] 219 root 85032M 2676K kqueue/4 7:34 0.00% 0.00% syslogd 13354 wiz 85089M 4948K select/0 0:52 0.00% 0.00% sshd 380 root 85030M 16M pause/40:04 0.00% 0.00% ntpd 10918 wiz 43025M 2872K CPU/3 0:01 0.00% 0.00% top 1 root 85020M 1756K wait/290:01 0.00% 0.00% init 5594 pbulk 00 0K0K RUN/0 0:00 0.00% 0.00% bjam 22861 pbulk 00 0K0K RUN/0 0:00 0.00% 0.00% bjam 747 root 117020M 2080K tstile/8 0:00 0.00% 0.00% cron 16473 pbulk117018M 1564K tstile/2 0:00 0.00% 0.00% cp 9705 pbulk117015M 1564K bioloc/5 0:00 0.00% 0.00% cp 7301 pbulk117015M 1560K tstile/2 0:00 0.00% 0.00% cp 22971 pbulk117019M 1520K tstile/1 0:00 0.00% 0.00% cp 10013 pbulk117015M 1520K tstile/1 0:00 0.00% 0.00% cp 3411 pbulk117015M 1520K tstile/3 0:00 0.00% 0.00% cp 5212 pbulk117015M 1520K tstile/2 0:00 0.00% 0.00% cp 7072 pbulk117018M 1516K tstile/2 0:00 0.00% 0.00% cp 8880 pbulk117015M 1516K tstile/2 0:00 0.00% 0.00% cp 5869 pbulk117015M 1516K tstile/0 0:00 0.00% 0.00% cp 10159 pbulk117015M 1516K tstile/1 0:00 0.00% 0.00% cp 11783 pbulk117015M 1516K tstile/7 0:00 0.00% 0.00% cp 7205 pbulk117015M 1512K tstile/1 0:00 0.00% 0.00% cp 18676 pbulk