Re: recurring tstile hangs on -current

2019-07-05 Thread maya
On Fri, Jul 05, 2019 at 10:29:24PM +0200, Thomas Klausner wrote:
> >From some debugging so far, the cause for the hang seems to be that
> the nvme driver is waiting for an interrupt that doesn't come.
> 
> At least once I got it to get unstuck by call "nvme_intr()" on the
> softc address from ddb.
>  Thomas

Obvious fix.

Index: wizd.8
===
RCS file: /cvsroot/src/share/man/man8/wizd.8,v
retrieving revision 1.11
diff -u -r1.11 wizd.8
--- wizd.8  12 Jun 2019 09:03:32 -  1.11
+++ wizd.8  5 Jul 2019 21:10:27 -
@@ -48,6 +48,9 @@
 also performs periodic sanity checks on the distribution set lists.
 E-mail alerts will be triggered if the build installs files that are marked
 as obsolete and therefore get automatically removed.
+.Pp
+.Nm
+provides interrupts when the kernel drops them.
 .Sh SIGNALS
 .Nm
 responds to the following signals:



Re: recurring tstile hangs on -current

2019-07-05 Thread Thomas Klausner
On Mon, Jul 01, 2019 at 07:02:06AM -, Michael van Elst wrote:
> t...@giga.or.at (Thomas Klausner) writes:
> 
> >So it looks like it could be a very extreme slowness instead of a
> >complete deadlock.
> 
> When it stops, try to reduce kern.maxvnodes to something low (like 100),
> you can restore it, if the machine wakes up.
> 
> If this is a memory shortage instead of a deadlock, this will easily
> free kernel memory. You may need to wait 1-2 minutes to let the
> page daemon collect free pages.
> 
> If you cannot issue a sysctl command, you can patch the kernel variable
> "desiredvnodes" with DDB instead.

>From some debugging so far, the cause for the hang seems to be that
the nvme driver is waiting for an interrupt that doesn't come.

At least once I got it to get unstuck by call "nvme_intr()" on the
softc address from ddb.
 Thomas


Re: recurring tstile hangs on -current

2019-07-01 Thread Brian Buhrow
hello.  If I were looking at this issue, I'd be looking at the perl
process stuck in bioloc, to see what it's doing.  As I understand it,
processes stuck in tstile are a symptom, rather than a cause.  that is, any
process that is waiting for access to some subsystem in an indirect manner
will show as waiting in tstile, rather than the actual thing it's waiting
for.  Perl, on the other hand, is in bioloc, short for biolock I assume, and
my question is why?  If you can clear that process, I'm thinking everything
else will spring to life and begin working again.

Just my 2 cents.
-thanks
-Brian

On Jun 28,  9:42pm, Thomas Klausner wrote:
} 28391 pbulk117033M 6240K bioloc/0   0:00  0.00%  0.00% perl


Re: recurring tstile hangs on -current

2019-07-01 Thread Manuel Bouyer
On Fri, Jun 28, 2019 at 09:42:08PM +0200, Thomas Klausner wrote:
> To reduce the bug surface, I've disconnected the wd0 device which was
> attached at ahcisata. This also removed the swap device, but the
> machine is far from needing to swap.
> 
> After ~5 hours the machine is currently hanging in tstile again. I
> noticed the bulk build wasn't progressing (in a perl module) and
> stopped it, then tried removing the work directory to get pbulk to
> break off, that hung the rm process.

I notice you also have a perl process in biowait, this is probably
the one that is really hung. I've seen this occasionally, on various
hardware, but I can't easily reproduce it.
It looks like a I/O request is either never processed, or never biodone()'ed.
But I couldn't spot any race condition in this path ...

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: recurring tstile hangs on -current

2019-07-01 Thread Michael van Elst
t...@giga.or.at (Thomas Klausner) writes:

>So it looks like it could be a very extreme slowness instead of a
>complete deadlock.

When it stops, try to reduce kern.maxvnodes to something low (like 100),
you can restore it, if the machine wakes up.

If this is a memory shortage instead of a deadlock, this will easily
free kernel memory. You may need to wait 1-2 minutes to let the
page daemon collect free pages.

If you cannot issue a sysctl command, you can patch the kernel variable
"desiredvnodes" with DDB instead.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: recurring tstile hangs on -current

2019-06-30 Thread Thomas Klausner
To reduce the bug surface, I've disconnected the wd0 device which was
attached at ahcisata. This also removed the swap device, but the
machine is far from needing to swap.

After ~5 hours the machine is currently hanging in tstile again. I
noticed the bulk build wasn't progressing (in a perl module) and
stopped it, then tried removing the work directory to get pbulk to
break off, that hung the rm process.

top:

load averages:  0.12,  0.09,  0.08;   up 0+09:10:01 

 21:39:56
42 processes: 40 sleeping, 2 on CPU
CPU states:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.5% idle
Memory: 8005M Act, 15M Inact, 16M Wired, 23M Exec, 7904M File, 113G Free
Swap:

  PID USERNAME PRI NICE   SIZE   RES STATE  TIME   WCPUCPU COMMAND
 9833 wiz   85076M   51M ttyraw/3   1:12 11.08% 11.08% systat
0 root   00 0K   94M CPU/31   305:54  0.00%  0.00% [system]
  191 root  85032M 2652K kqueue/1   2:13  0.00%  0.00% syslogd
24125 wiz   85086M 4984K select/1   0:13  0.00%  0.00% sshd
10315 wiz   43024M 2836K CPU/1  0:04  0.00%  0.00% top
  374 root  85030M   16M pause/10:01  0.00%  0.00% ntpd
28391 pbulk117033M 6240K bioloc/0   0:00  0.00%  0.00% perl
 8152 pbulk117018M 2300K tstile/1   0:00  0.00%  0.00% make
 4194 root 117015M 1612K tstile/0   0:00  0.00%  0.00% rm
12818 pbulk117015M 1596K tstile/2   0:00  0.00%  0.00% chmod
28169 pbulk110036M 7376K tstile/1   0:00  0.00%  0.00% perl
 8873 pbulk110016M 1608K tstile/2   0:00  0.00%  0.00% chmod
 2843 root  85086M 6824K select/3   0:00  0.00%  0.00% sshd
22714 root  85086M 6816K select/5   0:00  0.00%  0.00% sshd
  571 root  85070M 5060K wait/0 0:00  0.00%  0.00% login
13087 wiz   85086M 5028K select/0   0:00  0.00%  0.00% sshd
14025 pbulk 85021M 4812K wait/290:00  0.00%  0.00% make
17393 pbulk 85020M 4696K wait/140:00  0.00%  0.00% make
  541 postfix   85057M 4568K kqueue/6   0:00  0.00%  0.00% qmgr
24142 postfix   85057M 4548K kqueue/5   0:00  0.00%  0.00% pickup
  404 root  85072M 3424K select/3   0:00  0.00%  0.00% sshd
  386 root  85057M 3316K kqueue/1   0:00  0.00%  0.00% master
 8973 root  85020M 2092K wait/0 0:00  0.00%  0.00% sh
  516 root  85023M 2080K nanosl/6   0:00  0.00%  0.00% cron
18983 pbulk 85021M 1996K wait/140:00  0.00%  0.00% sh
  451 root  85016M 1952K pause/00:00  0.00%  0.00% ksh
19700 wiz   85019M 1948K pause/30:00  0.00%  0.00% ksh
11499 wiz   85017M 1944K pause/31   0:00  0.00%  0.00% ksh
  562 root  85023M 1836K ttyraw/2   0:00  0.00%  0.00% getty
  604 root  85020M 1832K ttyraw/1   0:00  0.00%  0.00% getty
  467 root  85021M 1828K ttyraw/3   0:00  0.00%  0.00% getty
1 root  85020M 1828K wait/4 0:00  0.00%  0.00% init
  599 root  85026M 1648K kqueue/1   0:00  0.00%  0.00% inetd
  362 root  85020M 1516K kqueue/3   0:00  0.00%  0.00% powerd
11168 pbulk 78037M 6240K bioloc/1   0:00  0.00%  0.00% perl
 8831 pbulk 78033M 6240K biowai/1   0:00  0.00%  0.00% perl
 3270 pbulk 78023M 1960K wait/180:00  0.00%  0.00% sh
 8456 pbulk 78020M 1960K wait/3 0:00  0.00%  0.00% sh
15353 pbulk 78020M 1960K wait/0 0:00  0.00%  0.00% sh
 9212 pbulk 78020M 1960K wait/290:00  0.00%  0.00% sh
 8360 pbulk 78020M 1960K wait/1 0:00  0.00%  0.00% sh
 2790 pbulk 78020M 1960K wait/290:00  0.00%  0.00% sh

systat vm:

3 usersLoad  0.16  0.10  0.09  Fri Jun 28 21:40:13

Proc:r  d  sCsw  Traps SysCal  Intr   Soft  Fault PAGING   SWAPPING
8  6 25   14566212in  out   in  out
ops
   0.4% Sy   0.0% Us   0.0% Ni   0.0% In  99.6% Idpages
|||||||||||
  forks
  fkppw
Anon   111464%   zero 64998924   167 Interrupts   fksvm
Exec23444%   wired   16308   TLB shootdownpwait
File  8093520   6%   inact   15064   100 cpu0 timer   relck
Meta  2016665%   bufs   406873 3 msix2 vec 0  rlkok
 (kB)real   swaponly  free 3 msix2 vec 1  noram
Active8197056118749328   msix2 vec 2  ndcpy

Re: recurring tstile hangs on -current

2019-06-30 Thread Thomas Klausner
This time it recovered! It took an hour or so, but the tstile blocked
processes are now gone (finished) and I got my console shell (with the
rm) back.

Nothing in dmesg or /var/log/messages.

So it looks like it could be a very extreme slowness instead of a
complete deadlock.
 Thomas

On Fri, Jun 28, 2019 at 09:42:08PM +0200, Thomas Klausner wrote:
> To reduce the bug surface, I've disconnected the wd0 device which was
> attached at ahcisata. This also removed the swap device, but the
> machine is far from needing to swap.
> 
> After ~5 hours the machine is currently hanging in tstile again. I
> noticed the bulk build wasn't progressing (in a perl module) and
> stopped it, then tried removing the work directory to get pbulk to
> break off, that hung the rm process.
> 
> top:
> 
> load averages:  0.12,  0.09,  0.08;   up 0+09:10:01   
>   
>  21:39:56
> 42 processes: 40 sleeping, 2 on CPU
> CPU states:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.5% idle
> Memory: 8005M Act, 15M Inact, 16M Wired, 23M Exec, 7904M File, 113G Free
> Swap:
> 
>   PID USERNAME PRI NICE   SIZE   RES STATE  TIME   WCPUCPU COMMAND
>  9833 wiz   85076M   51M ttyraw/3   1:12 11.08% 11.08% systat
> 0 root   00 0K   94M CPU/31   305:54  0.00%  0.00% [system]
>   191 root  85032M 2652K kqueue/1   2:13  0.00%  0.00% syslogd
> 24125 wiz   85086M 4984K select/1   0:13  0.00%  0.00% sshd
> 10315 wiz   43024M 2836K CPU/1  0:04  0.00%  0.00% top
>   374 root  85030M   16M pause/10:01  0.00%  0.00% ntpd
> 28391 pbulk117033M 6240K bioloc/0   0:00  0.00%  0.00% perl
>  8152 pbulk117018M 2300K tstile/1   0:00  0.00%  0.00% make
>  4194 root 117015M 1612K tstile/0   0:00  0.00%  0.00% rm
> 12818 pbulk117015M 1596K tstile/2   0:00  0.00%  0.00% chmod
> 28169 pbulk110036M 7376K tstile/1   0:00  0.00%  0.00% perl
>  8873 pbulk110016M 1608K tstile/2   0:00  0.00%  0.00% chmod
>  2843 root  85086M 6824K select/3   0:00  0.00%  0.00% sshd
> 22714 root  85086M 6816K select/5   0:00  0.00%  0.00% sshd
>   571 root  85070M 5060K wait/0 0:00  0.00%  0.00% login
> 13087 wiz   85086M 5028K select/0   0:00  0.00%  0.00% sshd
> 14025 pbulk 85021M 4812K wait/290:00  0.00%  0.00% make
> 17393 pbulk 85020M 4696K wait/140:00  0.00%  0.00% make
>   541 postfix   85057M 4568K kqueue/6   0:00  0.00%  0.00% qmgr
> 24142 postfix   85057M 4548K kqueue/5   0:00  0.00%  0.00% pickup
>   404 root  85072M 3424K select/3   0:00  0.00%  0.00% sshd
>   386 root  85057M 3316K kqueue/1   0:00  0.00%  0.00% master
>  8973 root  85020M 2092K wait/0 0:00  0.00%  0.00% sh
>   516 root  85023M 2080K nanosl/6   0:00  0.00%  0.00% cron
> 18983 pbulk 85021M 1996K wait/140:00  0.00%  0.00% sh
>   451 root  85016M 1952K pause/00:00  0.00%  0.00% ksh
> 19700 wiz   85019M 1948K pause/30:00  0.00%  0.00% ksh
> 11499 wiz   85017M 1944K pause/31   0:00  0.00%  0.00% ksh
>   562 root  85023M 1836K ttyraw/2   0:00  0.00%  0.00% getty
>   604 root  85020M 1832K ttyraw/1   0:00  0.00%  0.00% getty
>   467 root  85021M 1828K ttyraw/3   0:00  0.00%  0.00% getty
> 1 root  85020M 1828K wait/4 0:00  0.00%  0.00% init
>   599 root  85026M 1648K kqueue/1   0:00  0.00%  0.00% inetd
>   362 root  85020M 1516K kqueue/3   0:00  0.00%  0.00% powerd
> 11168 pbulk 78037M 6240K bioloc/1   0:00  0.00%  0.00% perl
>  8831 pbulk 78033M 6240K biowai/1   0:00  0.00%  0.00% perl
>  3270 pbulk 78023M 1960K wait/180:00  0.00%  0.00% sh
>  8456 pbulk 78020M 1960K wait/3 0:00  0.00%  0.00% sh
> 15353 pbulk 78020M 1960K wait/0 0:00  0.00%  0.00% sh
>  9212 pbulk 78020M 1960K wait/290:00  0.00%  0.00% sh
>  8360 pbulk 78020M 1960K wait/1 0:00  0.00%  0.00% sh
>  2790 pbulk 78020M 1960K wait/290:00  0.00%  0.00% sh
> 
> systat vm:
> 
> 3 usersLoad  0.16  0.10  0.09  Fri Jun 28 21:40:13
> 
> Proc:r  d  sCsw  Traps SysCal  Intr   Soft  Fault PAGING   
> SWAPPING
> 8  6 25   14566212in  out   in  
> out
> ops
>0.4% Sy   0.0% Us   0.0% Ni   0.0% In  99.6% Idpages
> |||||||||||
>   
> forks
>  

Re: recurring tstile hangs on -current

2019-06-30 Thread Thomas Klausner
With dmesg this time.

On Fri, Jun 28, 2019 at 11:39:05AM +0200, Thomas Klausner wrote:
> Hi Frank!
> 
> I checked some process states in ddb.
> 
> "master", the 2 "bjam" and at least one "cp" hanging in tstile have:
> sleepq_block()
> turnstile_block()
> rw_vector_enter()
> genfs_lock()
> VOP_LOCK()
> vn_lock()
> namei_tryemulroot()
> namei()
> check_exec()
> execve_loadvm()
> execve1()
> syscall()
> 
> These look quite similar to your backtraces.
> 
> The "cp" hanging in biolock has:
> sleepq_block
> cv_timedwait
> bbusy
> getblk
> bio_doread
> ffs_init_vnode
> ffs_newvnode
> vcache_new
> ufs_makeinode
> ufs_create
> VOP_CREATE
> vn_open
> do_open
> do_sys_openat
> sys_open
> syscall
> 
> I can't agree with the statement that it's a general -current problem
> -- my current working machine does not have this issue. It "only" has
> 32GB and 12 cores though, and no nvme. dmesg attached.
> 
> Do you see the issue on machines without nvme? Just to eliminate that.
> (I wanted to try replacing the nvme boot disk next.)
>  Thomas
> 
> 
> On Fri, Jun 28, 2019 at 11:20:45AM +0200, Frank Kardel wrote:
> > Hi Thomas,
> > 
> > glad that this is observed elsewhere.
> > 
> > Maybe following bugs could resonate with your observations:
> > 
> > kern/54207 [serious/high]:
> > -current locks up solidly when pkgsrc building 
> > adapta-gtk-theme-3.95.0.11
> > looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB)
> > 
> > kern/54210 [serious/high]:
> > NetBSD-8 processes presumably not exiting
> > not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 
> > CPU @ 3.20GHz, ~380Gb)
> > 
> > At this time I am not too confident, that -current is reliably able to do a 
> > pkgsrc build, though I have seen occasionally bulk builds that did finish.
> > Most of the time I run into hard lockups with no information about the 
> > system state available (no console, no X, no network, no DDB).
> > 
> > Frank
> > 
> > 
> > On 06/28/19 10:46, Thomas Klausner wrote:
> > > Hi!
> > > 
> > > I've set up a new machine for bulk building. I have tried various
> > > things, but in the end it always hangs in tstile.
> > > 
> > > First try was what I currently use: tmpfs sandboxes with nullfs
> > > mounted /bin, /lib, ... When it hung, the suspicion was that it's
> > > nullfs' fault. (The same setup works fine on my current machine.)
> > > 
> > > The second try was tmpfs with copied-in /bin, /lib, ... and
> > > NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
> > > hung. So the suspicion was that tmpfs or NFS are broken.
> > > 
> > > The last try was building in the root file system, i.e. not even a
> > > sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
> > > are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
> > > finished one pkgsrc build (where some packages didn't build because of
> > > missing distfiles, or because they randomly break like rust). When I
> > > restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.
> > > 
> > > I have the following systat output:
> > > 
> > >  2 usersLoad  8.78  7.19  3.62  Fri Jun 28 
> > > 04:27:32
> > > 
> > > Proc:r  d  sCsw  Traps SysCal  Intr   Soft  Fault PAGING   
> > > SWAPPING
> > >  2410   7548 265849 157956  3504   2399 265476 in  out   
> > > in  out
> > >  ops
> > >56.2% Sy   1.2% Us   0.0% Ni   0.0% In  42.5% Idpages
> > > |||||||||||
> > > > 670 
> > > forks
> > >   
> > >  fkppw
> > > Anon   294104%   zero 62161268  5572 Interrupts   
> > > fksvm
> > > Exec14116%   wired   16296  1968 TLB shootdown
> > > pwait
> > > File 24587740  18%   inact   43756   100 cpu0 timer   
> > > relck
> > > Meta  2606694%   bufs   495676   msi1 vec 0   
> > > rlkok
> > >   (kB)real   swaponly  free 9 msix2 vec 0 
> > >  noram
> > > Active   24835908100033996 9 msix2 vec 157262 
> > > ndcpy
> > > Namei Sys-cache Proc-cache   msix2 vec 227906 
> > > fltcp
> > >  Calls hits% hits %  3427 ioapic1 pin 12 
> > > 87178 zfod
> > > 125076   122834   98   80 059 ioapic2 pin 0  
> > > 35775 cow
> > >   msix7 vec 0 
> > > 8192 fmin
> > >Disks:   seeks   xfers   bytes   %busy
> > > 10922 ftarg
> > >   ld01969  16130K34.8 
> > >  itarg
> > >   dk01969  16130K34.8 
> > >  flnan
> > >   wd0  

Re: recurring tstile hangs on -current

2019-06-28 Thread Thomas Klausner
When I tried to get a core, I saw:

> reboot 0x104

dumping to dev 168,2 (offset=73677660, size=33524130)
dump ahcisata0 port 5: clearing WDCTL_RST failed for drive 0
wddump: device timed out
i/o error


rebooting...

 Thomas

On Fri, Jun 28, 2019 at 11:39:05AM +0200, Thomas Klausner wrote:
> Hi Frank!
> 
> I checked some process states in ddb.
> 
> "master", the 2 "bjam" and at least one "cp" hanging in tstile have:
> sleepq_block()
> turnstile_block()
> rw_vector_enter()
> genfs_lock()
> VOP_LOCK()
> vn_lock()
> namei_tryemulroot()
> namei()
> check_exec()
> execve_loadvm()
> execve1()
> syscall()
> 
> These look quite similar to your backtraces.
> 
> The "cp" hanging in biolock has:
> sleepq_block
> cv_timedwait
> bbusy
> getblk
> bio_doread
> ffs_init_vnode
> ffs_newvnode
> vcache_new
> ufs_makeinode
> ufs_create
> VOP_CREATE
> vn_open
> do_open
> do_sys_openat
> sys_open
> syscall
> 
> I can't agree with the statement that it's a general -current problem
> -- my current working machine does not have this issue. It "only" has
> 32GB and 12 cores though, and no nvme. dmesg attached.
> 
> Do you see the issue on machines without nvme? Just to eliminate that.
> (I wanted to try replacing the nvme boot disk next.)
>  Thomas
> 
> 
> On Fri, Jun 28, 2019 at 11:20:45AM +0200, Frank Kardel wrote:
> > Hi Thomas,
> > 
> > glad that this is observed elsewhere.
> > 
> > Maybe following bugs could resonate with your observations:
> > 
> > kern/54207 [serious/high]:
> > -current locks up solidly when pkgsrc building 
> > adapta-gtk-theme-3.95.0.11
> > looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB)
> > 
> > kern/54210 [serious/high]:
> > NetBSD-8 processes presumably not exiting
> > not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 
> > CPU @ 3.20GHz, ~380Gb)
> > 
> > At this time I am not too confident, that -current is reliably able to do a 
> > pkgsrc build, though I have seen occasionally bulk builds that did finish.
> > Most of the time I run into hard lockups with no information about the 
> > system state available (no console, no X, no network, no DDB).
> > 
> > Frank
> > 
> > 
> > On 06/28/19 10:46, Thomas Klausner wrote:
> > > Hi!
> > > 
> > > I've set up a new machine for bulk building. I have tried various
> > > things, but in the end it always hangs in tstile.
> > > 
> > > First try was what I currently use: tmpfs sandboxes with nullfs
> > > mounted /bin, /lib, ... When it hung, the suspicion was that it's
> > > nullfs' fault. (The same setup works fine on my current machine.)
> > > 
> > > The second try was tmpfs with copied-in /bin, /lib, ... and
> > > NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
> > > hung. So the suspicion was that tmpfs or NFS are broken.
> > > 
> > > The last try was building in the root file system, i.e. not even a
> > > sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
> > > are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
> > > finished one pkgsrc build (where some packages didn't build because of
> > > missing distfiles, or because they randomly break like rust). When I
> > > restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.
> > > 
> > > I have the following systat output:
> > > 
> > >  2 usersLoad  8.78  7.19  3.62  Fri Jun 28 
> > > 04:27:32
> > > 
> > > Proc:r  d  sCsw  Traps SysCal  Intr   Soft  Fault PAGING   
> > > SWAPPING
> > >  2410   7548 265849 157956  3504   2399 265476 in  out   
> > > in  out
> > >  ops
> > >56.2% Sy   1.2% Us   0.0% Ni   0.0% In  42.5% Idpages
> > > |||||||||||
> > > > 670 
> > > forks
> > >   
> > >  fkppw
> > > Anon   294104%   zero 62161268  5572 Interrupts   
> > > fksvm
> > > Exec14116%   wired   16296  1968 TLB shootdown
> > > pwait
> > > File 24587740  18%   inact   43756   100 cpu0 timer   
> > > relck
> > > Meta  2606694%   bufs   495676   msi1 vec 0   
> > > rlkok
> > >   (kB)real   swaponly  free 9 msix2 vec 0 
> > >  noram
> > > Active   24835908100033996 9 msix2 vec 157262 
> > > ndcpy
> > > Namei Sys-cache Proc-cache   msix2 vec 227906 
> > > fltcp
> > >  Calls hits% hits %  3427 ioapic1 pin 12 
> > > 87178 zfod
> > > 125076   122834   98   80 059 ioapic2 pin 0  
> > > 35775 cow
> > >   msix7 vec 0 
> > > 8192 fmin
> > >Disks:   seeks   xfers   bytes   %busy
> > > 10922 ftarg
> > >   

Re: recurring tstile hangs on -current

2019-06-28 Thread Thomas Klausner
On Fri, Jun 28, 2019 at 11:44:37AM +0100, Robert Swindells wrote:
> 
> Thomas Klausner  wrote:
> >I've set up a new machine for bulk building. I have tried various
> >things, but in the end it always hangs in tstile.
> >
> >First try was what I currently use: tmpfs sandboxes with nullfs
> >mounted /bin, /lib, ... When it hung, the suspicion was that it's
> >nullfs' fault. (The same setup works fine on my current machine.)
> >
> >The second try was tmpfs with copied-in /bin, /lib, ... and
> >NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
> >hung. So the suspicion was that tmpfs or NFS are broken.
> >
> >The last try was building in the root file system, i.e. not even a
> >sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
> >are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
> >finished one pkgsrc build (where some packages didn't build because of
> >missing distfiles, or because they randomly break like rust). When I
> >restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.
> 
> Is a bulk build trying to make packages in parallel ?

No.

> If so, then I'm guessing that the multiple instances of cp(1) are
> putting the package tarball into the same directory somewhere.

I guess this is just the installation of the boost-headers.

Also, I think that multiple programs writing different files in the
same directory should work ;-)
 Thomas


Re: recurring tstile hangs on -current

2019-06-28 Thread Robert Swindells


Thomas Klausner  wrote:
>I've set up a new machine for bulk building. I have tried various
>things, but in the end it always hangs in tstile.
>
>First try was what I currently use: tmpfs sandboxes with nullfs
>mounted /bin, /lib, ... When it hung, the suspicion was that it's
>nullfs' fault. (The same setup works fine on my current machine.)
>
>The second try was tmpfs with copied-in /bin, /lib, ... and
>NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
>hung. So the suspicion was that tmpfs or NFS are broken.
>
>The last try was building in the root file system, i.e. not even a
>sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
>are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
>finished one pkgsrc build (where some packages didn't build because of
>missing distfiles, or because they randomly break like rust). When I
>restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.

Is a bulk build trying to make packages in parallel ?

If so, then I'm guessing that the multiple instances of cp(1) are
putting the package tarball into the same directory somewhere.


Re: recurring tstile hangs on -current

2019-06-28 Thread Thomas Klausner
Hi Frank!

I checked some process states in ddb.

"master", the 2 "bjam" and at least one "cp" hanging in tstile have:
sleepq_block()
turnstile_block()
rw_vector_enter()
genfs_lock()
VOP_LOCK()
vn_lock()
namei_tryemulroot()
namei()
check_exec()
execve_loadvm()
execve1()
syscall()

These look quite similar to your backtraces.

The "cp" hanging in biolock has:
sleepq_block
cv_timedwait
bbusy
getblk
bio_doread
ffs_init_vnode
ffs_newvnode
vcache_new
ufs_makeinode
ufs_create
VOP_CREATE
vn_open
do_open
do_sys_openat
sys_open
syscall

I can't agree with the statement that it's a general -current problem
-- my current working machine does not have this issue. It "only" has
32GB and 12 cores though, and no nvme. dmesg attached.

Do you see the issue on machines without nvme? Just to eliminate that.
(I wanted to try replacing the nvme boot disk next.)
 Thomas


On Fri, Jun 28, 2019 at 11:20:45AM +0200, Frank Kardel wrote:
> Hi Thomas,
> 
> glad that this is observed elsewhere.
> 
> Maybe following bugs could resonate with your observations:
> 
> kern/54207 [serious/high]:
> -current locks up solidly when pkgsrc building 
> adapta-gtk-theme-3.95.0.11
> looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB)
> 
> kern/54210 [serious/high]:
> NetBSD-8 processes presumably not exiting
> not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 
> CPU @ 3.20GHz, ~380Gb)
> 
> At this time I am not too confident, that -current is reliably able to do a 
> pkgsrc build, though I have seen occasionally bulk builds that did finish.
> Most of the time I run into hard lockups with no information about the system 
> state available (no console, no X, no network, no DDB).
> 
> Frank
> 
> 
> On 06/28/19 10:46, Thomas Klausner wrote:
> > Hi!
> > 
> > I've set up a new machine for bulk building. I have tried various
> > things, but in the end it always hangs in tstile.
> > 
> > First try was what I currently use: tmpfs sandboxes with nullfs
> > mounted /bin, /lib, ... When it hung, the suspicion was that it's
> > nullfs' fault. (The same setup works fine on my current machine.)
> > 
> > The second try was tmpfs with copied-in /bin, /lib, ... and
> > NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
> > hung. So the suspicion was that tmpfs or NFS are broken.
> > 
> > The last try was building in the root file system, i.e. not even a
> > sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
> > are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
> > finished one pkgsrc build (where some packages didn't build because of
> > missing distfiles, or because they randomly break like rust). When I
> > restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.
> > 
> > I have the following systat output:
> > 
> >  2 usersLoad  8.78  7.19  3.62  Fri Jun 28 04:27:32
> > 
> > Proc:r  d  sCsw  Traps SysCal  Intr   Soft  Fault PAGING   
> > SWAPPING
> >  2410   7548 265849 157956  3504   2399 265476 in  out   in 
> >  out
> >  ops
> >56.2% Sy   1.2% Us   0.0% Ni   0.0% In  42.5% Idpages
> > |||||||||||
> > > 670 
> > forks
> >
> > fkppw
> > Anon   294104%   zero 62161268  5572 Interrupts   
> > fksvm
> > Exec14116%   wired   16296  1968 TLB shootdown
> > pwait
> > File 24587740  18%   inact   43756   100 cpu0 timer   
> > relck
> > Meta  2606694%   bufs   495676   msi1 vec 0   
> > rlkok
> >   (kB)real   swaponly  free 9 msix2 vec 0  
> > noram
> > Active   24835908100033996 9 msix2 vec 157262 
> > ndcpy
> > Namei Sys-cache Proc-cache   msix2 vec 227906 
> > fltcp
> >  Calls hits% hits %  3427 ioapic1 pin 12 87178 
> > zfod
> > 125076   122834   98   80 059 ioapic2 pin 0  35775 
> > cow
> >   msix7 vec 0 8192 
> > fmin
> >Disks:   seeks   xfers   bytes   %busy10922 
> > ftarg
> >   ld01969  16130K34.8  
> > itarg
> >   dk01969  16130K34.8  
> > flnan
> >   wd0  
> > pdfre
> >   dk1  
> > pdscn
> >   dk2
> > 
> > and this from top:
> > 
> > load averages:  5.13,  6.53,  3.56;   up 1+16:08:05 
> > 
> 

Re: recurring tstile hangs on -current

2019-06-28 Thread Frank Kardel

Hi Thomas,

glad that this is observed elsewhere.

Maybe following bugs could resonate with your observations:

kern/54207 [serious/high]:
-current locks up solidly when pkgsrc building 
adapta-gtk-theme-3.95.0.11
looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB)

kern/54210 [serious/high]:
NetBSD-8 processes presumably not exiting
not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 CPU 
@ 3.20GHz, ~380Gb)

At this time I am not too confident, that -current is reliably able to do a 
pkgsrc build, though I have seen occasionally bulk builds that did finish.
Most of the time I run into hard lockups with no information about the system 
state available (no console, no X, no network, no DDB).

Frank


On 06/28/19 10:46, Thomas Klausner wrote:

Hi!

I've set up a new machine for bulk building. I have tried various
things, but in the end it always hangs in tstile.

First try was what I currently use: tmpfs sandboxes with nullfs
mounted /bin, /lib, ... When it hung, the suspicion was that it's
nullfs' fault. (The same setup works fine on my current machine.)

The second try was tmpfs with copied-in /bin, /lib, ... and
NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
hung. So the suspicion was that tmpfs or NFS are broken.

The last try was building in the root file system, i.e. not even a
sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
finished one pkgsrc build (where some packages didn't build because of
missing distfiles, or because they randomly break like rust). When I
restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.

I have the following systat output:

 2 usersLoad  8.78  7.19  3.62  Fri Jun 28 04:27:32

Proc:r  d  sCsw  Traps SysCal  Intr   Soft  Fault PAGING   SWAPPING
 2410   7548 265849 157956  3504   2399 265476 in  out   in  out
 ops
   56.2% Sy   1.2% Us   0.0% Ni   0.0% In  42.5% Idpages
|||||||||||
> 670 forks
   fkppw
Anon   294104%   zero 62161268  5572 Interrupts   fksvm
Exec14116%   wired   16296  1968 TLB shootdownpwait
File 24587740  18%   inact   43756   100 cpu0 timer   relck
Meta  2606694%   bufs   495676   msi1 vec 0   rlkok
  (kB)real   swaponly  free 9 msix2 vec 0  noram
Active   24835908100033996 9 msix2 vec 157262 ndcpy
Namei Sys-cache Proc-cache   msix2 vec 227906 fltcp
 Calls hits% hits %  3427 ioapic1 pin 12 87178 zfod
125076   122834   98   80 059 ioapic2 pin 0  35775 cow
  msix7 vec 0 8192 fmin
   Disks:   seeks   xfers   bytes   %busy10922 ftarg
  ld01969  16130K34.8  itarg
  dk01969  16130K34.8  flnan
  wd0  pdfre
  dk1  pdscn
  dk2

and this from top:

load averages:  5.13,  6.53,  3.56;   up 1+16:08:05 

 04:28:13
59 processes: 2 runnable, 55 sleeping, 2 on CPU
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle
Memory: 24G Act, 43M Inact, 16M Wired, 14M Exec, 23G File, 95G Free
Swap: 163G Total, 163G Free

   PID USERNAME PRI NICE   SIZE   RES STATE  TIME   WCPUCPU COMMAND
10353 pbulk 770   185M  172M select/0   0:13  4.74%  4.54% bjam
12120 wiz  109083M   59M tstile/1 165:46  1.46%  1.46% systat
 0 root   00 0K   93M CPU/3135:39  0.00%  0.00% [system]
   219 root  85032M 2676K kqueue/4   7:34  0.00%  0.00% syslogd
13354 wiz   85089M 4948K select/0   0:52  0.00%  0.00% sshd
   380 root  85030M   16M pause/40:04  0.00%  0.00% ntpd
10918 wiz   43025M 2872K CPU/3  0:01  0.00%  0.00% top
 1 root  85020M 1756K wait/290:01  0.00%  0.00% init
  5594 pbulk  00 0K0K RUN/0  0:00  0.00%  0.00% bjam
22861 pbulk  00 0K0K RUN/0  0:00  0.00%  0.00% bjam
   747 root 117020M 2080K tstile/8   0:00  0.00%  0.00% cron
16473 pbulk117018M 1564K tstile/2   0:00  0.00%  0.00% cp
  9705 pbulk1170

recurring tstile hangs on -current

2019-06-28 Thread Thomas Klausner
Hi!

I've set up a new machine for bulk building. I have tried various
things, but in the end it always hangs in tstile.

First try was what I currently use: tmpfs sandboxes with nullfs
mounted /bin, /lib, ... When it hung, the suspicion was that it's
nullfs' fault. (The same setup works fine on my current machine.)

The second try was tmpfs with copied-in /bin, /lib, ... and
NFS-mounted packages/distfiles/pkgsrc (from localhost). That also
hung. So the suspicion was that tmpfs or NFS are broken.

The last try was building in the root file system, i.e. not even a
sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages
are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this
finished one pkgsrc build (where some packages didn't build because of
missing distfiles, or because they randomly break like rust). When I
restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours.

I have the following systat output:

2 usersLoad  8.78  7.19  3.62  Fri Jun 28 04:27:32

Proc:r  d  sCsw  Traps SysCal  Intr   Soft  Fault PAGING   SWAPPING
2410   7548 265849 157956  3504   2399 265476 in  out   in  out
ops
  56.2% Sy   1.2% Us   0.0% Ni   0.0% In  42.5% Idpages
|||||||||||
> 670 forks
  fkppw
Anon   294104%   zero 62161268  5572 Interrupts   fksvm
Exec14116%   wired   16296  1968 TLB shootdownpwait
File 24587740  18%   inact   43756   100 cpu0 timer   relck
Meta  2606694%   bufs   495676   msi1 vec 0   rlkok
 (kB)real   swaponly  free 9 msix2 vec 0  noram
Active   24835908100033996 9 msix2 vec 157262 ndcpy
Namei Sys-cache Proc-cache   msix2 vec 227906 fltcp
Calls hits% hits %  3427 ioapic1 pin 12 87178 zfod
   125076   122834   98   80 059 ioapic2 pin 0  35775 cow
 msix7 vec 0 8192 fmin
  Disks:   seeks   xfers   bytes   %busy10922 ftarg
 ld01969  16130K34.8  itarg
 dk01969  16130K34.8  flnan
 wd0  pdfre
 dk1  pdscn
 dk2

and this from top:

load averages:  5.13,  6.53,  3.56;   up 1+16:08:05 

 04:28:13
59 processes: 2 runnable, 55 sleeping, 2 on CPU
CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.9% idle
Memory: 24G Act, 43M Inact, 16M Wired, 14M Exec, 23G File, 95G Free
Swap: 163G Total, 163G Free

  PID USERNAME PRI NICE   SIZE   RES STATE  TIME   WCPUCPU COMMAND
10353 pbulk 770   185M  172M select/0   0:13  4.74%  4.54% bjam
12120 wiz  109083M   59M tstile/1 165:46  1.46%  1.46% systat
0 root   00 0K   93M CPU/3135:39  0.00%  0.00% [system]
  219 root  85032M 2676K kqueue/4   7:34  0.00%  0.00% syslogd
13354 wiz   85089M 4948K select/0   0:52  0.00%  0.00% sshd
  380 root  85030M   16M pause/40:04  0.00%  0.00% ntpd
10918 wiz   43025M 2872K CPU/3  0:01  0.00%  0.00% top
1 root  85020M 1756K wait/290:01  0.00%  0.00% init
 5594 pbulk  00 0K0K RUN/0  0:00  0.00%  0.00% bjam
22861 pbulk  00 0K0K RUN/0  0:00  0.00%  0.00% bjam
  747 root 117020M 2080K tstile/8   0:00  0.00%  0.00% cron
16473 pbulk117018M 1564K tstile/2   0:00  0.00%  0.00% cp
 9705 pbulk117015M 1564K bioloc/5   0:00  0.00%  0.00% cp
 7301 pbulk117015M 1560K tstile/2   0:00  0.00%  0.00% cp
22971 pbulk117019M 1520K tstile/1   0:00  0.00%  0.00% cp
10013 pbulk117015M 1520K tstile/1   0:00  0.00%  0.00% cp
 3411 pbulk117015M 1520K tstile/3   0:00  0.00%  0.00% cp
 5212 pbulk117015M 1520K tstile/2   0:00  0.00%  0.00% cp
 7072 pbulk117018M 1516K tstile/2   0:00  0.00%  0.00% cp
 8880 pbulk117015M 1516K tstile/2   0:00  0.00%  0.00% cp
 5869 pbulk117015M 1516K tstile/0   0:00  0.00%  0.00% cp
10159 pbulk117015M 1516K tstile/1   0:00  0.00%  0.00% cp
11783 pbulk117015M 1516K tstile/7   0:00  0.00%  0.00% cp
 7205 pbulk117015M 1512K tstile/1   0:00  0.00%  0.00% cp
18676 pbulk