Re: panic: assertion "!cpu_softintr_p()" failed
Hi, On Sun, Oct 01, 2023 at 10:12:47AM +0200, Thomas Klausner wrote: > panic: kernel diagnostic assertion "!cpu_softintr_p()" failed: file > "/usr/src/sys/kern/subr_kmem.c", line 451 Sorry about that. Should be fixed by: /cvsroot/src/sys/kern/kern_mutex_obj.c,v <-- kern_mutex_obj.c new revision: 1.15; previous revision: 1.14 /cvsroot/src/sys/kern/kern_rwlock_obj.c,v <-- kern_rwlock_obj.c new revision: 1.13; previous revision: 1.12 Cheers, Andrew
Re: Growth in pool usage between netbsd-9 and -10?
On Fri, Sep 08, 2023 at 12:27:57PM +1000, Paul Ripke wrote: > I need to read more code to see when the pools decide to release idle > pages - because this is remarkably wasteful considering my machine is > also paging, and "only" has 16GiB RAM. > > Memory resource pool statistics > NameSize Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg > Idle > buf2k 2048 8307800 617706 207284 85623 121661 172466 0 inf > 6948 > vcachepl 640 440759840 43266390 5228141 5012754 215387 704585 0 inf > 3547 > pcgnormal320 23415927 32666190 23388721 1050250 1045247 5003 89205 0 inf > 2425 > ffsdino2 264 433860980 42621716 1690136 1577576 112560 281810 0 inf > 2185 > ffsino 280 433124090 42548027 1850682 1733882 116800 301939 0 inf > 2143 > ractx 40 6770 6846710 41237 37592 3645 4679 0 inf > 1780 > bufpl280 5576110 239663 31776 559 31217 31296 0 inf > 1362 > buf16k 16896 4312320 350158 52798 21935 30863 32706 0 inf > 942 > mclpl 2048 1206000 119876 32540 31337 1203 5270 0 261333 > 841 > pvpage 409661134353585 40271 32334 7937 12454 0 inf > 388 > mbpl 520 2055500 204337 13673 12993 680 3049 0 inf > 362 > pcglarge1088 8444730 825312 88124 81378 6746 9809 0 inf > 359 > buf4k 409628406025833 5510 2654 2856 4001 0 inf > 283 > ... > Totals 1937415183 32666228 1927493251 11060670 10142826 917844 > > In use 3585418K, total allocated 5569236K; utilization 64.4% The targets are roughly 10% of RAM for vnodes and 15% for bufcache which adds up to 25%. So this doesn't seem so totally off the mark. DIAGNOSTIC does add a bit of overhead since it messes up all the kmem cache sizes with debugging stuff. I have plans to reduce the overhead from vnodes but not enough to claw back 100s of MB.. Sometimes I think the method for sizing these caches should be non-linear or something - don't have a good idea yet. Andrew
Re: kernel diagnostic assertion "c->c_cpu->cc_lwp == curlwp || c->c_cpu->cc_active != c" failed
Hi, On Fri, Jun 12, 2020 at 11:17:30PM +0200, Thomas Klausner wrote: > With a 9.99.63/amd64 kernel from May 19 I saw a panic: > > Jun 7 01:01:01 yt savecore: reboot after panic: [ 396809.5836453] panic: > kernel diagnostic assertion "c->c_cpu->cc_lwp == curlwp || > c->c_cpu->cc_active != c" failed: file "/usr/src/sys/kern/kern_timeout.c", > line 322 running callout 0x835adaa87658: c_func (0x809f459f) > c_flags (0x100) destroyed from 0x809f264b > Jun 7 01:01:01 yt savecore: writing compressed core to > /var/crash/netbsd.12.core.gz > > (gdb) bt > #0 0x80226865 in cpu_reboot (howto=howto@entry=260, > bootstr=bootstr@entry=0x0) at /usr/src/sys/arch/amd64/amd64/machdep.c:713 > #1 0x80cd387f in kern_reboot (howto=howto@entry=260, > bootstr=bootstr@entry=0x0) at /usr/src/sys/kern/kern_reboot.c:73 > #2 0x80d1285e in vpanic ( > fmt=0x8143e908 "kernel %sassertion \"%s\" failed: file \"%s\", > line %d running callout %p: c_func (%p) c_flags (%#x) destroyed from %p", > ap=ap@entry=0x850916323c88) at /usr/src/sys/kern/subr_prf.c:290 > #3 0x80e834b6 in kern_assert ( > fmt=fmt@entry=0x8143e908 "kernel %sassertion \"%s\" failed: file > \"%s\", line %d running callout %p: c_func (%p) c_flags (%#x) destroyed from > %p") > at /usr/src/sys/lib/libkern/kern_assert.c:51 > #4 0x80cf1708 in callout_destroy (cs=cs@entry=0x835adaa87658) at > /usr/src/sys/kern/kern_timeout.c:323 > #5 0x809f264b in tcp_close (tp=tp@entry=0x835adaa874c8) at > /usr/src/sys/netinet/tcp_subr.c:1227 > #6 0x809ecb83 in tcp_input (m=0x83415e784b00, off=20, > proto=) at /usr/src/sys/netinet/tcp_input.c:2396 > #7 0x809dde5f in ip_input (m=) at > /usr/src/sys/netinet/ip_input.c:816 > #8 ipintr (arg=) at /usr/src/sys/netinet/ip_input.c:402 > #9 0x80ce18ae in softint_execute (s=4, l=0x835d46efb480) at > /usr/src/sys/kern/kern_softint.c:565 > #10 softint_dispatch (pinned=, s=4) at > /usr/src/sys/kern/kern_softint.c:814 > #11 0x80220eef in Xsoftintr () > (gdb) > > Is this probably already fixed or should I file a PR? > Thomas I had a brief look at this before but I don't know the networking code too well. Basically something is starting a TCP timer after/during its destruction. It seems to be a old bug but whatever you're doing is managing to trigger it. It's definitely worth a PR. Cheers, Andrew
Re: cmake hanging
On Mon, Jun 08, 2020 at 03:38:44PM +0100, Chavdar Ivanov wrote: > On Sun, 7 Jun 2020 at 10:25, Chavdar Ivanov wrote: > > > > Hi, > > > > I just had another one, rebuilding gimp, running gegl. Again gdb -p > > ... ; quit sorted it out. > > > > On Sat, 6 Jun 2020 at 20:36, Chavdar Ivanov wrote: > > > > > > On Sat, 6 Jun 2020 at 20:26, Joerg Sonnenberger wrote: > > > > > > > > On Sat, Jun 06, 2020 at 06:45:03PM +0100, Chavdar Ivanov wrote: > > > > > On Sat, 6 Jun 2020 at 18:43, Chavdar Ivanov wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > I got another cmake hang during pkg_rolling-replace today, while > > > > > > building misc/kdepimlibs4, trace as follows: > > > > > > > > > > Just to mention that after I quit gdb and detached from the process it > > > > > continued and completed the build . . . > > > > > > > > Right, that's the bug in the mutex wakeup handling. > > > > > > The second hung sample - with git - also completed OK after I quit gdb... > > I had another three cmake hangs just like this today, while rebuilding > bits of kf5. > > Just to confirm - the moment one answers 'y' to the question whether > to leave gdb the process continues and the build succeeds. > > This is somewhat annoying; although it does not stop the rebuild > process, it makes it impossible to complete unattended. I just made some more changes to libpthread today that may help. I'll try building KDE soon. Cheers, Andrew
Re: cmake hanging
On Wed, Jun 10, 2020 at 01:30:22AM +0200, Joerg Sonnenberger wrote: > On Tue, Jun 09, 2020 at 11:22:27PM +0000, Andrew Doran wrote: > > On Sat, Jun 06, 2020 at 09:25:55PM +0200, Joerg Sonnenberger wrote: > > > > > On Sat, Jun 06, 2020 at 06:45:03PM +0100, Chavdar Ivanov wrote: > > > > On Sat, 6 Jun 2020 at 18:43, Chavdar Ivanov wrote: > > > > > > > > > > Hi, > > > > > > > > > > I got another cmake hang during pkg_rolling-replace today, while > > > > > building misc/kdepimlibs4, trace as follows: > > > > > > > > Just to mention that after I quit gdb and detached from the process it > > > > continued and completed the build . . . > > > > > > Right, that's the bug in the mutex wakeup handling. > > > > Hmm. From what I've seen I'm starting to suspect that it's a signal handler > > screwing up a waiting thread underneath it, by playing with threads stuff. I have made pthread_mutex and pthread_cond hopefully insensitive to that. If something is badly behaved and calls pthread_cond_broadcast() from a signal handler for example it shouldn't cause any damage. > I've been wondering if we actually want to have SA_RESTART behavior for > _lwp_park. It relies on EINTR being returned at the moment. Consider for example if a signal is taken, and during signal processing a wakeup is eaten by _lwp_park() called by the dynamic linker for its locks. On the way back out the parked thread needs to restart and check if it has actually been awoken, because it might never see a wakeup. Andrew
Re: cmake hanging
On Sat, Jun 06, 2020 at 09:25:55PM +0200, Joerg Sonnenberger wrote: > On Sat, Jun 06, 2020 at 06:45:03PM +0100, Chavdar Ivanov wrote: > > On Sat, 6 Jun 2020 at 18:43, Chavdar Ivanov wrote: > > > > > > Hi, > > > > > > I got another cmake hang during pkg_rolling-replace today, while > > > building misc/kdepimlibs4, trace as follows: > > > > Just to mention that after I quit gdb and detached from the process it > > continued and completed the build . . . > > Right, that's the bug in the mutex wakeup handling. Hmm. From what I've seen I'm starting to suspect that it's a signal handler screwing up a waiting thread underneath it, by playing with threads stuff. Andrew
Re: Automated report: NetBSD-current/i386 build failure
On Fri, May 22, 2020 at 12:07:52AM +, NetBSD Test Fixture wrote: > This is an automatically generated notice of a NetBSD-current/i386 > build failure. > > The failure occurred on babylon5.netbsd.org, a NetBSD/amd64 host, > using sources from CVS date 2020.05.21.21.12.31. ... > --- lapic.o --- > *** [lapic.o] Error code 1 > nbmake[2]: stopped in > /tmp/bracket/build/2020.05.21.21.12.31-i386/obj/sys/arch/i386/compile/LEGACY > --- lance.o --- This one is fixed already with 1.82 of sys/arch/x86/x86/lapic.c. Andrew
Re: NFS swap on current appears to deadlock
Hi, Finally got around to trying this. Having beaten on it for a while with real hardware I don't see any problem with swapping over NFS on 9.99.63. On Sat, May 02, 2020 at 12:06:48PM +1000, Paul Ripke wrote: > I have a qemu guest for experimenting with -current, 1 CPU & 64MiB RAM. 64 megs, I'm surprised it makes to the login prompt. Andrew > I gave it an NFS swap space to cope with a few small builds, and it now > locks up hard after touching that swap device. > > >From ddb, stacks are like: > > db{0}> t > sched_resched_cpu() at netbsd:sched_resched_cpu+0x3f > sleepq_wake() at netbsd:sleepq_wake+0x6c > uvm_wait() at netbsd:uvm_wait+0x47 > uvn_findpage() at netbsd:uvn_findpage+0x21b > uvn_findpages() at netbsd:uvn_findpages+0xca > genfs_getpages() at netbsd:genfs_getpages+0x107a > VOP_GETPAGES() at netbsd:VOP_GETPAGES+0x6b > uvm_fault_internal() at netbsd:uvm_fault_internal+0x1aae > trap() at netbsd:trap+0x521 > --- trap (number 6) --- > c6c0bc1b: > > db{0}> t > binuptime() at netbsd:binuptime+0x79 > mi_switch() at netbsd:mi_switch+0x61 > sleepq_block() at netbsd:sleepq_block+0xb4 > mtsleep() at netbsd:mtsleep+0x137 > uvm_pageout() at netbsd:uvm_pageout+0x69e > > They seem to change around, but userland in the OS seems pretty > locked up. Running with the NFS swap space disabled, using a normal > swap partition, is fine. > > This is 9.99.56 amd64, btw. > > Ideas? > > Cheers, > -- > Paul Ripke > "Great minds discuss ideas, average minds discuss events, small minds > discuss people." > -- Disputed: Often attributed to Eleanor Roosevelt. 1948.
Re: lang/rust build fails
On Thu, May 14, 2020 at 11:53:04AM -0500, Robert Nestor wrote: > Ran into an interesting problem trying to build lang/rust from both -current > and 2020Q1 pkgsrc. On a NetBSD installation of 9.99.45 kernel and user land, > the builds succeed. Under 9.99.60 kernel and user land the builds fail. > > The failure doesn?t give much of a clue about what?s happened. The last > lines in the build.log are: > > running: /pkg_comp/work/pkg/lang/rust/work/rust-bootstrap/bin/cargo build > --manifest-path > /pkg_comp/work/pkg/lang/rust/work/rustc-1.42.0-src/src/bootstrap/Cargo.toml > --frozen >Compiling proc-macro2 v0.4.30 > > At that point there?s nothing consuming CPU time in the build and everything > seems to be waiting on something to happen that never does. I?ve left the > system in that state for about 24 hours and still no progress. > > Any clues? Could this be something related to some of the recent kernel > changes? I think it's likely a race condition with pthread_mutex. I found a decent repro and will take a look, hopefully this weekend. Andrew
Re: Panic: vrelel: bad ref count (9.99.54)
On Mon, May 04, 2020 at 03:54:57PM +0200, Leonardo Taccari wrote: > Hello Yorick and Andrew, > > Yorick Hardy writes: > > > > > [...] > > > > > > > > > > Crash version 9.99.55, image version 9.99.55. > > > > > crash: _kvm_kvatop(0) > > > > > Kernel compiled without options LOCKDEBUG. > > > > > System panicked: vrelel: bad ref count > > > > > Backtrace from time of crash is available. > > > > > crash> bt > > > > > _KERNEL_OPT_NAGR() at 0 > > > > > ?() at 7f7ff7ecf000 > > > > > sys_reboot() at sys_reboot > > > > > vpanic() at vpanic+0x181 > > > > > vtryrele() at vtryrele > > > > > vcache_dealloc() at vcache_dealloc > > > > > uvm_unmap_detach() at uvm_unmap_detach+0x76 > > > > > uvm_unmap1() at uvm_unmap1+0x4e > > > > > uvm_mremap() at uvm_mremap+0x36b > > > > > sys_mremap() at sys_mremap+0x68 > > > > > syscall() at syscall+0x227 > > > > > --- syscall (number 411) --- > > > > > 797459842e9a: > > > > > crash> > > > > > > > > The same just happened on 9.99.56 while fetching (POP) mail using > > > > mail/fdm. > > > > > > Could you file a PR please? If this panics again could you please run the > > > "dmesg" command in crash and find out what it printed about the vnode? > > > That > > > would be very useful. > > > > > > Thanks, > > > Andrew > > > > I will do so (... perhaps only this weekend). > > [...] > > I was able to reproduce it too with a yesterday evening NetBSD/amd64 > -current when using mail/fdm and I will try to prepare a minimal > reproducer using mail/fdm and file a PR if noone beat me. > > In the meantime here the information from dmesg: > > [ 6107.6380323] vnode 0xa95219747d40 flags 0x418 > [ 6107.6380323]tag VT_TMPFS(25) type VREG(1) mount 0xa951f6d89000 > typedata 0xa95255e32c90 > [ 6107.6380323]usecount 1 writecount 1 holdcount 0 > [ 6107.6380323]size 18000 writesize 18000 numoutput 0 > [ 6107.6380323]data 0xa952583304a0 lock 0xa95219747f00 > [ 6107.6380323]state LOADED key(0xa951f6d89000 8) a0 04 33 58 52 a9 > ff ff > [ 6107.6380323]lrulisthd 0x816b5ed0 > [ 6107.6380323]tag VT_TMPFS, tmpfs_node 0xa952583304a0, flags 0x0, > links 1 > [ 6107.6380323]mode 0600, owner 1000, group 0, size 98304 > [ 6107.6380323] panic: vrelel: bad ref count > [ 6107.6380323] cpu0: Begin traceback... > [ 6107.6380323] vpanic() at netbsd:vpanic+0x178 > [ 6107.6480364] vnpanic() at netbsd:vnpanic+0x49 > [ 6107.6480364] vrelel() at netbsd:vrelel+0x5b6 > [ 6107.6480364] uvm_unmap_detach() at netbsd:uvm_unmap_detach+0x8e > [ 6107.6480364] sys_munmap() at netbsd:sys_munmap+0x85 > [ 6107.6480364] syscall() at netbsd:syscall+0x2a0 > [ 6107.6480364] --- syscall (number 73) --- > [ 6107.6480364] 7c1e5d18414a: > [ 6107.6480364] cpu0: End traceback... > [ 6107.6480364] fatal breakpoint trap in supervisor mode > [ 6107.6480364] trap type 1 code 0 rip 0x802219fd cs 0x8 rflags 0x202 > cr2 0x7f7ff7ee5000 ilevel 0 rsp 0xc100c227ae20 > [ 6107.6480364] curlwp 0xa9521e1b1600 pid 20756.20756 lowest kstack > 0xc100c22772c0 > [ 6107.6480364] dumping to dev 0,1 (offset=276847, size=2062664): > [ 6107.6480364] dump > > If any possible further information is needed do not hesitate to > contact me! > > > Thanks! Thank you. I opened PR 55237 to track so I don't forget. Andrew
Re: NFS swap on current appears to deadlock
Hi Paul, On Sat, May 02, 2020 at 12:06:48PM +1000, Paul Ripke wrote: > I have a qemu guest for experimenting with -current, 1 CPU & 64MiB RAM. > I gave it an NFS swap space to cope with a few small builds, and it now > locks up hard after touching that swap device. > > >From ddb, stacks are like: > > db{0}> t > sched_resched_cpu() at netbsd:sched_resched_cpu+0x3f > sleepq_wake() at netbsd:sleepq_wake+0x6c > uvm_wait() at netbsd:uvm_wait+0x47 > uvn_findpage() at netbsd:uvn_findpage+0x21b > uvn_findpages() at netbsd:uvn_findpages+0xca > genfs_getpages() at netbsd:genfs_getpages+0x107a > VOP_GETPAGES() at netbsd:VOP_GETPAGES+0x6b > uvm_fault_internal() at netbsd:uvm_fault_internal+0x1aae > trap() at netbsd:trap+0x521 > --- trap (number 6) --- > c6c0bc1b: > > db{0}> t > binuptime() at netbsd:binuptime+0x79 > mi_switch() at netbsd:mi_switch+0x61 > sleepq_block() at netbsd:sleepq_block+0xb4 > mtsleep() at netbsd:mtsleep+0x137 > uvm_pageout() at netbsd:uvm_pageout+0x69e > > They seem to change around, but userland in the OS seems pretty > locked up. Running with the NFS swap space disabled, using a normal > swap partition, is fine. > > This is 9.99.56 amd64, btw. > > Ideas? No idea off hand, but I'll try to reproduce it some time this week. Andrew
Re: firefox build broken
Hi, On Tue, Apr 28, 2020 at 12:36:04PM +0200, Thomas Klausner wrote: > It seems to me some recent change broke the firefox build. > > I've built all packages from scratch on 9.99.59/amd64 from 20200426. > > Firefox consistently fails with > > stack backtrace: >0: 0x490088e2 - > core::fmt::Display>::fmt::hfc6622696269221b >1: 0x490244ad - core::fmt::write::hd3eed79e7afa73cf >2: 0x49007aa5 - std::io::Write::write_fmt::h60ea7d9604ed82bb >3: 0x48ffae90 - > std::panicking::default_hook::{{closure}}::h2c07f5fe2b4f82c2 >4: 0x48ffab82 - std::panicking::default_hook::h6dd6fafda250b80f >5: 0x48ffb4ed - > std::panicking::rust_panic_with_hook::hd4a4901bf3898ce4 >6: 0x48ffb0d0 - rust_begin_unwind >7: 0x48ffb04b - std::panicking::begin_panic_fmt::h08093aab619d21cf >8: 0x48e50244 - > build_script_build::build_gecko::generate_structs::h05aae5803967982b >9: 0x48e51dc6 - build_script_build::main::hdd2fcab7190374c2 > 10: 0x48e38a53 - std::rt::lang_start::{{closure}}::h1cbbafc9225f7a3a > 11: 0x48ffafb3 - std::panicking::try::do_call::hf264ae5e639af05c > 12: 0x4900ec37 - __rust_maybe_catch_panic > 13: 0x48ffe8eb - std::rt::lang_start_internal::h63883908f6b782bf > 14: 0x48e52792 - main > 15: 0x48e35cfb - ___start > > gmake[3]: *** > [/scratch/www/firefox/work/firefox-75.0/config/makefiles/rust.mk:287: > force-cargo-library-build] Error 101 > > Anyone else seen this? > > firefox built fine for me on 20200423 with 9.99.57 from 20200422. Does this still occur? Are you using the latest ld.so? Andrew
Heads up: ubc_direct enabled by default
Hi, This affects amd64, alpha and aarch64, but only 1 and 2 CPU systems so far. Any more and it's still off by default. Only the default has changed so the sysctl (vm.ubc_direct) still works for turning it on and off manually. This works great for me on amd64 but needs some tweaks to handle many CPUs. I have some ideas on that one and hopefully will have something to try soon. Cheers, Andrew
Re: panic: LOCKDEBUG: Mutex error: mutex_vector_enter,514: spin lock held
Hi Paul, On Wed, Apr 22, 2020 at 12:06:41PM +1000, Paul Ripke wrote: > On -current as of ~yesterday, in a 1CPU amd64 qemu boot, I'm seeing: > > Waiting for duplicate address detection to finish... > Starting dhcpcd. > dhcpcd-9.0.1 starting > unknown option: > [ 17.0102686] wm0: link state UP (was UNKNOWN) > wm0: carrier acquired > unknown option: > [ 17.1710186] pid 122 (dhcpcd), uid 35: exited on signal 11 (core not > dumped, err = 1) > dhcpcd_fork_cb: truncated read 0 (expected 4) > /etc/rc.d/dhcpcd exited with code 1 > Building databases: dev[ 19.2211655] Mutex error: mutex_vector_enter,514: > spin lock held > > [ 19.2211655] lock address : 0x81765a40 type : spin > [ 19.2211655] initialized : 0x80a2690f > [ 19.2211655] shared holds : 0 exclusive: 1 > [ 19.2211655] shares wanted: 0 exclusive: 0 > [ 19.2211655] relevant cpu : 0 last held: 0 > [ 19.2211655] relevant lwp : 0x80a57f71c2c0 last held: 0x80a57f71c2c0 > [ 19.2211655] last locked* : 0x80a24843 unlocked : 0x80a268e7 > [ 19.2211655] owner field : 0x00010600 wait/spin:0/1 > > [ 19.2211655] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,514: spin > lock held > [ 19.2211655] cpu0: Begin traceback... > [ 19.2211655] vpanic() at netbsd:vpanic+0x178 > [ 19.2211655] snprintf() at netbsd:snprintf > [ 19.2211655] lockdebug_more() at netbsd:lockdebug_more > [ 19.2211655] mutex_enter() at netbsd:mutex_enter+0x3c7 > [ 19.2211655] vmem_rehash_all() at netbsd:vmem_rehash_all+0x13a > [ 19.2211655] workqueue_worker() at netbsd:workqueue_worker+0xe1 > [ 19.2211655] cpu0: End traceback... > > Seems consistent between attempts. > > Known? Meanwhile, I'll re-sync and try again. This one was fixed with src/sys/kern/subr_vmem.c 1.103. Cheers, Andrew
Re: Automated report: NetBSD-current/i386 build failure
Doesn't show up in Opengrok, maybe it dislikes rump. Already fixed. Andrew On Sun, Apr 19, 2020 at 09:56:55PM +, NetBSD Test Fixture wrote: > This is an automatically generated notice of a NetBSD-current/i386 > build failure. > > The failure occurred on babylon5.netbsd.org, a NetBSD/amd64 host, > using sources from CVS date 2020.04.19.20.35.29. > > An extract from the build.sh output follows: > > /tmp/bracket/build/2020.04.19.20.35.29-i386/tools/bin/nbctfconvert -g -L > VERSION klock.o > > /tmp/bracket/build/2020.04.19.20.35.29-i386/tools/bin/i486--netbsdelf-objcopy > -x klock.o > --- etfs_wrap.o --- > > /tmp/bracket/build/2020.04.19.20.35.29-i386/tools/bin/i486--netbsdelf-objcopy > -x etfs_wrap.o > --- sleepq.o --- > > /tmp/bracket/build/2020.04.19.20.35.29-i386/src/lib/librump/../../sys/rump/librump/rumpkern/sleepq.c:65:1: > error: conflicting types for 'sleepq_enqueue' > sleepq_enqueue(sleepq_t *sq, wchan_t wc, const char *wmsg, syncobj_t > *sob) > ^~ > In file included from > /tmp/bracket/build/2020.04.19.20.35.29-i386/src/lib/librump/../../sys/rump/librump/rumpkern/sleepq.c:36: > > /tmp/bracket/build/2020.04.19.20.35.29-i386/src/lib/librump/../../sys/rump/../sys/sleepq.h:63:6: > note: previous declaration of 'sleepq_enqueue' was here > void sleepq_enqueue(sleepq_t *, wchan_t, const char *, struct syncobj *, > ^~ > --- rumpkern_syscalls.o --- > --- locks.o --- > --- hyperentropy.o --- > /tmp/bracket/build/2020.04.19.20.35.29-i386/tools/bin/nbctfconvert -g -L > VERSION hyperentropy.o > --- rumpkern_syscalls.o --- > # compile librump/rumpkern_syscalls.o > --- hyperentropy.o --- > > /tmp/bracket/build/2020.04.19.20.35.29-i386/tools/bin/i486--netbsdelf-objcopy > -x hyperentropy.o > --- sleepq.o --- > *** [sleepq.o] Error code 1 > nbmake[7]: stopped in > /tmp/bracket/build/2020.04.19.20.35.29-i386/src/lib/librump > --- rumpkern_syscalls.o --- > > The following commits were made between the last successful build and > the failed build: > > 2020.04.19.18.47.40 jdolecek src/sys/arch/xen/include/xen_shm.h,v 1.11 > 2020.04.19.18.47.40 jdolecek src/sys/arch/xen/x86/xen_shm_machdep.c,v 1.15 > 2020.04.19.18.47.40 jdolecek src/sys/arch/xen/xen/hypervisor.c,v 1.75 > 2020.04.19.18.47.40 jdolecek src/sys/arch/xen/xen/xbdback_xenbus.c,v 1.79 > 2020.04.19.19.12.37 jmcneill > src/sys/external/bsd/drm2/dist/drm/nouveau/nouveau_nv50_display.c,v 1.12 > 2020.04.19.19.12.37 jmcneill > src/sys/external/bsd/drm2/dist/drm/nouveau/nvkm/subdev/mmu/nouveau_nvkm_subdev_mmu_nv44.c,v > 1.4 > 2020.04.19.19.20.32 gutteridge src/share/man/man5/fstab.5,v 1.47 > 2020.04.19.19.36.49 kre src/sys/arch/arm/arm32/pmap.c,v 1.410 > 2020.04.19.19.37.06 christos src/sbin/fsck_ffs/pass1.c,v 1.59 > 2020.04.19.20.07.53 jdolecek src/sys/arch/xen/xen/privcmd.c,v 1.55 > 2020.04.19.20.31.59 thorpej src/sys/compat/linux/common/linux_misc.c,v > 1.248 > 2020.04.19.20.31.59 thorpej src/sys/compat/linux/common/linux_sched.c,v > 1.74 > 2020.04.19.20.31.59 thorpej > src/sys/compat/linux32/common/linux32_sysinfo.c,v 1.11 > 2020.04.19.20.31.59 thorpej src/sys/compat/netbsd32/netbsd32_execve.c,v > 1.42 > 2020.04.19.20.31.59 thorpej src/sys/kern/kern_exec.c,v 1.497 > 2020.04.19.20.31.59 thorpej src/sys/kern/kern_exit.c,v 1.288 > 2020.04.19.20.31.59 thorpej src/sys/kern/kern_proc.c,v 1.244 > 2020.04.19.20.31.59 thorpej src/sys/miscfs/procfs/procfs_linux.c,v 1.81 > 2020.04.19.20.31.59 thorpej src/sys/miscfs/procfs/procfs_vfsops.c,v 1.105 > 2020.04.19.20.32.00 thorpej src/sys/rump/librump/rumpkern/lwproc.c,v 1.45 > 2020.04.19.20.35.29 ad src/sys/kern/kern_condvar.c,v 1.47 > 2020.04.19.20.35.29 ad src/sys/kern/kern_sleepq.c,v 1.66 > 2020.04.19.20.35.29 ad src/sys/kern/kern_synch.c,v 1.347 > 2020.04.19.20.35.29 ad src/sys/kern/kern_timeout.c,v 1.61 > 2020.04.19.20.35.29 ad src/sys/kern/kern_turnstile.c,v 1.39 > 2020.04.19.20.35.29 ad src/sys/kern/sys_lwp.c,v 1.77 > 2020.04.19.20.35.29 ad src/sys/kern/sys_select.c,v 1.54 > 2020.04.19.20.35.29 ad src/sys/sys/sleepq.h,v 1.29 > > Logs can be found at: > > > http://releng.NetBSD.org/b5reports/i386/commits-2020.04.html#2020.04.19.20.35.29
Re: Panic: vrelel: bad ref count (9.99.54)
Hi Yorick. On Sat, Apr 18, 2020 at 11:00:02AM +0200, Yorick Hardy wrote: > > I just had the same panic with 9.99.55: > > > > Crash version 9.99.55, image version 9.99.55. > > crash: _kvm_kvatop(0) > > Kernel compiled without options LOCKDEBUG. > > System panicked: vrelel: bad ref count > > Backtrace from time of crash is available. > > crash> bt > > _KERNEL_OPT_NAGR() at 0 > > ?() at 7f7ff7ecf000 > > sys_reboot() at sys_reboot > > vpanic() at vpanic+0x181 > > vtryrele() at vtryrele > > vcache_dealloc() at vcache_dealloc > > uvm_unmap_detach() at uvm_unmap_detach+0x76 > > uvm_unmap1() at uvm_unmap1+0x4e > > uvm_mremap() at uvm_mremap+0x36b > > sys_mremap() at sys_mremap+0x68 > > syscall() at syscall+0x227 > > --- syscall (number 411) --- > > 797459842e9a: > > crash> > > The same just happened on 9.99.56 while fetching (POP) mail using mail/fdm. Could you file a PR please? If this panics again could you please run the "dmesg" command in crash and find out what it printed about the vnode? That would be very useful. Thanks, Andrew
Re: Panic: vrelel: bad ref count (9.99.54)
Hi Yorick. On Mon, Apr 06, 2020 at 11:16:37PM +0200, Yorick Hardy wrote: >Crash version 9.99.54, image version 9.99.54. >crash: _kvm_kvatop(0) >Kernel compiled without options LOCKDEBUG. >System panicked: vrelel: bad ref count >Backtrace from time of crash is available. >crash> bt >_KERNEL_OPT_NAGR() at 0 >?() at 7f7ff7ecf000 >sys_reboot() at sys_reboot >vpanic() at vpanic+0x181 >vtryrele() at vtryrele >vcache_dealloc() at vcache_dealloc >uvm_unmap_detach() at uvm_unmap_detach+0x76 >uvm_unmap1() at uvm_unmap1+0x4e >uvm_mremap() at uvm_mremap+0x36b >sys_mremap() at sys_mremap+0x68 >syscall() at syscall+0x227 >--- syscall (number 411) --- >7f0af7842e9a: Were you running anything noteworthy at the time? There is a very good chance that is fixed by revision 1.217 of src/sys/kern/vfs_lookup.c. Thanks, Andrew
Re: Build time measurements
Hi Andreas, On Fri, Mar 27, 2020 at 10:39:44AM +0200, Andreas Gustafsson wrote: > On Wednesday, I said: > > I will rerun the 24-core tests with these disabled for comparison. > > Done. To recap, with a stock GENERIC kernel, the numbers were: > > 2016.09.06.06.27.173321.55 real 9853.49 user 5156.92 sys > 2019.10.18.17.16.503767.63 real 10376.15 user 16100.99 sys > 2020.03.17.22.03.412910.76 real 9696.10 user 18367.58 sys > 2020.03.22.19.56.072711.14 real 9729.10 user 12068.90 sys > > After disabling DIAGNOSTIC and acpicpu, they are: > > 2016.09.06.06.27.173319.87 real 9767.39 user 4184.24 sys > 2019.10.18.17.16.503525.65 real 10309.00 user 11618.57 sys > 2020.03.17.22.03.412419.52 real 9577.58 user 9602.81 sys > 2020.03.22.19.56.072363.06 real 9482.36 user 7614.66 sys Thanks for repeating the tests. For the sys time to still be that high in relation to user, there's some other limiting factor. Does that machine have tmpfs /tmp? Is NUMA enabled in the BIOS? Different node number for CPUs in recent kernels in dmesg is a good clue. Is it a really old source tree? I would be interested to see lockstat output from a kernel build at some point, if you're so inclined. Cheers, Andrew
Re: Automated report: NetBSD-current/i386 build failure
Fixed with 1.18 src/sys/rump/librump/rumpkern/sleepq.c Apologies, forgot to commit earlier. Andrew On Thu, Mar 26, 2020 at 10:36:27PM +, NetBSD Test Fixture wrote: > This is an automatically generated notice of a NetBSD-current/i386 > build failure. > > The failure occurred on babylon5.netbsd.org, a NetBSD/amd64 host, > using sources from CVS date 2020.03.26.21.25.26. > > An extract from the build.sh output follows: > > *(elm)->field.tqe_prev = (elm)->field.tqe_next; \ > ^~~~ > > /tmp/bracket/build/2020.03.26.21.25.26-i386/src/lib/librump/../../sys/rump/librump/rumpkern/sleepq.c:131:2: > note: in expansion of macro 'TAILQ_REMOVE' > TAILQ_REMOVE(l->l_sleepq, l, l_sleepchain); > ^~~~ > > /tmp/bracket/build/2020.03.26.21.25.26-i386/src/lib/librump/../../sys/rump/../sys/queue.h:548:40: > error: 'struct ' has no member named 'tqe_next'; did you mean > 'le_next'? > *(elm)->field.tqe_prev = (elm)->field.tqe_next; \ > ^~~~ > > /tmp/bracket/build/2020.03.26.21.25.26-i386/src/lib/librump/../../sys/rump/librump/rumpkern/sleepq.c:131:2: > note: in expansion of macro 'TAILQ_REMOVE' > TAILQ_REMOVE(l->l_sleepq, l, l_sleepchain); > ^~~~ > *** [sleepq.o] Error code 1 > nbmake[7]: stopped in > /tmp/bracket/build/2020.03.26.21.25.26-i386/src/lib/librump > --- rumpkern_syscalls.o --- > > The following commits were made between the last successful build and > the failed build: > > 2020.03.26.19.42.39 ad src/sys/kern/kern_idle.c,v 1.33 > 2020.03.26.19.42.39 ad src/sys/kern/kern_synch.c,v 1.345 > 2020.03.26.19.46.42 ad src/sys/kern/kern_condvar.c,v 1.44 > 2020.03.26.19.46.42 ad src/sys/kern/kern_sleepq.c,v 1.63 > 2020.03.26.19.46.42 ad src/sys/kern/kern_turnstile.c,v 1.37 > 2020.03.26.19.46.42 ad src/sys/kern/sys_select.c,v 1.53 > 2020.03.26.19.46.42 ad src/sys/sys/condvar.h,v 1.15 > 2020.03.26.19.46.42 ad src/sys/sys/lwp.h,v 1.203 > 2020.03.26.19.46.42 ad src/sys/sys/sleepq.h,v 1.28 > 2020.03.26.19.47.23 ad src/sys/sys/param.h,v 1.655 > 2020.03.26.20.19.06 ad src/sys/kern/kern_lwp.c,v 1.230 > 2020.03.26.20.19.06 ad src/sys/kern/kern_softint.c,v 1.63 > 2020.03.26.20.19.06 ad src/sys/sys/intr.h,v 1.20 > 2020.03.26.20.19.06 ad src/sys/sys/userret.h,v 1.33 > 2020.03.26.21.15.14 ad src/sys/sys/syncobj.h,v 1.13 > 2020.03.26.21.25.26 ad src/sys/kern/kern_sig.c,v 1.385 > > Logs can be found at: > > > http://releng.NetBSD.org/b5reports/i386/commits-2020.03.html#2020.03.26.21.25.26
Re: locking error using today's sources
Fixed as of src/sys/kern/kern_lwp.c 1.231. Andrew On Mon, Mar 23, 2020 at 01:00:11AM +, Andrew Doran wrote: > Hi, > > I looked into this, it's quite an old bug and you were unlucky to run into > it, there's a very small window of opportunity for it to occur. I'll see > about fixing it. > > Thanks, > Andrew > > On Thu, Mar 19, 2020 at 02:51:05PM -0700, David Hopper wrote: > > I just got this using today's kernel+userland during a pkgsrc Rust > > compilation while updating Firefox (pkgsrc also today's). ?System is a > > ThinkPad P53, i9 9880H and 48GB RAM: > > > > [ 8734.5350365] panic: kernel diagnostic assertion "l->l_refcnt != 0" > > failed: file "/usr/src/sys/kern/kern_lwp.c", line 1701 > > [ 8734.5350365] cpu2: begin traceback... > > [ 8734.5350365] vpanic() at netbsd:vpanic+0x178 > > [ 8734.5350365] kern_assert() at netbsd:kern_assert+0x48 > > [ 8734.5350365] lwp_addref() at netbsd:lwp_addref+0xbc > > [ 8734.5350365] procfs_rw() at netbsd:procfs_rw+0xeb > > [ 8734.5350365] VOP_READ() at netbsd:VOP_READ+0x53 > > [ 8734.5350365] vn_read() at netbsd:vn_read+0x88 > > [ 8734.5350365] dofileread() at netbsd:dofileread+0x8c > > [ 8734.5350365] sys_read() at netbsd:sys_read+0x49 > > [ 8734.5350365] syscall() at netbsd:syscall+0x299 > > [ 8734.5350365] --- syscall (number 3) --- > > [ 8734.5350365] 7519d7a42baa: > > [ 8734.5350365] cpu2: End traceback... > > [ 8734.5350365] Mutex error: mutex_vector_enter,542: locking against myself > > [ 8734.5350365] lock address : 0xced93d804b00 > > [ 8734.5350365] current cpu ?: ? ? ? ? ? ? ? ? ?2 > > [ 8734.5350365] current lwp ?: 0xced7c1ce42c0 > > [ 8734.5350365] owner field ?: 0xced7c1ce42c0 wait/spin: ? ? ? ? ? ? ? > > ?0/0 > > [ 8734.5350365] Skipping crash dump on recursive panic > > [ 8734.5350365] panic: lock error: Mutex: mutex_vector_enter,542: locking > > against myself: lock 0xced93d804b00 cpu 2 lwp 0xced7c1ce42c0 > > [ 8734.5350365] cpu2: Begin traceback... > > [ 8734.5350365] vpanic() at netbsd:vpanic+0x178 > > [ 8734.5350365] snprintf() at netbsd:snprintf > > [ 8734.5350620] lockdebug_abort() at netbsd:lockdebug_abort+0xe6 > > [ 8734.5350620] mutex_vector_enter() at netbsd:mutex_vector_enter+0x3c1 > > [ 8734.5350620] suspendsched() at netbsd:suspendsched+0xf5 > > [ 8734.5350620] cpu_reboot() at netbsd:cpu_reboot+0x46 > > [ 8734.5350620] sys_reboot() at netbsd:sys_reboot > > [ 8734.5350620] vpanic() at netbsd:vpanic+0x181 > > [ 8734.5350620] kern_assert() at netbsd:kern_assert+0x48 > > [ 8734.5350620] lwp_addref() at netbsd:lwp_addref+0xbc > > [ 8734.5350620] procfs_rw() at netbsd:procfs_rw+0xeb > > [ 8734.5350620] VOP_READ() at netbsd:VOP_READ+0x53 > > [ 8734.5350620] vn_read() at netbsd:vn_read+0x88 > > [ 8734.5350620] dofileread() at netbsd:dofileread+0x8c > > [ 8734.5350620] sys_read() at netbsd:sys_read+0x49 > > [ 8734.5350620] syscall() at netbsd:syscall+0x299 > > [ 8734.5350620] --- syscall (number 3) --- > > [ 8734.5350620] 7519d7a42baa: > > [ 8734.5350620] cpu2: End traceback... > > > > Let me know if there's anything else needed. > > David > > > > > > > >
Re: Build time measurements
On Wed, Mar 25, 2020 at 09:44:19PM +, Mike Pumford wrote: > On 24/03/2020 21:47, Andrew Doran wrote: > > DIAGNOSTIC and acpicpu are disabled in all kernels but they are otherwise > > GENERIC. The 2020-04-?? kernel is HEAD plus the remaining changes from the > > ad-namecache branch. > > > Curious to know why acpicpu is a performance hit. Is it just that it > downclocks the CPU if you don't have estd to ramp it up or something more > fundamental? It's a software problem right now. The ACPI idle loop doesn't currenly enter a low power sleep state because there are issues with interrupts to solve first. Nevertheless it's very heavy on I/O port access, takes locks and under certain conditions flushes the entire L1/L2/L3 cache (I haven't verified that yet but the cache miss rate observed with tprof(8) is very high). Andrew
Re: Build time measurements
Hi Andreas. On Mon, Mar 23, 2020 at 04:11:17PM +0200, Andreas Gustafsson wrote: > In September and November, I reported some measurements of the amount > of system time it takes to build a NetBSD-8/amd64 release on different > versions of -current/amd64. I have now repeated the measurements with > a couple of newer versions of -current on the same hardware, and here > are the results. The left column is the source date of the -current > system hosting the build. > > HP ProLiant DL360 G7, 2 x Xeon L5630, 8 cores, 32 GB, build.sh -j 8 > > 2016.09.06.06.27.173930.86 real 15737.04 user 4245.26 sys > 2019.10.18.17.16.504461.47 real 16687.37 user 9344.68 sys > 2020.03.17.22.03.414723.81 real 16646.42 user 8928.72 sys > 2020.03.22.19.56.074595.95 real 16592.80 user 8171.56 sys > > I also measured the same versions on a newer machine with more cores: > > Dell PowerEdge 630, 2 x Xeon E5-2678 v3, 24 cores, 32 GB, build.sh -j 24 > > 2016.09.06.06.27.173321.55 real 9853.49 user 5156.92 sys > 2019.10.18.17.16.503767.63 real 10376.15 user 16100.99 sys > 2020.03.17.22.03.412910.76 real 9696.10 user 18367.58 sys > 2020.03.22.19.56.072711.14 real 9729.10 user 12068.90 sys Thank you for doing this, and for bisecting the performance losses over time (I fixed the vnode regression you found BTW). There are two options enabled in -current that spoil performance on multi processor machines: DIAGNOSTIC and acpicpu. I'm guessing that you had both enabled during your test runs. We ship releases without DIAGNOSTIC, and acpicpu really needs to be fixed. I did some "build.sh release" runs on a machine vaguely similar in spec to you second one, a ThinkStation D30 with 2x Xeon E5-2696 v2. DIAGNOSTIC and acpicpu are disabled in all kernels but they are otherwise GENERIC. The 2020-04-?? kernel is HEAD plus the remaining changes from the ad-namecache branch. A Linux result is included too for a reference point. I would have tried FreeBSD as well, but don't have it installed on this machine yet. Andrew HT disabled in BIOS so -j24: 2019-10-23 2445.46 real 17297.90 user 16725.01 sys 2020-01-16 2013.58 real 16772.79 user 7801.39 sys 2020-03-23 1850.98 real 16383.89 user 4777.89 sys 2020-04-?? 1791.62 real 16367.51 user 3662.62 sys Linux 5.4.191688.29 real 15682.22 user 1962.81 sys HT enabled so -j48. With percentage real time reduction thanks to HT: 2019-10-23 -5% 2583.56 real 24459.69 user 45719.70 sys 2020-01-16 0% 2023.79 real 24495.20 user 20431.37 sys 2020-03-23 5% 1765.39 real 24348.67 user 8856.36 sys 2020-04-?? 7% 1672.71 real 24770.28 user .92 sys Linux 5.4.193% 1644.41 real 24425.07 user 2926.18 sys
Re: 9.99.51 crash: kernel diagnostic assertion "ncp->nc_vp == vp" failed
On Mon, Mar 23, 2020 at 08:27:15AM +0100, Thomas Klausner wrote: > I've updated to last night's 9.99.51, started a bulk build and went to > sleep. It lasted about an hour or two, then it paniced with: > > kernel diagnostic assertion "ncp->nc_vp == vp" failed: file > "/usr/src/sys/kern/vfs_cache.c", line 798 There was a race condition in vfs_cache.c which could have been triggered during reverse lookup (for getcwd()) and given you're doing pbulk that makes sense! Anyway it's fixed now so hopefully you don't see this one again. Thank you, Andrew > I have a core file, but it has at least 11000 frames that look like this: > > #3894 0xf000eef3f000eef3 in ?? () > #3895 0xf000eef3f000eef3 in ?? () > #3896 0xf000ff53f000ef57 in ?? () > #3897 0xf000eef3f000eef3 in ?? () > #3898 0xf000eef3f000e2c3 in ?? () > #3899 0xf000ff54f000eef3 in ?? () > #3900 0xf000322ff0003287 in ?? () > #3901 0xf000e987f000fea5 in ?? () > #3902 0xf000eef3f000eef3 in ?? () > #3903 0xf000eef3f000eef3 in ?? () > > So I guess something went wrong writing or recovering it. > Thomas
Re: locking error using today's sources
Hi, I looked into this, it's quite an old bug and you were unlucky to run into it, there's a very small window of opportunity for it to occur. I'll see about fixing it. Thanks, Andrew On Thu, Mar 19, 2020 at 02:51:05PM -0700, David Hopper wrote: > I just got this using today's kernel+userland during a pkgsrc Rust > compilation while updating Firefox (pkgsrc also today's). ?System is a > ThinkPad P53, i9 9880H and 48GB RAM: > > [ 8734.5350365] panic: kernel diagnostic assertion "l->l_refcnt != 0" failed: > file "/usr/src/sys/kern/kern_lwp.c", line 1701 > [ 8734.5350365] cpu2: begin traceback... > [ 8734.5350365] vpanic() at netbsd:vpanic+0x178 > [ 8734.5350365] kern_assert() at netbsd:kern_assert+0x48 > [ 8734.5350365] lwp_addref() at netbsd:lwp_addref+0xbc > [ 8734.5350365] procfs_rw() at netbsd:procfs_rw+0xeb > [ 8734.5350365] VOP_READ() at netbsd:VOP_READ+0x53 > [ 8734.5350365] vn_read() at netbsd:vn_read+0x88 > [ 8734.5350365] dofileread() at netbsd:dofileread+0x8c > [ 8734.5350365] sys_read() at netbsd:sys_read+0x49 > [ 8734.5350365] syscall() at netbsd:syscall+0x299 > [ 8734.5350365] --- syscall (number 3) --- > [ 8734.5350365] 7519d7a42baa: > [ 8734.5350365] cpu2: End traceback... > [ 8734.5350365] Mutex error: mutex_vector_enter,542: locking against myself > [ 8734.5350365] lock address : 0xced93d804b00 > [ 8734.5350365] current cpu ?: ? ? ? ? ? ? ? ? ?2 > [ 8734.5350365] current lwp ?: 0xced7c1ce42c0 > [ 8734.5350365] owner field ?: 0xced7c1ce42c0 wait/spin: ? ? ? ? ? ? ? > ?0/0 > [ 8734.5350365] Skipping crash dump on recursive panic > [ 8734.5350365] panic: lock error: Mutex: mutex_vector_enter,542: locking > against myself: lock 0xced93d804b00 cpu 2 lwp 0xced7c1ce42c0 > [ 8734.5350365] cpu2: Begin traceback... > [ 8734.5350365] vpanic() at netbsd:vpanic+0x178 > [ 8734.5350365] snprintf() at netbsd:snprintf > [ 8734.5350620] lockdebug_abort() at netbsd:lockdebug_abort+0xe6 > [ 8734.5350620] mutex_vector_enter() at netbsd:mutex_vector_enter+0x3c1 > [ 8734.5350620] suspendsched() at netbsd:suspendsched+0xf5 > [ 8734.5350620] cpu_reboot() at netbsd:cpu_reboot+0x46 > [ 8734.5350620] sys_reboot() at netbsd:sys_reboot > [ 8734.5350620] vpanic() at netbsd:vpanic+0x181 > [ 8734.5350620] kern_assert() at netbsd:kern_assert+0x48 > [ 8734.5350620] lwp_addref() at netbsd:lwp_addref+0xbc > [ 8734.5350620] procfs_rw() at netbsd:procfs_rw+0xeb > [ 8734.5350620] VOP_READ() at netbsd:VOP_READ+0x53 > [ 8734.5350620] vn_read() at netbsd:vn_read+0x88 > [ 8734.5350620] dofileread() at netbsd:dofileread+0x8c > [ 8734.5350620] sys_read() at netbsd:sys_read+0x49 > [ 8734.5350620] syscall() at netbsd:syscall+0x299 > [ 8734.5350620] --- syscall (number 3) --- > [ 8734.5350620] 7519d7a42baa: > [ 8734.5350620] cpu2: End traceback... > > Let me know if there's anything else needed. > David > > > >
Re: Heads up: UVM changes
On Sun, Mar 22, 2020 at 12:40:00PM -0700, Jason Thorpe wrote: > > On Mar 22, 2020, at 12:34 PM, Andrew Doran wrote: > > > > This works well for me on amd64, but I haven't tried it on other machines. > > From code inspection I can see that some architectures need more work so > > this is disabled on them until I can make the needed changes: hppa, mips, > > powerpc and riscv. It's enabled everywhere else. > > Can you provide a summary of what's needed to fix these other pmap > modules? Or, at least, what the issue is that you're avoiding with the > switch? Sure, I'll try to give an outline. Before this the pmap module could work with the assumption that for a given page "pg", at all times there was an exclusive lock held by the VM system on the owning object, and so there would never be any concurrent pmap operations on "pg". The pmap module could also assume that within a single pmap, there would never be concurrent operations on the same set of VAs. Both of those things have changed now. For example you can have one process doing pmap_enter() on a page while a different process does pmap_remove() for the same page. You can also have N threads in a multi-threaded process racing to do pmap_enter() on the exactly the same page and same VA because they faulted on it at the same time and both think it's unentered (those threads will never try to do something conflicting though, i.e. the new mapping will always have the same attributes). In the pmap module there are situations that can be excluded by making assumptions about what locks are held in the VM system, but I don't think the middle ground between "always strongly locked" and "never locked" is a happy place to be - so for this concurrent handling of faults I think the pmap should be able to handle anything thrown at it. The only assumption I think is reasonable to make is that a page won't change identity while being worked on in the pmap module, and won't change identity between a call to e.g. pmap_remove() -> pmap_update(). With that in mind for each pmap it's a matter of code inspection while thinking about the "what if" scenarios. For example the races above, or like what if pmap_is_modified(pg) is called right in the middle of another thread doing pmap_remove() for the same - will the right answer be produced. I think the approach to solving it depends on the machine and how much effort we can put into testing / maintaining it. For vax I've gone with a global pmap_lock because it's a good compromise there. Not taxing on the machine or the programmer at all. Obviously that wouldn't make the grade for something modern with many CPUs. The other two basic approaches we have to solving it are firstly per-pmap + per-page locking like x86 or aarch64, and secondly the alpha (old x86) approach of a global RW lock + per pmap locks (benchmarked extensively back in 2007 - it works for sure, but is worse than a global mutex from a performance POV). Andrew
Heads up: UVM changes
Hi, I changed UVM to allow for concurrent page faults on shared objects. Previously this was single threaded due to locking, which caused a lot of contention over busy objects like libc.so or PostgreSQL's shared buffer for example. This works well for me on amd64, but I haven't tried it on other machines. >From code inspection I can see that some architectures need more work so this is disabled on them until I can make the needed changes: hppa, mips, powerpc and riscv. It's enabled everywhere else. I will keep an eye out for any bug reports. Thank you, Andrew
Re: Automated report: NetBSD-current/i386 build failure
Fixed with src/sys/kern/vfs_vnode.c 1.115. Andrew On Sun, Mar 22, 2020 at 04:23:47PM +, NetBSD Test Fixture wrote: > This is an automatically generated notice of a NetBSD-current/i386 > build failure. > > The failure occurred on babylon5.netbsd.org, a NetBSD/amd64 host, > using sources from CVS date 2020.03.22.14.43.05. > > An extract from the build.sh output follows: > > --- vfs_vnops.pico --- > --- vfs_vnode.po --- > > -I/tmp/bracket/build/2020.03.22.14.43.05-i386/src/lib/librumpvfs/../../sys/rump/include > > -I/tmp/bracket/build/2020.03.22.14.43.05-i386/src/lib/librumpvfs/../../sys/rump/include/opt > > -I/tmp/bracket/build/2020.03.22.14.43.05-i386/src/lib/librumpvfs/../../sys/rump/../arch > > -I/tmp/bracket/build/2020.03.22.14.43.05-i386/src/lib/librumpvfs/../../sys/rump/.. > -DDIAGNOSTIC -DKTRACE -D_FORTIFY_SOURCE=2 -c -DGPROF -DPROF-pg -fPIC > /tmp/bracket/build/2020.03.22.14.43.05-i386/src/lib/librumpvfs/../../sys/rump/../kern/vfs_vnode.c > -o vfs_vnode.po > --- vfs_vnode.pico --- > > /tmp/bracket/build/2020.03.22.14.43.05-i386/src/lib/librumpvfs/../../sys/rump/../kern/vfs_vnode.c: > In function 'vcache_reclaim': > > /tmp/bracket/build/2020.03.22.14.43.05-i386/src/lib/librumpvfs/../../sys/rump/../kern/vfs_vnode.c:1679:17: > error: 'VI_DEADCHECK' undeclared (first use in this function); did you mean > 'PG_READAHEAD'? > vp->v_iflag |= VI_DEADCHECK; /* for genfs_getpages() */ > ^~~~ > PG_READAHEAD > > /tmp/bracket/build/2020.03.22.14.43.05-i386/src/lib/librumpvfs/../../sys/rump/../kern/vfs_vnode.c:1679:17: > note: each undeclared identifier is reported only once for each function it > appears in > *** [vfs_vnode.pico] Error code 1 > nbmake[7]: stopped in > /tmp/bracket/build/2020.03.22.14.43.05-i386/src/lib/librumpvfs > --- vfs_syscalls_50.pico --- > > The following commits were made between the last successful build and > the failed build: > > 2020.03.22.13.30.10 pgoyette src/lib/librumpuser/rumpuser_dl.c,v 1.33 > 2020.03.22.13.30.10 pgoyette src/sys/rump/include/rump/rumpuser.h,v 1.116 > 2020.03.22.13.30.10 pgoyette src/sys/rump/librump/rumpkern/rump.c,v 1.343 > 2020.03.22.14.27.33 ad src/distrib/sets/lists/comp/mi,v 1.2313 > 2020.03.22.14.27.33 ad src/sys/sys/Makefile,v 1.172 > 2020.03.22.14.27.33 ad src/sys/sys/vnode_impl.h,v 1.22 > 2020.03.22.14.38.37 ad src/sys/kern/init_sysctl.c,v 1.225 > 2020.03.22.14.38.37 ad src/sys/kern/vfs_cache.c,v 1.128 > 2020.03.22.14.38.37 ad src/sys/kern/vfs_getcwd.c,v 1.56 > 2020.03.22.14.38.37 ad src/sys/kern/vfs_vnode.c,v 1.114 > 2020.03.22.14.38.37 ad src/sys/sys/namei.src,v 1.49 > 2020.03.22.14.38.37 ad src/sys/sys/vnode_impl.h,v 1.23 > 2020.03.22.14.39.03 ad src/sys/rump/include/rump/rump_namei.h,v 1.39 > 2020.03.22.14.39.03 ad src/sys/sys/namei.h,v 1.105 > 2020.03.22.14.39.28 ad src/usr.bin/vmstat/vmstat.c,v 1.237 > 2020.03.22.14.41.32 ad src/usr.bin/pmap/main.c,v 1.28 > 2020.03.22.14.41.32 ad src/usr.bin/pmap/pmap.c,v 1.55 > 2020.03.22.14.41.32 ad src/usr.bin/pmap/pmap.h,v 1.12 > 2020.03.22.14.43.05 ad src/sys/sys/param.h,v 1.654 > > Logs can be found at: > > > http://releng.NetBSD.org/b5reports/i386/commits-2020.03.html#2020.03.22.14.43.05
Re: Another pmap panic
Hi, I meant to send a note yesterday but fatigue got the better of me. I suggest updaing to the latest, delivered yesterday, which has fixes for every problem I have encountered or seen mentioned including this one, and survives low memory stress testing for me: /* $NetBSD: pmap.c,v 1.378 2020/03/19 18:58:14 ad Exp $*/ Thank you, Andrew On Fri, Mar 20, 2020 at 05:49:59PM +, Chavdar Ivanov wrote: > Hi, > > Overnight, while doing pkg_rolling-replace, my 'server' got: > ... > panic: kernel diagnostic assertion "ptp->wire_count == 1" failed: file > "/home/sysbuild/src/sys/arch/x86/x86/pmap.c", line 2232 > > cpu0: Begin traceback... > vpanic() at netbsd:vpanic+0x178 > kern_assert() at netbsd:kern_assert+0x48 > pmap_unget_ptp() at netbsd:pmap_unget_ptp+0x1f1 > pmap_get_ptp() at netbsd:pmap_get_ptp+0x300 > pmap_enter_ma() at netbsd:pmap_enter_ma+0x6fb > pmap_enter_default() at netbsd:pmap_enter_default+0x29 > uvm_fault_internal() at netbsd:uvm_fault_internal+0xf2e > trap() at netbsd:trap+0x50a > --- trap (number 6) --- > 7f7eec2007e0: > cpu0: End traceback... > > dumping to dev 168,2 (offset=8, size=5225879): > ... > > > -- >
Re: pmap panic
Ok. I think the problems here should be fixed. Andrew On Sun, Mar 15, 2020 at 04:16:20PM +, Andrew Doran wrote: > Hi, > > Thanks for the reports. This and the NVMM related panics should be fixed > now, with: 1.369 src/sys/arch/x86/x86/pmap.c > > I don't have a machine capable of running X11 on NetBSD at the moment so I > will spin up qemu or VirtualBox or something to try that out now. > > Apologies for the distruption. This stuff is tough to get right. The good > news is I am almost finished with my changes to the VM system. > > Cheers, > Andrew > > On Sun, Mar 15, 2020 at 02:09:00PM +, Patrick Welche wrote: > > Updating an amd64 laptop from Wednesday's to today Mar 15 13:47 source, I > > get: > > > > #0 0x80222a45 in cpu_reboot (howto=howto@entry=260, > > bootstr=bootstr@entry=0x0) at ../../../../arch/amd64/amd64/machdep.c:720 > > #1 0x806371fa in kern_reboot (howto=howto@entry=260, > > bootstr=bootstr@entry=0x0) at ../../../../kern/kern_reboot.c:73 > > #2 0x806691ef in vpanic (fmt=fmt@entry=0x80b325a8 "trap", > > ap=ap@entry=0xb0814d473808) at ../../../../kern/subr_prf.c:336 > > #3 0x806692b3 in panic (fmt=fmt@entry=0x80b325a8 "trap") > > at ../../../../kern/subr_prf.c:255 > > #4 0x8022484d in trap (frame=0xb0814d473950) > > at ../../../../arch/amd64/amd64/trap.c:326 > > #5 0x8021db75 in alltraps () > > #6 0x802402fb in pmap_lookup_pv (ptp=0xb08000815580, > > old_pp=old_pp@entry=0x0, va=va@entry=140187359342592, pmap= > out>) > > at ../../../../arch/x86/x86/pmap.c:2010 > > #7 0x80243e48 in pmap_enter_ma (pmap=0x8e88998883c0, > > va=va@entry=140187359342592, ma=ma@entry=2147483648, > > pa=pa@entry=2147483648, prot=, flags=547, > > domid=domid@entry=0) at ../../../../arch/x86/x86/pmap.c:4671 > > #8 0x8024436a in pmap_enter_default (pmap=, > > va=va@entry=140187359342592, pa=pa@entry=2147483648, prot= > out>, > > flags=) at ../../../../arch/x86/x86/pmap.c:4539 > > #9 0x805f4907 in udv_fault (ufi=0xb0814d473c98, > > vaddr=, pps=0xb0814d473d50, npages=, > > centeridx=0, access_type=2, flags=66) at > > ../../../../uvm/uvm_device.c:429 > > #10 0x805f5df9 in uvm_fault_internal ( > > orig_map=orig_map@entry=0x8e889a67f1c0, > > vaddr=vaddr@entry=140187359342592, access_type=access_type@entry=2, > > fault_flag=fault_flag@entry=0) at ../../../../uvm/uvm_fault.c:878 > > #11 0x80224326 in trap (frame=0xb0814d473f00) > > at ../../../../arch/amd64/amd64/trap.c:520 > > #12 0x8021db75 in alltraps () > > > > (custom kernel without DIAGNOSTIC nor DEBUG) > > > > 11 March fine... > > > > > > Cheers, > > > > Patrick
Re: current: completely stuck after four minutes of uptime
On Mon, Mar 16, 2020 at 11:14:38AM +0100, Lars Reichardt wrote: > > On 2020-03-16 10:45, Thomas Klausner wrote: > > On Sun, Mar 15, 2020 at 11:29:15AM +0100, Thomas Klausner wrote: > > > I've just upgraded my 9.99.49 kernel from March 12 to today's from an > > > hour ago. > > > > > > After rebooting, the machine got stuck in less than five minutes. > > > > > > No reaction to CTRL-ALT-ESC from the console, no reaction to pressing > > > the power button. > > This problem was still there with a kernel from last night. > > Thomas > > > I've seen that behavior with a Thinkpad x201s i7 but not with my Ryzen 2700. > > The Thinkpad locks somehow up can switch consoles capslock ok but processes > seem stuck. > > Top is not updating etc. > > > The Ryzen seems completely unaffected running builds in the background with > X11 running. It sounds like the hanging systems make use of DRM/X11. I fixed an issue today that could have caused that: 1.10 src/sys/uvm/pmap/pmap_pvt.c Andrew
Re: pmap panic
Hi, Thanks for the reports. This and the NVMM related panics should be fixed now, with: 1.369 src/sys/arch/x86/x86/pmap.c I don't have a machine capable of running X11 on NetBSD at the moment so I will spin up qemu or VirtualBox or something to try that out now. Apologies for the distruption. This stuff is tough to get right. The good news is I am almost finished with my changes to the VM system. Cheers, Andrew On Sun, Mar 15, 2020 at 02:09:00PM +, Patrick Welche wrote: > Updating an amd64 laptop from Wednesday's to today Mar 15 13:47 source, I > get: > > #0 0x80222a45 in cpu_reboot (howto=howto@entry=260, > bootstr=bootstr@entry=0x0) at ../../../../arch/amd64/amd64/machdep.c:720 > #1 0x806371fa in kern_reboot (howto=howto@entry=260, > bootstr=bootstr@entry=0x0) at ../../../../kern/kern_reboot.c:73 > #2 0x806691ef in vpanic (fmt=fmt@entry=0x80b325a8 "trap", > ap=ap@entry=0xb0814d473808) at ../../../../kern/subr_prf.c:336 > #3 0x806692b3 in panic (fmt=fmt@entry=0x80b325a8 "trap") > at ../../../../kern/subr_prf.c:255 > #4 0x8022484d in trap (frame=0xb0814d473950) > at ../../../../arch/amd64/amd64/trap.c:326 > #5 0x8021db75 in alltraps () > #6 0x802402fb in pmap_lookup_pv (ptp=0xb08000815580, > old_pp=old_pp@entry=0x0, va=va@entry=140187359342592, pmap= out>) > at ../../../../arch/x86/x86/pmap.c:2010 > #7 0x80243e48 in pmap_enter_ma (pmap=0x8e88998883c0, > va=va@entry=140187359342592, ma=ma@entry=2147483648, > pa=pa@entry=2147483648, prot=, flags=547, > domid=domid@entry=0) at ../../../../arch/x86/x86/pmap.c:4671 > #8 0x8024436a in pmap_enter_default (pmap=, > va=va@entry=140187359342592, pa=pa@entry=2147483648, prot= out>, > flags=) at ../../../../arch/x86/x86/pmap.c:4539 > #9 0x805f4907 in udv_fault (ufi=0xb0814d473c98, > vaddr=, pps=0xb0814d473d50, npages=, > centeridx=0, access_type=2, flags=66) at ../../../../uvm/uvm_device.c:429 > #10 0x805f5df9 in uvm_fault_internal ( > orig_map=orig_map@entry=0x8e889a67f1c0, > vaddr=vaddr@entry=140187359342592, access_type=access_type@entry=2, > fault_flag=fault_flag@entry=0) at ../../../../uvm/uvm_fault.c:878 > #11 0x80224326 in trap (frame=0xb0814d473f00) > at ../../../../arch/amd64/amd64/trap.c:520 > #12 0x8021db75 in alltraps () > > (custom kernel without DIAGNOSTIC nor DEBUG) > > 11 March fine... > > > Cheers, > > Patrick
Re: change within last day broke nvmm
On Sun, Mar 15, 2020 at 02:38:19PM +0100, Tobias Nygren wrote: > This is consistently reproducable while trying to boot Linux on nvmm. > > panic: LIST_INSERT_HEAD 0x88713368 x86/pmap.c:2135 > vpanic() > panic() > pmap_enter_pv() > pmap_ept_enter() > uvm_fault_lower_enter() > uvm_fault_internal() > nvmm_ioctl() > sys_ioctl() > syscall() I see a related panic. Looking into it. Andrew
Re: Automated report: NetBSD-current/i386 build failure
Should be fixed with 1.91 src/sys/miscfs/genfs/genfs_io.c. Andrew On Sat, Mar 14, 2020 at 07:46:02PM +, NetBSD Test Fixture wrote: > This is an automatically generated notice of a NetBSD-current/i386 > build failure. > > The failure occurred on babylon5.netbsd.org, a NetBSD/amd64 host, > using sources from CVS date 2020.03.14.18.08.40. > > An extract from the build.sh output follows: > > # compile librumpvfs/spec_vnops.po > --- rump_vfs.po --- > > /tmp/bracket/build/2020.03.14.18.08.40-i386/tools/bin/i486--netbsdelf-objcopy > -X rump_vfs.po > --- spec_vnops.po --- > /tmp/bracket/build/2020.03.14.18.08.40-i386/tools/bin/i486--netbsdelf-gcc > -O2 -fno-delete-null-pointer-checks -ffreestanding -fno-strict-aliasing > -msoft-float -mno-mmx -mno-sse -mno-avx -msoft-float -mno-mmx -mno-sse > -mno-avx -std=gnu99-Wall -Wstrict-prototypes -Wmissing-prototypes > -Wpointer-arith -Wno-sign-compare -Wsystem-headers -Wno-traditional > -Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual > -Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Werror > -Wno-format-zero-length -Wno-pointer-sign -fPIE -fstack-protector > -Wstack-protector --param ssp-buffer-size=1 > --sysroot=/tmp/bracket/build/2020.03.14.18.08.40-i386/destdir -DCOMPAT_50 > -DCOMPAT_60 -DCOMPAT_70 -DCOMPAT_80 -nostdinc -imacros > /tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs/../../sys/rump/include/opt/opt_rumpkernel.h > -I/tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs -I. > -I/tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs/../.. > /sys/rump/../../common/include > -I/tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs/../../sys/rump/include > > -I/tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs/../../sys/rump/include/opt > > -I/tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs/../../sys/rump/../arch > > -I/tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs/../../sys/rump/.. > -DDIAGNOSTIC -DKTRACE -D_FORTIFY_SOURCE=2 -c -DGPROF -DPROF-pg -fPIC > /tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs/../../sys/rump/../miscfs/specfs/spec_vnops.c > -o spec_vnops.po > --- subr_bufq.pico --- > # compile librumpvfs/subr_bufq.pico > /tmp/bracket/build/2020.03.14.18.08.40-i386/tools/bin/i486--netbsdelf-gcc > -O2 -fno-delete-null-pointer-checks -ffreestanding -fno-strict-aliasing > -msoft-float -mno-mmx -mno-sse -mno-avx -msoft-float -mno-mmx -mno-sse > -mno-avx -std=gnu99-Wall -Wstrict-prototypes -Wmissing-prototypes > -Wpointer-arith -Wno-sign-compare -Wsystem-headers -Wno-traditional > -Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual > -Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Werror > -Wno-format-zero-length -Wno-pointer-sign -fPIE -fstack-protector > -Wstack-protector --param ssp-buffer-size=1 > --sysroot=/tmp/bracket/build/2020.03.14.18.08.40-i386/destdir -DCOMPAT_50 > -DCOMPAT_60 -DCOMPAT_70 -DCOMPAT_80 -nostdinc -imacros > /tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs/../../sys/rump/include/opt/opt_rumpkernel.h > -I/tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs -I. > -I/tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs/../.. > /sys/rump/../../common/include--- rumpvfs_if_wrappers.po --- > /tmp/bracket/build/2020.03.14.18.08.40-i386/tools/bin/nbctfconvert -g -L > VERSION rumpvfs_if_wrappers.po > > /tmp/bracket/build/2020.03.14.18.08.40-i386/tools/bin/i486--netbsdelf-objcopy > -X rumpvfs_if_wrappers.po > --- rumpblk.po --- > /tmp/bracket/build/2020.03.14.18.08.40-i386/tools/bin/nbctfconvert -g -L > VERSION rumpblk.po > > /tmp/bracket/build/2020.03.14.18.08.40-i386/tools/bin/i486--netbsdelf-objcopy > -X rumpblk.po > --- rumpvfs_if_wrappers.pico --- > > /tmp/bracket/build/2020.03.14.18.08.40-i386/tools/bin/i486--netbsdelf-objcopy > -x rumpvfs_if_wrappers.pico > --- rumpvfs_syscalls.po --- > /tmp/bracket/build/2020.03.14.18.08.40-i386/tools/bin/nbctfconvert -g -L > VERSION rumpvfs_syscalls.po > > /tmp/bracket/build/2020.03.14.18.08.40-i386/tools/bin/i486--netbsdelf-objcopy > -X rumpvfs_syscalls.po > --- genfs_io.pico --- > cc1: all warnings being treated as errors > *** [genfs_io.pico] Error code 1 > nbmake[7]: stopped in > /tmp/bracket/build/2020.03.14.18.08.40-i386/src/lib/librumpvfs > --- genfs_io.po --- > > The following commits were made between the last successful build and > the failed build: > > 2020.03.14.13.34.43 ad src/sys/arch/sparc/sparc/intr.c,v 1.124 > 2020.03.14.13.37.49 ad src/sys/fs/tmpfs/tmpfs_subr.c,v 1.107 > 2020.03.14.13.39.36 ad src/sys/fs/tmpfs/tmpfs_vnops.c,v 1.135 > 2020.03.14.13.50.46 ad src/sys/arch/x86/acpi/acpi_cpu_md.c,v 1.82 > 2020.03.14.13.53.26 ad src/sys/uvm/uvm_pdpolicy_clock.c,v 1.35 > 202
Re: Weird qemu-nvmm problem
On Wed, Mar 11, 2020 at 06:45:06PM +0100, Maxime Villard wrote: > Please CC me for issues related to NVMM, there is a number of lists where > I'm not subscribed. > > My understanding is that this commit is the cause (CC ad@): > > https://mail-index.netbsd.org/source-changes/2019/12/06/msg111617.html > > NVMM reschedules the thread when the SPCF_SHOULDYIELD flag is set. But > after this change the flag never gets set, so the rescheduling never > occurs, and NVMM is stuck with running the guest forever unless a signal > is caught in the emulator thread. > > The test program below shows the difference. On NetBSD-9 you have many > "resched", as expected. On NetBSD-current you have none. > > Andrew, can you have a look? There is a good dozen of places that use > SPCF_SHOULDYIELD for reschedulings, and they too may potentially be buggy > now. Yes that was incorrect and I have fixed it. I will see about adding a wrapper function for this too. Andrew > > Thanks, > Maxime > > > > --- > > /* > * # gcc -o test test.c -lnvmm > * # ./test > */ > > #include > #include > #include > #include > #include > #include > #include > #include > > int main() > { > uint8_t instr[] = { 0xEB, 0xFE }; > struct nvmm_machine mach; > struct nvmm_vcpu vcpu; > uintptr_t hva; > > nvmm_init(); > nvmm_machine_create(&mach); > nvmm_vcpu_create(&mach, 0, &vcpu); > > hva = (uintptr_t)mmap(NULL, 4096, PROT_READ|PROT_WRITE, > MAP_ANON|MAP_PRIVATE, -1, 0); > nvmm_hva_map(&mach, hva, 4096); > nvmm_gpa_map(&mach, hva, 0x, 4096, PROT_READ|PROT_EXEC); > > memcpy((void *)hva, instr, sizeof(instr)); > > nvmm_vcpu_getstate(&mach, &vcpu, NVMM_X64_STATE_GPRS); > vcpu.state->gprs[NVMM_X64_GPR_RIP] = 0; > nvmm_vcpu_setstate(&mach, &vcpu, NVMM_X64_STATE_GPRS); > > while (1) { > printf("looping\n"); > nvmm_vcpu_run(&mach, &vcpu); > > switch (vcpu.exit->reason) { > case NVMM_VCPU_EXIT_NONE: > printf("resched\n"); > break; > default: > printf("unknown: %d\n", vcpu.exit->reason); > break; > } > } > }
Re: diagnostic assertion curcpu()->ci_biglock_count == 0 failed
Hi, On Sun, Feb 23, 2020 at 10:30:24AM +0100, Thomas Klausner wrote: > With a 9.99.47/amd64 kernel from February 16, I just had panic (handcopied): > > panic: kernel diagnostic assertion "curcpu()->ci_biglock_count == 0" failed: > file .../kern_exit.c line 214: kernel_lock leaked > cpu12: Begin traceback > vpanic > kern_assert > exit1 > sys_exit > syscall The source of this problem is not going to be identified any time soon, so I put back an unconditional release for kernel_lock there, so the panic should no longer occur. In theory LOCKDEBUG should catch this but it changes the timing so much it may not happen. Andrew
Re: panic: softint screwup
Has anyone observed this again in the last couple of weeks? Assuming it's fixed now. Thanks, Anrew On Sun, Feb 09, 2020 at 05:05:14PM +0100, Thomas Klausner wrote: > I just had a panic in 9.99.46/amd64: > > Feb 9 16:27:54 yt savecore: reboot after panic: [ 14300.7407347] panic: > softint screwup > > The backtrace is > > Reading symbols from netbsd.9.99.46.gdb... > targ(gdb) target kvm netbsd.5.core > 0x80224315 in cpu_reboot (howto=howto@entry=260, > bootstr=bootstr@entry=0x0) at > /disk/6/archive/foreign/src/sys/arch/amd64/amd64/machdep.c:720 > 720 dumpsys(); > (gdb) bt > #0 0x80224315 in cpu_reboot (howto=howto@entry=260, > bootstr=bootstr@entry=0x0) at > /disk/6/archive/foreign/src/sys/arch/amd64/amd64/machdep.c:720 > #1 0x809f7909 in kern_reboot (howto=howto@entry=260, > bootstr=bootstr@entry=0x0) at > /disk/6/archive/foreign/src/sys/kern/kern_reboot.c:61 > #2 0x80a39179 in vpanic (fmt=fmt@entry=0x81420663 "softint > screwup", ap=ap@entry=0x8d090723afe8) > at /disk/6/archive/foreign/src/sys/kern/subr_prf.c:336 > #3 0x80a3923d in panic (fmt=fmt@entry=0x81420663 "softint > screwup") at /disk/6/archive/foreign/src/sys/kern/subr_prf.c:255 > #4 0x80a08a47 in softint_dispatch (pinned=0x8b731acf7480, s=5) > at /disk/6/archive/foreign/src/sys/kern/kern_softint.c:858 > #5 0x8021ea4f in Xsoftintr () > (gdb) > > I have a core file if more details are needed. > Thomas
Re: assertion "pve == NULL" failed
On Wed, Mar 04, 2020 at 11:41:33AM +, Patrick Welche wrote: > Netbooting (uefi) a -current/amd64 kernel from yesterday morning, > with a serial console, I just had > > [ 131.6333574] panic: kernel diagnostic assertion "pve == NULL" failed: file > "/ > [ 131.7633964] cpu8: Begin traceback... > [ 131.8134059] vpanic() at netbsd:vpanic+0x178 > [ 131.8634212] kern_assert() at netbsd:kern_assert+0x48 > [ 131.9234393] pmap_remove_pv() at netbsd:pmap_remove_pv+0x2e3 > [ 131.9834571] pmap_enter_ma() at netbsd:pmap_enter_ma+0x7aa > [ 132.0534782] pmap_enter_default() at netbsd:pmap_enter_default+0x29 > [ 132.1234992] uvm_fault_lower_enter() at netbsd:uvm_fault_lower_enter+0x111 > [ 132.2035233] uvm_fault_internal() at netbsd:uvm_fault_internal+0x12c2 > [ 132.2835474] trap() at netbsd:trap+0x6f4 > [ 132.3235594] --- trap (number 6) --- > [ 132.3635714] 7f7ef58009ee: > > when hitting return at > > /etc/rc.conf is not configured. Multiuser boot aborted. > Enter pathname of shell or RETURN for /bin/sh: > > (am in ddb if that helps - unfortunately I cvs updated the source in > the meantime) I wonder if this has the same root as .. https://syzkaller.appspot.com/bug?id=ada8326eeb74dd1bab60e8677adacd95afb04ce7 .. which has been around for a while, but due to small changes in the pmap it now goes down the path with the assertion more readily. I don't see the problem but will look in depth this weekend. Thank you, Andrew
Re: Panic on aarch64
On Tue, Mar 03, 2020 at 10:03:38PM +, Robert Swindells wrote: > > I just got this: > > panic: pr_phinpage_check: [vcachepl] item 0x54d19880 not part of pool > cpu0: Begin traceback... > trace fp ffc0405efc90 > fp ffc0405efcb0 vpanic() at ffc000240880 netbsd:vpanic+0x160 > fp ffc0405efd20 panic() at ffc000240974 netbsd:panic+0x44 > fp ffc0405efdb0 pool_cache_put_paddr() at ffc00023e858 > netbsd:pool_cache_put_paddr+0x110 > fp ffc0405efde0 vrelel() at ffc00028b058 netbsd:vrelel+0x208 What file systems do you have in use? Andrew > fp ffc0405efe20 vrecycle() at ffc00028b90c netbsd:vrecycle+0x6c > fp ffc0405efe50 vdrain_thread() at ffc00028bf54 > netbsd:vdrain_thread+0x2ac > address 0x100 is invalid > address 0xe8 is invalid > cpu0: End traceback... > > I didn't have it set to enter ddb on panic and don't have enough swap > on the eMMC disk to dump to. > > Have enabled DDB now.
Re: benchmark results on ryzen 3950x with netbsd-9, -current, and -current (no DIAGNOSTIC)
Hi. On Tue, Mar 03, 2020 at 08:25:25PM +1100, matthew green wrote: > here are a few build benchmark tests on an amd ryzen 3950x > system, to see the cumulative effect of the various fixes we've > seen since netbsd-9, for this 16 core/ 32 thread CPU, 64GB of > ram, separate nvme ssd for src & obj. Cool! Thank you very much for doing this. Really interesting to see these results. > below has a full summary, but the highlights: > > - building kernels into tmpfs is 10-12% faster > - DIAGNOSTIC costs 3-8% > - current's better CPU thread aware scheduler helps -j16 on >32 CPUs significantly (more than double benefit compared >to the other tests.) > - "build.sh release" is about 10% faster > - kernel builds are similar about 10% faster > - builds of ircII are 22% faster, though configure only 11% > - -j32 is faster than -j16 or -j24 > - -j40 is not much worse than -j32, occasinally faster > > and the one lowlight: > > - "du -mcs *" on a src tree already in ram has lost about 30% >performance, though still under 1 second. OK that's intriguing and something that can hopefully be diagnosed with a bit of dtrace. I'm not aware of a reason for it. > time for amd64 GENERIC kernel builds: > > -j16 -j24 -j32 -j40 >netbsd-9 2m26.561m55:301m43:461m43:82 >current (DIAG) 2m01.251m46.841m40.221m41.12 >current1m54.561m39.571m33.091m34.06 Another perspective and a couple of observations: this is from my dual socket system with 24 cores / 48 threads total, running -current and building GENERIC on a SATA disk. This is with -j48. 132.53s real 1501.10s user 3291.25s systemnov 2019 86.29s real 1537.95s user 786.29s systemmar 2020 79.16s real 1602.97s user 419.54s systemmar 2020 !DIAGNOSTIC I agree with Greg, the picture with DIAGNOSTIC isn't good. I think one of the culprits may be in the x86 pmap where it regularly scans all CPUs checking if a pmap is in use - that should probably be DEBUG. Beyond that, I don't have good ideas (other than investigation warranted). In the case of the dual socket system, the difference is pronounced and my take on it is that contention stresses the interconnect, and backpressure is then exerted on every CPU in the system, not just those CPUs actively contending with others. With a single socket system that kind of fight stays on chip in the cache. Cheers, Andrew
Re: Regressions
On Sun, Mar 01, 2020 at 03:26:12PM +0200, Andreas Gustafsson wrote: > NetBSD-current is again suffering from a number of regressions. The > last time the ATF tests showed zero unexpected failures on real amd64 > hardware was on Dec 12, and the sparc, sparc64, pmax, and hpcmips > tests have all been unable to run to completion for more than a month. > > Here are the PRs for some of the issues: > > 50350 rump/rumpkern/t_sp/stress_{long,short} fail on Core 2 Quad > 55032 rump/rumpkern/t_vm:uvmwait test case now fails > 55018 atf tests for pppoe sometimes leave rump_server processes around Rump is very fragile. There's no real bug in the system captured here as far as I am aware. Have already spent a lot of time on these and related - see pthread changes etc which somewhat improved the picture. Will look into them again when finished addressing build performance which should be within the month. > 54845 sparc panics in sleepq_remove Spent a couple of days so far this year fixing sparc's maladies. Will cycle back when I have free time (as above). > 54810 sparc64 pool_redzone_check errors during install > 54923 pmax test runs fail to complete since Jan 15 > 55020 dbregs_dr?_dont_inherit_lwp test cases fail on real hardware I've not personally looked into these yet. Andrew
Re: Failures in x86 pmap
On Mon, Feb 24, 2020 at 01:22:15PM +, Patrick Welche wrote: > On Sun, Feb 23, 2020 at 06:59:50PM +0000, Andrew Doran wrote: > > I think I found the problem, which has existed since ~8PM GMT yesterday. > > Hopefully fixed by revision 1.17 of src/sys/arch/x86/x86/x86_tlb.c. > > With src from Mon Feb 24 13:18:07 GMT 2020 (so with v 1.17 of > x86_tlb.c), I now see: > > (gdb) x/s panicstr > 0x819535c0 : "kernel diagnostic assertion > \"uvm_page_owner_locked_p(pg, true)\" failed: file > \"/usr/src/sys/arch/x86/x86/pmap.c\", line 4041 " > > whenever logging into xdm with an nfs mounted /home. > > Booting with a kernel from last Monday gets me back to a working amd64 system. > (Not sure if related, but see the magic characters "pmap.c") I missed a few changes. Just made a bunch of commits and nfs survives a short fsx run for me so I'd say it should be fixed now. Thank you, Andrew > Cheers, > > Patrick > > (gdb) bt > #0 0x80224225 in cpu_reboot (howto=howto@entry=260, > bootstr=bootstr@entry=0x0) at /usr/src/sys/arch/amd64/amd64/machdep.c:720 > #1 0x809c888f in kern_reboot (howto=howto@entry=260, > bootstr=bootstr@entry=0x0) at /usr/src/sys/kern/kern_reboot.c:73 > #2 0x80a0ac79 in vpanic ( > fmt=0x8135b650 "kernel %sassertion \"%s\" failed: file \"%s\", > line %d ", ap=ap@entry=0xaa814dffcad8) at /usr/src/sys/kern/subr_prf.c:336 > #3 0x80e51f13 in kern_assert ( > fmt=fmt@entry=0x8135b650 "kernel %sassertion \"%s\" failed: file > \"%s\", line %d ") at /usr/src/sys/lib/libkern/kern_assert.c:51 > #4 0x80251869 in pmap_clear_attrs (pg=0xaa80002dac50, > clearbits=clearbits@entry=4) at /usr/src/sys/arch/x86/x86/pmap.c:4046 > #5 0x808fd094 in pmap_page_protect (prot=1, pg=) > at ./x86/pmap.h:464 > #6 nfs_gop_write (vp=0xd6274866adc0, pgs=0xaa814dffcc48, npages=1, > flags=19) at /usr/src/sys/nfs/nfs_node.c:278 > #7 0x80a80c02 in genfs_do_putpages (vp=0xd6274866adc0, > startoff=0, endoff=9223372036854771712, origflags=19, busypg=0x0) > at /usr/src/sys/miscfs/genfs/genfs_io.c:1303 > #8 0x80a7cf40 in VOP_PUTPAGES (vp=vp@entry=0xd6274866adc0, > offlo=offlo@entry=0, offhi=offhi@entry=0, flags=flags@entry=19) > at /usr/src/sys/kern/vnode_if.c:1632 > #9 0x8092da43 in nfs_flush (vp=0xd6274866adc0, > cred=, waitfor=waitfor@entry=1, l=, > commit=commit@entry=0) at /usr/src/sys/nfs/nfs_vnops.c:3166 > #10 0x8092dadc in nfs_close (v=) at ./machine/cpu.h:72 > #11 0x80a7b4c0 in VOP_CLOSE (vp=vp@entry=0xd6274866adc0, > fflag=fflag@entry=2, cred=cred@entry=0xd62746737dc0) > at /usr/src/sys/kern/vnode_if.c:332 > #12 0x80a72d99 in vn_close (vp=0xd6274866adc0, flags=2, > cred=0xd62746737dc0) at /usr/src/sys/kern/vfs_vnops.c:396 > #13 0x809a23dd in closef (fp=fp@entry=0xd6274539db40) > at /usr/src/sys/kern/kern_descrip.c:832 > #14 0x809a26f0 in fd_close (fd=fd@entry=3) > at /usr/src/sys/kern/kern_descrip.c:715 > #15 0x80a17adb in sys_close (l=0xd62733a7c600, > uap=, retval=) > at /usr/src/sys/kern/sys_descrip.c:513 > #16 0x80255cb9 in sy_call (rval=0xaa814dffcfb0, > uap=0xaa814dffd000, l=0xd62733a7c600, > sy=0x8185b870 ) at /usr/src/sys/sys/syscallvar.h:65 > #17 sy_invoke (code=6, rval=0xaa814dffcfb0, uap=0xaa814dffd000, > l=0xd62733a7c600, sy=0x8185b870 ) > at /usr/src/sys/sys/syscallvar.h:94 > #18 syscall (frame=0xaa814dffd000) > at /usr/src/sys/arch/x86/x86/syscall.c:138 > #19 0x802096ad in handle_syscall ()
Re: Failures in x86 pmap
I think I found the problem, which has existed since ~8PM GMT yesterday. Hopefully fixed by revision 1.17 of src/sys/arch/x86/x86/x86_tlb.c. Andrew On Sun, Feb 23, 2020 at 06:29:38PM +, Andrew Doran wrote: > Having gotten a report of this privately I've now started running into it. > Has anyone else seen this, and if so any idea when it started happening? > I wonder if there is a memory corruption or TLB coherency issue. > > Andrew > > > > hanging here: > > db{0}> bt > pmap_pp_remove() at netbsd:pmap_pp_remove+0x46c > uvm_anon_dispose() at netbsd:uvm_anon_dispose+0xd0 > uvm_anon_freelst() at netbsd:uvm_anon_freelst+0x3d > amap_wipeout() at netbsd:amap_wipeout+0x101 > uvm_unmap_detach() at netbsd:uvm_unmap_detach+0x52 > > panics: > > [ 107.1489360] panic: kernel diagnostic assertion "pve != NULL" failed: file > "../../../../arch/x86/x86/pmap.c", line 1918 > [ 107.1589403] cpu5: Begin traceback... > [ 107.1589403] vpanic() at netbsd:vpanic+0x178 > [ 107.1689455] kern_assert() at netbsd:kern_assert+0x48 > [ 107.1689455] pmap_enter_pv() at netbsd:pmap_enter_pv+0x295 > [ 107.1789498] pmap_enter_ma() at netbsd:pmap_enter_ma+0x3dd > [ 107.1889536] pmap_enter_default() at netbsd:pmap_enter_default+0x29 > [ 107.1889536] uvm_fault_upper_enter.isra.6() at > netbsd:uvm_fault_upper_enter.isra.6+0xcc > [ 107.1989589] uvm_fault_internal() at netbsd:uvm_fault_internal+0xb2a > [ 107.2089635] trap() at netbsd:trap+0x6f4 > [ 107.2089635] --- trap (number 6) --- > [ 107.2189676] copyout() at netbsd:copyout+0x33 > [ 107.2189676] elf64_copyargs() at netbsd:elf64_copyargs+0x1b > [ 107.2289730] execve_runproc() at netbsd:execve_runproc+0x4bb > [ 107.2289730] execve1() at netbsd:execve1+0x4e > [ 107.2389768] sys_execve() at netbsd:sys_execve+0x2a > [ 107.2389768] syscall() at netbsd:syscall+0x299 > [ 107.2489822] --- syscall (number 59) --- > [ 107.2489822] 71c9fa842dfa: > [ 107.2489822] cpu5: End traceback... > [ 107.2589851] fatal breakpoint trap in supervisor mode > [ 107.2589851] trap type 1 code 0 rip 0x8021f55d cs 0x8 rflags 0x202 > cr2 0x7f7fff28a1f0 ilevel 0 rsp 0xc3891bdf35b0 > [ 107.2689905] curlwp 0xb724ccc90640 pid 6429.1 lowest kstack > 0xc3891bdf02c0 > Stopped in pid 6429.1 (ls) at netbsd:breakpoint+0x5: leave
Failures in x86 pmap
Having gotten a report of this privately I've now started running into it. Has anyone else seen this, and if so any idea when it started happening? I wonder if there is a memory corruption or TLB coherency issue. Andrew hanging here: db{0}> bt pmap_pp_remove() at netbsd:pmap_pp_remove+0x46c uvm_anon_dispose() at netbsd:uvm_anon_dispose+0xd0 uvm_anon_freelst() at netbsd:uvm_anon_freelst+0x3d amap_wipeout() at netbsd:amap_wipeout+0x101 uvm_unmap_detach() at netbsd:uvm_unmap_detach+0x52 panics: [ 107.1489360] panic: kernel diagnostic assertion "pve != NULL" failed: file "../../../../arch/x86/x86/pmap.c", line 1918 [ 107.1589403] cpu5: Begin traceback... [ 107.1589403] vpanic() at netbsd:vpanic+0x178 [ 107.1689455] kern_assert() at netbsd:kern_assert+0x48 [ 107.1689455] pmap_enter_pv() at netbsd:pmap_enter_pv+0x295 [ 107.1789498] pmap_enter_ma() at netbsd:pmap_enter_ma+0x3dd [ 107.1889536] pmap_enter_default() at netbsd:pmap_enter_default+0x29 [ 107.1889536] uvm_fault_upper_enter.isra.6() at netbsd:uvm_fault_upper_enter.isra.6+0xcc [ 107.1989589] uvm_fault_internal() at netbsd:uvm_fault_internal+0xb2a [ 107.2089635] trap() at netbsd:trap+0x6f4 [ 107.2089635] --- trap (number 6) --- [ 107.2189676] copyout() at netbsd:copyout+0x33 [ 107.2189676] elf64_copyargs() at netbsd:elf64_copyargs+0x1b [ 107.2289730] execve_runproc() at netbsd:execve_runproc+0x4bb [ 107.2289730] execve1() at netbsd:execve1+0x4e [ 107.2389768] sys_execve() at netbsd:sys_execve+0x2a [ 107.2389768] syscall() at netbsd:syscall+0x299 [ 107.2489822] --- syscall (number 59) --- [ 107.2489822] 71c9fa842dfa: [ 107.2489822] cpu5: End traceback... [ 107.2589851] fatal breakpoint trap in supervisor mode [ 107.2589851] trap type 1 code 0 rip 0x8021f55d cs 0x8 rflags 0x202 cr2 0x7f7fff28a1f0 ilevel 0 rsp 0xc3891bdf35b0 [ 107.2689905] curlwp 0xb724ccc90640 pid 6429.1 lowest kstack 0xc3891bdf02c0 Stopped in pid 6429.1 (ls) at netbsd:breakpoint+0x5: leave
Re: diagnostic assertion curcpu()->ci_biglock_count == 0 failed
Hi Thomas. On Sun, Feb 23, 2020 at 10:30:24AM +0100, Thomas Klausner wrote: > With a 9.99.47/amd64 kernel from February 16, I just had panic (handcopied): > > panic: kernel diagnostic assertion "curcpu()->ci_biglock_count == 0" failed: > file .../kern_exit.c line 214: kernel_lock leaked > cpu12: Begin traceback > vpanic > kern_assert > exit1 > sys_exit > syscall > -- syscall (number 1) -- > 70497d18609a > cpu12: End trackback... > Mutex error: mutex_vector_enter,542: locking against myself > lock address: 0xa5ce9b40 > current cpu: 12 > current lwp: 0xa5cbee57da80 > owner field: 0xa5cbee57da80 wait/spin: 0/0 > > Skipping crash dump on recursive panic > > > It might have been a low-memory situation -- it was running a bulk > build and had some network activity. kardel@ has been running into this one too during bulk builds. Something is leaking a hold on kernel_lock, and it's not the LWP calling exit1() but seems to be trapped there. I will think about some additional diagnostics that don't require LOCKDEBUG. Andrew
Re: 9.99.47 panic: diagnostic assertion "lwp_locked(l, spc->spc_mutex)" failed: file ".../kern_synch.c", line 1001
Hi, On Sun, Feb 16, 2020 at 12:27:45PM +0100, Thomas Klausner wrote: > I just updated -current and quite soon had a panic: > cpu1: Begin traceback... > vpanic() > kern_assert > schend_lendpri > turnstile_block > rw_vector_enter > genfs_lock > layer_bypass > VOP_LOCK > vn_lock > layerfs_root > VFS_ROOT > lookup_once > namei_tryemulroot > namei > vn_open > do_open > do_sys_openat > sys_open > syscall Sorry about that - a race condition introduced yesterday - should be fixed with revision 1.340 of src/sys/kern/kern_synch.c. Andrew
Re: Automated report: NetBSD-current/i386 test failure
On Wed, Jan 29, 2020 at 02:45:22AM +, NetBSD Test Fixture wrote: > The newly failing test case is: > > lib/libpthread/t_detach:pthread_detach ... > 2020.01.27.20.50.05 ad src/lib/libpthread/pthread.c,v 1.157 Wrong error code from the kernel (ESRCH vs EINVAL) worked around with pthread.c 1.158 and will be fixed properly soon. Andrew
Re: 9.99.40: panic: kernel diagnostic assertion "ci->ci_biglock_count == 0" failed
On Sun, Jan 26, 2020 at 09:09:47PM +, Andrew Doran wrote: > Hi Frank, > > On Sun, Jan 26, 2020 at 09:00:51PM +0100, Frank Kardel wrote: > > > While bulk building pkgsrc with 9.99.42 from Jan 25t I see > > > > panic:kernel diagnostic assertion "curcpu()->ci_biglockcount == 0" failed: > > ..kern_exit.c, line 209 kernel lock leaked > > > > That happens every couple of thousand packages - sorry no dump (locking > > against myself as expected). > > Thanks for letting me know. I will make a change tomorrow to mitigate the > panic, and allow the badly behaved code to be identified. Done. It won't panic any more but should print out a message instead with the name/address of the routine that last acquired kernel lock. If you get some of those messages I'd be very interested to see them. Thank you, Andrew > > Andrew > > > > > Frank > > > > > > > > On 01/22/20 17:02, Andrew Doran wrote: > > > On Tue, Jan 21, 2020 at 07:59:35PM +, Andrew Doran wrote: > > > > > > > Hi Thomas, > > > > > > > > On Tue, Jan 21, 2020 at 08:47:44PM +0100, Thomas Klausner wrote: > > > > > > > > > During a bulk build (in rust AFAICT), I got a panic with > > > > > panic: kernel diagnostic assertion "ci->ci_biglock_count == 0" > > > > > failed: file "/usr/src/sys/sys/userret.h", line 88 > > > > > > > > > > That's this one: > > > > > > > > > > static __inline void > > > > > mi_userret(struct lwp *l) > > > > > { > > > > > struct cpu_info *ci; > > > > > > > > > > KPREEMPT_DISABLE(l); > > > > > ci = l->l_cpu; > > > > > KASSERT(l->l_blcnt == 0); > > > > > KASSERT(ci->ci_biglock_count == 0); > > > > > > > > > > > > > > > > > > > > The backtrace in the crash dump is not very helpful: > > > > > > > > > > (gdb) bt > > > > > #0 0x80224315 in cpu_reboot (howto=howto@entry=260, > > > > > bootstr=bootstr@entry=0x0) at > > > > > /usr/src/sys/arch/amd64/amd64/machdep.c:720 > > > > > #1 0x809f5ec3 in kern_reboot (howto=howto@entry=260, > > > > > bootstr=bootstr@entry=0x0) at /usr/src/sys/kern/kern_reboot.c:61 > > > > > #2 0x80a37109 in vpanic (fmt=0x8135e980 "kernel > > > > > %sassertion \"%s\" failed: file \"%s\", line %d ", > > > > > ap=ap@entry=0xad0928973f48) > > > > > at /usr/src/sys/kern/subr_prf.c:336 > > > > > #3 0x80e7b0b3 in kern_assert > > > > > (fmt=fmt@entry=0x8135e980 "kernel %sassertion \"%s\" failed: > > > > > file \"%s\", line %d ") > > > > > at /usr/src/sys/lib/libkern/kern_assert.c:51 > > > > > #4 0x802568ce in mi_userret (l=0xcfc320ca9c00) at > > > > > /usr/src/sys/sys/userret.h:91 > > > > > #5 userret (l=0xcfc320ca9c00) at ./machine/userret.h:81 > > > > > #6 syscall (frame=) at > > > > > /usr/src/sys/arch/x86/x86/syscall.c:166 > > > > > #7 0x802096ad in handle_syscall () > > > > hannken@ supplied me with a repro for this one so I'm going to look > > > > into it > > > > tomorrow morning. syzbot has also run into it recently. > > > This should be fixed now, with the following revisions: > > > > > > cvs rdiff -u -r1.165 -r1.166 src/sys/kern/kern_lock.c > > > cvs rdiff -u -r1.336 -r1.337 src/sys/kern/kern_synch.c > > > > > > Cheers, > > > Andrew > >
Re: 9.99.40: panic: kernel diagnostic assertion "ci->ci_biglock_count == 0" failed
Hi Frank, On Sun, Jan 26, 2020 at 09:00:51PM +0100, Frank Kardel wrote: > While bulk building pkgsrc with 9.99.42 from Jan 25t I see > > panic:kernel diagnostic assertion "curcpu()->ci_biglockcount == 0" failed: > ..kern_exit.c, line 209 kernel lock leaked > > That happens every couple of thousand packages - sorry no dump (locking > against myself as expected). Thanks for letting me know. I will make a change tomorrow to mitigate the panic, and allow the badly behaved code to be identified. Andrew > > Frank > > > > On 01/22/20 17:02, Andrew Doran wrote: > > On Tue, Jan 21, 2020 at 07:59:35PM +, Andrew Doran wrote: > > > > > Hi Thomas, > > > > > > On Tue, Jan 21, 2020 at 08:47:44PM +0100, Thomas Klausner wrote: > > > > > > > During a bulk build (in rust AFAICT), I got a panic with > > > > panic: kernel diagnostic assertion "ci->ci_biglock_count == 0" failed: > > > > file "/usr/src/sys/sys/userret.h", line 88 > > > > > > > > That's this one: > > > > > > > > static __inline void > > > > mi_userret(struct lwp *l) > > > > { > > > > struct cpu_info *ci; > > > > > > > > KPREEMPT_DISABLE(l); > > > > ci = l->l_cpu; > > > > KASSERT(l->l_blcnt == 0); > > > > KASSERT(ci->ci_biglock_count == 0); > > > > > > > > > > > > > > > > The backtrace in the crash dump is not very helpful: > > > > > > > > (gdb) bt > > > > #0 0x80224315 in cpu_reboot (howto=howto@entry=260, > > > > bootstr=bootstr@entry=0x0) at > > > > /usr/src/sys/arch/amd64/amd64/machdep.c:720 > > > > #1 0x809f5ec3 in kern_reboot (howto=howto@entry=260, > > > > bootstr=bootstr@entry=0x0) at /usr/src/sys/kern/kern_reboot.c:61 > > > > #2 0x80a37109 in vpanic (fmt=0x8135e980 "kernel > > > > %sassertion \"%s\" failed: file \"%s\", line %d ", > > > > ap=ap@entry=0xad0928973f48) > > > > at /usr/src/sys/kern/subr_prf.c:336 > > > > #3 0x80e7b0b3 in kern_assert (fmt=fmt@entry=0x8135e980 > > > > "kernel %sassertion \"%s\" failed: file \"%s\", line %d ") > > > > at /usr/src/sys/lib/libkern/kern_assert.c:51 > > > > #4 0x802568ce in mi_userret (l=0xcfc320ca9c00) at > > > > /usr/src/sys/sys/userret.h:91 > > > > #5 userret (l=0xcfc320ca9c00) at ./machine/userret.h:81 > > > > #6 syscall (frame=) at > > > > /usr/src/sys/arch/x86/x86/syscall.c:166 > > > > #7 0x802096ad in handle_syscall () > > > hannken@ supplied me with a repro for this one so I'm going to look into > > > it > > > tomorrow morning. syzbot has also run into it recently. > > This should be fixed now, with the following revisions: > > > > cvs rdiff -u -r1.165 -r1.166 src/sys/kern/kern_lock.c > > cvs rdiff -u -r1.336 -r1.337 src/sys/kern/kern_synch.c > > > > Cheers, > > Andrew >
Re: assertion (pinned->l_flag & LW_RUNNING) != 0 failed
On Sat, Jan 25, 2020 at 09:13:49AM +, co...@sdf.org wrote: > -current is getting even less reliable at booting for me. > > I think the thing I'm doing unusual is having an Android phone connected > by USB. It's pretty chatty at boot compared to my other USB devices. > > It's hitting this assertion (in kern_softint.c:854). > > It seems to be new, but I haven't upgraded -current in a while. Do you get this message before it starts /sbin/init? If so it looks like we've had that bug for about ~13 years. Should be fixed with this: /cvsroot/src/sys/kern/kern_idle.c,v <-- kern_idle.c new revision: 1.31; previous revision: 1.30 Andrew
Re: 9.99.40: panic: kernel diagnostic assertion "ci->ci_biglock_count == 0" failed
On Tue, Jan 21, 2020 at 07:59:35PM +, Andrew Doran wrote: > Hi Thomas, > > On Tue, Jan 21, 2020 at 08:47:44PM +0100, Thomas Klausner wrote: > > > During a bulk build (in rust AFAICT), I got a panic with > > panic: kernel diagnostic assertion "ci->ci_biglock_count == 0" failed: file > > "/usr/src/sys/sys/userret.h", line 88 > > > > That's this one: > > > > static __inline void > > mi_userret(struct lwp *l) > > { > > struct cpu_info *ci; > > > > KPREEMPT_DISABLE(l); > > ci = l->l_cpu; > > KASSERT(l->l_blcnt == 0); > > KASSERT(ci->ci_biglock_count == 0); > > > > > > > > The backtrace in the crash dump is not very helpful: > > > > (gdb) bt > > #0 0x80224315 in cpu_reboot (howto=howto@entry=260, > > bootstr=bootstr@entry=0x0) at /usr/src/sys/arch/amd64/amd64/machdep.c:720 > > #1 0x809f5ec3 in kern_reboot (howto=howto@entry=260, > > bootstr=bootstr@entry=0x0) at /usr/src/sys/kern/kern_reboot.c:61 > > #2 0x80a37109 in vpanic (fmt=0x8135e980 "kernel > > %sassertion \"%s\" failed: file \"%s\", line %d ", > > ap=ap@entry=0xad0928973f48) > > at /usr/src/sys/kern/subr_prf.c:336 > > #3 0x80e7b0b3 in kern_assert (fmt=fmt@entry=0x8135e980 > > "kernel %sassertion \"%s\" failed: file \"%s\", line %d ") > > at /usr/src/sys/lib/libkern/kern_assert.c:51 > > #4 0x802568ce in mi_userret (l=0xcfc320ca9c00) at > > /usr/src/sys/sys/userret.h:91 > > #5 userret (l=0xcfc320ca9c00) at ./machine/userret.h:81 > > #6 syscall (frame=) at > > /usr/src/sys/arch/x86/x86/syscall.c:166 > > #7 0x802096ad in handle_syscall () > > hannken@ supplied me with a repro for this one so I'm going to look into it > tomorrow morning. syzbot has also run into it recently. This should be fixed now, with the following revisions: cvs rdiff -u -r1.165 -r1.166 src/sys/kern/kern_lock.c cvs rdiff -u -r1.336 -r1.337 src/sys/kern/kern_synch.c Cheers, Andrew
Re: 9.99.40: panic: kernel diagnostic assertion "ci->ci_biglock_count == 0" failed
Hi Thomas, On Tue, Jan 21, 2020 at 08:47:44PM +0100, Thomas Klausner wrote: > During a bulk build (in rust AFAICT), I got a panic with > panic: kernel diagnostic assertion "ci->ci_biglock_count == 0" failed: file > "/usr/src/sys/sys/userret.h", line 88 > > That's this one: > > static __inline void > mi_userret(struct lwp *l) > { > struct cpu_info *ci; > > KPREEMPT_DISABLE(l); > ci = l->l_cpu; > KASSERT(l->l_blcnt == 0); > KASSERT(ci->ci_biglock_count == 0); > > > > The backtrace in the crash dump is not very helpful: > > (gdb) bt > #0 0x80224315 in cpu_reboot (howto=howto@entry=260, > bootstr=bootstr@entry=0x0) at /usr/src/sys/arch/amd64/amd64/machdep.c:720 > #1 0x809f5ec3 in kern_reboot (howto=howto@entry=260, > bootstr=bootstr@entry=0x0) at /usr/src/sys/kern/kern_reboot.c:61 > #2 0x80a37109 in vpanic (fmt=0x8135e980 "kernel %sassertion > \"%s\" failed: file \"%s\", line %d ", ap=ap@entry=0xad0928973f48) > at /usr/src/sys/kern/subr_prf.c:336 > #3 0x80e7b0b3 in kern_assert (fmt=fmt@entry=0x8135e980 > "kernel %sassertion \"%s\" failed: file \"%s\", line %d ") > at /usr/src/sys/lib/libkern/kern_assert.c:51 > #4 0x802568ce in mi_userret (l=0xcfc320ca9c00) at > /usr/src/sys/sys/userret.h:91 > #5 userret (l=0xcfc320ca9c00) at ./machine/userret.h:81 > #6 syscall (frame=) at /usr/src/sys/arch/x86/x86/syscall.c:166 > #7 0x802096ad in handle_syscall () hannken@ supplied me with a repro for this one so I'm going to look into it tomorrow morning. syzbot has also run into it recently. Andrew
Re: CVS commit: src/sys [freeze on boot]
Fix committed with sys/kern/kern_rwlock.c rev 1.62. I didn't see the problem as I am running with LOCKDEBUG. Apologies for the disruption. Andrew
Re: File corruption?
On Sun, Jan 19, 2020 at 12:21:06PM -0600, Robert Nestor wrote: > Thanks! I followed Andrew?s instructions and got a photo of the stack > trace and sent it to him directly. Hope it helps him figure out what?s > happening. Thanks for the photo. This is a problem in the DRM code. It was fixed a day or two ago, so if you update your kernel it won't happen any more. Andrew > -bob > > On Jan 19, 2020, at 11:29 AM, Greg Troxel wrote: > > > Robert Nestor writes: > > > >> Sorry for not being specific. When I do the shutdown on a subsequent > >> reboot all the filesystems are dirty forcing fsck to run. Sometimes > >> it finds some minor errors and repairs them. > > > > ok - I am trying to separate "corruption", which means that files that > > were not in the process of being written were damaged, from an unclean > > shutdown with the usual non-frightening fixups. > > > >> I?m running xfce4, so when I do the ?shutdown -r now? I see xfce4 and > >> X exit bringing me back to the console display that was active when I > >> booted the system. As it goes thru the normal shutdown process it > >> reaches a point where I get the assertion error (something like > >> ?uvm_page locked against owner?) followed by a stack trace and then > >> quickly followed by the system rebooting. There is no crash file > >> generated. > > > > (Definitely follow ad@'s advice here.) > > > > You can of course exit xfe4 back to console before starting this. > > > >> I haven?t changed any crash parameters from the stock setup. I seem > >> to recall there used to be one for kernel crashes, but can?t find it > >> now. I guess next step is to boot up with the ?-d? flag and see if I > >> can get something useful. Is that correct? > > > > See swapctl(8) and fstab(5). Basically you need to configure a dump > > device (almost always the swap device). swapctl -l is useful. > > > > But, it is likely that after sending ad@ a picture, you won't have to > > debug this any more... >
Re: File corruption?
Hi Robert, On Sun, Jan 19, 2020 at 10:42:37AM -0600, Robert Nestor wrote: > Sorry for not being specific. When I do the shutdown on a subsequent > reboot all the filesystems are dirty forcing fsck to run. Sometimes it > finds some minor errors and repairs them. > > I?m running xfce4, so when I do the ?shutdown -r now? I see xfce4 and X > exit bringing me back to the console display that was active when I booted > the system. As it goes thru the normal shutdown process it reaches a > point where I get the assertion error (something like ?uvm_page locked > against owner?) followed by a stack trace and then quickly followed by the > system rebooting. There is no crash file generated. Could you please do a "sysctl -w ddb.onpanic=1" first, and then trigger the problem? It shouldn't leave the stack trace there and not reboot then. If you don't feel like typing it all in you could use a camera phone to take a picture of the backtrace or something like that. If you don't have somewhere to upload the image you could e-mail it to me directly. Andrew > I haven?t changed any crash parameters from the stock setup. I seem to > recall there used to be one for kernel crashes, but can?t find it now. I > guess next step is to boot up with the ?-d? flag and see if I can get > something useful. Is that correct? > > -bob > > On Jan 19, 2020, at 9:52 AM, Greg Troxel wrote: > > > Robert Nestor writes: > > > >> I?ve downloaded and installed 9.99.38 (Jan 17 build) and the original > >> problem I was seeing with ?git? is gone. However, I?m now seeing a > >> new problem with file corruption, but it only seems to happen when I > >> do a normal shutdown. If I do a ?shutdown -r now? to shutdown and > > > > You say "corruption", but then you say "filesystems are dirty". Are you > > actually finding files with bad contents? > > > >> reboot the system I see a crash during the shutdown phase and on > >> subsequent reboot the filesystems are all dirty. There is an assertion > >> about uvm_page I think, but it quickly disappears on the reboot. > > > > Are you saying that if you "shutdown now", that the system shuts down > > with the crash? And that there are then files with bad contents? > > > > > >> Is there a log file someplace that is written on the shutdown or is > >> there an easy way for me to capture the traceback before it disappears > >> off my screen? There?s no crash file produced. > > > > crash dumps may not be enabled. You could also enable ddb and do the > > shutdown not using X, and then get a trace there. >
Re: File corruption?
On Sun, Jan 19, 2020 at 05:17:25PM +, Andrew Doran wrote: > Hi Robert, > > On Sun, Jan 19, 2020 at 10:42:37AM -0600, Robert Nestor wrote: > > > Sorry for not being specific. When I do the shutdown on a subsequent > > reboot all the filesystems are dirty forcing fsck to run. Sometimes it > > finds some minor errors and repairs them. > > > > I?m running xfce4, so when I do the ?shutdown -r now? I see xfce4 and X > > exit bringing me back to the console display that was active when I booted > > the system. As it goes thru the normal shutdown process it reaches a > > point where I get the assertion error (something like ?uvm_page locked > > against owner?) followed by a stack trace and then quickly followed by the > > system rebooting. There is no crash file generated. > > Could you please do a "sysctl -w ddb.onpanic=1" first, and then trigger the > problem? It shouldn't leave the stack trace there and not reboot then. If Argh. I mean to say it will leave the stack trace there and not reboot. > you don't feel like typing it all in you could use a camera phone to take a > picture of the backtrace or something like that. If you don't have > somewhere to upload the image you could e-mail it to me directly. > > Andrew > > > I haven?t changed any crash parameters from the stock setup. I seem to > > recall there used to be one for kernel crashes, but can?t find it now. I > > guess next step is to boot up with the ?-d? flag and see if I can get > > something useful. Is that correct? > > > > -bob > > > > On Jan 19, 2020, at 9:52 AM, Greg Troxel wrote: > > > > > Robert Nestor writes: > > > > > >> I?ve downloaded and installed 9.99.38 (Jan 17 build) and the original > > >> problem I was seeing with ?git? is gone. However, I?m now seeing a > > >> new problem with file corruption, but it only seems to happen when I > > >> do a normal shutdown. If I do a ?shutdown -r now? to shutdown and > > > > > > You say "corruption", but then you say "filesystems are dirty". Are you > > > actually finding files with bad contents? > > > > > >> reboot the system I see a crash during the shutdown phase and on > > >> subsequent reboot the filesystems are all dirty. There is an assertion > > >> about uvm_page I think, but it quickly disappears on the reboot. > > > > > > Are you saying that if you "shutdown now", that the system shuts down > > > with the crash? And that there are then files with bad contents? > > > > > > > > >> Is there a log file someplace that is written on the shutdown or is > > >> there an easy way for me to capture the traceback before it disappears > > >> off my screen? There?s no crash file produced. > > > > > > crash dumps may not be enabled. You could also enable ddb and do the > > > shutdown not using X, and then get a trace there. > >
Re: 9.99.38 panic - i915drm + EFI ?
On Fri, Jan 17, 2020 at 08:35:13PM +0100, Kamil Rytarowski wrote: > On 17.01.2020 20:29, Andrew Doran wrote: > > Hi, > > > > On Fri, Jan 17, 2020 at 07:58:52PM +0100, Kamil Rytarowski wrote: > > > >> My system with i915 survived with these changes applied (credit > >> riastradh@ for hints): > >> > >> https://www.netbsd.org/~kamil/patch-00215-i915-dirty-pages.txt > > > > That change looks good to me. I didn't put the mutex acquire in originally > > because I had no idea what context that stuff was operating in and didn't > > want to make it worse. Can you commit it? > > > > Thank you, > > Andrew > > > > I have triggered this panic without my patch: > > https://www.netbsd.org/~kamil/panic/i915.txt > > If you think that the patch is OK, please go for it. Committed - thank you very much. I changed one other file, also committed. Andrew
Re: 9.99.38 panic - i915drm + EFI ?
Hi, On Fri, Jan 17, 2020 at 07:58:52PM +0100, Kamil Rytarowski wrote: > My system with i915 survived with these changes applied (credit > riastradh@ for hints): > > https://www.netbsd.org/~kamil/patch-00215-i915-dirty-pages.txt That change looks good to me. I didn't put the mutex acquire in originally because I had no idea what context that stuff was operating in and didn't want to make it worse. Can you commit it? Thank you, Andrew
Re: 9.99.38 panic - i915drm + EFI ?
Hi, On Thu, Jan 16, 2020 at 06:14:40PM +, Chavdar Ivanov wrote: > Today's update brought 9.99.38, which fails to boot on my HP Envy 17 > laptop with Intel 530 graphics and NVidia GeForce; latter not used. The > system uses EFI boot and the panic happens the moment it has to switch > the console in graphics mode. I managed to collect a dump; the trace is > as follows: > > > > #1? 0x809fd663 in kern_reboot () > #2? 0x807c1ec3 in db_reboot_cmd () > #3? 0x807c26db in db_command () > #4? 0x807c2a46 in db_command_loop () > #5? 0x807c63ca in db_trap () > #6? 0x80220af5 in kdb_trap () > #7? 0x80225c66 in trap () > #8? 0x8021ed43 in alltraps () > #9? 0x8021f55d in breakpoint () > #10 0x80a3e380 in vpanic () > #11 0x80e85393 in kern_assert () > #12 0x809ba1ec in uvm_pagegetdirty () > #13 0x809ba218 in uvm_pagemarkdirty () > #14 0x80c0100e in intel_lr_context_deferred_alloc () > #15 0x80c01470 in logical_ring_init () > #16 0x80c016fd in intel_logical_rings_init () > #17 0x80b86359 in i915_gem_init () > #18 0x80b7b11d in i915_driver_load () > #19 0x8087f31b in drm_dev_register () > #20 0x806be10b in drm_pci_attach () > #21 0x80b757ab in i915drmkms_attach_real () > #22 0x80a20ebe in config_mountroot_thread () > #23 0x80209777 in lwp_trampoline () > #24 0x in ?? () I see the problem. DRM does not hold the appropriate locks. I don't think it needs to mark those pages dirty anyhow. I'll fix it tomorrow. Cheers, Andrew
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 09:17:28PM +0100, Manuel Bouyer wrote: > On Mon, Jan 13, 2020 at 07:11:21PM +0000, Andrew Doran wrote: > > On Mon, Jan 13, 2020 at 07:36:41PM +0100, Manuel Bouyer wrote: > > > > > On Mon, Jan 13, 2020 at 06:33:08PM +, Andrew Doran wrote: > > > > On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote: > > > > > > > > > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote: > > > > > > It also sets rsp and rbp. I think rbp is not set by anything else, > > > > > > at last > > > > > > in the Xen case. > > > > > > The different rbp value would explain why in one case we hit a > > > > > > KASSERT() > > > > > > in lwp_startup later. > > > > > > But I don't know what pcb_rbp contains; I couldn't find where the > > > > > > pcb for > > > > > > idlelwp is initialized. > > > > > > > > > > I tried the attached patch, which should set rsp/rbp as cpu_switchto() > > > > > does. It doens't cause the lwp_startup() KASSERT as calling > > > > > cpu_switchto() > > > > > does; it also doesn't change the scheduler behavior. > > > > > > > > Wait - do you mean that everything works now? Or that everything still > > > > runs > > > > on CPU0? > > > > > > No, everything still runs on CPU0 > > > > Hmm, I don't understand why. I'll set up Xen and try it out. It might take > > me a day or two. > > OK thanks. I reproduced it on native x86. It's a bug in the CPU topology code. Now fixed with revision 1.11 src/sys/kern/subr_cpu.c - sorry about that. > OK, so I removed the call to cpu_switchto() before idle_loop(), > and added a few KASSERTS. > I guess you can back out the prev == NULL case from cpu_switchto(). Will do. Thank you Manuel. Andrew
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 07:36:41PM +0100, Manuel Bouyer wrote: > On Mon, Jan 13, 2020 at 06:33:08PM +0000, Andrew Doran wrote: > > On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote: > > > > > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote: > > > > It also sets rsp and rbp. I think rbp is not set by anything else, at > > > > last > > > > in the Xen case. > > > > The different rbp value would explain why in one case we hit a KASSERT() > > > > in lwp_startup later. > > > > But I don't know what pcb_rbp contains; I couldn't find where the pcb > > > > for > > > > idlelwp is initialized. > > > > > > I tried the attached patch, which should set rsp/rbp as cpu_switchto() > > > does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto() > > > does; it also doesn't change the scheduler behavior. > > > > Wait - do you mean that everything works now? Or that everything still runs > > on CPU0? > > No, everything still runs on CPU0 Hmm, I don't understand why. I'll set up Xen and try it out. It might take me a day or two. > > The very first thing that idle_loop() does on amd64/i386 is set up the frame > > pointer - ebp/rbp. > > > > : > >0: 55 push %rbp > >1: 48 89 e5mov%rsp,%rbp > >4: 41 56 push %r14 > >6: 41 55 push %r13 > > OK, so it's OK that my patch doesn't changes anything. > And so I still don't understand the KASSERT when cpu_switchto() is called > before idle_loop(). The assertion in lwp_startup() is because I made MI changes so that prevlwp is never NULL when calling cpu_switchto(), when fixing some bugs problems MP support on !x86 and make things more correct. lwp_startup()/mi_switch() now need to unlock prevlwp after it is finished in cpu_switchto(). I never expected anybody but mi_switch() to call cpu_switchto(). Andrew
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote: > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote: > > It also sets rsp and rbp. I think rbp is not set by anything else, at last > > in the Xen case. > > The different rbp value would explain why in one case we hit a KASSERT() > > in lwp_startup later. > > But I don't know what pcb_rbp contains; I couldn't find where the pcb for > > idlelwp is initialized. > > I tried the attached patch, which should set rsp/rbp as cpu_switchto() > does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto() > does; it also doesn't change the scheduler behavior. Wait - do you mean that everything works now? Or that everything still runs on CPU0? The very first thing that idle_loop() does on amd64/i386 is set up the frame pointer - ebp/rbp. : 0: 55 push %rbp 1: 48 89 e5mov%rsp,%rbp 4: 41 56 push %r14 6: 41 55 push %r13 Andrew
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 03:16:05PM +0100, Manuel Bouyer wrote: > On Mon, Jan 13, 2020 at 12:02:13PM +0000, Andrew Doran wrote: > > Ah yes it does, I saw something that made me think it affected x86_64 only. > > I'll make the change on i386 too. > > thanks. > > Now I get a different panic: > [ 1.000] vcpu0 at hypervisor0 > [ 1.000] vcpu0: 64 page colors > [ 1.000] vcpu0: Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz, id > 0x6fb > [ 1.000] vcpu0: node 0, package 0, core 1, smt 0 > [ 1.000] vcpu1 at hypervisor0 > [ 1.000] vcpu1: 2 page colors > [ 1.000] vcpu1: starting > [ 1.000] vcpu1: is started. > [ 1.000] vcpu1: Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz, id > 0x6fb > [ 1.000] vcpu1: node 0, package 0, core 0, smt 0 > [...] > [ 1.030] UVM: using package allocation scheme, 1 package(s) per bucket > [ 1.030] Xen vcpu1 clock: using event channel 7 > [ 1.8809493] vcpu1: running > [ 1.8809493] panic: kernel diagnostic assertion "prev != NULL" failed: file > "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_lwp.c", line 1021 > [ 1.8809493] cpu1: Begin traceback... > [ 1.8809493] > vpanic(c057f868,d77abf74,d77abf98,c03cc3e5,c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0) > at netbsd:vpanic+0x134 > [ 1.8809493] > kern_assert(c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0,0,0,c13a6900,c03c60c0) > at netbsd:kern_assert+0x23 > [ 1.8809493] lwp_startup(0,c13a6900,8b1000,c0674200,0,c010007a,0,0,0,0) at > netbsd:lwp_startup+0x155 > [ 1.8809493] cpu1: End traceback... > > If I remove the call to cpu_switchto() in cpu_hatch() it boots, but it seems > that all user processes are running on cpu0 only ... I looked and the only thing cpu_switchto() is doing there is setting curlwp, but that's already set in cpu_start_secondary(), so it's not needed. > I can't see what extra work the cpu_switchto() could be doing that would > matters, execpt maybe the %epb/rbp init. Any idea ? Right I don't think cpu_switchto() matters there. The strategy for assigning LWPs to CPUs in the scheduler has changed. If the machine is not busy everything is likely to stay on CPU0. Are you putting much load on it? Andrew
Re: Xen MP panics in cpu_switchto()
On Mon, Jan 13, 2020 at 12:56:22PM +0100, Manuel Bouyer wrote: > On Mon, Jan 13, 2020 at 11:42:17AM +0000, Andrew Doran wrote: > > Hi Manuel, > > > > On Mon, Jan 13, 2020 at 10:56:23AM +0100, Manuel Bouyer wrote: > > > Hello, > > > A current Xen domU kernel fails to boot with: > > > [ 1.000] hypervisor0 at mainbus0: Xen version 4.11.3nb1 > > > [ 1.000] vcpu0 at hypervisor0 > > > [ 1.000] vcpu0: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64 > > > [ 1.000] vcpu0: node 0, package 0, core 1, smt 1 > > > [ 1.000] vcpu1 at hypervisor0 > > > [ 1.000] vcpu1: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64 > > > [ 1.000] vcpu1: node 0, package 1, core 0, smt 0 > > > [ 1.000] xenbus0 at hypervisor0: Xen Virtual Bus Interface > > > [ 1.000] xencons0 at hypervisor0: Xen Virtual Console Driver > > > [ 1.9901295] uvm_fault(0x80d5c120, 0x0, 1) -> e > > > [ 1.9901295] fatal page fault in supervisor mode > > > [ 1.9901295] trap type 6 code 0 rip 0x8020209f cs 0x8 rflags > > > 0x10246 cr2 0x28 ilevel 0 rsp 0xb7802b19de88 > > > [ 1.9901295] curlwp 0xb7800083b500 pid 0.15 lowest kstack > > > 0xb7802b1992c0 > > > kernel: page fault trap, code=0 > > > Stopped in pid 0.15 (system) at netbsd:cpu_switchto+0xf:movq > > > 28(%r13),%rax > > > cpu_switchto() at netbsd:cpu_switchto+0xf > > > > > > both amd64 and i386. A boot with vcpus=1 succeeds, so I guess something is > > > missing in initialisations of secondary CPUs. > > > This happens with the 202001101800Z but the problem is probably older than > > > that (the testbed used vcpus=1 until today) > > > > > > Any idea ? > > > > It should work now with revision 1.199 of > > src/sys/arch/amd64/amd64/locore.S. > > The same problem happens with i386. Ah yes it does, I saw something that made me think it affected x86_64 only. I'll make the change on i386 too. Andrew
Re: Xen MP panics in cpu_switchto()
Hi Manuel, On Mon, Jan 13, 2020 at 10:56:23AM +0100, Manuel Bouyer wrote: > Hello, > A current Xen domU kernel fails to boot with: > [ 1.000] hypervisor0 at mainbus0: Xen version 4.11.3nb1 > [ 1.000] vcpu0 at hypervisor0 > [ 1.000] vcpu0: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64 > [ 1.000] vcpu0: node 0, package 0, core 1, smt 1 > [ 1.000] vcpu1 at hypervisor0 > [ 1.000] vcpu1: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64 > [ 1.000] vcpu1: node 0, package 1, core 0, smt 0 > [ 1.000] xenbus0 at hypervisor0: Xen Virtual Bus Interface > [ 1.000] xencons0 at hypervisor0: Xen Virtual Console Driver > [ 1.9901295] uvm_fault(0x80d5c120, 0x0, 1) -> e > [ 1.9901295] fatal page fault in supervisor mode > [ 1.9901295] trap type 6 code 0 rip 0x8020209f cs 0x8 rflags > 0x10246 cr2 0x28 ilevel 0 rsp 0xb7802b19de88 > [ 1.9901295] curlwp 0xb7800083b500 pid 0.15 lowest kstack > 0xb7802b1992c0 > kernel: page fault trap, code=0 > Stopped in pid 0.15 (system) at netbsd:cpu_switchto+0xf:movq > 28(%r13),%rax > cpu_switchto() at netbsd:cpu_switchto+0xf > > both amd64 and i386. A boot with vcpus=1 succeeds, so I guess something is > missing in initialisations of secondary CPUs. > This happens with the 202001101800Z but the problem is probably older than > that (the testbed used vcpus=1 until today) > > Any idea ? It should work now with revision 1.199 of src/sys/arch/amd64/amd64/locore.S. Nothing else in tree calls cpu_switchto() with prevlwp=NULL any more. Can Xen's cpu_hatch() call idle_loop() directly? Thank you, Andrew
Re: pmap lock changes: Xen panic
Manuel, On Tue, Jan 07, 2020 at 10:39:33AM +0100, Manuel Bouyer wrote: > Hello, > with 2020-01-05 00:40 UTC sources, Xen domUs panics because of what looks like > locking changes in the pmap code (full log at > http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/): > mlock error: Mutex: mutex_vector_enter,506: assertion failed: !cpu_intr_p(): > lock 0xc0b46300 cpu 0 lwp 0xc1aac100 > [ 39.3701414] cpu0: Begin traceback... > [ 39.3701414] > vpanic(c05b72e8,d7c7c728,d7c7c744,c041c699,c05b72e8,c05b04e0,c05746b4,1fa,c05b0454,c0b46300) > at netbsd:vpanic+0x134 > [ 39.3701414] > panic(c05b72e8,c05b04e0,c05746b4,1fa,c05b0454,c0b46300,0,c1aac100,d7c7c768,c03d2ec9) > at netbsd:panic+0x18 > [ 39.3701414] > lockdebug_abort(c05746b4,1fa,c0b46300,c0b414c8,c05b0454,c0b46240,c0b46300,d7c7c794,c03d3405,c05b0454) > at netbsd:lockdebug_abort+0xc9 > [ 39.3701414] > mutex_abort(c05b0454,0,d39e7004,79339,c0b37340,c1aac100,c0b46240,c0b46300,c1a5e1e8,d7c7c7dc) > at netbsd:mutex_abort+0x39 > [ 39.3701414] > mutex_vector_enter(c0b46300,d7c7c7b8,d7c7c7b4,c03cc135,86908857,d7c7a024,c0b37340,c1aac100,d7c7c7d4,c011924f) > at netbsd:mutex_vector_enter+0x355 > [ 39.3701414] > pmap_extract_ma(c0b46240,d6db3000,d7c7c820,0,c1733508,6,d7e7e000,c1a5e1e0,6,c1a5e008) > at netbsd:pmap_extract_ma+0x1a > [ 39.3701414] > xbd_diskstart(c189e908,c2672e2c,1c0,d7c7c884,c011d25e,c0b35880,c1a5e018,32b0,c1a5e008,) > at netbsd:xbd_diskstart+0x234 > [ 39.3701414] > dk_start(c1a5e008,0,c23fc4dc,23f1000,0,1,c2670558,c1a5e1e0,6,d7e7e000) at > netbsd:dk_start+0xea > [ 39.3701414] > xbd_handler(c1a5e008,6,d7c7c978,c18a6318,d7c7c924,c011cc99,c18a6318,d7c7c978,d7c7ca3c,c04a1a7d) > at netbsd:xbd_handler+0x12e > [ 39.3701414] > xen_intr_biglock_wrapper(c18a6318,d7c7c978,d7c7ca3c,c04a1a7d,c1cba008,23f,0,d7c7c9ec,d7c7ca10,c0c589b8) > at netbsd:xen_intr_biglock_wrapper+0x1f > [ 39.3701414] > evtchn_do_event(6,d7c7c978,c0e670fc,c0e62548,c0e68494,0,c0b37340,c0d9a000,c0b37340,0) > at netbsd:evtchn_do_event+0xf9 > [ 39.3701414] > do_hypervisor_callback(d7c7c978,0,11,31,11,c0b40011,0,0,d7c7ca44,bfec0d70) at > netbsd:do_hypervisor_callback+0x15f > [ 39.3701414] > Xhypervisor_pvhvm_callback(c0b46240,d81ae000,696f3000,1,1ef3000,0,1,21,7ff0,21) > at netbsd:Xhypervisor_pvhvm_callback+0x67 > > > In rev 1.35 of xen/x86/xen_pmap.c the pmap lock is taken unconditionally, > even in the pmap_kernel() case, which means pmap_extract_ma() can't be > used from interrrupt context any more. I don't think we can impose such > restrictions on pmap_kernel(); bus_dma(9) needs it. This was a mistake on my part. We should never need to lock pmap_kernel() for pmap_extract() since it only touches the PTEs. It should be fixed now with xen_pmap.c 1.37. Cheers, Andrew
Re: Automated report: NetBSD-current/i386 test failure
The remaining failures should be fixed by: 1.181 src/sys/rump/librump/rumpkern/vm.c Cheers, Andrew On Thu, Jan 02, 2020 at 01:26:42PM +, Andrew Doran wrote: > I think this is likely fixed already but will take a look now. > > Andrew > > On Thu, Jan 02, 2020 at 08:35:09AM +, NetBSD Test Fixture wrote: > > > This is an automatically generated notice of new failures of the > > NetBSD test suite. > > > > The newly failing test cases are: > > > > fs/ffs/t_miscquota:log_unlink > > fs/ffs/t_miscquota:log_unlink_remount > > fs/ffs/t_miscquota:npsnapshot_unconf_user > > fs/ffs/t_miscquota:npsnapshot_user > > fs/ffs/t_miscquota:psnapshot_unconf_user > > fs/ffs/t_miscquota:psnapshot_user > > fs/ffs/t_quota2_1:quota_1000_O1_be > > fs/ffs/t_quota2_1:quota_1000_O1_le > > fs/ffs/t_quota2_1:quota_1000_O2_be > > fs/ffs/t_quota2_1:quota_1000_O2_le > > fs/ffs/t_quota2_1:quota_40_O1_be > > fs/ffs/t_quota2_1:quota_40_O1_le > > fs/ffs/t_quota2_1:quota_40_O1_log > > fs/ffs/t_quota2_1:quota_40_O2_be > > fs/ffs/t_quota2_1:quota_40_O2_le > > fs/ffs/t_quota2_1:quota_40_O2_log > > fs/ffs/t_quota2_remount:quota_10_O1_be > > fs/ffs/t_quota2_remount:quota_10_O1_le > > fs/ffs/t_quotalimit:herit_defq_le_1_group > > fs/ffs/t_quotalimit:herit_defq_le_1_group_log > > fs/ffs/t_quotalimit:herit_defq_le_1_user > > fs/ffs/t_quotalimit:herit_defq_le_1_user_log > > fs/ffs/t_quotalimit:herit_idefq_le_1_group_log > > fs/ffs/t_quotalimit:herit_idefq_le_1_user_log > > fs/ffs/t_quotalimit:inolimit_le_1_group > > fs/ffs/t_quotalimit:inolimit_le_1_group_log > > fs/ffs/t_quotalimit:inolimit_le_1_user > > fs/ffs/t_quotalimit:inolimit_le_1_user_log > > fs/ffs/t_quotalimit:limit_le_1_group > > fs/ffs/t_quotalimit:limit_le_1_group_log > > fs/ffs/t_quotalimit:limit_le_1_user > > fs/ffs/t_quotalimit:limit_le_1_user_log > > fs/ffs/t_quotalimit:sinolimit_le_1_group > > fs/ffs/t_quotalimit:sinolimit_le_1_user > > fs/ffs/t_quotalimit:slimit_le_1_group > > fs/ffs/t_quotalimit:slimit_le_1_user > > fs/ffs/t_snapshot:snapshot > > fs/ffs/t_snapshot:snapshotstress > > fs/ffs/t_snapshot_log:snapshot > > fs/ffs/t_snapshot_log:snapshotstress > > fs/ffs/t_snapshot_v2:snapshot > > fs/ffs/t_snapshot_v2:snapshotstress > > fs/ffs/t_update_log:updaterwtolog > > fs/ffs/t_update_log:updaterwtolog_async > > fs/msdosfs/t_snapshot:snapshot > > fs/msdosfs/t_snapshot:snapshotstress > > fs/vfs/t_full:ext2fs_fillfs > > fs/vfs/t_full:tmpfs_fillfs > > fs/vfs/t_io:ext2fs_extendfile > > fs/vfs/t_io:ext2fs_extendfile_append > > fs/vfs/t_io:ext2fs_holywrite > > fs/vfs/t_io:ext2fs_overwrite512 > > fs/vfs/t_io:ext2fs_overwrite64k > > fs/vfs/t_io:ext2fs_overwrite_trunc > > fs/vfs/t_io:ext2fs_read_after_unlink > > fs/vfs/t_io:ext2fs_read_fault > > fs/vfs/t_io:ext2fs_shrinkfile > > fs/vfs/t_io:ext2fs_wrrd_after_unlink > > fs/vfs/t_io:ffs_extendfile > > fs/vfs/t_io:ffs_extendfile_append > > fs/vfs/t_io:ffs_holywrite > > fs/vfs/t_io:ffs_overwrite512 > > fs/vfs/t_io:ffs_read_after_unlink > > fs/vfs/t_io:ffs_read_fault > > fs/vfs/t_io:ffs_shrinkfile > > fs/vfs/t_io:ffs_wrrd_after_unlink > > fs/vfs/t_io:ffslog_extendfile > > fs/vfs/t_io:ffslog_extendfile_append > > fs/vfs/t_io:ffslog_holywrite > > fs/vfs/t_io:ffslog_overwrite512 > > fs/vfs/t_io:ffslog_read_after_unlink > > fs/vfs/t_io:ffslog_read_fault > > fs/vfs/t_io:ffslog_shrinkfile > > fs/vfs/t_io:ffslog_wrrd_after_unlink > > fs/vfs/t_io:nfs_extendfile > > fs/vfs/t_io:nfs_extendfile_append > > fs/vfs/t_io:nfs_holywrite > > fs/vfs/t_io:nfs_overwrite512 > > fs/vfs/t_io:nfs_read_after_unlink > > fs/vfs/t_io:nfs_read_fault > > fs/vfs/t_io:nfs_shrinkfile > > fs/vfs/t_io:nfs_wrrd_after_unlink > > fs/vfs/t_io:p2k_ffs_extendfile > > fs/vfs/t_io:p2k_ffs_extendfile_append > > fs/vfs/t_io:p2k_ffs_holywrite > > fs/vfs/t_io:p2k_ffs_overwrite512 > > fs/vfs/t_io:p2k_ffs_read_after_unlink > > fs/vfs/t_io:p2k_ffs_read_fault > > fs/vfs/t_io:p2k_ffs_shrinkfile > > fs/vfs/t_io:p2k_ffs_wrrd_after_unlink > > fs/vfs/t_io:tmpfs_extendfile > > fs/vfs/t_io:tmpfs_extendfile_append > > fs/
Re: Automated report: NetBSD-current/i386 test failure
I think this is likely fixed already but will take a look now. Andrew On Thu, Jan 02, 2020 at 08:35:09AM +, NetBSD Test Fixture wrote: > This is an automatically generated notice of new failures of the > NetBSD test suite. > > The newly failing test cases are: > > fs/ffs/t_miscquota:log_unlink > fs/ffs/t_miscquota:log_unlink_remount > fs/ffs/t_miscquota:npsnapshot_unconf_user > fs/ffs/t_miscquota:npsnapshot_user > fs/ffs/t_miscquota:psnapshot_unconf_user > fs/ffs/t_miscquota:psnapshot_user > fs/ffs/t_quota2_1:quota_1000_O1_be > fs/ffs/t_quota2_1:quota_1000_O1_le > fs/ffs/t_quota2_1:quota_1000_O2_be > fs/ffs/t_quota2_1:quota_1000_O2_le > fs/ffs/t_quota2_1:quota_40_O1_be > fs/ffs/t_quota2_1:quota_40_O1_le > fs/ffs/t_quota2_1:quota_40_O1_log > fs/ffs/t_quota2_1:quota_40_O2_be > fs/ffs/t_quota2_1:quota_40_O2_le > fs/ffs/t_quota2_1:quota_40_O2_log > fs/ffs/t_quota2_remount:quota_10_O1_be > fs/ffs/t_quota2_remount:quota_10_O1_le > fs/ffs/t_quotalimit:herit_defq_le_1_group > fs/ffs/t_quotalimit:herit_defq_le_1_group_log > fs/ffs/t_quotalimit:herit_defq_le_1_user > fs/ffs/t_quotalimit:herit_defq_le_1_user_log > fs/ffs/t_quotalimit:herit_idefq_le_1_group_log > fs/ffs/t_quotalimit:herit_idefq_le_1_user_log > fs/ffs/t_quotalimit:inolimit_le_1_group > fs/ffs/t_quotalimit:inolimit_le_1_group_log > fs/ffs/t_quotalimit:inolimit_le_1_user > fs/ffs/t_quotalimit:inolimit_le_1_user_log > fs/ffs/t_quotalimit:limit_le_1_group > fs/ffs/t_quotalimit:limit_le_1_group_log > fs/ffs/t_quotalimit:limit_le_1_user > fs/ffs/t_quotalimit:limit_le_1_user_log > fs/ffs/t_quotalimit:sinolimit_le_1_group > fs/ffs/t_quotalimit:sinolimit_le_1_user > fs/ffs/t_quotalimit:slimit_le_1_group > fs/ffs/t_quotalimit:slimit_le_1_user > fs/ffs/t_snapshot:snapshot > fs/ffs/t_snapshot:snapshotstress > fs/ffs/t_snapshot_log:snapshot > fs/ffs/t_snapshot_log:snapshotstress > fs/ffs/t_snapshot_v2:snapshot > fs/ffs/t_snapshot_v2:snapshotstress > fs/ffs/t_update_log:updaterwtolog > fs/ffs/t_update_log:updaterwtolog_async > fs/msdosfs/t_snapshot:snapshot > fs/msdosfs/t_snapshot:snapshotstress > fs/vfs/t_full:ext2fs_fillfs > fs/vfs/t_full:tmpfs_fillfs > fs/vfs/t_io:ext2fs_extendfile > fs/vfs/t_io:ext2fs_extendfile_append > fs/vfs/t_io:ext2fs_holywrite > fs/vfs/t_io:ext2fs_overwrite512 > fs/vfs/t_io:ext2fs_overwrite64k > fs/vfs/t_io:ext2fs_overwrite_trunc > fs/vfs/t_io:ext2fs_read_after_unlink > fs/vfs/t_io:ext2fs_read_fault > fs/vfs/t_io:ext2fs_shrinkfile > fs/vfs/t_io:ext2fs_wrrd_after_unlink > fs/vfs/t_io:ffs_extendfile > fs/vfs/t_io:ffs_extendfile_append > fs/vfs/t_io:ffs_holywrite > fs/vfs/t_io:ffs_overwrite512 > fs/vfs/t_io:ffs_read_after_unlink > fs/vfs/t_io:ffs_read_fault > fs/vfs/t_io:ffs_shrinkfile > fs/vfs/t_io:ffs_wrrd_after_unlink > fs/vfs/t_io:ffslog_extendfile > fs/vfs/t_io:ffslog_extendfile_append > fs/vfs/t_io:ffslog_holywrite > fs/vfs/t_io:ffslog_overwrite512 > fs/vfs/t_io:ffslog_read_after_unlink > fs/vfs/t_io:ffslog_read_fault > fs/vfs/t_io:ffslog_shrinkfile > fs/vfs/t_io:ffslog_wrrd_after_unlink > fs/vfs/t_io:nfs_extendfile > fs/vfs/t_io:nfs_extendfile_append > fs/vfs/t_io:nfs_holywrite > fs/vfs/t_io:nfs_overwrite512 > fs/vfs/t_io:nfs_read_after_unlink > fs/vfs/t_io:nfs_read_fault > fs/vfs/t_io:nfs_shrinkfile > fs/vfs/t_io:nfs_wrrd_after_unlink > fs/vfs/t_io:p2k_ffs_extendfile > fs/vfs/t_io:p2k_ffs_extendfile_append > fs/vfs/t_io:p2k_ffs_holywrite > fs/vfs/t_io:p2k_ffs_overwrite512 > fs/vfs/t_io:p2k_ffs_read_after_unlink > fs/vfs/t_io:p2k_ffs_read_fault > fs/vfs/t_io:p2k_ffs_shrinkfile > fs/vfs/t_io:p2k_ffs_wrrd_after_unlink > fs/vfs/t_io:tmpfs_extendfile > fs/vfs/t_io:tmpfs_extendfile_append > fs/vfs/t_io:tmpfs_holywrite > fs/vfs/t_io:tmpfs_overwrite512 > fs/vfs/t_io:tmpfs_overwrite64k > fs/vfs/t_io:tmpfs_overwrite_trunc > fs/vfs/t_io:tmpfs_read_after_unlink > fs/vfs/t_io:tmpfs_read_fault > fs/vfs/t_io:tmpfs_shrinkfile > fs/vfs/t_io:tmpfs_wrrd_after_unlink > fs/vfs/t_mtime_write:ext2fs_mtime_update_on_write > fs/vfs/t_mtime_write:ffs_mtime_update_on_write > fs/vfs/t_mtime_write:ffslog_mtime_update_on_write > fs/vfs/t_mtime_write:nfs_mtime_update_on_write > fs/vfs/t_mtime_write:p2k_ffs_mtime_update_on_write > fs/vfs/t_mtime_write:tmpfs_mtime_update_on_write > fs/vfs/t_ro:ext2fs_attrs > fs/vfs/t_ro:ext2fs_createlink > fs/vfs/t_ro:ext2fs_fileio > fs/vfs/t_ro:ext2fs_rmfile > fs/vfs/t_ro:ffs_attrs > fs/vfs/t_ro:ffs_createlink > fs/vfs/t_ro:ffs_fileio > fs/vfs/t_ro:ffs_rmfile > fs/vfs/t_ro:ffslog_attrs > fs/vfs/t_ro:ffslog_createlink > fs/vfs/t_ro:ffslog_fileio > fs/vfs/t_ro:ff
Re: 9.99.32 panic
Hi, Missing commit to sys/uvm/uvm_amap.c. Fixed today. Thanks, Andrew On Wed, Jan 01, 2020 at 05:48:35PM +, Chavdar Ivanov wrote: > Hi, > > I get: > ... > #0 0x80224245 in cpu_reboot () > #1 0x807b9723 in db_reboot_cmd () > #2 0x807b9f3b in db_command () > #3 0x807ba2a6 in db_command_loop () > #4 0x807bdc2a in db_trap () > #5 0x80220b15 in kdb_trap () > #6 0x80225c86 in trap () > #7 0x8021ed63 in alltraps () > #8 0x8021f57d in breakpoint () > #9 0x80a33270 in vpanic () > #10 0x80e79e33 in kern_assert () > #11 0x809ae062 in uvm_pageactivate () > #12 0x809947a2 in amap_cow_now () > #13 0x809a8d1e in uvmspace_fork () > #14 0x8099eea9 in uvm_proc_fork () > #15 0x809d9e8b in fork1 () > #16 0x809daa12 in sys_fork () > #17 0x80254d49 in syscall () > #18 0x802096bd in handle_syscall () > > > on > > uname -a > NetBSD marge.lorien.lan 9.99.32 NetBSD 9.99.32 (GENERIC) #13: Wed Jan > 1 12:31:34 GMT 2020 > sysbuild@ymir:/home/sysbuild/amd64/obj/home/sysbuild/src/sys/arch/amd64/compile/GENERIC > amd64 > > when I try to startxfce4. Normal startx works, as well as xrandr to > get me to the right resolution under VirtualBox (with client additions > 6.1); gdm also starts OK and subsequently the mate session works. > > Chavdar > > > -- >
Re: odd panic
Hi Michael, On Thu, Dec 26, 2019 at 11:04:12AM -0800, Michael Cheponis wrote: > what does this mean? (Received last night on RPi 3B+ that has been h/w > stable): > > panic: kernel diagnostic assertion "uvmexp.swpgonly + npages <= > uvmexp.swpginuse" failed: file "/c/usr/src/sys/uvm/uvm_pager.c", line 472 > > load averages: 3.32, 3.72, 3.93; up 0+11:45:09 > 09:54:27UTC > 63 processes: 6 runnable, 53 sleeping, 4 on CPU > CPU0 states: 68.6% user, 0.0% nice, 27.6% system, 3.8% interrupt, 0.0% > idle > CPU1 states: 94.3% user, 0.0% nice, 5.7% system, 0.0% interrupt, 0.0% > idle > CPU2 states: 89.5% user, 0.0% nice, 10.5% system, 0.0% interrupt, 0.0% > idle > CPU3 states: 74.0% user, 0.0% nice, 17.3% system, 8.7% interrupt, 0.0% > idle > Memory: 403M Act, 204M Inact, 13M Wired, 31M Exec, 259M File, 15M Free > Swap: 3425M Total, 7912K Used, 3418M Free > > [ 1.000] NetBSD 9.99.17 (GENERIC) #2: Wed Nov 13 09:59:13 UTC 2019 > [ 1.000] m...@s.culver.net: > /c/usr/src/sys/arch/evbarm/compile/obj/GENERIC > > after reboot: > > # swapctl -l > Device 1K-blocks UsedAvail Capacity Priority > /dev/ld0b 1310720 131072 0%1 > /swapfile 33763840 3376384 0%2 > Total 35074560 3507456 0 > > So it seems like there was plenty of Swap available. I just removed this assertion with rev 1.118 of src/sys/uvm/uvm_pager.c. The assertion itself is not safe and can race on a multi-CPU machine. Andrew
Re: building kernel w/o (CPU_UCODE and COMPAT_60) fails
On Sat, Dec 21, 2019 at 12:46:12PM -, Michael van Elst wrote: > a...@netbsd.org (Andrew Doran) writes: > > >cvs rdiff -u -r1.88 -r1.89 src/sys/kern/kern_cpu.c > >cvs rdiff -u -r1.1 -r1.2 src/sys/kern/subr_cpu.c > > Still broken. The topology print references MD fields (e.g. ci_dev) that > don't exist for a RUMPKERNEL. Hmm. As useful as it is, rump is a minefield. Worked for me yesterday with a full build on amd64. Anyway I think this should fix it: cvs rdiff -u -r1.27 -r1.28 src/sys/arch/aarch64/aarch64/cpu.c cvs rdiff -u -r1.99 -r1.100 src/sys/arch/x86/x86/identcpu.c cvs rdiff -u -r1.2 -r1.3 src/sys/kern/subr_cpu.c cvs rdiff -u -r1.46 -r1.47 src/sys/sys/cpu.h Thanks, Andrew
Re: building kernel w/o (CPU_UCODE and COMPAT_60) fails
Hi, On Sat, Dec 21, 2019 at 12:30:25PM +0100, K. Schreiner wrote: > after the last changes to src/sys/kern/kern_cpu.c compiling > a custom kernel w/o "options CPU_UCODE" and "options COMPAT_60" > fails in kern_cpu.c: > > compile vNBx64/kern_cpu.o > /u/NetBSD/src/sys/kern/kern_cpu.c: In function 'cpuctl_ioctl': > /u/NetBSD/src/sys/kern/kern_cpu.c:273:13: error: 'compat_cpuctl_ioctl' > undeclared (first use in this function > ); did you mean 'cpuctl_ioctl'? >error = (*compat_cpuctl_ioctl)(l, cmd, data); > ^~~ > cpuctl_ioctl > /u/NetBSD/src/sys/kern/kern_cpu.c:273:13: note: each undeclared identifier is > reported only once for each fun > ction it appears in This should be fixed now with: cvs rdiff -u -r1.88 -r1.89 src/sys/kern/kern_cpu.c cvs rdiff -u -r1.1 -r1.2 src/sys/kern/subr_cpu.c Thank you, Andrew
Re: Automated report: NetBSD-current/i386 test failure
On Wed, Dec 18, 2019 at 01:27:32PM +, Andrew Doran wrote: > On Wed, Dec 18, 2019 at 08:25:15AM +, NetBSD Test Fixture wrote: > > > This is an automatically generated notice of new failures of the > > NetBSD test suite. > > > > The newly failing test cases are: > > > > fs/vfs/t_full:lfs_fillfs > > fs/vfs/t_io:lfs_extendfile > > fs/vfs/t_io:lfs_extendfile_append > > fs/vfs/t_io:lfs_holywrite > > fs/vfs/t_io:lfs_overwrite512 > > fs/vfs/t_io:lfs_overwrite64k > > fs/vfs/t_io:lfs_overwrite_trunc > > fs/vfs/t_io:lfs_read_after_unlink > > fs/vfs/t_io:lfs_read_fault > > fs/vfs/t_io:lfs_shrinkfile > > fs/vfs/t_io:lfs_wrrd_after_unlink > > fs/vfs/t_mtime_write:lfs_mtime_update_on_write > > fs/vfs/t_union:lfs_basic > > fs/vfs/t_vfsops:lfs_tfilehandle > > fs/vfs/t_vnops:lfs_fcntl_getlock_pids > > fs/vfs/t_vnops:lfs_fcntl_lock These should be fixed with revision 1.18 of src/sys/ufs/lfs/lfs_pages.c. Andrew
Re: kaybe lake panic
Hi, I had a quick look and this code is confusing. I could not see what lock it's trying to get. Presumably you have a netbsd.gdb in the build directory (seems to be the way now). Could you feed it to gdb and try: "info line *execlists_update_context+0x1234" where 0x1234 is the actual offset shown in the backtrace. Andrew On Thu, Dec 19, 2019 at 03:31:15PM +, Patrick Welche wrote: > Just tried NetBSD on a lapto that has what I think is a Kaybe lake > intel chip, and (no serial console - just copied off the screen): > > vpanic > snprintf > lockdebug_abort1 > mutex_enter > execlists_update_context > execlists_context_unqueue > gen8_emit_request > __i915_add_request > i915_gem_init_hw > i915_gem_init > i915_driver_load > drm_dev_register > drm_pci_attach > i915drmkms_attach_real > config_mountroot_thread > > (x/s panicstr doesn't show me much) > > ps shows 2 x configroot, 2 x idle. > > What does one look for in show locks? > > Lock 0 initialized at drm_dev_alloc, sleep/adaptive, no active turnstile > > > Lock 0 initialized at main, spin, holds 0 excl 1, wants 0 excl 1 >curcpu holds 1 wanted by 000 > Lock 1 initialized at logical_ring_init, spin, holds 0 excl 1, wants 0 excl 0 >wait/spin 0/1 > > > Cheers, > > Patrick
Re: Automated report: NetBSD-current/i386 test failure
On Wed, Dec 18, 2019 at 08:25:15AM +, NetBSD Test Fixture wrote: > This is an automatically generated notice of new failures of the > NetBSD test suite. > > The newly failing test cases are: > > fs/vfs/t_full:lfs_fillfs > fs/vfs/t_io:lfs_extendfile > fs/vfs/t_io:lfs_extendfile_append > fs/vfs/t_io:lfs_holywrite > fs/vfs/t_io:lfs_overwrite512 > fs/vfs/t_io:lfs_overwrite64k > fs/vfs/t_io:lfs_overwrite_trunc > fs/vfs/t_io:lfs_read_after_unlink > fs/vfs/t_io:lfs_read_fault > fs/vfs/t_io:lfs_shrinkfile > fs/vfs/t_io:lfs_wrrd_after_unlink > fs/vfs/t_mtime_write:lfs_mtime_update_on_write > fs/vfs/t_union:lfs_basic > fs/vfs/t_vfsops:lfs_tfilehandle > fs/vfs/t_vnops:lfs_fcntl_getlock_pids > fs/vfs/t_vnops:lfs_fcntl_lock Unsurprising. I had to spend a couple of hours fiding & fixing an aincent bug in LFS to get the tests to pass beore commit. > sbin/resize_ffs/t_grow:grow_16M_v1_16384 > sbin/resize_ffs/t_grow:grow_16M_v2_32768 > sbin/resize_ffs/t_grow_swapped:grow_16M_v0_65536 > sbin/resize_ffs/t_grow_swapped:grow_16M_v1_4096 > sbin/resize_ffs/t_grow_swapped:grow_16M_v2_8192 > sbin/resize_ffs/t_shrink:shrink_24M_16M_v0_32768 > sbin/resize_ffs/t_shrink:shrink_24M_16M_v1_65536 > sbin/resize_ffs/t_shrink_swapped:shrink_24M_16M_v0_4096 > sbin/resize_ffs/t_shrink_swapped:shrink_24M_16M_v1_8192 Hmm, I wonder I this is a rump issue. In any case I'll take a look into the failures this evening. Cheers, Andrew
Re: current/Xen i386 broken on 2019-12-16 01:20 UTC
Hi, On Wed, Dec 18, 2019 at 09:48:46AM +0100, Martin Husemann wrote: > On Wed, Dec 18, 2019 at 09:41:45AM +0100, Manuel Bouyer wrote: > > kernel diagnostic assertion "pg->offset >= nextoff" failed: file > > "/home/source/ab/HEAD/src/sys/miscfs/genfs/genfs_io.c", line 972 > > We see that on various architectures. Andrew, any news on that? I think it should be fixed by this commit from the 16th: http://mail-index.netbsd.org/source-changes/2019/12/16/msg111985.html Andrew
Re: amd64 -current build failure
Hi, On Tue, Dec 17, 2019 at 09:49:58AM +, Chavdar Ivanov wrote: > Last two days I haven't been able to build amd64 -current: > ... > /home/sysbuild/amd64/tools/lib/gcc/x86_64--netbsd/8.3.0/../../../../x86_64--netbsd/bin/ld: > /home/sysbuild/amd64/destdir/usr/lib/librump.so: undefined reference > to `rumpns_cpuctl_ioctl' > collect2: error: ld returned 1 exit status I think I fixed that one late last night - your followup message seems to confirm. Andrew
Re: Automated report: NetBSD-current/i386 build failure
On Sun, Dec 15, 2019 at 10:10:15PM +, NetBSD Test Fixture wrote: > This is an automatically generated notice of a NetBSD-current/i386 > build failure. Fixed already with rev 1.285 src/sys/sys/vnode.h. Andrew > The failure occurred on babylon5.netbsd.org, a NetBSD/amd64 host, > using sources from CVS date 2019.12.15.20.33.22. > > An extract from the build.sh output follows: > > > /tmp/bracket/build/2019.12.15.20.33.22-i386/tools/bin/i486--netbsdelf-objcopy > -X vfs_quotactl.po > --- vfs_vnode.po --- > # compile librumpvfs/vfs_vnode.po > /tmp/bracket/build/2019.12.15.20.33.22-i386/tools/bin/i486--netbsdelf-gcc > -O2 -fno-delete-null-pointer-checks -ffreestanding -fno-strict-aliasing > -msoft-float -mno-mmx -mno-sse -mno-avx -msoft-float -mno-mmx -mno-sse > -mno-avx -std=gnu99-Wall -Wstrict-prototypes -Wmissing-prototypes > -Wpointer-arith -Wno-sign-compare -Wsystem-headers -Wno-traditional > -Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual > -Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Werror > -Wno-format-zero-length -Wno-pointer-sign -fPIE -fstack-protector > -Wstack-protector --param ssp-buffer-size=1 > --sysroot=/tmp/bracket/build/2019.12.15.20.33.22-i386/destdir -DCOMPAT_50 > -DCOMPAT_60 -DCOMPAT_70 -DCOMPAT_80 -nostdinc -imacros > /tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/include/opt/opt_rumpkernel.h > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs -I. > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../.. > /sys/rump/../../common/include > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/include > > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/include/opt > > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/../arch > > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/.. > -DDIAGNOSTIC -DKTRACE -D_FORTIFY_SOURCE=2 -c -DGPROF -DPROF-pg -fPIC > /tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/../kern/vfs_vnode.c > -o vfs_vnode.po > --- vfs_subr.pico --- > > /tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/../kern/vfs_subr.c:1554:1: > error: no previous prototype for 'vfs_mount_print_all' > [-Werror=missing-prototypes] > vfs_mount_print_all(int full, void (*pr)(const char *, ...)) > ^~~ > --- uvm_vnode.po --- > /tmp/bracket/build/2019.12.15.20.33.22-i386/tools/bin/nbctfconvert -g -L > VERSION uvm_vnode.po > > /tmp/bracket/build/2019.12.15.20.33.22-i386/tools/bin/i486--netbsdelf-objcopy > -X uvm_vnode.po > --- vfs_vnops.pico --- > # compile librumpvfs/vfs_vnops.pico > /tmp/bracket/build/2019.12.15.20.33.22-i386/tools/bin/i486--netbsdelf-gcc > -O2 -fno-delete-null-pointer-checks -ffreestanding -fno-strict-aliasing > -msoft-float -mno-mmx -mno-sse -mno-avx -msoft-float -mno-mmx -mno-sse > -mno-avx -std=gnu99-Wall -Wstrict-prototypes -Wmissing-prototypes > -Wpointer-arith -Wno-sign-compare -Wsystem-headers -Wno-traditional > -Wa,--fatal-warnings -Wreturn-type -Wswitch -Wshadow -Wcast-qual > -Wwrite-strings -Wextra -Wno-unused-parameter -Wno-sign-compare -Werror > -Wno-format-zero-length -Wno-pointer-sign -fPIE -fstack-protector > -Wstack-protector --param ssp-buffer-size=1 > --sysroot=/tmp/bracket/build/2019.12.15.20.33.22-i386/destdir -DCOMPAT_50 > -DCOMPAT_60 -DCOMPAT_70 -DCOMPAT_80 -nostdinc -imacros > /tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/include/opt/opt_rumpkernel.h > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs -I. > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../.. > /sys/rump/../../common/include > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/include > > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/include/opt > > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/../arch > > -I/tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/.. > -DDIAGNOSTIC -DKTRACE -D_FORTIFY_SOURCE=2 -c-fPIC > /tmp/bracket/build/2019.12.15.20.33.22-i386/src/lib/librumpvfs/../../sys/rump/../kern/vfs_vnops.c > -o vfs_vnops.pico > --- vfs_trans.pico --- > > /tmp/bracket/build/2019.12.15.20.33.22-i386/tools/bin/i486--netbsdelf-objcopy > -x vfs_trans.pico > --- vfs_lockf.po --- > /tmp/bracket/build/2019.12.15.20.33.22-i386/tools/bin/nbctfconvert -g -L > VERSION vfs_lockf.po > > /tmp/bracket/build/2019.12.15.20.33.22-i386/tools/bin/i486--netbsdelf-objcopy > -X vfs_lockf.po > --- vfs_vnops.po --- > --- vfs_wapbl.pico --- > --- vfs_vnops.po --- > # compile librumpvfs/vfs_vnops.po > /tmp/bracket
Re: Automated report: NetBSD-current/i386 build failure
Hi Robert, On Tue, Dec 10, 2019 at 08:54:12AM +0700, Robert Elz wrote: > Date:Tue, 10 Dec 2019 00:49:43 + (UTC) > From:NetBSD Test Fixture > Message-ID: <157593898298.13655.3447375934086628...@babylon5.netbsd.org> > > | This is an automatically generated notice of a NetBSD-current/i386 > | build failure. > > | > /tmp/bracket/build/2019.12.09.21.08.56-i386/src/sys/kern/kern_mutex.c:836:7: > error: implicit declaration of function 'mutex_oncpu'; did you mean > 'mutex_dump'? [-Werror=implicit-function-declaration] > | rv = mutex_oncpu(MUTEX_OWNER(mtx->mtx_owner)); > |^~~ > > > To answer gcc, obviously not ... > > The problem is that mutex_oncpu() is currently only defined with > MULTIPROCESSOR and Xen Dom0 kernels are not MULTIPROCESSOR capable. > > Andrew, can you either fix mutex_owner_running() to not use mutex_oncpu() > in the !MULTIPROCESSOR case, or supply a version of mutex_oncpu() that > is appropriate there. > > My thinking was something like > > #else /* MULTIPROCESSOR */ > static /*inline?*/ bool > mutex_oncpu(uintptr_t owner) > { > return MUTEX_OWNED(owner); > } > > > inserted just before > #endif/* MULTIPROCESSOR */ > > in the obvious place, but that is probably much too simple ? Done, thank you. Too much preprocessor goop in that file I'll take a look some time. Andrew
Re: Current test failures
On Sat, Dec 07, 2019 at 09:53:35PM +0200, Andreas Gustafsson wrote: > Perhaps, but before Taylor made that commit, at least one other bug > was introduced that is causing the system to panic before finishing > the tests: > > fs/vfs/t_renamerace (726/847): 28 test cases > ext2fs_renamerace: [6.743565s] Failed: Test program received signal 11 > (core dumped) > ext2fs_renamerace_dirs: [6.690776s] Failed: Test program received signal > 11 (core dumped) > ffs_renamerace: [6.602727s] Failed: Test program received signal 11 (core > dumped) > ffs_renamerace_dirs: [ 3923.9308316] panic: kernel diagnostic assertion > "l->l_cpu == ci" failed: file > "/tmp/bracket/build/2019.12.06.21.45.14-amd64-baremetal/src/sys/kern/kern_synch.c", > line 764 > [ 3924.1108893] cpu7: Begin traceback... > [ 3924.1509019] vpanic() at netbsd:vpanic+0x178 > [ 3924.2009181] kern_assert() at netbsd:kern_assert+0x48 > [ 3924.2609379] mi_switch() at netbsd:mi_switch+0x569 > [ 3924.3209576] sleepq_block() at netbsd:sleepq_block+0xb7 > [ 3924.3809774] lwp_park() at netbsd:lwp_park+0x10d > [ 3924.4409956] syslwp_park60() at netbsd:syslwp_park60+0x5d > [ 3924.5110189] syscall() at netbsd:syscall+0x299 > [ 3924.5610351] --- syscall (number 478) --- > [ 3924.6110531] 7adcb44b035a: > [ 3924.6410624] cpu7: End traceback... Fixed with sys/kern/kern_synch.c revision 1.330. > Could everyone please refrain from committing new kernel-crashing bugs > until the test infrastructure has recovered from the previous round? I think that's a reasonable suggestion. Looking at it from a positive viewpoint your system and ATF appear to be doing a brilliant job in finding problems. Also on a positive note a minority of the bugs have been aincent and only exposed due to recent changes. I will make an effort to run ATF more often locally. Thank you, Andrew
Re: LOCKDEBUG: Mutex error: mi_switch,528: spin lock held
Hi, On Sat, Dec 07, 2019 at 07:24:32PM +0900, Kimihiro Nonaka wrote: > I got a panic with recent updated source. This should be fixed with rev. 1.330 of sys/kern/kern_synch.c. Thank you, Andrew
Re: Testbed breakage
Hi, On Fri, Dec 06, 2019 at 04:03:05PM +0200, Andreas Gustafsson wrote: > For the last few days, most of the testbeds have been seeing the > system under test either hang or panic before the ATF tests have run > to completion. The failures are too many and varied to file a PR > about each, but for a start, you can look for "tests: did not > complete" in the following: > > http://releng.netbsd.org/b5reports/i386/commits-2019.12.html > http://releng.netbsd.org/b5reports/amd64/commits-2019.12.html > http://releng.netbsd.org/b5reports/evbarm-aarch64/commits-2019.12.html > http://releng.netbsd.org/b5reports/pmax/commits-2019.12.html > > For sparc, there is PR 54734. Both qemu and gxemul based testbeds are > failing, but my i386 and amd64 testbeds running on real hardware are > not (other than the latest amd64 test run showing 1336 new test > failures, which looks like an unrelated bug). That the failing hosts > are uniprocessors and the working ones are multiprocessors may or may > not be a coincidence. I think may have just fixed the hangs: http://mail-index.netbsd.org/source-changes/2019/12/06/msg111617.html The xcall related crash is baffling. I will spend a little time to look into it over the weekend. Andrew
Re: vm.ubc_direct
On Tue, Dec 03, 2019 at 11:28:07PM +0100, Jarom?r Dole?ek wrote: > Le mar. 3 d?c. 2019 ? 18:59, Chuck Silvers a ?crit : > > > On Mon, Dec 02, 2019 at 07:10:52PM +0000, Andrew Doran wrote: > > > Hello, > > > > > > In light of the recent discussion, and having asked Jaromir his thoughts > > on > > > the subject, we both think it's time to enable this by default, so it > > gets > > > wider testing. Is there a good reason not to? > > > > > > Cheers, > > > Andrew > > > > The current ubc_direct code still has the problem that I pointed out > > originally, > > which is that it deadlocks if you read() or write() a page of a file into > > a mapping of itself. We should not enable this by default until that > > problem > > is fixed. > > > > Right, I completely forgot about this. > > I have a small program which triggers the deadlock quite reliably. Never > got around to actually add it into test suite because it caused problems > also on other system I run it on. > > Andrew, would you by chance be interested to look at this? Yes I am, but I think it'll be next year some time if I do as there are more items on my list. Could you please send me the test program anyway? Incidentally I saw a thundering herd problem with this enabled (PG_WANTED vs. wakeup()). From a quick look it seems like more of a general UBC problem than a problem with ubc_direct but definitely something that needs looking into. It could be mitigated somewhat with wakeup_one() and associated changes but would be nice to address the root cause. This was actually kind of amusing to watch in top(1), I wish I had a video: http://www.netbsd.org/~ad/sh.txt Andrew
Re: vm.ubc_direct
On Tue, Dec 03, 2019 at 09:58:39AM -0800, Chuck Silvers wrote: > The current ubc_direct code still has the problem that I pointed out > originally, > which is that it deadlocks if you read() or write() a page of a file into > a mapping of itself. We should not enable this by default until that problem > is fixed. I didn't know about that. Agree enabling it would be a bad idea then. Cheers, Andrew
Re: vm.ubc_direct
On Mon, Dec 02, 2019 at 08:30:58PM +0100, Kamil Rytarowski wrote: > On 02.12.2019 20:10, Andrew Doran wrote: > > Hello, > > > > In light of the recent discussion, and having asked Jaromir his thoughts on > > the subject, we both think it's time to enable this by default, so it gets > > wider testing. Is there a good reason not to? > > > > Cheers, > > Andrew > > > > There were reported issues in this mail: > > https://mail-index.netbsd.org/current-users/2019/11/20/msg037077.html Yup. I have seen that and am looking for further experiences with it. If someone does run into problems, they can turn it off with the sysctl (and hopefully get us some information about the failure). Cheers, Andrew
Re: Xen MP hang in pmap
On Mon, Dec 02, 2019 at 03:47:41PM +0100, Manuel Bouyer wrote: > in pmap_update(): > pmap_update(c0c751c0,c13e2000,c13e4000,c13e2000,2000,0,ccf7dd88,c041babf,c0c8d680,c13e2000) > at netbsd:pmap_update+0x21 > uvm_km_kmem_free(c0c8d680,c13e2000,2000,2,0,c13e3000,1,ccf7dd98,c03d267b,c13e2000) > at netbsd:uvm_km_kmem_free+0x49 > kmem_intr_free(c13e2000,2000,ccf7ddbc,c01359b0,c13e3000,c1428fcc,c0583f60,c13e3000,0,0) > at netbsd:kmem_intr_free+0xaf > kern_free(c13e3000,c1428fcc,c0583f60,c13e3000,0,0,c1423008,ccf7de00,c013ac71,1) > at netbsd:kern_free+0x2b > xenbus_printf(1,c1428fcc,c0583f60,c05b7843,1ff,c013b6f0,c1423008,0,c1331e64,28) > at netbsd:xenbus_printf+0x70 > xbd_xenbus_resume(c1331e48,c057371c,0,c1331e64,c0583f9a,c1331e48,c057d55f,c0b43a40,ccf7de54,c040b56c) > at netbsd:xbd_xenbus_resume+0x251 > xbd_xenbus_attach(c1331ac8,c1331e48,ccf7dee8,ccf7dee8,c0b439f8,c1331e48,ccf7dee8,c1331ac8,c0b439f8,ccf7de80) > at netbsd:xbd_xenbus_attach+0x127 > config_attach_loc(c1331ac8,c0b439f8,0,ccf7dee8,c0133ca0,c133a46c,c1428fcc,c1428f9c,0,ccf7dea0) > at netbsd:config_attach_loc+0x19c > config_found_sm_loc(c1331ac8,c0582f59,0,ccf7dee8,c0133ca0,0,ccf7df00,c0134294,c1331ac8,c0582f59) > at netbsd:config_found_sm_loc+0x5a > config_found_ia(c1331ac8,c0582f59,ccf7dee8,c0133ca0,a,0,c121e87c,0,c1232834,ccf7df2e) > at netbsd:config_found_ia+0x32 > xenbus_probe_device_type(ccf7df2e,1e,c05833be,c121e87c,ccf7df24,0,4,c121e86c,2,6 > 564) at netbsd:xenbus_probe_device_type+0x2f4 > xenbus_probe_frontends(60,0,0,c01351b0,0,c058421b,0,0,ccf7df9c,c0134b6f) at > netbsd:xenbus_probe_frontends+0xb8 > xenbus_probe(0,c13433a0,c0134b20,0,0,c0102031,c13433a0,d6e000,c0b4f200,0) at > netbsd:xenbus_probe+0x2e > xenbus_probe_init(c13433a0,d6e000,c0b4f200,0,c010007a,0,0,0,0,0) at > netbsd:xenbus_probe_init+0x4f > > pmap_update+0x21 is the call to pmap_tlb_shootnow() (which unfortunably > doesn't show up in the stack trace). > > Any idea what can happen here ? At this point we're still in autoconf. > Interrupts are working but it's possible that CPUs are not fully up yet. Oops. There was a bug in my TLB shootdown changes. Should be fixed with: /cvsroot/src/sys/arch/x86/x86/x86_tlb.c,v <-- x86_tlb.c new revision: 1.11; previous revision: 1.10 Andrew
vm.ubc_direct
Hello, In light of the recent discussion, and having asked Jaromir his thoughts on the subject, we both think it's time to enable this by default, so it gets wider testing. Is there a good reason not to? Cheers, Andrew
Re: Xen panic in lwp_need_userret()
OK, I just checked in a fix. Andrew On Fri, Nov 29, 2019 at 09:42:44AM +0100, Manuel Bouyer wrote: > On Tue, Nov 26, 2019 at 01:38:08PM +0000, Andrew Doran wrote: > > Hi Manuel, > > > > On Tue, Nov 26, 2019 at 09:01:28AM +0100, Manuel Bouyer wrote: > > > > > Any chance this has been fixed since 2 days ago ? > > > > Yes indeed, since yesterday with rev 1.51 src/sys/kern/kern_softint.c. > > Well, the 201911261940Z now panics with: > [ 1.000] xenbus0 at hypervisor0: Xen Virtual Bus Interface > [ 1.000] xencons0 at hypervisor0: Xen Virtual Console Driver > [ 1.030] panic: kernel diagnostic assertion "(flags & RESCHED_UPREEMPT) > != 0" failed: file "/home/source/ab/HEAD/src/sys/arch/x86/x86/x86_machdep.c", > line 317 > [ 1.030] cpu0: Begin traceback... > [ 1.030] vpanic() at netbsd:vpanic+0x146 > [ 1.030] kern_assert() at netbsd:kern_assert+0x48 > [ 1.030] cpu_need_resched() at netbsd:cpu_need_resched+0xb3 > [ 1.030] hardclock() at netbsd:hardclock+0xc4 > [ 1.030] xen_timer_handler() at netbsd:xen_timer_handler+0x66 > [ 1.030] Xresume_xenev7() at netbsd:Xresume_xenev7+0x49 > [ 1.030] --- interrupt --- > [ 1.030] Xspllower() at netbsd:Xspllower+0xe > [ 1.030] cpu0: End traceback... > > (http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201911261940Z_anita.txt) > > -- > Manuel Bouyer > NetBSD: 26 ans d'experience feront toujours la difference > --
Re: Xen panic in lwp_need_userret()
Hi, On Fri, Nov 29, 2019 at 09:42:44AM +0100, Manuel Bouyer wrote: > > Yes indeed, since yesterday with rev 1.51 src/sys/kern/kern_softint.c. > > Well, the 201911261940Z now panics with: > [ 1.000] xenbus0 at hypervisor0: Xen Virtual Bus Interface > [ 1.000] xencons0 at hypervisor0: Xen Virtual Console Driver > [ 1.030] panic: kernel diagnostic assertion "(flags & RESCHED_UPREEMPT) > != 0" failed: file "/home/source/ab/HEAD/src/sys/arch/x86/x86/x86_machdep.c", > line 317 > [ 1.030] cpu0: Begin traceback... > [ 1.030] vpanic() at netbsd:vpanic+0x146 > [ 1.030] kern_assert() at netbsd:kern_assert+0x48 > [ 1.030] cpu_need_resched() at netbsd:cpu_need_resched+0xb3 > [ 1.030] hardclock() at netbsd:hardclock+0xc4 > [ 1.030] xen_timer_handler() at netbsd:xen_timer_handler+0x66 > [ 1.030] Xresume_xenev7() at netbsd:Xresume_xenev7+0x49 > [ 1.030] --- interrupt --- > [ 1.030] Xspllower() at netbsd:Xspllower+0xe > [ 1.030] cpu0: End traceback... > > (http://www-soc.lip6.fr/~bouyer/NetBSD-tests/xen/HEAD/amd64/201911261940Z_anita.txt) You have a CPU hog stuck in kernel and it's trying to force it off with a kernel preemption, but NetBSD/xen doesn't have kernel preemption. I'll fix it when I get home from work later. Andrew
Re: Xen panic in lwp_need_userret()
Hi Manuel, On Tue, Nov 26, 2019 at 09:01:28AM +0100, Manuel Bouyer wrote: > Any chance this has been fixed since 2 days ago ? Yes indeed, since yesterday with rev 1.51 src/sys/kern/kern_softint.c. Cheers, Andrew
Re: Crash with HEAD on amd64 - in setrunnable()
Hi Paul, On Sun, Nov 24, 2019 at 07:15:24PM -0800, Paul Goyette wrote: > On Sun, 24 Nov 2019, Paul Goyette wrote: > > > With a very current kernel, I just got this: > > > > # crash -M /var/crash/netbsd.21.core -N /netbsd.gdb > > Crash version 9.99.18, image version 9.99.18. > > System panicked: kernel diagnostic assertion "lwp_locked(l, > > l->l_cpu->ci_schedstate.spc_lwplock)" failed: file > > "/build/netbsd-local/src_ro/sys/kern/kern_synch.c", line 910 > > Backtrace from time of crash is available. > > crash> bt > > _KERNEL_OPT_NVGA_RASTERCONSOLE() at 0 > > ?() at de890ce0af54 > > vpanic() at vpanic+0x181 > > kern_assert() at kern_assert+0x48 > > setrunnable() at setrunnable+0x179 > > lwp_start() at lwp_start+0xba > > do_lwp_create() at do_lwp_create+0xa1 > > sys__lwp_create() at sys__lwp_create+0xc1 > > syscall() at syscall+0x28a > > --- syscall (number 309) --- > > 45ae46: > > crash> > > > > > > (Obviously, I have a core dump, so I'll be happy to investigate further > > if anyone has suggestions.) > > Perhaps this is the "potential panic" that ad@ references in this commit > log message? :) Thanks for letting me know. Yes, that one should be fixed already. Andrew
Re: Automated report: NetBSD-current/sparc test failure
I checked in a potential fix for this. More scheduler changes to come later today, though. Andrew On Sat, Nov 23, 2019 at 12:48:29PM +, NetBSD Test Fixture wrote: > This is an automatically generated notice of new failures of the > NetBSD test suite. > > The newly failing test cases are: > > lib/libc/sys/t_sendrecv:sendrecv_basic > lib/libc/sys/t_sendrecv:sendrecv_rerror > > The above tests failed in each of the last 3 test runs, and passed in > at least 27 consecutive runs before that. > > The following commits were made between the last successful test and > the failed test: > > 2019.11.21.16.45.05 tkusumi src/usr.sbin/autofs/autounmountd.c,v 1.2 > 2019.11.21.17.47.23 ad src/sys/dev/pci/ichsmb.c,v 1.61 > 2019.11.21.17.47.53 ad src/sys/uvm/uvm_glue.c,v 1.170 > 2019.11.21.17.50.49 ad src/sys/kern/kern_resource.c,v 1.183 > 2019.11.21.17.50.49 ad src/sys/kern/kern_softint.c,v 1.49 > 2019.11.21.17.54.04 ad src/sys/kern/sys_pset.c,v 1.22 > 2019.11.21.17.54.04 ad src/sys/sys/pset.h,v 1.7 > 2019.11.21.17.57.40 ad src/sys/kern/kern_timeout.c,v 1.57 > 2019.11.21.18.17.36 ad src/sys/kern/kern_lwp.c,v 1.209 > 2019.11.21.18.17.36 ad src/sys/kern/kern_sig.c,v 1.380 > 2019.11.21.18.22.05 ad src/sys/kern/kern_lwp.c,v 1.210 > 2019.11.21.18.46.40 martin src/distrib/notes/common/main,v 1.552 > 2019.11.21.18.56.55 ad src/sys/kern/kern_sleepq.c,v 1.52 > 2019.11.21.18.56.55 ad src/sys/kern/kern_turnstile.c,v 1.33 > 2019.11.21.18.56.55 ad src/sys/sys/sleepq.h,v 1.26 > 2019.11.21.19.02.43 ad src/sys/arch/alpha/alpha/ipifuncs.c,v 1.49 > 2019.11.21.19.23.16 martin src/distrib/notes/Makefile.inc,v 1.44 > 2019.11.21.19.23.16 martin src/distrib/notes/acorn32/contents,v 1.5 > 2019.11.21.19.23.16 martin src/distrib/notes/alpha/contents,v 1.19 > 2019.11.21.19.23.17 martin src/distrib/notes/amd64/contents,v 1.8 > 2019.11.21.19.23.17 martin src/distrib/notes/amiga/contents,v 1.23 > 2019.11.21.19.23.17 martin src/distrib/notes/amiga/install,v 1.37 > 2019.11.21.19.23.17 martin src/distrib/notes/arc/contents,v 1.5 > 2019.11.21.19.23.17 martin src/distrib/notes/atari/contents,v 1.23 > 2019.11.21.19.23.17 martin src/distrib/notes/atari/xfer,v 1.18 > 2019.11.21.19.23.17 martin src/distrib/notes/bebox/contents,v 1.5 > 2019.11.21.19.23.17 martin src/distrib/notes/cats/contents,v 1.5 > 2019.11.21.19.23.17 martin src/distrib/notes/common/contents,v 1.179 > 2019.11.21.19.23.17 martin src/distrib/notes/common/legal.common,v 1.99 > 2019.11.21.19.23.17 martin src/distrib/notes/common/main,v 1.553 > 2019.11.21.19.23.17 martin src/distrib/notes/common/netboot,v 1.37 > 2019.11.21.19.23.17 martin src/distrib/notes/common/postinstall,v 1.84 > 2019.11.21.19.23.17 martin src/distrib/notes/common/sysinst,v 1.108 > 2019.11.21.19.23.17 martin src/distrib/notes/common/xfer,v 1.76 > 2019.11.21.19.23.17 martin src/distrib/notes/emips/contents,v 1.5 > 2019.11.21.19.23.17 martin src/distrib/notes/emips/install,v 1.3 > 2019.11.21.19.23.18 martin src/distrib/notes/evbarm/contents,v 1.5 > 2019.11.21.19.23.18 martin src/distrib/notes/evbarm/install,v 1.10 > 2019.11.21.19.23.18 martin src/distrib/notes/evbppc/contents,v 1.6 > 2019.11.21.19.23.18 martin src/distrib/notes/ews4800mips/contents,v 1.5 > 2019.11.21.19.23.18 martin src/distrib/notes/hp300/contents,v 1.20 > 2019.11.21.19.23.18 martin src/distrib/notes/hp300/upgrade,v 1.20 > 2019.11.21.19.23.18 martin src/distrib/notes/hpcarm/contents,v 1.6 > 2019.11.21.19.23.18 martin src/distrib/notes/hpcmips/contents,v 1.13 > 2019.11.21.19.23.18 martin src/distrib/notes/hpcsh/contents,v 1.6 > 2019.11.21.19.23.18 martin src/distrib/notes/hppa/contents,v 1.5 > 2019.11.21.19.23.18 martin src/distrib/notes/i386/contents,v 1.32 > 2019.11.21.19.23.19 martin src/distrib/notes/landisk/contents,v 1.6 > 2019.11.21.19.23.19 martin src/distrib/notes/mac68k/contents,v 1.23 > 2019.11.21.19.23.19 martin src/distrib/notes/mac68k/install,v 1.31 > 2019.11.21.19.23.19 martin src/distrib/notes/mac68k/prep,v 1.18 > 2019.11.21.19.23.19 martin src/distrib/notes/mac68k/xfer,v 1.21 > 2019.11.21.19.23.19 martin src/distrib/notes/macppc/contents,v 1.17 > 2019.11.21.19.23.19 martin src/distrib/notes/macppc/install,v 1.42 > 2019.11.21.19.23.19 martin src/distrib/notes/mmeye/contents,v 1.6 > 2019.11.21.19.23.19 martin src/distrib/notes/mvme68k/contents,v 1.16 > 2019.11.21.19.23.19 martin src/distrib/notes/mvme68k/xfer,v 1.20 > 2019.11.21.19.23.19 martin src/distrib/notes/news68k/contents,v 1.9 > 2019.11.21.19.23.19 martin src/distrib/notes/newsmips/contents,v 1.5 > 2019.11.21.19.23.19 martin src/distrib/notes/next68k/contents,v 1.10 > 2019.11.21.19.23.20 martin src/distrib/notes/ofppc/contents,v 1.6 > 2019.11.21.19.23.20 martin src/distrib/notes/pmax/contents,v 1.20 > 2019.11.21.19.23.20 m