Re: CVS commit: src/sys/netinet
Ryota Ozakiwrites: > One possible fix has been committed. > > Can you update the source code and try a new kernel? Will do. Meanwhile, before I got around to building a kernel with debug options enabled, I had another hang. Got it into DDB successfully, but then I managed, while looking for a way to switch between CPUs (the man page is wrong in this respect), to get DDB to look at something it shouldn't have, so it just said "fatal pag", and that was that. I did get a backtrace of CPU 0, though, and it looked interesting: pool_catchup() pool_get() pool_cache_get_slow() pool_cache_get_paddr() m_get() m_gethdr() wm_add_rxbuf() wm_rxeof() wm_intr_legacy() intr_biglock_wrapper() Xintr_ioapic_level2() --- interrupt --- Xspllower() uvm_km_kmem_alloc() pool_page_alloc() pool_grow() pool_catchup() pool_get() pool_cache_get_slow() pool_cache_get_paddr() m_get() m_gethdr() tcp_output() tcp_send_wrapper() sosend() soo_write() dofilewrite() sys_write() syscall() --- syscall (number 4) --- -tih -- Most people who graduate with CS degrees don't understand the significance of Lisp. Lisp is the most important idea in computer science. --Alan Kay
Re: CVS commit: src/sys/netinet6
On Tue, 26 Dec 2017, Ryota Ozaki wrote: Well, since the lock _might_ be released (and subsequently reacquired) by callout_halt(), it might be easiest to modify all the callers to just unlock it before calling nd6_dad_stoptimer(), and reacquire the mutex after it returns (as well as re-establishing any values that might have changed while the mutex was released). The callers needs to be prepared for the release/reacquire situation anyway, so the change should not be significant. As noted in the callout_halt(9) man page ...to avoid race conditions the caller should always assume that [the] interlock has been released and reacquired, and act accordingly. Alternatively, you could modify all the callers to always acquire the mutex before calling nd6_dad_stoptimer(). I mean such changes are tough and mess up many codes which we don't hope. Yes, the changes are not trivial. But the first option above should already have been done when you changed from callout_stop() to _halt(). But making a run-time decision with mutex_owned() is not a good idea. If it must be respected we can back to callout_stop because it's unsafe but NetBSD 7 uses it without any issues; using callout_stop in nd6_dad_stoptimer might be safe somehow (if not NET_MPSAFE). IMHO, we definitely do not want to use mutex_owned() in this way. I do not believe that going backwards to callout_stop() is the right approach. Again IMHO, the _right_ thing to do is to continue using callout_halt() and modify the callers. Yes, it might be a lot of work, and it might initially introduce new errors, but in the long run it is the correct approach. IMHO. +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org | +--+--++
Re: CVS commit: src/sys/netinet6
On Tue, Dec 26, 2017 at 12:37 PM, Paul Goyettewrote: > On Tue, 26 Dec 2017, Ryota Ozaki wrote: > >> On Tue, Dec 26, 2017 at 11:35 AM, Paul Goyette wrote: >>> >>> To generate a diff of this commit: >>> >>> >>> >>> # cvs rdiff -u -r1.139 -r1.140 src/sys/netinet6/nd6_nbr.c >>> @@ -1097,7 +1097,11 @@ nd6_dad_stoptimer(struct dadq *dp) >>> #ifdef NET_MPSAFE >>> callout_halt(>dad_timer_ch, NULL); >>> #else >>> - callout_halt(>dad_timer_ch, softnet_lock); >>> + /* XXX still need the trick for softnet_lock */ >>> + if (mutex_owned(softnet_lock)) >>> + callout_halt(>dad_timer_ch, softnet_lock); >>> + else >>> + callout_halt(>dad_timer_ch, NULL); >>> #endif >>> } >>> >>> This goes against the restriction noted in the mutex(9) man page: >>> >>>[mutex_owned()] should not be used to make locking decisions at run >>>time. ... >>> >>> Please find a different way to make this run-time decision. >> >> >> I know the restriction well, but for softnet_lock, following the >> restriction >> is sometimes hard. I don't have an option to statically decide if >> softnet_lock is held or not without messing up many codes. >> >> An option we can have here is to give up using callout_halt and use >> callout_stop that may be unsafe. >> >> Which is better for us? > > > Well, since the lock _might_ be released (and subsequently reacquired) > by callout_halt(), it might be easiest to modify all the callers to > just unlock it before calling nd6_dad_stoptimer(), and reacquire the > mutex after it returns (as well as re-establishing any values that > might have changed while the mutex was released). > > The callers needs to be prepared for the release/reacquire situation > anyway, so the change should not be significant. As noted in the > callout_halt(9) man page > >...to avoid race conditions the caller should always assume that >[the] interlock has been released and reacquired, and act >accordingly. > > > Alternatively, you could modify all the callers to always acquire the > mutex before calling nd6_dad_stoptimer(). I mean such changes are tough and mess up many codes which we don't hope. > > But making a run-time decision with mutex_owned() is not a good idea. If it must be respected we can back to callout_stop because it's unsafe but NetBSD 7 uses it without any issues; using callout_stop in nd6_dad_stoptimer might be safe somehow (if not NET_MPSAFE). ozaki-r
Re: CVS commit: src/sys/netinet6
On Tue, 26 Dec 2017, Ryota Ozaki wrote: On Tue, Dec 26, 2017 at 11:35 AM, Paul Goyettewrote: To generate a diff of this commit: # cvs rdiff -u -r1.139 -r1.140 src/sys/netinet6/nd6_nbr.c @@ -1097,7 +1097,11 @@ nd6_dad_stoptimer(struct dadq *dp) #ifdef NET_MPSAFE callout_halt(>dad_timer_ch, NULL); #else - callout_halt(>dad_timer_ch, softnet_lock); + /* XXX still need the trick for softnet_lock */ + if (mutex_owned(softnet_lock)) + callout_halt(>dad_timer_ch, softnet_lock); + else + callout_halt(>dad_timer_ch, NULL); #endif } This goes against the restriction noted in the mutex(9) man page: [mutex_owned()] should not be used to make locking decisions at run time. ... Please find a different way to make this run-time decision. I know the restriction well, but for softnet_lock, following the restriction is sometimes hard. I don't have an option to statically decide if softnet_lock is held or not without messing up many codes. An option we can have here is to give up using callout_halt and use callout_stop that may be unsafe. Which is better for us? Well, since the lock _might_ be released (and subsequently reacquired) by callout_halt(), it might be easiest to modify all the callers to just unlock it before calling nd6_dad_stoptimer(), and reacquire the mutex after it returns (as well as re-establishing any values that might have changed while the mutex was released). The callers needs to be prepared for the release/reacquire situation anyway, so the change should not be significant. As noted in the callout_halt(9) man page ...to avoid race conditions the caller should always assume that [the] interlock has been released and reacquired, and act accordingly. Alternatively, you could modify all the callers to always acquire the mutex before calling nd6_dad_stoptimer(). But making a run-time decision with mutex_owned() is not a good idea. +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org | +--+--++
Re: CVS commit: src/sys/netinet
One possible fix has been committed. Can you update the source code and try a new kernel? Thanks, ozaki-r On Tue, Dec 26, 2017 at 1:00 AM, Ryota Ozakiwrote: > On Mon, Dec 25, 2017 at 8:31 PM, Tom Ivar Helbekkmo > wrote: >> Martin Husemann writes: >> >>> The sparc64 hang happened with: >>> >>> $NetBSD: ip_output.c,v 1.288 2017/12/22 11:22:37 ozaki-r Exp $ >> >> My amd64 that experienced hangs has: >> >> $NetBSD: ip_output.c,v 1.287 2017/12/15 04:03:46 ozaki-r Exp $ >> >> ...and was last updated from CVS on December 19th, which made it much >> more stable than a kernel from a few days previous to that (from the >> 12th, I think). After upgrading, I was trying to update a lot of >> installed packages from pkgsrc, with pkgsrc, distfiles, and the binary >> package repository NFS mounted, and a local pkgobj for building in. I >> gave up in the end, because of the hangs, but managed to build >> everything after updating on the 19th. Since then, I've only had the >> one hard hang. > > My recent changes: > - 12/6-7: IFNET_LOCK changes #1 > - 12/14-15: IFNET_LOCK changes #2 > - 12/19:Disable IFEF_MPSAFE by default > > So your experiences looks synchronized to the changes. > > Can you run the kernel with LOCKDEBUG and DEBUG enabled? And if once your > system hangs again, can you enter the DDB and collect threads stuck on > turnstile? > > Thanks, > ozaki-r
Re: CVS commit: src/sys/netinet6
On Tue, Dec 26, 2017 at 11:35 AM, Paul Goyettewrote: > >> To generate a diff of this commit: > > > # cvs rdiff -u -r1.139 -r1.140 src/sys/netinet6/nd6_nbr.c > @@ -1097,7 +1097,11 @@ nd6_dad_stoptimer(struct dadq *dp) > #ifdef NET_MPSAFE > callout_halt(>dad_timer_ch, NULL); > #else > - callout_halt(>dad_timer_ch, softnet_lock); > + /* XXX still need the trick for softnet_lock */ > + if (mutex_owned(softnet_lock)) > + callout_halt(>dad_timer_ch, softnet_lock); > + else > + callout_halt(>dad_timer_ch, NULL); > #endif > } > > This goes against the restriction noted in the mutex(9) man page: > >[mutex_owned()] should not be used to make locking decisions at run >time. ... > > Please find a different way to make this run-time decision. I know the restriction well, but for softnet_lock, following the restriction is sometimes hard. I don't have an option to statically decide if softnet_lock is held or not without messing up many codes. An option we can have here is to give up using callout_halt and use callout_stop that may be unsafe. Which is better for us? ozaki-r
Re: CVS commit: src/sys/netinet6
To generate a diff of this commit: # cvs rdiff -u -r1.139 -r1.140 src/sys/netinet6/nd6_nbr.c @@ -1097,7 +1097,11 @@ nd6_dad_stoptimer(struct dadq *dp) #ifdef NET_MPSAFE callout_halt(>dad_timer_ch, NULL); #else - callout_halt(>dad_timer_ch, softnet_lock); + /* XXX still need the trick for softnet_lock */ + if (mutex_owned(softnet_lock)) + callout_halt(>dad_timer_ch, softnet_lock); + else + callout_halt(>dad_timer_ch, NULL); #endif } This goes against the restriction noted in the mutex(9) man page: [mutex_owned()] should not be used to make locking decisions at run time. ... Please find a different way to make this run-time decision. +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org | +--+--++
Re: CVS commit: src/sys/netinet
On Mon, Dec 25, 2017 at 8:31 PM, Tom Ivar Helbekkmowrote: > Martin Husemann writes: > >> The sparc64 hang happened with: >> >> $NetBSD: ip_output.c,v 1.288 2017/12/22 11:22:37 ozaki-r Exp $ > > My amd64 that experienced hangs has: > > $NetBSD: ip_output.c,v 1.287 2017/12/15 04:03:46 ozaki-r Exp $ > > ...and was last updated from CVS on December 19th, which made it much > more stable than a kernel from a few days previous to that (from the > 12th, I think). After upgrading, I was trying to update a lot of > installed packages from pkgsrc, with pkgsrc, distfiles, and the binary > package repository NFS mounted, and a local pkgobj for building in. I > gave up in the end, because of the hangs, but managed to build > everything after updating on the 19th. Since then, I've only had the > one hard hang. My recent changes: - 12/6-7: IFNET_LOCK changes #1 - 12/14-15: IFNET_LOCK changes #2 - 12/19:Disable IFEF_MPSAFE by default So your experiences looks synchronized to the changes. Can you run the kernel with LOCKDEBUG and DEBUG enabled? And if once your system hangs again, can you enter the DDB and collect threads stuck on turnstile? Thanks, ozaki-r
Re: CVS commit: src/sys/netinet
On Mon, Dec 25, 2017 at 8:16 PM, Martin Husemannwrote: > On Mon, Dec 25, 2017 at 12:05:28PM +0900, Ryota Ozaki wrote: >> Anyway I annoy that we often cannot have suspect commits because of cvs >> (you know the kernel version doesn't work at all for the purpose). I wish >> we had revision IDs of svn or commit IDs of git to know them. > > The sparc64 hang happened with: > > $NetBSD: ip_output.c,v 1.288 2017/12/22 11:22:37 ozaki-r Exp $ Thanks. Hmm, I have no idea why and how the change hangs the system. If it's a deadlock can you run ps in DDB and gather threads stuck on turnstile? If not, ...I have no idea :-/ ozaki-r
Re: CVS commit: src/distrib/ews4800mips/floppies/ramdisk
In article <20171225061550.e13a3f...@cvs.netbsd.org>, Rin Okuyamawrote: >-=-=-=-=-=- > >Module Name: src >Committed By: rin >Date: Mon Dec 25 06:15:50 UTC 2017 > >Modified Files: > src/distrib/ews4800mips/floppies/ramdisk: list > >Log Message: >Drop the following features, which reduces ramdisk.bin about 70KB: >- shutdown, rcmd, rcp >- support for byte-swapped FFS and Apple UFS in fsck_ffs and newfs >- support for byte-swapped label and interactive editor in disklabel >OK tsutsui You didn't need to do this, because the real issue was that the kernel grew (which I fixed by removing DDB and making compat a library since there are no compat options in the install kernel). So now we should have space to put back DDB or those :-) christos
Re: CVS commit: src/sys/netinet
Martin Husemannwrites: > The sparc64 hang happened with: > > $NetBSD: ip_output.c,v 1.288 2017/12/22 11:22:37 ozaki-r Exp $ My amd64 that experienced hangs has: $NetBSD: ip_output.c,v 1.287 2017/12/15 04:03:46 ozaki-r Exp $ ...and was last updated from CVS on December 19th, which made it much more stable than a kernel from a few days previous to that (from the 12th, I think). After upgrading, I was trying to update a lot of installed packages from pkgsrc, with pkgsrc, distfiles, and the binary package repository NFS mounted, and a local pkgobj for building in. I gave up in the end, because of the hangs, but managed to build everything after updating on the 19th. Since then, I've only had the one hard hang. -tih -- Most people who graduate with CS degrees don't understand the significance of Lisp. Lisp is the most important idea in computer science. --Alan Kay
Re: CVS commit: src/sys/netinet
On Mon, Dec 25, 2017 at 12:05:28PM +0900, Ryota Ozaki wrote: > Anyway I annoy that we often cannot have suspect commits because of cvs > (you know the kernel version doesn't work at all for the purpose). I wish > we had revision IDs of svn or commit IDs of git to know them. The sparc64 hang happened with: $NetBSD: ip_output.c,v 1.288 2017/12/22 11:22:37 ozaki-r Exp $ Martin
Re: CVS commit: src/sys
On 25.12.2017 02:21, m...@netbsd.org wrote: > On Tue, Dec 19, 2017 at 07:40:04PM +, Kamil Rytarowski wrote: >> Module Name: src >> Committed By:kamil >> Date:Tue Dec 19 19:40:04 UTC 2017 >> >> Modified Files: >> src/sys/compat/netbsd32: netbsd32_netbsd.c netbsd32_syscall.h >> netbsd32_syscallargs.h netbsd32_syscalls.c >> netbsd32_syscalls_autoload.c netbsd32_sysent.c >> netbsd32_systrace_args.c syscalls.master >> src/sys/kern: init_sysent.c syscalls.c syscalls.master >> syscalls_autoload.c systrace_args.c >> src/sys/rump/include/rump: rump_syscalls.h >> src/sys/rump/librump/rumpkern: rump_syscalls.c >> src/sys/sys: syscall.h syscallargs.h >> src/sys/uvm: uvm_unix.c >> >> Log Message: >> Drop SYS_vadvise >> >> The (o)vadvise syscall is dummy since the beginning of NetBSD. >> >> It is an obsolete remnant from the old UNIX. > > I think this removes a symbol from libc too > This is correct. $ nm /usr/lib/libc.so|grep vadvise 0003b980 T vadvise I will fix it. signature.asc Description: OpenPGP digital signature