Re: Working on ufs_rename patches for NetBSD-5
hello David. I'm still working on porting your ufs_rename patches to NetBSD-5.x, and I've made even more progress. Now, I can run without filesystem corruption of any kind without logging, or with WAPBL logging enabled. Softdep still isn't working, but I have a feeling I'm close to resolving that issue. However, I'm intermittently running into the same issue you reported in July. Namely, When runing your dirconc test with WAPBL logging enabled, I pretty reliably get: panic: lockdebug_barrier: holding 1 shared locks (curlwp = 0xcbc6fcc0) Note that I've patched the lockdebug code to show me the relevant curlwp at the time of the panic. In loking at my crash dump with gdb, I find I have a question. This panic comes from subr_lockdebug.c, line 664, or there abouts, under the NetBSD-5.x sources. It says, in part, if (l-l_shlocks != 0) { panic(lockdebug_barrier: holding %d shared locks (curlwp = 0x%x), l-l_shlocks, (unsigned int)l); } This is after it's checked for an actual lock structure, and just before it declares success. The arguments to the function in this case are all 0, meaning there's no actual lock to be checked, or, at least, I don't think there is. In fact, the call I'm losing things on is from: sys/sys/userret.h, line 104. That line reads: LOCKDEBUG_BARRIER(NULL, 0); I notice in earlier parts of subr_lockdebug.c, that l_shlocks gets set at the same time as ld-ld_shares. Does anyone know why, on this particular check, we do the check for a shared lock in the struct lwp without matching it up with an actual lock? In the backtrace below, you'll notice the call to lockdebug_barrier is passed 0's for arguments across the board, meaning all the code above the lines that panic are rendered moot, unless I'm not understanding something here in a big way, which is highly probabl. Also, everywhere else we manipulate l_shlocks, we do it while holding splhigh. We do this check, ans subsequent panic, without holding splhigh. Is it possible something's changing under us while we're still checking things out? I don't get the panic every time, and sometimes it takes a while to hit after I start the tests, but when it panics, the traces always look the same. Should this check really not fire if there isn't a matching ld structure to go with the lwp in question? Any light anyone could shed on these questions would be greatly appreciated. -thanks -Brian (gdb) target kvm netbsd.22.core #0 0xc050f8e2 in cpu_reboot (howto=256, bootstr=0x0) at /usr/local/netbsd/src/sys/arch/i386/i386/machdep.c:924 924 /usr/local/netbsd/src/sys/arch/i386/i386/machdep.c: No such file or directory. in /usr/local/netbsd/src/sys/arch/i386/i386/machdep.c (gdb)kvm proc 0xcbc6fcc0 #0 0xc0448bdf in mi_switch (l=0xcbc6fcc0) at /usr/local/netbsd/src/sys/kern/kern_synch.c:765 765 /usr/local/netbsd/src/sys/kern/kern_synch.c: No such file or directory. in /usr/local/netbsd/src/sys/kern/kern_synch.c (gdb) bt #0 0xc0448bdf in mi_switch (l=0xcbc6fcc0) at /usr/local/netbsd/src/sys/kern/kern_synch.c:765 #1 0xc044594b in sleepq_block (timo=0, catch=false) at /usr/local/netbsd/src/sys/kern/kern_sleepq.c:269 #2 0xc0424f5c in cv_wait (cv=0xc122fcb0, mtx=0xca4fca10) at /usr/local/netbsd/src/sys/kern/kern_condvar.c:201 #3 0xc0492f52 in biowait (bp=0xc122fc18) at /usr/local/netbsd/src/sys/kern/vfs_bio.c:1515 #4 0xc04a665c in wapbl_doio (data=0xc135be00, len=512, devvp=0xca4fca10, pbn=1301769, flags=0) at /usr/local/netbsd/src/sys/kern/vfs_wapbl.c:745 ---Type return to continue, or q return to quit--- #5 0xc04a773f in wapbl_circ_write (wl=0xc1294700, data=0xc135be00, len=512, offp=0xcbc838c8) at /usr/local/netbsd/src/sys/kern/vfs_wapbl.c:800 #6 0xc04a7e59 in wapbl_flush (wl=0xc1294700, waitfor=0) at /usr/local/netbsd/src/sys/kern/vfs_wapbl.c:2000 #7 0xc03aea08 in ffs_sync (mp=0xcb38c644, waitfor=2, cred=0xcb926900) at /usr/local/netbsd/src/sys/ufs/ffs/ffs_vfsops.c:1823 #8 0xc049b44c in VFS_SYNC (mp=0xcb38c644, a=2, b=0xcb926900) at /usr/local/netbsd/src/sys/kern/vfs_subr.c:3064 #9 0xc04a2d9c in sys_sync (l=0xcbc6fcc0, v=0x0, retval=0x0) at /usr/local/netbsd/src/sys/kern/vfs_syscalls.c:825 ---Type return to continue, or q return to quit--- #10 0xc049c12e in vfs_shutdown () at /usr/local/netbsd/src/sys/kern/vfs_subr.c:2383 #11 0xc050f95b in cpu_reboot (howto=256, bootstr=0x0) at /usr/local/netbsd/src/sys/arch/i386/i386/machdep.c:910 #12 0xc015f1e9 in db_sync_cmd (addr=-876070228, have_addr=false, count=-1, modif=0xcbc83a18 \003ÚÀc) at /usr/local/netbsd/src/sys/ddb/db_command.c:1304 #13 0xc015f9a8 in db_command (last_cmdp=0xc0a4917c) at /usr/local/netbsd/src/sys/ddb/db_command.c:926 #14 0xc015fc22 in db_command_loop () ---Type return to continue, or q return to quit--- at /usr/local/netbsd/src/sys/ddb/db_command.c:583 #15
Re: CVS commit: src
On Tue, Mar 13, 2012 at 06:41:18PM +, Elad Efrat wrote: Log Message: Replace the remaining KAUTH_GENERIC_ISSUSER authorization calls with something meaningful. All relevant documentation has been updated or written. Most of these changes were brought up in the following messages: http://mail-index.netbsd.org/tech-kern/2012/01/18/msg012490.html http://mail-index.netbsd.org/tech-kern/2012/01/19/msg012502.html http://mail-index.netbsd.org/tech-kern/2012/02/17/msg012728.html Thanks to christos, manu, njoly, and jmmv for input. Huge thanks to pgoyette for spinning these changes through some build cycles and ATF. This seems to cause deadlocks in the *fs_rename_dir tests. See recent (new) test failures both on i386 and amd64 in fs/vfs/t_vnops. To reprocuce, do: cd /usr/tests/fs/vfs atf-run t_vnops | atf-report Martin
Re: CVS commit: src
On Wed, Mar 14, 2012 at 09:55:21AM +, Martin Husemann wrote: This seems to cause deadlocks in the *fs_rename_dir tests. Also the page residency check written by thorpej years ago now fails for the first time. - Jukka.
patch for kern/46168
Hello, has someone comments about the patch I propose in kern/46168 before I commit it ? It's also attached below. Basically, it looks like a LWP may be added to a proc while lwp_wait1() has released the p_lock, even if lwp_wait1() returns 0. As a result, we're waiting for a LWP which doesn't have the LW_WEXIT flag. What the patch does is always reprocess the lwp list if lwp_wait1() returns and p_nlwps is still 1. This doesn't cause new test failures (I did 2 runs) and I coudln't reproduce the problem in kern/46168 with this patch. -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference -- Index: kern/kern_exit.c === RCS file: /cvsroot/src/sys/kern/kern_exit.c,v retrieving revision 1.237 diff -u -p -u -r1.237 kern_exit.c --- kern/kern_exit.c19 Feb 2012 21:06:50 - 1.237 +++ kern/kern_exit.c14 Mar 2012 16:22:47 - @@ -598,44 +598,42 @@ exit_lwps(struct lwp *l) p = l-l_proc; KASSERT(mutex_owned(p-p_lock)); -retry: - /* -* Interrupt LWPs in interruptable sleep, unsuspend suspended -* LWPs and then wait for everyone else to finish. -*/ - LIST_FOREACH(l2, p-p_lwps, l_sibling) { - if (l2 == l) - continue; - lwp_lock(l2); - l2-l_flag |= LW_WEXIT; - if ((l2-l_stat == LSSLEEP (l2-l_flag LW_SINTR)) || - l2-l_stat == LSSUSPENDED || l2-l_stat == LSSTOP) { - /* setrunnable() will release the lock. */ - setrunnable(l2); - DPRINTF((exit_lwps: Made %d.%d runnable\n, - p-p_pid, l2-l_lid)); - continue; - } - lwp_unlock(l2); - } while (p-p_nlwps 1) { + /* +* Interrupt LWPs in interruptable sleep, unsuspend suspended +* LWPs and then wait for everyone else to finish. +*/ + LIST_FOREACH(l2, p-p_lwps, l_sibling) { + if (l2 == l) + continue; + lwp_lock(l2); + l2-l_flag |= LW_WEXIT; + if ((l2-l_stat == LSSLEEP (l2-l_flag LW_SINTR)) || + l2-l_stat == LSSUSPENDED || l2-l_stat == LSSTOP) { + /* setrunnable() will release the lock. */ + setrunnable(l2); + DPRINTF((exit_lwps: Made %d.%d runnable\n, + p-p_pid, l2-l_lid)); + continue; + } + lwp_unlock(l2); + } DPRINTF((exit_lwps: waiting for %d LWPs (%d zombies)\n, p-p_nlwps, p-p_nzlwps)); error = lwp_wait1(l, 0, waited, LWPWAIT_EXITCONTROL); if (p-p_nlwps == 1) break; - if (error == EDEADLK) { - /* -* LWPs can get suspended/slept behind us. -* (eg. sa_setwoken) -* kick them again and retry. -*/ - goto retry; - } - if (error) + if (error == 0) + DPRINTF((exit_lwps: Got LWP %d from lwp_wait1()\n, + waited)); + else if (error != EDEADLK) panic(exit_lwps: lwp_wait1 failed with error %d, error); - DPRINTF((exit_lwps: Got LWP %d from lwp_wait1()\n, waited)); + /* +* LWPs can get suspended/slept behind us. +* (eg. sa_setwoken) +* kick them again and retry. +*/ } KERNEL_LOCK(nlocks, l);
Re: CVS commit: src
On Mar 14, 2012, at 10:55 AM, Martin Husemann wrote: On Tue, Mar 13, 2012 at 06:41:18PM +, Elad Efrat wrote: Log Message: Replace the remaining KAUTH_GENERIC_ISSUSER authorization calls with something meaningful. All relevant documentation has been updated or written. Most of these changes were brought up in the following messages: http://mail-index.netbsd.org/tech-kern/2012/01/18/msg012490.html http://mail-index.netbsd.org/tech-kern/2012/01/19/msg012502.html http://mail-index.netbsd.org/tech-kern/2012/02/17/msg012728.html Thanks to christos, manu, njoly, and jmmv for input. Huge thanks to pgoyette for spinning these changes through some build cycles and ATF. This seems to cause deadlocks in the *fs_rename_dir tests. See recent (new) test failures both on i386 and amd64 in fs/vfs/t_vnops. To reprocuce, do: cd /usr/tests/fs/vfs atf-run t_vnops | atf-report At least this (sketched) diff looks strange, where is the vput(tdp) in the `dp-i_number == foundino' case, who did the review? file: sys/ufs/ufs/ufs_lookup.c,v ufs_lookup(void *v) ... - if (dp-i_number == foundino) { - vref(vdp); - *vpp = vdp; - error = 0; - goto out; - } if (flags ISDOTDOT) VOP_UNLOCK(vdp); /* race to get the inode */ error = VFS_VGET(vdp-v_mount, foundino, tdp); if (flags ISDOTDOT) vn_lock(vdp, LK_EXCLUSIVE | LK_RETRY); if (error) goto out; ... + if (dp-i_number == foundino) { + vref(vdp); + *vpp = vdp; + error = 0; goto out; } -- Juergen Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: CVS commit: src
On Mar 14, 2012, at 10:55 AM, Martin Husemann wrote: On Tue, Mar 13, 2012 at 06:41:18PM +, Elad Efrat wrote: Log Message: Replace the remaining KAUTH_GENERIC_ISSUSER authorization calls with something meaningful. All relevant documentation has been updated or written. Most of these changes were brought up in the following messages: http://mail-index.netbsd.org/tech-kern/2012/01/18/msg012490.html http://mail-index.netbsd.org/tech-kern/2012/01/19/msg012502.html http://mail-index.netbsd.org/tech-kern/2012/02/17/msg012728.html Thanks to christos, manu, njoly, and jmmv for input. Huge thanks to pgoyette for spinning these changes through some build cycles and ATF. This seems to cause deadlocks in the *fs_rename_dir tests. See recent (new) test failures both on i386 and amd64 in fs/vfs/t_vnops. Hmmm, they didn't fail when I tested Elad's changes. Either that, or I missed something in the test output. - | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net | | Kernel Developer | | pgoyette at netbsd.org | -