Re: Working on ufs_rename patches for NetBSD-5

2012-03-14 Thread Brian Buhrow
hello David.  I'm still working on porting your ufs_rename patches to
NetBSD-5.x, and I've made even more progress.  Now, I can run without
filesystem corruption of any kind without logging, or with WAPBL logging 
enabled.
Softdep still isn't working, but I have a feeling I'm close to resolving
that issue.
However, I'm intermittently running into the same  issue you reported in
July.  Namely, When runing your dirconc test with WAPBL logging enabled, I
pretty reliably get:
panic: lockdebug_barrier: holding 1 shared locks (curlwp = 0xcbc6fcc0)

Note that I've patched the lockdebug code to show me the relevant curlwp at
the time of the panic.
In loking at my crash dump with gdb, I find I have a question.  
This panic comes from subr_lockdebug.c, line 664, or there abouts, under
the NetBSD-5.x sources.  It says, in part, 


if (l-l_shlocks != 0) {
panic(lockdebug_barrier: holding %d shared locks (curlwp = 
0x%x),
l-l_shlocks, (unsigned int)l);
}

This is after it's checked for an actual lock structure, and just
before it declares success.  The arguments to the function in this case are
all 0, meaning there's no actual lock to be checked, or, at least, I don't
think there is.  In fact, the call I'm losing things on is from:
sys/sys/userret.h, line 104.
That line reads:

LOCKDEBUG_BARRIER(NULL, 0);

I notice in earlier parts of subr_lockdebug.c, that l_shlocks gets set
at the same time as ld-ld_shares.
Does anyone know why, on this particular check, we do the check for  a
shared lock in the struct  lwp without matching it up with an actual lock?

In the backtrace below, you'll notice the call to lockdebug_barrier is
passed 0's for arguments across the board, meaning all the code above the
lines that panic are rendered moot, unless I'm not understanding something
here in a big way, which is highly probabl.  Also, everywhere else we
manipulate l_shlocks, we do it while holding splhigh.  We do this check,
ans subsequent panic, without holding splhigh.  Is it possible something's
changing under us while we're still checking things out?  I don't get the
panic every time, and sometimes it takes a while to  hit after I start the
tests, but when it panics, the traces always look the same.
Should this check really not fire if there isn't a matching ld structure to
go with the lwp in question?
Any light anyone could shed on these questions would be greatly
appreciated.
-thanks
-Brian


(gdb) target kvm netbsd.22.core
#0  0xc050f8e2 in cpu_reboot (howto=256, bootstr=0x0)
at /usr/local/netbsd/src/sys/arch/i386/i386/machdep.c:924
924 /usr/local/netbsd/src/sys/arch/i386/i386/machdep.c: No such file or 
directory.
in /usr/local/netbsd/src/sys/arch/i386/i386/machdep.c
(gdb)kvm proc 0xcbc6fcc0
#0  0xc0448bdf in mi_switch (l=0xcbc6fcc0)
at /usr/local/netbsd/src/sys/kern/kern_synch.c:765
765 /usr/local/netbsd/src/sys/kern/kern_synch.c: No such file or directory.
in /usr/local/netbsd/src/sys/kern/kern_synch.c
(gdb) bt
#0  0xc0448bdf in mi_switch (l=0xcbc6fcc0)
at /usr/local/netbsd/src/sys/kern/kern_synch.c:765
#1  0xc044594b in sleepq_block (timo=0, catch=false)
at /usr/local/netbsd/src/sys/kern/kern_sleepq.c:269
#2  0xc0424f5c in cv_wait (cv=0xc122fcb0, mtx=0xca4fca10)
at /usr/local/netbsd/src/sys/kern/kern_condvar.c:201
#3  0xc0492f52 in biowait (bp=0xc122fc18)
at /usr/local/netbsd/src/sys/kern/vfs_bio.c:1515
#4  0xc04a665c in wapbl_doio (data=0xc135be00, len=512, devvp=0xca4fca10, 
pbn=1301769, flags=0) at /usr/local/netbsd/src/sys/kern/vfs_wapbl.c:745
---Type return to continue, or q return to quit---
#5  0xc04a773f in wapbl_circ_write (wl=0xc1294700, data=0xc135be00, len=512, 
offp=0xcbc838c8) at /usr/local/netbsd/src/sys/kern/vfs_wapbl.c:800
#6  0xc04a7e59 in wapbl_flush (wl=0xc1294700, waitfor=0)
at /usr/local/netbsd/src/sys/kern/vfs_wapbl.c:2000
#7  0xc03aea08 in ffs_sync (mp=0xcb38c644, waitfor=2, cred=0xcb926900)
at /usr/local/netbsd/src/sys/ufs/ffs/ffs_vfsops.c:1823
#8  0xc049b44c in VFS_SYNC (mp=0xcb38c644, a=2, b=0xcb926900)
at /usr/local/netbsd/src/sys/kern/vfs_subr.c:3064
#9  0xc04a2d9c in sys_sync (l=0xcbc6fcc0, v=0x0, retval=0x0)
at /usr/local/netbsd/src/sys/kern/vfs_syscalls.c:825
---Type return to continue, or q return to quit---
#10 0xc049c12e in vfs_shutdown ()
at /usr/local/netbsd/src/sys/kern/vfs_subr.c:2383
#11 0xc050f95b in cpu_reboot (howto=256, bootstr=0x0)
at /usr/local/netbsd/src/sys/arch/i386/i386/machdep.c:910
#12 0xc015f1e9 in db_sync_cmd (addr=-876070228, have_addr=false, count=-1, 
modif=0xcbc83a18 \003Ú­Àc)
at /usr/local/netbsd/src/sys/ddb/db_command.c:1304
#13 0xc015f9a8 in db_command (last_cmdp=0xc0a4917c)
at /usr/local/netbsd/src/sys/ddb/db_command.c:926
#14 0xc015fc22 in db_command_loop ()
---Type return to continue, or q return to quit---
at /usr/local/netbsd/src/sys/ddb/db_command.c:583
#15 

Re: CVS commit: src

2012-03-14 Thread Martin Husemann
On Tue, Mar 13, 2012 at 06:41:18PM +, Elad Efrat wrote:
 Log Message:
 Replace the remaining KAUTH_GENERIC_ISSUSER authorization calls with
 something meaningful. All relevant documentation has been updated or
 written.
 
 Most of these changes were brought up in the following messages:
 
 http://mail-index.netbsd.org/tech-kern/2012/01/18/msg012490.html
 http://mail-index.netbsd.org/tech-kern/2012/01/19/msg012502.html
 http://mail-index.netbsd.org/tech-kern/2012/02/17/msg012728.html
 
 Thanks to christos, manu, njoly, and jmmv for input.
 
 Huge thanks to pgoyette for spinning these changes through some build
 cycles and ATF.

This seems to cause deadlocks in the *fs_rename_dir tests.
See recent (new) test failures both on i386 and amd64 in fs/vfs/t_vnops.
To reprocuce, do:

   cd /usr/tests/fs/vfs
   atf-run t_vnops | atf-report


Martin


Re: CVS commit: src

2012-03-14 Thread Jukka Ruohonen
On Wed, Mar 14, 2012 at 09:55:21AM +, Martin Husemann wrote:
 This seems to cause deadlocks in the *fs_rename_dir tests.

Also the page residency check written by thorpej years ago now fails for the
first time.

- Jukka.


patch for kern/46168

2012-03-14 Thread Manuel Bouyer
Hello,
has someone comments about the patch I propose in kern/46168 before I commit
it ? It's also attached below.
Basically, it looks like a LWP may be added to a proc while lwp_wait1()
has released the p_lock, even if lwp_wait1() returns 0. As a result,
we're waiting for a LWP which doesn't have the LW_WEXIT flag.
What the patch does is always reprocess the lwp list if lwp_wait1() returns
and p_nlwps is still  1.
This doesn't cause new test failures (I did 2 runs) and I coudln't
reproduce the problem in kern/46168 with this patch.

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--
Index: kern/kern_exit.c
===
RCS file: /cvsroot/src/sys/kern/kern_exit.c,v
retrieving revision 1.237
diff -u -p -u -r1.237 kern_exit.c
--- kern/kern_exit.c19 Feb 2012 21:06:50 -  1.237
+++ kern/kern_exit.c14 Mar 2012 16:22:47 -
@@ -598,44 +598,42 @@ exit_lwps(struct lwp *l)
 
p = l-l_proc;
KASSERT(mutex_owned(p-p_lock));
-retry:
-   /*
-* Interrupt LWPs in interruptable sleep, unsuspend suspended
-* LWPs and then wait for everyone else to finish.
-*/
-   LIST_FOREACH(l2, p-p_lwps, l_sibling) {
-   if (l2 == l)
-   continue;
-   lwp_lock(l2);
-   l2-l_flag |= LW_WEXIT;
-   if ((l2-l_stat == LSSLEEP  (l2-l_flag  LW_SINTR)) ||
-   l2-l_stat == LSSUSPENDED || l2-l_stat == LSSTOP) {
-   /* setrunnable() will release the lock. */
-   setrunnable(l2);
-   DPRINTF((exit_lwps: Made %d.%d runnable\n,
-   p-p_pid, l2-l_lid));
-   continue;
-   }
-   lwp_unlock(l2);
-   }
while (p-p_nlwps  1) {
+   /*
+* Interrupt LWPs in interruptable sleep, unsuspend suspended
+* LWPs and then wait for everyone else to finish.
+*/
+   LIST_FOREACH(l2, p-p_lwps, l_sibling) {
+   if (l2 == l)
+   continue;
+   lwp_lock(l2);
+   l2-l_flag |= LW_WEXIT;
+   if ((l2-l_stat == LSSLEEP  (l2-l_flag  LW_SINTR)) 
||
+   l2-l_stat == LSSUSPENDED || l2-l_stat == LSSTOP) {
+   /* setrunnable() will release the lock. */
+   setrunnable(l2);
+   DPRINTF((exit_lwps: Made %d.%d runnable\n,
+   p-p_pid, l2-l_lid));
+   continue;
+   }
+   lwp_unlock(l2);
+   }
DPRINTF((exit_lwps: waiting for %d LWPs (%d zombies)\n,
p-p_nlwps, p-p_nzlwps));
error = lwp_wait1(l, 0, waited, LWPWAIT_EXITCONTROL);
if (p-p_nlwps == 1)
break;
-   if (error == EDEADLK) {
-   /*
-* LWPs can get suspended/slept behind us.
-* (eg. sa_setwoken)
-* kick them again and retry.
-*/
-   goto retry;
-   }
-   if (error)
+   if (error == 0)
+   DPRINTF((exit_lwps: Got LWP %d from lwp_wait1()\n,
+   waited));
+   else if (error != EDEADLK)
panic(exit_lwps: lwp_wait1 failed with error %d,
error);
-   DPRINTF((exit_lwps: Got LWP %d from lwp_wait1()\n, waited));
+   /*
+* LWPs can get suspended/slept behind us.
+* (eg. sa_setwoken)
+* kick them again and retry.
+*/
}
 
KERNEL_LOCK(nlocks, l);


Re: CVS commit: src

2012-03-14 Thread J. Hannken-Illjes
On Mar 14, 2012, at 10:55 AM, Martin Husemann wrote:

 On Tue, Mar 13, 2012 at 06:41:18PM +, Elad Efrat wrote:
 Log Message:
 Replace the remaining KAUTH_GENERIC_ISSUSER authorization calls with
 something meaningful. All relevant documentation has been updated or
 written.
 
 Most of these changes were brought up in the following messages:
 
http://mail-index.netbsd.org/tech-kern/2012/01/18/msg012490.html
http://mail-index.netbsd.org/tech-kern/2012/01/19/msg012502.html
http://mail-index.netbsd.org/tech-kern/2012/02/17/msg012728.html
 
 Thanks to christos, manu, njoly, and jmmv for input.
 
 Huge thanks to pgoyette for spinning these changes through some build
 cycles and ATF.
 
 This seems to cause deadlocks in the *fs_rename_dir tests.
 See recent (new) test failures both on i386 and amd64 in fs/vfs/t_vnops.
 To reprocuce, do:
 
   cd /usr/tests/fs/vfs
   atf-run t_vnops | atf-report

At least this (sketched) diff looks strange, where is the vput(tdp) in
the `dp-i_number == foundino' case, who did the review?

file: sys/ufs/ufs/ufs_lookup.c,v

ufs_lookup(void *v)
...

-   if (dp-i_number == foundino) {
-   vref(vdp);
-   *vpp = vdp;
-   error = 0;
-   goto out;
-   }
if (flags  ISDOTDOT)
VOP_UNLOCK(vdp); /* race to get the inode */
error = VFS_VGET(vdp-v_mount, foundino, tdp);
if (flags  ISDOTDOT)
vn_lock(vdp, LK_EXCLUSIVE | LK_RETRY);
if (error)
goto out;
...

+   if (dp-i_number == foundino) {
+   vref(vdp);
+   *vpp = vdp;
+   error = 0;
goto out;
}

--
Juergen Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: CVS commit: src

2012-03-14 Thread Paul Goyette

On Mar 14, 2012, at 10:55 AM, Martin Husemann wrote:


On Tue, Mar 13, 2012 at 06:41:18PM +, Elad Efrat wrote:

Log Message:
Replace the remaining KAUTH_GENERIC_ISSUSER authorization calls with
something meaningful. All relevant documentation has been updated or
written.

Most of these changes were brought up in the following messages:

   http://mail-index.netbsd.org/tech-kern/2012/01/18/msg012490.html
   http://mail-index.netbsd.org/tech-kern/2012/01/19/msg012502.html
   http://mail-index.netbsd.org/tech-kern/2012/02/17/msg012728.html

Thanks to christos, manu, njoly, and jmmv for input.

Huge thanks to pgoyette for spinning these changes through some build
cycles and ATF.


This seems to cause deadlocks in the *fs_rename_dir tests.
See recent (new) test failures both on i386 and amd64 in fs/vfs/t_vnops.


Hmmm, they didn't fail when I tested Elad's changes.  Either that, or I 
missed something in the test output.



-
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:   |
| Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com|
| Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net |
| Kernel Developer |  | pgoyette at netbsd.org  |
-