Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Tue, Aug 29, 2006 at 05:05:26PM +, Michael Abbott wrote: [I wrote] > >>>An alternative would be to update to RELENG_6 (or at least RELENG_6_1) > >>>and then try again. > So. I have done this. And I can't reproduce the problem. > # uname -a > FreeBSD venus.araneidae.co.uk 6.1-STABLE FreeBSD 6.1-STABLE #1: Mon Aug 28 > 18:32:17 UTC 2006 > [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC i386 > Hmm. Hopefully this is a *good* thing, ie, the problem really has been > fixed, rather than just going into hiding. > So, as far as I can tell, lockf works properly in this release. Just as an interesting side note, I just experienced rpc.lockd crashing. The server is not running RELENG_6, but RELENG_5 (FreeBSD 5.5-STABLE #15: Thu Aug 24 18:47:20 CEST 2006). Due to user error, someone ended up with over 1000 processes trying to lock the same NFS mounted file at the same time. The result was over 1000 "Cannot allocate memory" errors followed by rpc.lockd crashing. I guess the server is telling me it wants an update... -- greg byshenk - [EMAIL PROTECTED] - Leiden, NL ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
An alternative would be to update to RELENG_6 (or at least RELENG_6_1) and then try again. So. I have done this. And I can't reproduce the problem. # uname -a FreeBSD venus.araneidae.co.uk 6.1-STABLE FreeBSD 6.1-STABLE #1: Mon Aug 28 18:32:17 UTC 2006 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC i386 Hmm. Hopefully this is a *good* thing, ie, the problem really has been fixed, rather than just going into hiding. So, as far as I can tell, lockf works properly in this release. Sorry to have generated so much traffic on this! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Tue, 29 Aug 2006, Alexey Karagodov wrote: it's all is very good, but what can you say about to fix problem with rpc.lockd ??? Well, I will repeat the test with RELENG_6 (as of yesterday lunchtime), probably tonight, and report back. Unfortunately building takes around 8 hours on my test machine! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
Alexey Karagodov wrote: > it's all is very good, but what can you say about to fix problem with > rpc.lockd ??? It has been mentioned several times in this mailing list: rpc.lockd is in need of a complete rewrite. Someone will have to write a new rpc.lockd implementation. As far as I know, there is currently nobody working on it. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. "And believe me, as a C++ programmer, I don't hesitate to question the decisions of language designers. After a decent amount of C++ exposure, Python's flaws seem ridiculously small." -- Ville Vainio ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
it's all is very good, but what can you say about to fix problem with rpc.lockd ??? 2006/8/29, Peter Jeremy <[EMAIL PROTECTED]>: On Mon, 2006-Aug-28 13:23:30 +, Michael Abbott wrote: >I think there is a case to be made for special casing SIGKILL, but in a >sense it's not so much the fate of the process receiving the SIGKILL that >counts: after all, having sent -9 I know that it will never process again. Currently, if you send SIGKILL, the process will never enter userland again. Going further, so that if you send a process SIGKILL, it will always terminate immediately is significantly more difficult. In the normal case, a process is sleeping on some condition with PCATCH specified. If the process receives a signal, sleep(9) will return ERESTART or EINTR and the code has to then arrange to return back to userland (which will cause the signal to be handled as per sigaction(2) and the processes signal handlers). In some cases, it may be inconvenient to unwind back to userland from a particular point so PCATCH isn't specified on the sleep. -- Peter Jeremy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Mon, 2006-Aug-28 13:23:30 +, Michael Abbott wrote: >I think there is a case to be made for special casing SIGKILL, but in a >sense it's not so much the fate of the process receiving the SIGKILL that >counts: after all, having sent -9 I know that it will never process again. Currently, if you send SIGKILL, the process will never enter userland again. Going further, so that if you send a process SIGKILL, it will always terminate immediately is significantly more difficult. In the normal case, a process is sleeping on some condition with PCATCH specified. If the process receives a signal, sleep(9) will return ERESTART or EINTR and the code has to then arrange to return back to userland (which will cause the signal to be handled as per sigaction(2) and the processes signal handlers). In some cases, it may be inconvenient to unwind back to userland from a particular point so PCATCH isn't specified on the sleep. -- Peter Jeremy pgpWFCSDFpOmv.pgp Description: PGP signature
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Mon, 28 Aug 2006, Oliver Fromme wrote: SIGKILL _does_ always work. However, signal processing can be delayed for various reasons. [...] Well, in theory, a special case could be made for SIGKILL, but it's quite difficult if you don't want break existing semantics (or creating holes). Thank you, that was both instructive and interesting. if a process is stopped (SIGSTOP), further signals will only take effect when it continues (SIGCONT). Um. Doesn't this mean that SIGCONT is already a special case? I think there is a case to be made for special casing SIGKILL, but in a sense it's not so much the fate of the process receiving the SIGKILL that counts: after all, having sent -9 I know that it will never process again. More to the point, all processes which are waiting for the killed process should be released. I think maybe I'd like to change the process into Z ('zombie') state while it's still blocked in IO! Sounds like a new state to me, actually: K, "killed in disk wait". Of course, ideally, all other resources held by the new zombie should also be released ... including the return context for the blocked IO call! Tricky, but the process is never going to use its resources again. Of course, any resources held in the blocked IO call itself are another matter... Ah well. I guess it's a bit of an academic point. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
Michael Abbott wrote: > What about the non-interruptible sleep? Is this regarded as par for the > course with NFS, or as a problem? > > I know that "hard" NFS mounts are treated as completely unkillable, though > why `kill -9` isn't made to work escapes me, but a locking operation which > (presumably) suffers a protocol error? Or is rpc.lockd simply waiting to > hear back from the (presumably broken) NFS server? Even so: `kill -9` > ought to work! SIGKILL _does_ always work. However, signal processing can be delayed for various reasons. For example, if a process is stopped (SIGSTOP), further signals will only take effect when it continues (SIGCONT). Signal processing does not occur if a process is currently not scheduled, which is the case if the process is blocked on I/O (indicated by "D" in the STAT column of ps(1), also called the "disk-wait" state). That can happen if the hardware is broken (disk, controller, cable), so an I/O request doesn't return. It can also happen if there are NFS hiccups, as seems to be the case here. As soon as the "D" state ends, the process becomes runnable again (i.e. it's put on the schedulers "run queue"), which means that it'll get a CPU share, and the SIGKILL signal that you sent it before will be processed, finally. Some background information: Each process has a bit mask which stores the set of received signals. kill(2) (and therefore also kill(1)) only sets a bit in that bit mask. The next time the process is scheduled onto a CPU, the mask of received signals is processed and acted upon. That's not FreeBSD-specific; it works like that on almost all UNIX systems. Why does it work that way? Well, if signals were processed for processes not on the CPU, then there would be a "hole": A process would be able to circumvent the scheduler, because signal processing happens on behalf of the process, which means that it runs with the credentials, resource limits, nice value etc. of that process. Well, in theory, a special case could be made for SIGKILL, but it's quite difficult if you don't want break existing semantics (or creating holes). Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. "UNIX was not designed to stop you from doing stupid things, because that would also stop you from doing clever things." -- Doug Gwyn ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Sun, 27 Aug 2006, Kostik Belousov wrote: On server, tcpdump -p -s 1500 -w file -i host Ok. I've run saturn# tcpdump -p -s 1500 -w tcpdump.out -i xl0 host 10.0.0.105 and run the failing test on venus (with `rpc.lockd -d1`). The failing lockf has moved -- it took longer to fail this time -- but it does fail. As before, one of the lockd processes has vanished. venus# ps axlww | grep rpc\\. 0 18303 1 0 96 0 263460 916 select Ss??0:00.00 /usr/sbin/rpc.statd -d 0 18308 1 0 96 0 1416 1024 select Is??0:00.01 /usr/sbin/rpc.lockd -d1 1 18309 18308 0 4 0 1420 1036 nfsloc I ??0:00.00 /usr/sbin/rpc.lockd -d1 venus# ps axlww | grep rpc\\. 0 18303 1 0 96 0 263460 884 select Ss??0:00.00 /usr/sbin/rpc.statd -d 1 18309 1 0 4 0 1440 1008 nfsloc S ??0:00.00 /usr/sbin/rpc.lockd -d1 Yes, this is very interesting. Does something appears in the logs ? Also, you shall use -d option of rpc.lockd (and show the output together with tcpdump output). Well. See my previous message this smorning for -d output. As for tcpdump, I have an interesting (and rather obvious) problem: saturn# stat -f%z /tmp/tcpdump.out 161794058 Hmm. Perhaps you don't want that. I'll hang onto it for a bit: let me know what you want to do with it! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Mon, Aug 28, 2006 at 09:48:48AM +, Michael Abbott wrote: > >An alternative would be to update to RELENG_6 (or at least RELENG_6_1) > >and then try again. > > This machine is so painfully slow that I'll probably have to do that > overnight, and then I'm out of time until next weekend. Just installed > cvsup (from ports; oops, my mistake, forgot about having to build ezm3!) BTW, at least one fix after 6.1 could fix exactly issue with client rpc.lockd disappearing, that was mentioned before. Anyway, I think that tcpdump output would be crucial in debugging the issue. pgpYworMMN4nO.pgp Description: PGP signature
Re: NFS locking: lockf freezes (rpc.lockd problem?)
Well, the result I have to report is so interesting, I'll give the executive summary right away: If rpc.lockd is run with -d255 it works perfectly If rpc.lockd is run with -d1 (or no -d option) it locks up Sweet. Does anyone out there who understands rpc.lockd fancy a deeper look? On Sun, 27 Aug 2006, Greg Byshenk wrote: The problem here is that the process is waiting for somthing, and thus not listening to signals (including your 'kill'). I'm not an expert on this, but my first guess would be that saturn (your server) is offering something that it can't deliver. That is, the client asks the server "can you do X?", and the server says "yes I can", so the client says "do X" and waits -- and the server never does it. I buy that analysis. However, I think that the client's (venus) behaviour is unreasonable, and there must be some problem at the driver level: unkillable processes? (Tries to bang drum about this!) Interesting: it looks like `nfslocking stop` releases the processes. Or alternatively (based on your rpc.statd dying), rpc.lockd on your client is trying to use rpc.statd to communicate with your server. And it starts successfully, but then rpc.statd dies (for some reason) and your lock ends up waiting forever for it to answer. Not quite: it was the first instantiation of rpc.lockd that went away, almost as if it was just waiting for something to happen! However, it doesn't do this on the simple one-line test, so yes, I think there's something here to investigate. Definitely: after running the test with no failures (see below), both lockd instances are still up and about. I would recommend starting both rpc.lockd and rpc.statd with the '-d' flag, to see if this provides any information as to what is going on. There may well be a bug somewhere, but you need to find where it is. I suspect that it is not actually in rpc.statd, as nothing in the source has changed since January 2005. Ok, I'll try that. I'll try -d1, see what I get. venus# /etc/rc.d/nfslocking stop venus# /etc/rc.d/nfslocking start Oddly, running restart only restarts statd, not lockd. Strange naming convention for the commands, too -- have to put the following in rc.conf: rpc_lockd_enable=YES lockd_flags=-d1 rpc_statd_enable=YES statd_flags=-d Hmm. Not terribly consistent. Ok, let's see what a single lockf exchange looks like: venus# mount saturn:$EXPORT /mnt venus# lockf /mnt/test ls /mnt test venus# tail -n3 /var/log/debug.log Aug 28 08:52:44 venus rpc.statd: unmon_all for host: NFS NLM prog: 0 ver: 0 proc: 0 Aug 28 08:54:19 venus rpc.lockd: nlm_lock_res from saturn.araneidae.co.uk Aug 28 08:54:19 venus rpc.lockd: nlm_unlock_res from saturn.araneidae.co.uk Good. Now let's run the test: venus# cd /usr/src; make installworld DESTDIR=/mnt and, at the same time, in another shell: venus# tail -f /var/log/debug.log Well, that's odd. I get five nlm_lock_res/nlm_unlock_res pairs, with the last three in less than a second... and then nothing: the last, blocking, lockf doesn't generate any messages at all! Interesting: stopping the lock daemon, by running `/etc/rc.d/nfslocking stop`, releases the lock! Good. Now I can run the test again with more logging (I'll set lockd_flags=-d255, though a quick grep of the source suggests that 6 would suffice). Hmmm. It's working perfectly now! Well well well. What are the differences? 1. I'm rerunning the test after restarting the lock and stat daemons without an intervening reboot. 2. I'm running lockd with maximum debugging. 3. I'm running the test in a different virtual console (I think we can ignore that difference!) Fantastic: it ran to completion without a fault! Wow. I'll reboot (keep the same debug level) and try again... Astounding! Watch carefully: 1. Reboot 2. Login on virtual consoles 1 & 2 3. On console 2 run venus# tail -f /var/log/debug.log 4. On console 1 run venus# mount saturn:$EXPORT /mnt venus# rm -rf /mnt/* venus# cd /usr/src; make installworld DESTDIR=/mnt 5. Switch to console 2 and watch the console and the PC's activity lights. Lots and lots of network activity (well, there's a surprise), and the occasional flurry of rcp.lockd messages. 6. When the lights stop flashing (five minutes or so) switch back to console 1 and see that everything ran perfectly. Runs ok. Well. I can think of two possiblities: 1. High levels of logging do more than just logging: there's an inadvertent side effect. 2. There's a tight timing issue that is changed by the extra delays introduced by logging. On the whole I buy 2, and it's not altogether clear whether the issue is on the client or the server. Hmm. Maybe I need to try Kostik Belousov's suggestion of running tcpdump. Another message for that... An alternative would be to update to RELENG_6 (or at least RELENG_6_1) and then try again. This machine is so painfully slow that I'll probably have to do that overni
Re: NFS locking: lockf freezes (rpc.lockd problem?)
Hello! On Mon, 28 Aug 2006, Peter Jeremy wrote: On Sun, 2006-Aug-27 22:55:55 +0300, Kostik Belousov wrote: On server, tcpdump -p -s 1500 -w file -i host Recent tcpdumps appear to want the ethernet frame size rather than the MTU: Specifying 1500 appears to truncate full-size frames. Try '-s 1516' instead. 'tcpdump -s 0' always works OK for me. Peter Jeremy Sincerely, Dmitry -- Atlantis ISP, System Administrator e-mail: [EMAIL PROTECTED] nic-hdl: LYNX-RIPE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Sun, 2006-Aug-27 22:55:55 +0300, Kostik Belousov wrote: >On server, >tcpdump -p -s 1500 -w file -i host Recent tcpdumps appear to want the ethernet frame size rather than the MTU: Specifying 1500 appears to truncate full-size frames. Try '-s 1516' instead. -- Peter Jeremy pgpozGke7ZjM8.pgp Description: PGP signature
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Sun, Aug 27, 2006 at 07:17:34PM +, Michael Abbott wrote: > On Sun, 27 Aug 2006, Kostik Belousov wrote: > >Make sure that rpc.statd is running. > Yep. Took me some while to figure that one out, but the first lockf test > failed without that. [...] > As for the other test, let's have a look. Here we are before the test > (NFS server, 4.11, is saturn, test machine, 6.1, is venus): > saturn$ ps auxww | grep rpc\\. > root48917 0.0 0.1 980 640 ?? Is7:56am 0:00.01 rpc.lockd > root 115 0.0 0.1 263096 536 ?? Is 18Aug06 0:00.00 rpc.statd [...] > Well, how odd: as soon as I start the test process 515 on venus goes away. > Now to wait for it to fail... (doesn't take too long): [...] > In conclusion: I agree with Greg Byshenk that the NFS server is bound to > be the one at fault, BUT, is this "freeze until reboot" behaviour really > what we want? I remain astonished (and irritated) that `kill -9` doesn't > work! The problem here is that the process is waiting for somthing, and thus not listening to signals (including your 'kill'). I'm not an expert on this, but my first guess would be that saturn (your server) is offering something that it can't deliver. That is, the client asks the server "can you do X?", and the server says "yes I can", so the client says "do X" and waits -- and the server never does it. Or alternatively (based on your rpc.statd dying), rpc.lockd on your client is trying to use rpc.statd to communicate with your server. And it starts successfully, but then rpc.statd dies (for some reason) and your lock ends up waiting forever for it to answer. I would recommend starting both rpc.lockd and rpc.statd with the '-d' flag, to see if this provides any information as to what is going on. There may well be a bug somewhere, but you need to find where it is. I suspect that it is not actually in rpc.statd, as nothing in the source has changed since January 2005. An alternative would be to update to RELENG_6 (or at least RELENG_6_1) and then try again. -- greg byshenk - [EMAIL PROTECTED] - Leiden, NL ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Sun, Aug 27, 2006 at 07:17:34PM +, Michael Abbott wrote: > On Sun, 27 Aug 2006, Kostik Belousov wrote: > >For debugging purposes, tcpdump of the corresponding communications > >would be quite useful. Besides this, output of ps auxww | grep 'rpc\.' > >may be interesting. > > Um. How interesting would tcpdump be? I'm prepared to do the work, but > as I've never used the tool, it may take me some effort and time to figure > out the right commands. Yes: `man tcpdump | wc -l` == 1543. Fancy > giving me a sample command to try? On server, tcpdump -p -s 1500 -w file -i host This is assuming you use ethernet with usual MTU, iface is interface where communication with client comes from. > > As for the other test, let's have a look. Here we are before the test > (NFS server, 4.11, is saturn, test machine, 6.1, is venus): > > saturn$ ps auxww | grep rpc\\. My fault, better use ps axlww. . > Well, how odd: as soon as I start the test process 515 on venus goes away. > Now to wait for it to fail... (doesn't take too long): Yes, this is very interesting. Does something appears in the logs ? Also, you shall use -d option of rpc.lockd (and show the output together with tcpdump output). pgpFxQQLM10xw.pgp Description: PGP signature
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Sun, 27 Aug 2006, Kostik Belousov wrote: Make sure that rpc.statd is running. Yep. Took me some while to figure that one out, but the first lockf test failed without that. For debugging purposes, tcpdump of the corresponding communications would be quite useful. Besides this, output of ps auxww | grep 'rpc\.' may be interesting. Um. How interesting would tcpdump be? I'm prepared to do the work, but as I've never used the tool, it may take me some effort and time to figure out the right commands. Yes: `man tcpdump | wc -l` == 1543. Fancy giving me a sample command to try? As for the other test, let's have a look. Here we are before the test (NFS server, 4.11, is saturn, test machine, 6.1, is venus): saturn$ ps auxww | grep rpc\\. root48917 0.0 0.1 980 640 ?? Is7:56am 0:00.01 rpc.lockd root 115 0.0 0.1 263096 536 ?? Is 18Aug06 0:00.00 rpc.statd venus# ps auxww | grep rpc\\. root 510 0.0 0.9 263460 1008 ?? Ss6:05PM 0:00.01 /usr/sbin/rpc.statd root 515 0.0 1.0 1416 1120 ?? Is6:05PM 0:00.02 /usr/sbin/rpc.lockd daemon 520 0.0 1.0 1420 1124 ?? I 6:05PM 0:00.00 /usr/sbin/rpc.lockd That's interesting. Don't know how significant the differences are... Ok, let's run the test: venus# cd /usr/src; make installworld DESTDIR=/mnt Well, how odd: as soon as I start the test process 515 on venus goes away. Now to wait for it to fail... (doesn't take too long): saturn$ ps auxww | grep rpc\\. root48917 0.0 0.1 980 640 ?? Is7:56am 0:00.01 rpc.lockd root 115 0.0 0.1 263096 536 ?? Is 18Aug06 0:00.00 rpc.statd venus# ps auxww | grep rpc\\. root 510 0.0 0.9 263460 992 ?? Ss6:05PM 0:00.01 /usr/sbin/rpc.statd daemon 520 0.0 1.0 1440 1152 ?? S 6:05PM 0:00.01 /usr/sbin/rpc.lockd venus# ps auxww | grep lockf ... root7034 0.0 0.5 1172 528 v0 D+6:51PM 0:00.01 lockf -k /mnt/usr/... (I've truncated the lockf call: the detail of the install call it's making is hardly relevant!) Note that now any call to lockf on this server will fail... Hmm. What about a different mount point? Bet I can't unmount ... venus# umount /mnt umount: unmount of /mnt failed: Device busy venus# umount -f /mnt venus# mount saturn:/tmp /mnt venus# lockf /mnt/test ls (Hangs) Now this is interesting: the file saturn:/tmp/test exists! And it appears to be owned by uid=4294967294 (-2?)! How very odd. If I reboot venus and try just a single lockf: venus# lockf /mnt/test stat -f%u /mnt/test 0 As one might expect, indeed. A hint as to who's got stuck (saturn, I'm sure), but beside the point, I guess. Note also that the `umount -f /mnt` *didn't* release the lockf, and also note that /tmp/test is still there (on saturn) after a reboot of venus. In conclusion: I agree with Greg Byshenk that the NFS server is bound to be the one at fault, BUT, is this "freeze until reboot" behaviour really what we want? I remain astonished (and irritated) that `kill -9` doesn't work! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Sun, 27 Aug 2006, Greg Byshenk wrote: On Sun, Aug 27, 2006 at 11:24:13AM +, Michael Abbott wrote: I've been trying to make some sense of the "NFS locking" issue. I am trying to run # make installworld DESTDIR=/mnt where /mnt is an NFS mount on a FreeBSD 4.11 server, but I am unable to get past a call to `lockf`. I have just performed a test of what you describe, using 'smbtest' (6.1-STABLE #17: Fri Aug 25 12:25:19 CEST 2006) as the client and 'data-2' (FreeBSD 6.1-STABLE #16: Wed Aug 9 15:38:12 CEST 2006) as the server. ... Which is to say that it completed successfully. Which suggests that there is not a serious and ongoing problem. Hm. That's a useful data point: thanks for making the test! What about the non-interruptible sleep? Is this regarded as par for the course with NFS, or as a problem? I know that "hard" NFS mounts are treated as completely unkillable, though why `kill -9` isn't made to work escapes me, but a locking operation which (presumably) suffers a protocol error? Or is rpc.lockd simply waiting to hear back from the (presumably broken) NFS server? Even so: `kill -9` ought to work! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Sun, Aug 27, 2006 at 11:24:13AM +, Michael Abbott wrote: > I've been trying to make some sense of the "NFS locking" issue. I am > trying to run > # make installworld DESTDIR=/mnt > where /mnt is an NFS mount on a FreeBSD 4.11 server, but I am unable to > get past a call to `lockf`. > > On this mailing list I've seen a thread starting with this message: > > http://lists.freebsd.org/pipermail/freebsd-stable/2006-August/027561.html > and elsewhere I've seen this thread: > http://www.gatago.com/mpc/lists/freebsd/stable/21851805.html > > The gist seems to be that rpc.lockd is badly behaved and broken and nobody > knows how to fix it. So, in case my experience is any help, here is what > I can report. > > 1. I have installed a fresh installation of FreeBSD 6.1 from the CD, > 6.1-RELEASE-i386-disc1.iso, and have run `cd /usr/src; make buildworld; > make buildkernel` successfully (takes nearly 8 hours, but then it is a > fanless machine). The full distribution (as installed by sysinstall) is > present, but nothing else. > > 2. Intending to experiment with network booting, I've attempted > `make installworld DESTDIR=/mnt`, where /mnt is an NFS mount point on my > master server, running FreeBSD 4.11-RELEASE-p11. > > 3. This fails when invoking lockf. To work around this, I have started > rpc.lockd on the 4.11 server and configured all of the following lines in > rc.conf: > rpcbind_enable="YES" > nfs_client_enable="YES" > rpc_lockd_enable="YES" > rpc_statd_enable="YES" > > 4. Now here is the behaviour: > > # mount $MY_SERVER:$MY_PATH /mnt > # lockf /mnt/test ls > This works just fine > # cd /usr/src; make installworld DESTDIR=/mnt > This hangs in lockf, and is unkillable (even `kill -9` is no good, and ps > shows state = D+). So let's start another shell (Alt-F2): > # lockf /mnt/test ls > Also hangs. > > Rebooting the test machine clears the problem, returning to the state at > the start of point (4), and the problem is completely repeatable in my > configuration. > > > Some observations: > > - Hanging in "uninterruptible sleep" is not good. No doubt it's quite > possible that my 4.11 server has a broken rpc.lockd (or maybe I've not > configured it right: I just started rpc.lockd, rather than restarting the > server), but the behaviour of 6.1 is exceptionally unfriendly. In > particular, unkillable processes look like outright bugs to me. > > - The conversation on mpc.lists.freebsd.stable (and elsewhere) looks > alarming. I get the impression that this part of FreeBSD 6.1 is really > rather broken and that there's no real sense of what to do about it. Make sure that rpc.statd is running. For debugging purposes, tcpdump of the corresponding communications would be quite useful. Besides this, output of ps auxww | grep 'rpc\.' may be interesting. pgpvRfK1jAiLS.pgp Description: PGP signature
Re: NFS locking: lockf freezes (rpc.lockd problem?)
On Sun, Aug 27, 2006 at 11:24:13AM +, Michael Abbott wrote: > I've been trying to make some sense of the "NFS locking" issue. I am > trying to run > # make installworld DESTDIR=/mnt > where /mnt is an NFS mount on a FreeBSD 4.11 server, but I am unable to > get past a call to `lockf`. I have not closely followed the discussion, as I have not experienced the problem. I am currently running FreeBSD6 based fileservers in an environment that includes FreeBSD, Linux (multiple flavors), Solaris, and Irix clients, and have experienced no nfs locking issues (I have one occasional problem with 64-bit Linux clients, but it is not locking related and appears to be due to a 64-bit Linux problem). Further, (though there may well be problems with nfs locking) I cannot recreate the problem you described -- at least in a FreeBSD6 environment. I have just performed a test of what you describe, using 'smbtest' (6.1-STABLE #17: Fri Aug 25 12:25:19 CEST 2006) as the client and 'data-2' (FreeBSD 6.1-STABLE #16: Wed Aug 9 15:38:12 CEST 2006) as the server. data-2 # mkdir /export/rw/bsd6root/ ## /export/rw is already exported via NFS smbtest # mount data-2:/export/rw/bsd6root /mnt smbtest # cd /usr/src smbtest # make installworld DESTDIR=/mnt [...] makewhatis /mnt/usr/share/man makewhatis /mnt/usr/share/openssl/man rm -rf /tmp/install.2INObZ3j smbtest # Which is to say that it completed successfully. Which suggests that there is not a serious and ongoing problem. There may well be a problem with FreeBSD4, but I no longer have any NFS servers running FreeBSD4.x, so I cannot confirm. Alternatively, there may have been a problem in 6.1-RELEASE that has since been solved in 6.1-STABLE that I am using. Or there could be a problem with the configuration of your server. Or there could be something else going on (in the network...?). But to see what exactly is happening in your case, you would probably want to look at what exactly is happening on the client, the server, and the network between them. -- greg byshenk - [EMAIL PROTECTED] - Leiden, NL ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"