Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-31 Thread Greg Byshenk
On Tue, Aug 29, 2006 at 05:05:26PM +, Michael Abbott wrote:

[I wrote]
 An alternative would be to update to RELENG_6 (or at least RELENG_6_1)
 and then try again.
 
 So.  I have done this.  And I can't reproduce the problem.

 # uname -a
 FreeBSD venus.araneidae.co.uk 6.1-STABLE FreeBSD 6.1-STABLE #1: Mon Aug 28 
 18:32:17 UTC 2006 
 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC  i386
 
 Hmm.  Hopefully this is a *good* thing, ie, the problem really has been 
 fixed, rather than just going into hiding.
 
 So, as far as I can tell, lockf works properly in this release.


Just as an interesting side note, I just experienced rpc.lockd crashing.
The server is not running RELENG_6, but RELENG_5 (FreeBSD 5.5-STABLE
#15: Thu Aug 24 18:47:20 CEST 2006).  Due to user error, someone ended
up with over 1000 processes trying to lock the same NFS mounted file at
the same time.  The result was over 1000 Cannot allocate memory errors
followed by rpc.lockd crashing.

I guess the server is telling me it wants an update...


-- 
greg byshenk  -  [EMAIL PROTECTED]  -  Leiden, NL
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-29 Thread Peter Jeremy
On Mon, 2006-Aug-28 13:23:30 +, Michael Abbott wrote:
I think there is a case to be made for special casing SIGKILL, but in a 
sense it's not so much the fate of the process receiving the SIGKILL that 
counts: after all, having sent -9 I know that it will never process again.

Currently, if you send SIGKILL, the process will never enter userland
again.

Going further, so that if you send a process SIGKILL, it will always
terminate immediately is significantly more difficult.  In the normal
case, a process is sleeping on some condition with PCATCH specified.
If the process receives a signal, sleep(9) will return ERESTART or
EINTR and the code has to then arrange to return back to userland
(which will cause the signal to be handled as per sigaction(2) and
the processes signal handlers).  In some cases, it may be inconvenient
to unwind back to userland from a particular point so PCATCH isn't
specified on the sleep.

-- 
Peter Jeremy


pgpWFCSDFpOmv.pgp
Description: PGP signature


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-29 Thread Alexey Karagodov

it's all is very good, but what can you say about to fix problem with
rpc.lockd ???

2006/8/29, Peter Jeremy [EMAIL PROTECTED]:


On Mon, 2006-Aug-28 13:23:30 +, Michael Abbott wrote:
I think there is a case to be made for special casing SIGKILL, but in a
sense it's not so much the fate of the process receiving the SIGKILL that
counts: after all, having sent -9 I know that it will never process
again.

Currently, if you send SIGKILL, the process will never enter userland
again.

Going further, so that if you send a process SIGKILL, it will always
terminate immediately is significantly more difficult.  In the normal
case, a process is sleeping on some condition with PCATCH specified.
If the process receives a signal, sleep(9) will return ERESTART or
EINTR and the code has to then arrange to return back to userland
(which will cause the signal to be handled as per sigaction(2) and
the processes signal handlers).  In some cases, it may be inconvenient
to unwind back to userland from a particular point so PCATCH isn't
specified on the sleep.

--
Peter Jeremy




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-29 Thread Oliver Fromme
Alexey Karagodov wrote:
  it's all is very good, but what can you say about to fix problem with
  rpc.lockd ???

It has been mentioned several times in this mailing list:
rpc.lockd is in need of a complete rewrite.  Someone will
have to write a new rpc.lockd implementation.  As far as
I know, there is currently nobody working on it.

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH  Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

And believe me, as a C++ programmer, I don't hesitate to question
the decisions of language designers.  After a decent amount of C++
exposure, Python's flaws seem ridiculously small. -- Ville Vainio
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-29 Thread Michael Abbott

On Tue, 29 Aug 2006, Alexey Karagodov wrote:

it's all is very good, but what can you say about to fix problem with
rpc.lockd ???


Well, I will repeat the test with RELENG_6 (as of yesterday lunchtime), 
probably tonight, and report back.  Unfortunately building takes around 8 
hours on my test machine!

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-29 Thread Michael Abbott

An alternative would be to update to RELENG_6 (or at least RELENG_6_1)
and then try again.


So.  I have done this.  And I can't reproduce the problem.

# uname -a
FreeBSD venus.araneidae.co.uk 6.1-STABLE FreeBSD 6.1-STABLE #1: Mon Aug 28 
18:32:17 UTC 2006 
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/GENERIC  i386


Hmm.  Hopefully this is a *good* thing, ie, the problem really has been 
fixed, rather than just going into hiding.


So, as far as I can tell, lockf works properly in this release.

Sorry to have generated so much traffic on this!
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-28 Thread Peter Jeremy
On Sun, 2006-Aug-27 22:55:55 +0300, Kostik Belousov wrote:
On server,
tcpdump -p -s 1500 -w file -i iface host client host ip

Recent tcpdumps appear to want the ethernet frame size rather than
the MTU:  Specifying 1500 appears to truncate full-size frames.
Try '-s 1516' instead.

-- 
Peter Jeremy


pgpozGke7ZjM8.pgp
Description: PGP signature


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-28 Thread Dmitry Pryanishnikov


Hello!

On Mon, 28 Aug 2006, Peter Jeremy wrote:

On Sun, 2006-Aug-27 22:55:55 +0300, Kostik Belousov wrote:

On server,
tcpdump -p -s 1500 -w file -i iface host client host ip


Recent tcpdumps appear to want the ethernet frame size rather than
the MTU:  Specifying 1500 appears to truncate full-size frames.
Try '-s 1516' instead.


 'tcpdump -s 0' always works OK for me.


Peter Jeremy



Sincerely, Dmitry
--
Atlantis ISP, System Administrator
e-mail:  [EMAIL PROTECTED]
nic-hdl: LYNX-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-28 Thread Michael Abbott
Well, the result I have to report is so interesting, I'll give the 
executive summary right away:


If rpc.lockd is run with -d255 it works perfectly
If rpc.lockd is run with -d1 (or no -d option) it locks up

Sweet.

Does anyone out there who understands rpc.lockd fancy a deeper look?

On Sun, 27 Aug 2006, Greg Byshenk wrote:

The problem here is that the process is waiting for somthing, and
thus not listening to signals (including your 'kill').

I'm not an expert on this, but my first guess would be that saturn (your
server) is offering something that it can't deliver.  That is, the client
asks the server can you do X?, and the server says yes I can, so the
client says do X and waits -- and the server never does it.


I buy that analysis.

However, I think that the client's (venus) behaviour is unreasonable, and 
there must be some problem at the driver level: unkillable processes? 
(Tries to bang drum about this!)


Interesting: it looks like `nfslocking stop` releases the processes.


Or alternatively (based on your rpc.statd dying), rpc.lockd on your
client is trying to use rpc.statd to communicate with your server.  And
it starts successfully, but then rpc.statd dies (for some reason) and
your lock ends up waiting forever for it to answer.


Not quite: it was the first instantiation of rpc.lockd that went away, 
almost as if it was just waiting for something to happen!  However, it 
doesn't do this on the simple one-line test, so yes, I think there's 
something here to investigate.


Definitely: after running the test with no failures (see below), both 
lockd instances are still up and about.



I would recommend starting both rpc.lockd and rpc.statd with the '-d'
flag, to see if this provides any information as to what is going on.
There may well be a bug somewhere, but you need to find where it is.
I suspect that it is not actually in rpc.statd, as nothing in the
source has changed since January 2005.


Ok, I'll try that.  I'll try -d1, see what I get.

venus# /etc/rc.d/nfslocking stop
venus# /etc/rc.d/nfslocking start
Oddly, running restart only restarts statd, not lockd.  Strange naming 
convention for the commands, too -- have to put the following in rc.conf:

rpc_lockd_enable=YES
lockd_flags=-d1
rpc_statd_enable=YES
statd_flags=-d
Hmm.  Not terribly consistent.

Ok, let's see what a single lockf exchange looks like:

venus# mount saturn:$EXPORT /mnt
venus# lockf /mnt/test ls /mnt
test
venus# tail -n3 /var/log/debug.log
Aug 28 08:52:44 venus rpc.statd: unmon_all for host: NFS NLM prog: 0 ver: 0 
proc: 0
Aug 28 08:54:19 venus rpc.lockd: nlm_lock_res from saturn.araneidae.co.uk
Aug 28 08:54:19 venus rpc.lockd: nlm_unlock_res from saturn.araneidae.co.uk

Good.  Now let's run the test:
venus# cd /usr/src; make installworld DESTDIR=/mnt
and, at the same time, in another shell:
venus# tail -f /var/log/debug.log

Well, that's odd.  I get five nlm_lock_res/nlm_unlock_res pairs, with the 
last three in less than a second... and then nothing: the last, blocking, 
lockf doesn't generate any messages at all!


Interesting: stopping the lock daemon, by running `/etc/rc.d/nfslocking 
stop`, releases the lock!  Good.  Now I can run the test again with more 
logging (I'll set lockd_flags=-d255, though a quick grep of the source 
suggests that 6 would suffice).


Hmmm.  It's working perfectly now!  Well well well.  What are the 
differences?
 1.  I'm rerunning the test after restarting the lock and stat daemons 
without an intervening reboot.

 2.  I'm running lockd with maximum debugging.
 3.  I'm running the test in a different virtual console (I think we can 
ignore that difference!)


Fantastic: it ran to completion without a fault!  Wow.  I'll reboot (keep 
the same debug level) and try again...


Astounding!  Watch carefully:

1. Reboot
2. Login on virtual consoles 1  2
3. On console 2 run
venus# tail -f /var/log/debug.log
4. On console 1 run
venus# mount saturn:$EXPORT /mnt
venus# rm -rf /mnt/*
venus# cd /usr/src; make installworld DESTDIR=/mnt
5. Switch to console 2 and watch the console and the PC's activity lights. 
Lots and lots of network activity (well, there's a surprise), and the 
occasional flurry of rcp.lockd messages.
6. When the lights stop flashing (five minutes or so) switch back to 
console 1 and see that everything ran perfectly.


Runs ok.

Well.  I can think of two possiblities:
1. High levels of logging do more than just logging: there's an 
inadvertent side effect.
2. There's a tight timing issue that is changed by the extra delays 
introduced by logging.


On the whole I buy 2, and it's not altogether clear whether the issue is 
on the client or the server.  Hmm.  Maybe I need to try Kostik Belousov's 
suggestion of running tcpdump.  Another message for that...



An alternative would be to update to RELENG_6 (or at least RELENG_6_1)
and then try again.


This machine is so painfully slow that I'll probably have to do that 
overnight, 

Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-28 Thread Kostik Belousov
On Mon, Aug 28, 2006 at 09:48:48AM +, Michael Abbott wrote:
 An alternative would be to update to RELENG_6 (or at least RELENG_6_1)
 and then try again.
 
 This machine is so painfully slow that I'll probably have to do that 
 overnight, and then I'm out of time until next weekend.  Just installed 
 cvsup (from ports; oops, my mistake, forgot about having to build ezm3!)
BTW, at least one fix after 6.1 could fix exactly issue with client rpc.lockd
disappearing, that was mentioned before.

Anyway, I think that tcpdump output would be crucial in debugging the issue.


pgpYworMMN4nO.pgp
Description: PGP signature


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-28 Thread Michael Abbott

On Sun, 27 Aug 2006, Kostik Belousov wrote:

On server,
tcpdump -p -s 1500 -w file -i iface host client host ip


Ok.  I've run
saturn# tcpdump -p -s 1500 -w tcpdump.out -i xl0 host 10.0.0.105

and run the failing test on venus (with `rpc.lockd -d1`).  The failing 
lockf has moved -- it took longer to fail this time -- but it does fail. 
As before, one of the lockd processes has vanished.


venus# ps axlww | grep rpc\\.
0 18303 1   0  96  0 263460   916 select Ss??0:00.00 
/usr/sbin/rpc.statd -d
0 18308 1   0  96  0  1416  1024 select Is??0:00.01 
/usr/sbin/rpc.lockd -d1
1 18309 18308   0   4  0  1420  1036 nfsloc I ??0:00.00 
/usr/sbin/rpc.lockd -d1
run the test until it locks
venus# ps axlww | grep rpc\\.
0 18303 1   0  96  0 263460   884 select Ss??0:00.00 
/usr/sbin/rpc.statd -d
1 18309 1   0   4  0  1440  1008 nfsloc S ??0:00.00 
/usr/sbin/rpc.lockd -d1


Yes, this is very interesting. Does something appears in the logs ?
Also, you shall use -d option of rpc.lockd (and show the output together
with tcpdump output).


Well.  See my previous message this smorning for -d output.  As for 
tcpdump, I have an interesting (and rather obvious) problem:


saturn# stat -f%z /tmp/tcpdump.out
161794058

Hmm.  Perhaps you don't want that.  I'll hang onto it for a bit: let me 
know what you want to do with it!

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-28 Thread Oliver Fromme
Michael Abbott wrote:
  What about the non-interruptible sleep?  Is this regarded as par for the 
  course with NFS, or as a problem?
  
  I know that hard NFS mounts are treated as completely unkillable, though 
  why `kill -9` isn't made to work escapes me, but a locking operation which 
  (presumably) suffers a protocol error?  Or is rpc.lockd simply waiting to 
  hear back from the (presumably broken) NFS server?  Even so: `kill -9` 
  ought to work!

SIGKILL _does_ always work.  However, signal processing can
be delayed for various reasons.  For example, if a process
is stopped (SIGSTOP), further signals will only take effect
when it continues (SIGCONT).

Signal processing does not occur if a process is currently
not scheduled, which is the case if the process is blocked
on I/O (indicated by D in the STAT column of ps(1), also
called the disk-wait state).  That can happen if the
hardware is broken (disk, controller, cable), so an I/O
request doesn't return.  It can also happen if there are
NFS hiccups, as seems to be the case here.

As soon as the D state ends, the process becomes runnable
again (i.e. it's put on the schedulers run queue), which
means that it'll get a CPU share, and the SIGKILL signal
that you sent it before will be processed, finally.

Some background information:  Each process has a bit mask
which stores the set of received signals.  kill(2) (and
therefore also kill(1)) only sets a bit in that bit mask.
The next time the process is scheduled onto a CPU, the mask
of received signals is processed and acted upon.  That's
not FreeBSD-specific; it works like that on almost all UNIX
systems.  Why does it work that way?  Well, if signals were
processed for processes not on the CPU, then there would be
a hole:  A process would be able to circumvent the
scheduler, because signal processing happens on behalf of
the process, which means that it runs with the credentials,
resource limits, nice value etc. of that process.  Well, in
theory, a special case could be made for SIGKILL, but it's
quite difficult if you don't want break existing semantics
(or creating holes).

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH  Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

UNIX was not designed to stop you from doing stupid things,
because that would also stop you from doing clever things.
-- Doug Gwyn
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-28 Thread Michael Abbott

On Mon, 28 Aug 2006, Oliver Fromme wrote:
SIGKILL _does_ always work.  However, signal processing can be delayed 
for various reasons.

[...]
Well, in theory, a special case could be made for SIGKILL, but it's 
quite difficult if you don't want break existing semantics (or creating 
holes).


Thank you, that was both instructive and interesting.

if a process is stopped (SIGSTOP), further signals will only take effect 
when it continues (SIGCONT).


Um.  Doesn't this mean that SIGCONT is already a special case?

I think there is a case to be made for special casing SIGKILL, but in a 
sense it's not so much the fate of the process receiving the SIGKILL that 
counts: after all, having sent -9 I know that it will never process again.


More to the point, all processes which are waiting for the killed process 
should be released.  I think maybe I'd like to change the process into Z 
('zombie') state while it's still blocked in IO!  Sounds like a new state 
to me, actually: K, killed in disk wait.


Of course, ideally, all other resources held by the new zombie should also 
be released ... including the return context for the blocked IO call! 
Tricky, but the process is never going to use its resources again. Of 
course, any resources held in the blocked IO call itself are another 
matter...


Ah well.  I guess it's a bit of an academic point.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-27 Thread Greg Byshenk
On Sun, Aug 27, 2006 at 11:24:13AM +, Michael Abbott wrote:
 I've been trying to make some sense of the NFS locking issue.  I am 
 trying to run
   # make installworld DESTDIR=/mnt
 where /mnt is an NFS mount on a FreeBSD 4.11 server, but I am unable to 
 get past a call to `lockf`.

I have not closely followed the discussion, as I have not experienced 
the problem.

I am currently running FreeBSD6 based fileservers in an environment that
includes FreeBSD, Linux (multiple flavors), Solaris, and Irix clients,
and have experienced no nfs locking issues (I have one occasional
problem with 64-bit Linux clients, but it is not locking related and
appears to be due to a 64-bit Linux problem).

Further, (though there may well be problems with nfs locking) I cannot
recreate the problem you described -- at least in a FreeBSD6 environment.

I have just performed a test of what you describe, using 'smbtest'
(6.1-STABLE #17: Fri Aug 25 12:25:19 CEST 2006) as the client and 
'data-2' (FreeBSD 6.1-STABLE #16: Wed Aug  9 15:38:12 CEST 2006) as the
server.

   data-2 # mkdir /export/rw/bsd6root/
   ## /export/rw is already exported via NFS
   smbtest # mount data-2:/export/rw/bsd6root /mnt
   smbtest # cd /usr/src
   smbtest # make installworld DESTDIR=/mnt
   [...]
   makewhatis /mnt/usr/share/man
   makewhatis /mnt/usr/share/openssl/man
   rm -rf /tmp/install.2INObZ3j
   smbtest #

Which is to say that it completed successfully.  Which suggests that there
is not a serious and ongoing problem.

There may well be a problem with FreeBSD4, but I no longer have any NFS
servers running FreeBSD4.x, so I cannot confirm.  Alternatively, there
may have been a problem in 6.1-RELEASE that has since been solved in
6.1-STABLE that I am using.  Or there could be a problem with the 
configuration of your server.  Or there could be something else going
on (in the network...?).

But to see what exactly is happening in your case, you would probably 
want to look at what exactly is happening on the client, the server, and
the network between them.
 

-- 
greg byshenk  -  [EMAIL PROTECTED]  -  Leiden, NL
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-27 Thread Kostik Belousov
On Sun, Aug 27, 2006 at 11:24:13AM +, Michael Abbott wrote:
 I've been trying to make some sense of the NFS locking issue.  I am 
 trying to run
   # make installworld DESTDIR=/mnt
 where /mnt is an NFS mount on a FreeBSD 4.11 server, but I am unable to 
 get past a call to `lockf`.
 
 On this mailing list I've seen a thread starting with this message:
   
 http://lists.freebsd.org/pipermail/freebsd-stable/2006-August/027561.html
 and elsewhere I've seen this thread:
   http://www.gatago.com/mpc/lists/freebsd/stable/21851805.html
 
 The gist seems to be that rpc.lockd is badly behaved and broken and nobody 
 knows how to fix it.  So, in case my experience is any help, here is what 
 I can report.
 
 1.  I have installed a fresh installation of FreeBSD 6.1 from the CD, 
 6.1-RELEASE-i386-disc1.iso, and have run `cd /usr/src; make buildworld; 
 make buildkernel` successfully (takes nearly 8 hours, but then it is a 
 fanless machine).  The full distribution (as installed by sysinstall) is 
 present, but nothing else.
 
 2.  Intending to experiment with network booting, I've attempted
 `make installworld DESTDIR=/mnt`, where /mnt is an NFS mount point on my 
 master server, running FreeBSD 4.11-RELEASE-p11.
 
 3.  This fails when invoking lockf.  To work around this, I have started 
 rpc.lockd on the 4.11 server and configured all of the following lines in 
 rc.conf:
   rpcbind_enable=YES
   nfs_client_enable=YES
   rpc_lockd_enable=YES
   rpc_statd_enable=YES
 
 4.  Now here is the behaviour:
 
   # mount $MY_SERVER:$MY_PATH /mnt
   # lockf /mnt/test ls
 This works just fine
   # cd /usr/src; make installworld DESTDIR=/mnt
 This hangs in lockf, and is unkillable (even `kill -9` is no good, and ps 
 shows state = D+).  So let's start another shell (Alt-F2):
   # lockf /mnt/test ls
 Also hangs.
 
 Rebooting the test machine clears the problem, returning to the state at 
 the start of point (4), and the problem is completely repeatable in my 
 configuration.
 
 
 Some observations:
 
  - Hanging in uninterruptible sleep is not good.  No doubt it's quite 
 possible that my 4.11 server has a broken rpc.lockd (or maybe I've not 
 configured it right: I just started rpc.lockd, rather than restarting the 
 server), but the behaviour of 6.1 is exceptionally unfriendly.  In 
 particular, unkillable processes look like outright bugs to me.
 
  - The conversation on mpc.lists.freebsd.stable (and elsewhere) looks 
 alarming.  I get the impression that this part of FreeBSD 6.1 is really 
 rather broken and that there's no real sense of what to do about it.

Make sure that rpc.statd is running.
For debugging purposes, tcpdump of the corresponding communications would
be quite useful. Besides this, output of ps auxww | grep 'rpc\.' may be
interesting.


pgpvRfK1jAiLS.pgp
Description: PGP signature


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-27 Thread Michael Abbott

On Sun, 27 Aug 2006, Greg Byshenk wrote:

On Sun, Aug 27, 2006 at 11:24:13AM +, Michael Abbott wrote:

I've been trying to make some sense of the NFS locking issue.  I am
trying to run
# make installworld DESTDIR=/mnt
where /mnt is an NFS mount on a FreeBSD 4.11 server, but I am unable to
get past a call to `lockf`.



I have just performed a test of what you describe, using 'smbtest'
(6.1-STABLE #17: Fri Aug 25 12:25:19 CEST 2006) as the client and
'data-2' (FreeBSD 6.1-STABLE #16: Wed Aug  9 15:38:12 CEST 2006) as the
server.

...

Which is to say that it completed successfully.  Which suggests that there
is not a serious and ongoing problem.


Hm.  That's a useful data point: thanks for making the test!

What about the non-interruptible sleep?  Is this regarded as par for the 
course with NFS, or as a problem?


I know that hard NFS mounts are treated as completely unkillable, though 
why `kill -9` isn't made to work escapes me, but a locking operation which 
(presumably) suffers a protocol error?  Or is rpc.lockd simply waiting to 
hear back from the (presumably broken) NFS server?  Even so: `kill -9` 
ought to work!

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-27 Thread Michael Abbott

On Sun, 27 Aug 2006, Kostik Belousov wrote:

Make sure that rpc.statd is running.
Yep.  Took me some while to figure that one out, but the first lockf test 
failed without that.


For debugging purposes, tcpdump of the corresponding communications 
would be quite useful. Besides this, output of ps auxww | grep 'rpc\.' 
may be interesting.


Um.  How interesting would tcpdump be?  I'm prepared to do the work, but 
as I've never used the tool, it may take me some effort and time to figure 
out the right commands.  Yes: `man tcpdump | wc -l` == 1543.  Fancy 
giving me a sample command to try?


As for the other test, let's have a look.  Here we are before the test 
(NFS server, 4.11, is saturn, test machine, 6.1, is venus):


saturn$ ps auxww | grep rpc\\.
root48917  0.0  0.1   980  640  ??  Is7:56am   0:00.01 rpc.lockd
root  115  0.0  0.1 263096  536  ??  Is   18Aug06   0:00.00 rpc.statd

venus# ps auxww | grep rpc\\.
root 510  0.0  0.9 263460  1008  ??  Ss6:05PM   0:00.01 
/usr/sbin/rpc.statd
root 515  0.0  1.0  1416  1120  ??  Is6:05PM   0:00.02 
/usr/sbin/rpc.lockd
daemon   520  0.0  1.0  1420  1124  ??  I 6:05PM   0:00.00 
/usr/sbin/rpc.lockd

That's interesting.  Don't know how significant the differences are... 
Ok, let's run the test:


venus# cd /usr/src; make installworld DESTDIR=/mnt

Well, how odd: as soon as I start the test process 515 on venus goes away. 
Now to wait for it to fail... (doesn't take too long):


saturn$ ps auxww | grep rpc\\.
root48917  0.0  0.1   980  640  ??  Is7:56am   0:00.01 rpc.lockd
root  115  0.0  0.1 263096  536  ??  Is   18Aug06   0:00.00 rpc.statd

venus# ps auxww | grep rpc\\.
root 510  0.0  0.9 263460   992  ??  Ss6:05PM   0:00.01 
/usr/sbin/rpc.statd
daemon   520  0.0  1.0  1440  1152  ??  S 6:05PM   0:00.01 
/usr/sbin/rpc.lockd
venus# ps auxww | grep lockf
...
root7034  0.0  0.5  1172   528  v0  D+6:51PM   0:00.01 lockf -k 
/mnt/usr/...

(I've truncated the lockf call: the detail of the install call it's making 
is hardly relevant!)


Note that now any call to lockf on this server will fail...  Hmm.  What 
about a different mount point?  Bet I can't unmount ...


venus# umount /mnt
umount: unmount of /mnt failed: Device busy
venus# umount -f /mnt
venus# mount saturn:/tmp /mnt
venus# lockf /mnt/test ls
(Hangs)

Now this is interesting: the file saturn:/tmp/test exists!  And it appears 
to be owned by uid=4294967294 (-2?)!  How very odd.  If I reboot venus and 
try just a single lockf:


venus# lockf /mnt/test stat -f%u /mnt/test
0

As one might expect, indeed.  A hint as to who's got stuck (saturn, I'm 
sure), but beside the point, I guess.


Note also that the `umount -f /mnt` *didn't* release the lockf, and also 
note that /tmp/test is still there (on saturn) after a reboot of venus.



In conclusion: I agree with Greg Byshenk that the NFS server is bound to 
be the one at fault, BUT, is this freeze until reboot behaviour really 
what we want?  I remain astonished (and irritated) that `kill -9` doesn't 
work!

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-27 Thread Kostik Belousov
On Sun, Aug 27, 2006 at 07:17:34PM +, Michael Abbott wrote:
 On Sun, 27 Aug 2006, Kostik Belousov wrote:
 For debugging purposes, tcpdump of the corresponding communications 
 would be quite useful. Besides this, output of ps auxww | grep 'rpc\.' 
 may be interesting.
 
 Um.  How interesting would tcpdump be?  I'm prepared to do the work, but 
 as I've never used the tool, it may take me some effort and time to figure 
 out the right commands.  Yes: `man tcpdump | wc -l` == 1543.  Fancy 
 giving me a sample command to try?
On server,
tcpdump -p -s 1500 -w file -i iface host client host ip
This is assuming you use ethernet with usual MTU, iface is interface where
communication with client comes from.

 As for the other test, let's have a look.  Here we are before the test
 (NFS server, 4.11, is saturn, test machine, 6.1, is venus):

 saturn$ ps auxww | grep rpc\\.
My fault, better use ps axlww.
.
 Well, how odd: as soon as I start the test process 515 on venus goes away. 
 Now to wait for it to fail... (doesn't take too long):
Yes, this is very interesting. Does something appears in the logs ?
Also, you shall use -d option of rpc.lockd (and show the output together
with tcpdump output).


pgpFxQQLM10xw.pgp
Description: PGP signature


Re: NFS locking: lockf freezes (rpc.lockd problem?)

2006-08-27 Thread Greg Byshenk
On Sun, Aug 27, 2006 at 07:17:34PM +, Michael Abbott wrote:
 On Sun, 27 Aug 2006, Kostik Belousov wrote:

 Make sure that rpc.statd is running.
 Yep.  Took me some while to figure that one out, but the first lockf test 
 failed without that.
 
[...]
 
 As for the other test, let's have a look.  Here we are before the test 
 (NFS server, 4.11, is saturn, test machine, 6.1, is venus):
 
 saturn$ ps auxww | grep rpc\\.
 root48917  0.0  0.1   980  640  ??  Is7:56am   0:00.01 rpc.lockd
 root  115  0.0  0.1 263096  536  ??  Is   18Aug06   0:00.00 rpc.statd
 
[...]
 
 Well, how odd: as soon as I start the test process 515 on venus goes away. 
 Now to wait for it to fail... (doesn't take too long):
 
[...] 
 
 In conclusion: I agree with Greg Byshenk that the NFS server is bound to 
 be the one at fault, BUT, is this freeze until reboot behaviour really 
 what we want?  I remain astonished (and irritated) that `kill -9` doesn't 
 work!

The problem here is that the process is waiting for somthing, and 
thus not listening to signals (including your 'kill').

I'm not an expert on this, but my first guess would be that saturn (your
server) is offering something that it can't deliver.  That is, the client
asks the server can you do X?, and the server says yes I can, so the
client says do X and waits -- and the server never does it.

Or alternatively (based on your rpc.statd dying), rpc.lockd on your
client is trying to use rpc.statd to communicate with your server.  And
it starts successfully, but then rpc.statd dies (for some reason) and
your lock ends up waiting forever for it to answer.


I would recommend starting both rpc.lockd and rpc.statd with the '-d'
flag, to see if this provides any information as to what is going on.
There may well be a bug somewhere, but you need to find where it is.
I suspect that it is not actually in rpc.statd, as nothing in the
source has changed since January 2005.

An alternative would be to update to RELENG_6 (or at least RELENG_6_1)
and then try again.


-- 
greg byshenk  -  [EMAIL PROTECTED]  -  Leiden, NL
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]