rpc.lockd stalls

Tom Ierna Thu, 07 Sep 2006 11:46:53 -0700

Hello, list.

For the purposes of ease of software and hardware management, I'mattempting to run a set of PXE-booted Client machines as web/db ormail servers.

The NFS/DHCP/YP servers are running on a 5.4-STABLE Server. I mostlyfollowed the PXE guide when building these systems.

All of the disk (except for swap) sits on the master Server (whichhas a bunch of external drive sleds), and all of the Client machinesboot via Gig-E.

Client machines are running 5.4-STABLE as well, but it is notcompiled with the same kernel configuration as the master Server, asthe hardware is slightly different. Client machines share userlandwith the Server.

At the moment I have one Client machine running about 40 domains ofweb and db, with reasonably low traffic (less than 3Mbit/sec total)and one Client machine booted from the master Server, but not doinganything.


Resource utilization on the master Server seems pretty low.

Sporadically, there appear to be stalls on some locks with rpc.lockd.These lock stalls exhibit "interesting" behavior on the Clientmachines: Slots will fill up on Apache in the "W" state. SSH loginattempts to the client machine (passwd files get some user data viaYP) will hang and timeout. when I find a file (via Apache's extendedstatus) which appears to be one of the stalled locks, and I attemptto do anything with the file via a shell on the client machine, suchas "cat" it, that shell will become unresponsive. Any process whichis stalled on one of these files cannot be killled.

On the server, the only symptom I've witnessed is that rpc.lockdstarts using a bit more proc than it usually does. Normal utilizationis 0.0, and when the problem is happening, proc might go up to 3.0 orso. "cat"ing a file on the Server which appears stalled on theClient, works fine.

A stop and start of nfslocking on the server seems to clear thingsup. Apache on the client will recover on its own, I'm guessing aftereach stalled lock reaches a timeout. I usually gracefully restartApache, which forces the recovery to happen faster.

As far as timing, it doesn't appear to be consistently periodic. Itdoesn't appear to be load related - I suffered through a Digg of oneof the sites, and while the client machine served more bandwidth thatcouple of days than it had in a month, this particular problem didnot occur.

Over the past three months or so, this issue has probably cropped upthree or four times.

What can I do to troubleshoot this? I would like to add more clientmachines, but I can't until this problem is resolved.

Changing OS builds at this point, unless absolutely necessary, is notsomething I want to do.


Thanks for any insight!

--
Tom Ierna
President
Shockergroup, Inc.

_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

rpc.lockd stalls

Reply via email to