Re: [nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

Jorgen Lundman Thu, 14 Oct 2010 19:16:25 -0700

Our vendor finally got back to us, suggestion we install patch 144532-01. So notentirely sure what happened with an IDR for us, and what not. 144532-01 does notappear to be related to CR: 6976554, but I did install it 'just in case'.

I had Dave Brown run his tests again and we can confirm that 144532-01 does NOTfix the issues that we are having with CR: 6976554.

I will bounce it back to our vendor yet again, and wait a few more weeks forthem to maybe escalate the right one this time.


Lund




Jorgen Lundman wrote:


Just an update while the details are semi-fresh in my mind.

Thanks to the expert help from Marcel, Robert and Jan at Sun/Oracle, we
started to dig a little deeper into the NFS problems that we have been
experiencing.

Eventually, we noticed that some of the internal locks are rather high.
Which can be viewed in the output from:


x4500-14:/# echo '::rfs4_db' | mdb -k
rfs4_database=ffffffff994875c8
debug_flags=00000000 shutdown: count=0 tables=ffffff80ec69d000
------------------ Table ------------------- Bkt ------- Indices -------
Address Name Flags Cnt Cnt Pointer Cnt Max
ffffff80ec69d000 DelegStateID 00000000 137788 2047 fffffe98eb209c00 0002
0002
ffffff80e4b7f900 File 00000000 126070 2047 fffffe8ccea06ec0 0001 0001
fffffe8cd5fda000 Lockowner 00000000 2196 2047 fffffe8cc1592d00 0002 0002
fffffe8cb6454c00 LockStateID 00000000 1321 2047 fffffe98eb269680 0002 0002
fffffe8cd6edae40 OpenStateID 00000000 54945 2047 fffffe89041d3dc0 0003 0003
fffffe8cd5fe7d80 OpenOwner 00000000 308101 2047 fffffe98ed09b900 0001 0001
fffffe8b4a893540 Client 00000000 0022 2047 fffffe8af751e580 0002 0002


I added the output's first "Cnt" column to nagios graphs so we could
track it over time (see attachment, the drop is the daily remount).

Soon it became apparent that "OpenOwner" is leaking locks over time.
There is a hard limit in the kernel of 1 million locks.

Eventually Marcel created CR: 6976554
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6976554


We are currently playing pass-the-bucket with our vendor in an attempt
to escalate this problem so that we can receive an IDR. But in the
meanwhile, there are daily procedures which help alleviate the problem.

We can unmount/remount to free many locks, and about every 20 days, we
do maintenances where we restart nfsd.

Should we find a fix for this issue in future, I will reply here again.

Thanks,






Jorgen Lundman wrote:

Just a progress update post. We are still experiencing this problem, and
since we now have 21 x4540 we see it fairly regularly.

The NOC staff remounts the cgi and vmx cluster servers every morning as
it make it occur less frequently that way.

We now monitor /var/adm/messages for the string:

Jul 26 21:05:34 cgi12.unix Suspected server reboot.

Which occurs at increasing frequency. First once a week or so, until it
reaches twice a day. Each time, if left alone, it recovers more slowly.
It recovers a little faster if you restart processes, and remount
file-system.

During an outbreak; nfsd appears unresponsive, but not over loaded. In
the above case on x4500-14, the loadavg was 0.60, using ~350 threads.
'df' commands on the cgi servers would take 4-20 seconds to get statfs
on this mount.

Currently, when we see the above message on a particular x4540 server,
we schedule a maintenance and restart nfsd. This clears the problem for
a few weeks, for this server.

This happens most on cgi, vmx and pop clusters (in decreasing order).

At a wild guess, and I have nothing but hunches to back this up, is that
nfsd has some sort of bug where it completely loses, or invalidates, all
clientid stored information. Perhaps open-files, or open-locks. This
forces all NFS clients to re-negotiate all open file, or locks. This
results in a network frenzy which takes several minutes to recover from.
nfsd then appears to have hung.

The frequency of triggering the nfsd bug seems to increase with age.

At the request of our vendor, we have run GUDS, livecore, gcore, dtrace,
and many stat commands.

We still run Sol10u8 (Generic_141445-09) everywhere.




_______________________________________________
nfs-discuss mailing list
[email protected]


--
Jorgen Lundman       | <[email protected]>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)
_______________________________________________
nfs-discuss mailing list
[email protected]

Re: [nfs-discuss] NFS4ERR_STALE_CLIENTID and NFS4ERR_SAME with Solaris 10u8

Reply via email to