Yes, absolutely. We've run into this same problem, exactly as you describe, in Solaris10 (all versions)
You can catch it with a kernel dump, but you have to be wary and quick.

keep a vmstat 3 open (or similar), and when free mem drops below 5GB or so, be ready. As soon you start seeing PO or DE, that's when to take your crash dump.

Basically, what happens (from my understanding previously talking with an Oracle kernel engineer) is that the kernel just allocates tons of NFS buffers that keep building up and building up and there's no mechanism for getting rid of them in sufficient time. There really ought to be a RED or some sort of back pressure, but it doesn't seem to be there.

You can make this problem less likely to occur by decreasing the client side rsize and wsize. Linux centos/rhel6 (and similar 2.6+ kernel) exacerbates the problem by using 1MB rsize and wsize, which makes the server burn through big NFS buffers, but if you force the clients to 32k or perhaps even smaller, then you can push off the problem a bit.

Do you have a synthetic load test to reproduce it?

On 5/4/2015 5:45 PM, Chris Siebenmann wrote:
  We now have a reproducable setup with OmniOS r151014 where an OmniOS
NFS fileserver will experience memory exhaustion and then hang in the
kernel if it receives sustained NFS write traffic from multiple clients
at a rate faster than its local disks can sustain. The machine will run
okay for a while but with mdb -k's ::memstat showing steadily increasing
'Kernel' memory usage; after a while it tips over the edge, the ZFS ARC
starts shrinking, free RAM reported by 'vmstat' goes basically to nothing
(eg 182 MB), and the system locks hard.

(We have not at this point tried to make a crash dump, but past attempts
to do so in similar situations have been failures.)

  A fairly reliable signal that the system is about to lock up very
soon is that '::svc_pool nfs' will report a steadily increasing and often
very large number of 'Pending requests' (as well as all configured threads
being active). Our most recent lockup reported over 270,000 pending
requests. Our working hypothesis is that something in the NFS server code
is accepting (too many) incoming requests and filling all memory with them,
which then leads to the hard lock.

(It's possible that lower levels are also involved, eg TCP socket
receive buffers.)

  Our current simplified test setup: the OmniOS machine has 64 GB RAM
with 2x 1G Ethernet for incoming NFS writes, writing to a single pool of
a mirrored pair of 2 TB WD SE SATA drives. There are six client machines
on one network, 25 on the other, and all client machines are running
multiple processes that are writing files of various sizes (from 50 MB
through several GB); all client machines are Ubuntu Linux. We believe
(but have not tested) that multiple clients and possibly multiple
processes are required to provoke this behavior. All NFS traffic is
NFS v3 over TCP.

  Has anyone seen or heard of anything like this before?

  Is there any way to limit the number of pending NFS requests that the
system will accept? Allowing 270,000 strikes me as kind of absurd.

(I don't suppose anyone with a test environment wants to take a shot
at reproducing this. For us, this happens within an hour or three of
running at this load, and generally happens faster with smaller number
of NFS server threads.)

        - cks
_______________________________________________
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss

_______________________________________________
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss

Reply via email to