Re: [zfs-discuss] Why number of NFS threads jumps to the max value?

Leon Koll Mon, 05 Mar 2007 14:01:53 -0800

On 3/5/07, Spencer Shepler <[EMAIL PROTECTED]> wrote:


On Mar 5, 2007, at 11:17 AM, Leon Koll wrote:

> On 3/5/07, Roch - PAE <[EMAIL PROTECTED]> wrote:
>>
>> Leon Koll writes:
>>
>>  > On 3/5/07, Roch - PAE <[EMAIL PROTECTED]> wrote:
>>  > >
>>  > > Leon Koll writes:
>>  > >  > On 2/28/07, Roch - PAE <[EMAIL PROTECTED]> wrote:
>>  > >  > >
>>  > >  > >
>>  > >  > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?
>> bug_id=6467988
>>  > >  > >
>>  > >  > > NFSD  threads are created  on a  demand  spike (all of
>> them
>>  > >  > > waiting  on I/O) but then    tend to stick around
>> servicing
>>  > >  > > moderate loads.
>>  > >  > >
>>  > >  > > -r
>>  > >  >
>>  > >  > Hello Roch,
>>  > >  > It's not my case. NFS stops to service after some point.
>> And the
>>  > >  > reason is in ZFS. It never happens with NFS/UFS.
>>  > >  > Shortly, my scenario:
>>  > >  > 1st SFS run, 2000 requested IOPS. NFS is fine, ;low number
>> of threads.
>>  > >  > 2st SFS run, 4000 requested IOPS. NFS cannot serve all
>> requests, no of
>>  > >  > threads jumps to max
>>  > >  > 3rd SFS run, 2000 requested IOPS. NFS cannot serve all
>> requests, no of
>>  > >  > threads jumps to max.
>>  > >  > System cannot get back to the same results under equal
>> load (1st and 3rd).
>>  > >  > Reboot between 2nd and 3rd doesn't help. The only
>> persistent thing is
>>  > >  > a directory structure that was created during the 2nd run
>> (in SFS
>>  > >  > higher requested load -> more directories/files created).
>>  > >  > I am sure it's a bug. I need help. I don't care that ZFS
>> works N times
>>  > >  > worse than UFS. I really care that after heavy load
>> everything is
>>  > >  > totally screwed.
>>  > >  >
>>  > >  > Thanks,
>>  > >  > -- Leon
>>  > >
>>  > > Hi Leon,
>>  > >
>>  > > How much is the slowdown between 1st and 3rd ? How filled is
>>  >
>>  > Typical case is:
>>  > 1st: 1996 IOPS, latency  2.7
>>  > 3rd: 1375 IOPS, latency 37.9
>>  >
>>
>> The large latency increase is the  side effect of requesting
>> more than what can be delivered. Queue builds up and latency
>> follow. So  it  should  not be  the  primary  focus IMO. The
>> Decrease in IOPS is the primary problem.
>>
>> One hypothesis is that over the life of the FS we're moving
>> toward spreading access to the full disk platter. We can
>> imagine some fragmentation hitting as well. I'm not sure
>> how I'd test both hypothesis.
>>
>>  > > the pool at each stage ? What does 'NFS stops to service'
>>  > > mean ?
>>  >
>>  > There is a lot of error messages on the NFS(SFS) client :
>>  > sfs352: too many failed RPC calls - 416 good 27 bad
>>  > sfs3132: too many failed RPC calls - 302 good 27 bad
>>  > sfs3109: too many failed RPC calls - 533 good 31 bad
>>  > sfs353: too many failed RPC calls - 301 good 28 bad
>>  > sfs3144: too many failed RPC calls - 305 good 25 bad
>>  > sfs3121: too many failed RPC calls - 311 good 30 bad
>>  > sfs370: too many failed RPC calls - 315 good 27 bad
>>  >
>>
>> Can this be timing out or queue full drops ? Might be a side
>> effect of SFS requesting more than what can be delivered.
>
> I don't know was it timeouts or full drops. SFS marked such runs as
> INVALID.
> I can run whatever is needed to help to investigate the problem. If
> you have a D script that will tell us more, please send it to me.
> I appreciate your help.

The failed RPCs are indeed a result of the SFS client timing out
the requests it has made to the server.  The server is being
overloaded for its capabilities and the benchmark results
show that.  I agree with Roch that as the SFS benchmark adds
more data to the filesystems that additional latency is
added and for this particular configuration and the
server is being over-driven.

The helpful thing would be to run smaller increments in the
benchmark to determine where the response time increases
beyond what the SFS workload can handle.

There have been a number of changes in ZFS recently that should
help with SFS performance measurement but fundamentally it
all depends on the configuration of the server (number of spindles
and CPU available).  So there may be a limit that is being
reached based on the hardware configuration.

What is your real goal here, Leon?  Are you trying to gather SFS
data to fit into sizing of a particular solution or just trying
to gather performance results for other general comparisons?


Spencer,
I am using SFS benchmark to emulate the real-world environment for the
NAS solution that I've built. SFS is able to create as many processes
(each one emulating separate client) as one needs plus it's rich of
meta operations.
My real goal is: quiet peaceful nights after my solution will start to
work in the production :)
What I see now: after some step the directory structure/on-disk layout
of ZFS? becomes an obstacle that cannot be passed. I don't want that
this will happen in the field, that's it.
And I am sure we have a bug case here.

There are certainly better benchmarks than SFS for either
sizing and comparison reasons.


I haven't found anything better: client-independent,
multi-proc./multi-threaded, meta-rich, comparable. An obvious drawback
is: $$. I think that a free replacement of it is almost unknown and
underestimated fstress ( http://www.cs.duke.edu/ari/fstress/ ).
Which ones are your candidates?

Thanks,
-- Leon
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Why number of NFS threads jumps to the max value?

Reply via email to