Nikolay Denev wrote: > On Oct 11, 2012, at 7:20 PM, Nikolay Denev <nde...@gmail.com> wrote: > > > On Oct 11, 2012, at 8:46 AM, Nikolay Denev <nde...@gmail.com> wrote: > > > >> > >> On Oct 11, 2012, at 1:09 AM, Rick Macklem <rmack...@uoguelph.ca> > >> wrote: > >> > >>> Nikolay Denev wrote: > >>>> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmack...@uoguelph.ca> > >>>> wrote: > >>>> > >>>>> Nikolay Denev wrote: > >>>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem > >>>>>> <rmack...@uoguelph.ca> > >>>>>> wrote: > >>>>>> > >>>>>>> Garrett Wollman wrote: > >>>>>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem > >>>>>>>> <rmack...@uoguelph.ca> said: > >>>>>>>> > >>>>>>>>>> Simple: just use a sepatate mutex for each list that a > >>>>>>>>>> cache > >>>>>>>>>> entry > >>>>>>>>>> is on, rather than a global lock for everything. This would > >>>>>>>>>> reduce > >>>>>>>>>> the mutex contention, but I'm not sure how significantly > >>>>>>>>>> since > >>>>>>>>>> I > >>>>>>>>>> don't have the means to measure it yet. > >>>>>>>>>> > >>>>>>>>> Well, since the cache trimming is removing entries from the > >>>>>>>>> lists, > >>>>>>>>> I > >>>>>>>>> don't > >>>>>>>>> see how that can be done with a global lock for list > >>>>>>>>> updates? > >>>>>>>> > >>>>>>>> Well, the global lock is what we have now, but the cache > >>>>>>>> trimming > >>>>>>>> process only looks at one list at a time, so not locking the > >>>>>>>> list > >>>>>>>> that > >>>>>>>> isn't being iterated over probably wouldn't hurt, unless > >>>>>>>> there's > >>>>>>>> some > >>>>>>>> mechanism (that I didn't see) for entries to move from one > >>>>>>>> list > >>>>>>>> to > >>>>>>>> another. Note that I'm considering each hash bucket a > >>>>>>>> separate > >>>>>>>> "list". (One issue to worry about in that case would be > >>>>>>>> cache-line > >>>>>>>> contention in the array of hash buckets; perhaps > >>>>>>>> NFSRVCACHE_HASHSIZE > >>>>>>>> ought to be increased to reduce that.) > >>>>>>>> > >>>>>>> Yea, a separate mutex for each hash list might help. There is > >>>>>>> also > >>>>>>> the > >>>>>>> LRU list that all entries end up on, that gets used by the > >>>>>>> trimming > >>>>>>> code. > >>>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't > >>>>>>> looked > >>>>>>> at > >>>>>>> it in a while.) > >>>>>>> > >>>>>>> Also, increasing the hash table size is probably a good idea, > >>>>>>> especially > >>>>>>> if you reduce how aggressively the cache is trimmed. > >>>>>>> > >>>>>>>>> Only doing it once/sec would result in a very large cache > >>>>>>>>> when > >>>>>>>>> bursts of > >>>>>>>>> traffic arrives. > >>>>>>>> > >>>>>>>> My servers have 96 GB of memory so that's not a big deal for > >>>>>>>> me. > >>>>>>>> > >>>>>>> This code was originally "production tested" on a server with > >>>>>>> 1Gbyte, > >>>>>>> so times have changed a bit;-) > >>>>>>> > >>>>>>>>> I'm not sure I see why doing it as a separate thread will > >>>>>>>>> improve > >>>>>>>>> things. > >>>>>>>>> There are N nfsd threads already (N can be bumped up to 256 > >>>>>>>>> if > >>>>>>>>> you > >>>>>>>>> wish) > >>>>>>>>> and having a bunch more "cache trimming threads" would just > >>>>>>>>> increase > >>>>>>>>> contention, wouldn't it? > >>>>>>>> > >>>>>>>> Only one cache-trimming thread. The cache trim holds the > >>>>>>>> (global) > >>>>>>>> mutex for much longer than any individual nfsd service thread > >>>>>>>> has > >>>>>>>> any > >>>>>>>> need to, and having N threads doing that in parallel is why > >>>>>>>> it's > >>>>>>>> so > >>>>>>>> heavily contended. If there's only one thread doing the trim, > >>>>>>>> then > >>>>>>>> the nfsd service threads aren't spending time either > >>>>>>>> contending > >>>>>>>> on > >>>>>>>> the > >>>>>>>> mutex (it will be held less frequently and for shorter > >>>>>>>> periods). > >>>>>>>> > >>>>>>> I think the little drc2.patch which will keep the nfsd threads > >>>>>>> from > >>>>>>> acquiring the mutex and doing the trimming most of the time, > >>>>>>> might > >>>>>>> be > >>>>>>> sufficient. I still don't see why a separate trimming thread > >>>>>>> will > >>>>>>> be > >>>>>>> an advantage. I'd also be worried that the one cache trimming > >>>>>>> thread > >>>>>>> won't get the job done soon enough. > >>>>>>> > >>>>>>> When I did production testing on a 1Gbyte server that saw a > >>>>>>> peak > >>>>>>> load of about 100RPCs/sec, it was necessary to trim > >>>>>>> aggressively. > >>>>>>> (Although I'd be tempted to say that a server with 1Gbyte is > >>>>>>> no > >>>>>>> longer relevant, I recently recall someone trying to run > >>>>>>> FreeBSD > >>>>>>> on a i486, although I doubt they wanted to run the nfsd on > >>>>>>> it.) > >>>>>>> > >>>>>>>>> The only negative effect I can think of w.r.t. having the > >>>>>>>>> nfsd > >>>>>>>>> threads doing it would be a (I believe negligible) increase > >>>>>>>>> in > >>>>>>>>> RPC > >>>>>>>>> response times (the time the nfsd thread spends trimming the > >>>>>>>>> cache). > >>>>>>>>> As noted, I think this time would be negligible compared to > >>>>>>>>> disk > >>>>>>>>> I/O > >>>>>>>>> and network transit times in the total RPC response time? > >>>>>>>> > >>>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, > >>>>>>>> and > >>>>>>>> 10G > >>>>>>>> network connectivity, spinning on a contended mutex takes a > >>>>>>>> significant amount of CPU time. (For the current design of > >>>>>>>> the > >>>>>>>> NFS > >>>>>>>> server, it may actually be a win to turn off adaptive mutexes > >>>>>>>> -- > >>>>>>>> I > >>>>>>>> should give that a try once I'm able to do more testing.) > >>>>>>>> > >>>>>>> Have fun with it. Let me know when you have what you think is > >>>>>>> a > >>>>>>> good > >>>>>>> patch. > >>>>>>> > >>>>>>> rick > >>>>>>> > >>>>>>>> -GAWollman > >>>>>>>> _______________________________________________ > >>>>>>>> freebsd-hackers@freebsd.org mailing list > >>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >>>>>>>> To unsubscribe, send any mail to > >>>>>>>> "freebsd-hackers-unsubscr...@freebsd.org" > >>>>>>> _______________________________________________ > >>>>>>> freebsd...@freebsd.org mailing list > >>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >>>>>>> To unsubscribe, send any mail to > >>>>>>> "freebsd-fs-unsubscr...@freebsd.org" > >>>>>> > >>>>>> My quest for IOPS over NFS continues :) > >>>>>> So far I'm not able to achieve more than about 3000 8K read > >>>>>> requests > >>>>>> over NFS, > >>>>>> while the server locally gives much more. > >>>>>> And this is all from a file that is completely in ARC cache, no > >>>>>> disk > >>>>>> IO involved. > >>>>>> > >>>>> Just out of curiousity, why do you use 8K reads instead of 64K > >>>>> reads. > >>>>> Since the RPC overhead (including the DRC functions) is per RPC, > >>>>> doing > >>>>> fewer larger RPCs should usually work better. (Sometimes large > >>>>> rsize/wsize > >>>>> values generate too large a burst of traffic for a network > >>>>> interface > >>>>> to > >>>>> handle and then the rsize/wsize has to be decreased to avoid > >>>>> this > >>>>> issue.) > >>>>> > >>>>> And, although this experiment seems useful for testing patches > >>>>> that > >>>>> try > >>>>> and reduce DRC CPU overheads, most "real" NFS servers will be > >>>>> doing > >>>>> disk > >>>>> I/O. > >>>>> > >>>> > >>>> This is the default blocksize the Oracle and probably most > >>>> databases > >>>> use. > >>>> It uses also larger blocks, but for small random reads in OLTP > >>>> applications this is what is used. > >>>> > >>> If the client is doing 8K reads, you could increase the read ahead > >>> "readahead=N" (N up to 16), to try and increase the bandwidth. > >>> (But if the CPU is 99% busy, then I don't think it will matter.) > >> > >> I'll try to check if this is possible to be set, as we are testing > >> not only with the Linux NFS client, > >> but also with the Oracle's built in so called DirectNFS client that > >> is built in to the app. > >> > >>> > >>>> > >>>>>> I've snatched some sample DTrace script from the net : [ > >>>>>> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes > >>>>>> ] > >>>>>> > >>>>>> And modified it for our new NFS server : > >>>>>> > >>>>>> #!/usr/sbin/dtrace -qs > >>>>>> > >>>>>> fbt:kernel:nfsrvd_*:entry > >>>>>> { > >>>>>> self->ts = timestamp; > >>>>>> @counts[probefunc] = count(); > >>>>>> } > >>>>>> > >>>>>> fbt:kernel:nfsrvd_*:return > >>>>>> / self->ts > 0 / > >>>>>> { > >>>>>> this->delta = (timestamp-self->ts)/1000000; > >>>>>> } > >>>>>> > >>>>>> fbt:kernel:nfsrvd_*:return > >>>>>> / self->ts > 0 && this->delta > 100 / > >>>>>> { > >>>>>> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50); > >>>>>> } > >>>>>> > >>>>>> fbt:kernel:nfsrvd_*:return > >>>>>> / self->ts > 0 / > >>>>>> { > >>>>>> @dist[probefunc, "ms"] = quantize(this->delta); > >>>>>> self->ts = 0; > >>>>>> } > >>>>>> > >>>>>> END > >>>>>> { > >>>>>> printf("\n"); > >>>>>> printa("function %-20s %@10d\n", @counts); > >>>>>> printf("\n"); > >>>>>> printa("function %s(), time in %s:%@d\n", @dist); > >>>>>> printf("\n"); > >>>>>> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow); > >>>>>> } > >>>>>> > >>>>>> And here's a sample output from one or two minutes during the > >>>>>> run > >>>>>> of > >>>>>> Oracle's ORION benchmark > >>>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G > >>>>>> ethernet: > >>>>>> > >>>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d > >>>>>> ^C > >>>>>> > >>>>>> function nfsrvd_access 4 > >>>>>> function nfsrvd_statfs 10 > >>>>>> function nfsrvd_getattr 14 > >>>>>> function nfsrvd_commit 76 > >>>>>> function nfsrvd_sentcache 110048 > >>>>>> function nfsrvd_write 110048 > >>>>>> function nfsrvd_read 283648 > >>>>>> function nfsrvd_dorpc 393800 > >>>>>> function nfsrvd_getcache 393800 > >>>>>> function nfsrvd_rephead 393800 > >>>>>> function nfsrvd_updatecache 393800 > >>>>>> > >>>>>> function nfsrvd_access(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_statfs(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_getattr(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_sentcache(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_rephead(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_updatecache(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >>>>>> 1 | 0 > >>>>>> > >>>>>> function nfsrvd_getcache(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 > >>>>>> 1 | 1 > >>>>>> 2 | 0 > >>>>>> 4 | 1 > >>>>>> 8 | 0 > >>>>>> > >>>>>> function nfsrvd_write(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 > >>>>>> 1 | 5 > >>>>>> 2 | 4 > >>>>>> 4 | 0 > >>>>>> > >>>>>> function nfsrvd_read(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 > >>>>>> 1 | 19 > >>>>>> 2 | 3 > >>>>>> 4 | 2 > >>>>>> 8 | 0 > >>>>>> 16 | 1 > >>>>>> 32 | 0 > >>>>>> 64 | 0 > >>>>>> 128 | 0 > >>>>>> 256 | 1 > >>>>>> 512 | 0 > >>>>>> > >>>>>> function nfsrvd_commit(), time in ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> -1 | 0 > >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 > >>>>>> 1 |@@@@@@@ 14 > >>>>>> 2 | 0 > >>>>>> 4 |@ 1 > >>>>>> 8 |@ 1 > >>>>>> 16 | 0 > >>>>>> 32 |@@@@@@@ 14 > >>>>>> 64 |@ 2 > >>>>>> 128 | 0 > >>>>>> > >>>>>> > >>>>>> function nfsrvd_commit(), time in ms for >= 100 ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> < 100 | 0 > >>>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >>>>>> 150 | 0 > >>>>>> > >>>>>> function nfsrvd_read(), time in ms for >= 100 ms: > >>>>>> value ------------- Distribution ------------- count > >>>>>> 250 | 0 > >>>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >>>>>> 350 | 0 > >>>>>> > >>>>>> > >>>>>> Looks like the nfs server cache functions are quite fast, but > >>>>>> extremely frequently called. > >>>>>> > >>>>> Yep, they are called for every RPC. > >>>>> > >>>>> I may try coding up a patch that replaces the single mutex with > >>>>> one for each hash bucket, for TCP. > >>>>> > >>>>> I'll post if/when I get this patch to a testing/review stage, > >>>>> rick > >>>>> > >>>> > >>>> Cool. > >>>> > >>>> I've readjusted the precision of the dtrace script a bit, and I > >>>> can > >>>> see > >>>> now the following three functions as taking most of the time : > >>>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() > >>>> > >>>> This was recorded during a oracle benchmark run called SLOB, > >>>> which > >>>> caused 99% cpu load on the NFS server. > >>>> > >>> Even with the drc2.patch and a large value for > >>> vfs.nfsd.tcphighwater? > >>> (Assuming the mounts are TCP ones.) > >>> > >>> Have fun with it, rick > >>> > >> > >> I had upped it, but probably not enough. I'm now running with > >> vfs.nfsd.tcphighwater set > >> to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500. > >> So far it looks like good improvement as those functions no longer > >> show up in the dtrace script output. > >> I'll run some more benchmarks and testing today. > >> > >> Thanks! > >> > >>>> > >>>>>> I hope someone can find this information useful. > >>>>>> > >>>>>> _______________________________________________ > >>>>>> freebsd-hackers@freebsd.org mailing list > >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >>>>>> To unsubscribe, send any mail to > >>>>>> "freebsd-hackers-unsubscr...@freebsd.org" > >> > > > > I haven't had the opportunity today to run more DB tests over NFS as > > the DBA was busy with something else, > > however I tested a bit the large file transfers. > > And I'm seeing something strange probably not only NFS but also ZFS > > and ARC related. > > > > When I first tested the drc2.patch I reported a huge bandwidth > > improvement, > > but now I think that this was probably because of the machine > > freshly rebooted instead of just the patch. > > The patch surely improved things, especially CPU utilization > > combined with the increased cache. > > But today I'm again having a file completely cached in ZFS's ARC > > cache, which when transferred over NFS > > reaches about 300MB/s, when at some tests it reached 900+MB/s (as > > reported in my first email). > > The file locally can be read at about 3GB/s as reported by dd. > > > > Local: > > [17:51]root@goliath:/tank/spa_db/undo# dd if=data.dbf of=/dev/null > > bs=1M > > 30720+1 records in > > 30720+1 records out > > 32212262912 bytes transferred in 10.548485 secs (3053733573 > > bytes/sec) > > > > Over NFS: > > [17:48]root@spa:/mnt/spa_db/undo# dd if=data.dbf of=/dev/null bs=1M > > 30720+1 records in > > 30720+1 records out > > 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s > > > > The machines are almost idle during this transfer and I can't see a > > reason why it can't reach the full bandwith when it's > > just reading it from RAM. > > > > I've tried again tracing with DTrace to see what's happening with > > this script : > > > > fbt:kernel:nfs*:entry > > { > > this->ts = timestamp; > > @counts[probefunc] = count(); > > } > > > > fbt:kernel:nfs*:return > > / this->ts > 0 / > > { > > @time[probefunc] = avg(timestamp - this->ts); > > } > > > > END > > { > > trunc(@counts, 10); > > trunc(@time, 10); > > printf("Top 10 called functions\n\n"); > > printa(@counts); > > printf("\n\nTop 10 slowest functions\n\n"); > > printa(@time); > > } > > > > And here's the result (several seconds during the dd test): > > > > Top 10 called functions > > nfsrc_freecache 88849 > > nfsrc_wanted 88849 > > nfsrv_fillattr 88849 > > nfsrv_postopattr 88849 > > nfsrvd_read 88849 > > nfsrvd_rephead 88849 > > nfsrvd_updatecache 88849 > > nfsvno_testexp 88849 > > nfsrc_trimcache 177697 > > nfsvno_getattr 177698 > > > > Top 10 slowest functions > > nfsd_excred 5673 > > nfsrc_freecache 5674 > > nfsrv_postopattr 5970 > > nfsrv_servertimer 6327 > > nfssvc_nfscommon 6596 > > nfsd_fhtovp 8000 > > nfsrvd_read 8380 > > nfssvc_program 92752 > > nfsvno_read 124979 > > nfsvno_fhtovp 1789523 > > > > I might try now to trace what nfsvno_fhtovp() is doing and where is > > spending it's time. > > > > Any other ideas are welcome :) > > > > To take the network out of the equation I redid the test by mounting > the same filesystem over NFS on the server: > > [18:23]root@goliath:~# mount -t nfs -o > rw,hard,intr,tcp,nfsv3,rsize=1048576,wsize=1048576 Just fyi, the maximum rsize,wsize is MAXBSIZE, which is 65536 for FreeBSD currently. As I noted in the other email, I'd suggest "rsize=65536,wsize=65536,readahead=16,...".
> localhost:/tank/spa_db/undo /mnt > [18:24]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M > 30720+1 records in > 30720+1 records out > 32212262912 bytes transferred in 79.793343 secs (403696120 bytes/sec) > [18:25]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M > 30720+1 records in > 30720+1 records out > 32212262912 bytes transferred in 12.033420 secs (2676900110 bytes/sec) > > During the first run I saw several nfsd threads in top, along with dd > and again zero disk I/O. > There was increase in memory usage because of the double buffering > ARC->buffercahe. > The second run was with all of the nfsd threads totally idle, and read > directly from the buffercache. _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"