Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-24 Thread Nikolay Denev

On Jan 24, 2013, at 4:24 PM, Wojciech Puchar  
wrote:
>> 
> Except it is on paper reliability.

This "on paper" reliability saved my ass numerous times.
For example I had one home NAS server machine with flaky SATA controller that 
would not detect one of the four drives from time to time on reboot.
This made my pool degraded several times, and even rebooting with let's say 
disk4 failed to a situation that disk3 is failed did not corrupt any data.
I don't think this is possible with any other open source FS, let alone 
hardware RAID that would drop the whole array because of this.
I have never ever personally lost any data on ZFS. Yes, the performance is 
another topic, and you must know what you are doing, and what is your
usage pattern, but from reliability standpoint, to me ZFS looks more durable 
than anything else.

P.S.: My home NAS is running freebsd-CURRENT with ZFS from the first version 
available. Several drives died, two times the pool was expanded
by replacing all drives one by one and resilvered, no single byte lost.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: ZFS regimen: scrub, scrub, scrub and scrub again.

2013-01-23 Thread Nikolay Denev

On Jan 23, 2013, at 11:09 PM, Mark Felder  wrote:

> On Wed, 23 Jan 2013 14:26:43 -0600, Chris Rees  wrote:
> 
>> 
>> So we have to take your word for it?
>> Provide a link if you're going to make assertions, or they're no more than
>> your own opinion.
> 
> I've heard this same thing -- every vdev == 1 drive in performance. I've 
> never seen any proof/papers on it though.
> ___
> freebsd...@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscr...@freebsd.org"


Here is a blog post that describes why this is true for IOPS:

http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: pgbench performance is lagging compared to Linux and DragonflyBSD?

2012-11-08 Thread Nikolay Denev

On Nov 8, 2012, at 12:56 PM, Wojciech Puchar  
wrote:

>> EC> That thread starts here:
>> EC> http://lists.freebsd.org/pipermail/freebsd-arch/2010-April/010143.html
>> Year 2010! And we still limited by MAXPHYS (128K) transfers :(
> put
> options MAXPHYS=2097152
> in your kernel config.
> 
> EVERYTHING works in all production machines for over a year
> 
> 
> the only exception is my laptop with OCZ petrol SSD that hangs on any 
> transfer >1MB, i've set it to 0.5MB here.

Have you measured the performance increase?
I'm also interested in bigger MAXBSIZE as this is what the NFS server uses as 
maximum transfer size. Linux and Solaris can do up to 1MB.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: pgbench performance is lagging compared to Linux and DragonflyBSD?

2012-11-07 Thread Nikolay Denev

On Nov 7, 2012, at 4:48 PM, Wojciech Puchar  
wrote:

>>> 
>>> actually FreeBSD defaults are actually good for COMMON usage. and can be 
>>> tuned.
>>> 
>>> default MAXBSIZE is one exception.
>> 
>> "Common usage" is vague. While FreeBSD might do ok for some applications 
>> (dev box, simple workstation/laptop, etc), there are other areas that 
>> require additional tuning to get better perf that arguably shouldn't as much 
>> (or there should be templates for doing so): 10GbE and mbuf and network 
>> tuning; file server and file descriptor, network tuning, etc; low latency 
>> desktop and scheduler tweaking; etc.
> 
> still any idea why MAXBSIZE is 128kB by default. for modern hard disk it is a 
> disaster. 2 or even 4 megabyte is OK.
> 
>> 
>> Not to say that freebsd is entirely at fault, but because it's more of a 
>> commodity OS that Linux, more tweaking is required...
> actually IMHO much more tweaking is needed with linux, at least from what i 
> know from other people. And they are not newbies
> ___
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Actually MAXBSIZE is 64k, MAXPHYS is 128k.

There was a thread about NFS performance where it was mentioned that bigger 
MAXBSIZE leads to KVA fragmentation.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-22 Thread Nikolay Denev

On Oct 23, 2012, at 2:36 AM, Rick Macklem  wrote:

> Ivan Voras wrote:
>> On 20 October 2012 13:42, Nikolay Denev  wrote:
>> 
>>> Here are the results from testing both patches :
>>> http://home.totalterror.net/freebsd/nfstest/results.html
>>> Both tests ran for about 14 hours ( a bit too much, but I wanted to
>>> compare different zfs recordsize settings ),
>>> and were done first after a fresh reboot.
>>> The only noticeable difference seems to be much more context
>>> switches with Ivan's patch.
>> 
>> Thank you very much for your extensive testing!
>> 
>> I don't know how to interpret the rise in context switches; as this is
>> kernel code, I'd expect no context switches. I hope someone else can
>> explain.
>> 
>> But, you have also shown that my patch doesn't do any better than
>> Rick's even on a fairly large configuration, so I don't think there's
>> value in adding the extra complexity, and Rick knows NFS much better
>> than I do.
>> 
>> But there are a few things other than that I'm interested in: like why
>> does your load average spike almost to 20-ties, and how come that with
>> 24 drives in RAID-10 you only push through 600 MBit/s through the 10
>> GBit/s Ethernet. Have you tested your drive setup locally (AESNI
>> shouldn't be a bottleneck, you should be able to encrypt well into
>> Gbyte/s range) and the network?
>> 
>> If you have the time, could you repeat the tests but with a recent
>> Samba server and a CIFS mount on the client side? This is probably not
>> important, but I'm just curious of how would it perform on your
>> machine.
> 
> Oh, I realized that, if you are testing 9/stable (and not head), that
> you won't have r227809. Without that, all reads on a given file will
> be serialized, because the server will acquire an exclusive lock on
> the vnode.
> 
> The patch for r227809 in head is at:
>  http://people.freebsd.org/~rmacklem/lkshared.patch
> This should apply fine to a 9 system (but not 8.n), I think.
> 
> Good luck with it and have fun, rick
> 
>> ___
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to
>> "freebsd-hackers-unsubscr...@freebsd.org"

Thanks, I've applied the patch by hand because of some differences and I'm now 
rebuilding.

In case they are still needed here are the "dd" tests with loopback UDP mount :

http://home.totalterror.net/freebsd/nfstest/udp-dd.html

Over udp writing degrades much worse...
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-20 Thread Nikolay Denev

On Oct 20, 2012, at 10:45 PM, Outback Dingo  wrote:

> On Sat, Oct 20, 2012 at 3:28 PM, Ivan Voras  wrote:
>> On 20 October 2012 14:45, Rick Macklem  wrote:
>>> Ivan Voras wrote:
>> 
 I don't know how to interpret the rise in context switches; as this is
 kernel code, I'd expect no context switches. I hope someone else can
 explain.
 
>>> Don't the mtx_lock() calls spin for a little while and then context
>>> switch if another thread still has it locked?
>> 
>> Yes, but are in-kernel context switches also counted? I was assuming
>> they are light-weight enough not to count.
>> 
>>> Hmm, I didn't look, but were there any tests using UDP mounts?
>>> (I would have thought that your patch would mainly affect UDP mounts,
>>> since that is when my version still has the single LRU queue/mutex.
>> 
>> Another assumption - I thought UDP was the default.
>> 
>>> As I think you know, my concern with your patch would be correctness
>>> for UDP, not performance.)
>> 
>> Yes.
> 
> Ive got a similar box config here, with 2x 10GB intel nics, and 24 2TB
> drives on an LSI controller.
> Im watching the thread patiently, im kinda looking for results, and
> answers, Though Im also tempted to
> run benchmarks on my system also see if i get similar results I also
> considered that netmap might be one
> but not quite sure if it would help NFS, since its to hard to tell if
> its a network bottle neck, though it appears
> to be network related.
> 
>> ___
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Doesn't look like network issue to me. From my observations it's more like some 
overhead in nfs and arc.
The boxes easily push 10G with simple iperf test.
Running two iperf test over each port of the dual ported 10G nics gives 
960MB/sec regardles which machine is the server.
Also, I've seen over 960Gb/sec over NFS with this setup, but I can't understand 
what type of workload was able to do this.
At some point I was able to do this with simple dd, then after a reboot I was 
no longer to push this traffic.
I'm thinking something like ARC/kmem fragmentation might be the issue?
 

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-20 Thread Nikolay Denev

On Oct 20, 2012, at 4:00 PM, Nikolay Denev  wrote:

> 
> On Oct 20, 2012, at 3:11 PM, Ivan Voras  wrote:
> 
>> On 20 October 2012 13:42, Nikolay Denev  wrote:
>> 
>>> Here are the results from testing both patches : 
>>> http://home.totalterror.net/freebsd/nfstest/results.html
>>> Both tests ran for about 14 hours ( a bit too much, but I wanted to compare 
>>> different zfs recordsize settings ),
>>> and were done first after a fresh reboot.
>>> The only noticeable difference seems to be much more context switches with 
>>> Ivan's patch.
>> 
>> Thank you very much for your extensive testing!
>> 
>> I don't know how to interpret the rise in context switches; as this is
>> kernel code, I'd expect no context switches. I hope someone else can
>> explain.
>> 
>> But, you have also shown that my patch doesn't do any better than
>> Rick's even on a fairly large configuration, so I don't think there's
>> value in adding the extra complexity, and Rick knows NFS much better
>> than I do.
>> 
>> But there are a few things other than that I'm interested in: like why
>> does your load average spike almost to 20-ties, and how come that with
>> 24 drives in RAID-10 you only push through 600 MBit/s through the 10
>> GBit/s Ethernet. Have you tested your drive setup locally (AESNI
>> shouldn't be a bottleneck, you should be able to encrypt well into
>> Gbyte/s range) and the network?
>> 
>> If you have the time, could you repeat the tests but with a recent
>> Samba server and a CIFS mount on the client side? This is probably not
>> important, but I'm just curious of how would it perform on your
>> machine.
> 
> The first iozone local run finished, I'll paste just the result here, and 
> also the same test over NFS for comparison:
> (This is iozone doing 8k sized IO ops, on ZFS dataset with recordsize=8k)
> 
> NFS:
>random  random
> bkwd   record   stride   
>  KB  reclen   write rewritereadrereadread   write
> read  rewrite read   
>33554432   849735522 2930 290629083886 
>  
> 
> Local:
>random  random
> bkwd   record   stride   
>  KB  reclen   write rewritereadrereadread   write
> read  rewrite read   
>33554432   8   34740   41390   135442   142534   24992   12493 
>  
> 
> 
> P.S.: I forgot to mention that the network is with 9K mtu.


Here are the full results of the test on the local fs :

http://home.totalterror.net/freebsd/nfstest/local_fs/

I'm now running the same test on NFS mount over the loopback interface on the 
NFS server machine.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-20 Thread Nikolay Denev

On Oct 20, 2012, at 3:11 PM, Ivan Voras  wrote:

> On 20 October 2012 13:42, Nikolay Denev  wrote:
> 
>> Here are the results from testing both patches : 
>> http://home.totalterror.net/freebsd/nfstest/results.html
>> Both tests ran for about 14 hours ( a bit too much, but I wanted to compare 
>> different zfs recordsize settings ),
>> and were done first after a fresh reboot.
>> The only noticeable difference seems to be much more context switches with 
>> Ivan's patch.
> 
> Thank you very much for your extensive testing!
> 
> I don't know how to interpret the rise in context switches; as this is
> kernel code, I'd expect no context switches. I hope someone else can
> explain.
> 
> But, you have also shown that my patch doesn't do any better than
> Rick's even on a fairly large configuration, so I don't think there's
> value in adding the extra complexity, and Rick knows NFS much better
> than I do.
> 
> But there are a few things other than that I'm interested in: like why
> does your load average spike almost to 20-ties, and how come that with
> 24 drives in RAID-10 you only push through 600 MBit/s through the 10
> GBit/s Ethernet. Have you tested your drive setup locally (AESNI
> shouldn't be a bottleneck, you should be able to encrypt well into
> Gbyte/s range) and the network?
> 
> If you have the time, could you repeat the tests but with a recent
> Samba server and a CIFS mount on the client side? This is probably not
> important, but I'm just curious of how would it perform on your
> machine.

The first iozone local run finished, I'll paste just the result here, and also 
the same test over NFS for comparison:
(This is iozone doing 8k sized IO ops, on ZFS dataset with recordsize=8k)

NFS:
random  random
bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
read  rewrite read   
33554432   849735522 2930 290629083886  


Local:
random  random
bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
read  rewrite read   
33554432   8   34740   41390   135442   142534   24992   12493  



P.S.: I forgot to mention that the network is with 9K mtu.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-20 Thread Nikolay Denev

On Oct 20, 2012, at 3:11 PM, Ivan Voras  wrote:

> On 20 October 2012 13:42, Nikolay Denev  wrote:
> 
>> Here are the results from testing both patches : 
>> http://home.totalterror.net/freebsd/nfstest/results.html
>> Both tests ran for about 14 hours ( a bit too much, but I wanted to compare 
>> different zfs recordsize settings ),
>> and were done first after a fresh reboot.
>> The only noticeable difference seems to be much more context switches with 
>> Ivan's patch.
> 
> Thank you very much for your extensive testing!
> 
> I don't know how to interpret the rise in context switches; as this is
> kernel code, I'd expect no context switches. I hope someone else can
> explain.
> 
> But, you have also shown that my patch doesn't do any better than
> Rick's even on a fairly large configuration, so I don't think there's
> value in adding the extra complexity, and Rick knows NFS much better
> than I do.
> 
> But there are a few things other than that I'm interested in: like why
> does your load average spike almost to 20-ties, and how come that with
> 24 drives in RAID-10 you only push through 600 MBit/s through the 10
> GBit/s Ethernet. Have you tested your drive setup locally (AESNI
> shouldn't be a bottleneck, you should be able to encrypt well into
> Gbyte/s range) and the network?
> 
> If you have the time, could you repeat the tests but with a recent
> Samba server and a CIFS mount on the client side? This is probably not
> important, but I'm just curious of how would it perform on your
> machine.

I've now started this test locally.
But from previous different iozone runs, I remember locally the speed was much 
better,
but I will wait for this test to finish, as the comparison will be better.

But I think there is still something fishy… I have cases where I have reached 
1000MB/s over NFS
(from network stats, not local machine stats), but sometimes it is very slow 
even for 
file completely in ARC. Rick mentioned that this could be due to RPC overhead 
and network round trip time, but
earlier in this thread I've done a test only on the server by mounting the NFS 
exported ZFS dataset locally and did some tests with "dd":

> To take the network out of the equation I redid the test by mounting the same 
> filesystem over NFS on the server:
> 
> [18:23]root@goliath:~#  mount -t nfs -o 
> rw,hard,intr,tcp,nfsv3,rsize=1048576,wsize=1048576 
> localhost:/tank/spa_db/undo /mnt
> [18:24]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M 
> 30720+1 records in
> 30720+1 records out
> 32212262912 bytes transferred in 79.793343 secs (403696120 bytes/sec)
> [18:25]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M
> 30720+1 records in
> 30720+1 records out
> 32212262912 bytes transferred in 12.033420 secs (2676900110 bytes/sec)
> 
> During the first run I saw several nfsd threads in top, along with dd and 
> again zero disk I/O.
> There was increase in memory usage because of the double buffering 
> ARC->buffercahe.
> The second run was with all of the nfsd threads totally idle, and read 
> directly from the buffercache.



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-20 Thread Nikolay Denev

On Oct 18, 2012, at 6:11 PM, Nikolay Denev  wrote:

> 
> On Oct 15, 2012, at 5:34 PM, Ivan Voras  wrote:
> 
>> On 15 October 2012 16:31, Nikolay Denev  wrote:
>>> 
>>> On Oct 15, 2012, at 2:52 PM, Ivan Voras  wrote:
>> 
>>>> http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch
>>>> 
>>>> It should apply to HEAD without Rick's patches.
>>>> 
>>>> It's a bit different approach than Rick's, breaking down locks even more.
>>> 
>>> Applied and compiled OK, I will be able to test it tomorrow.
>> 
>> Ok, thanks!
>> 
>> The differences should be most visible in edge cases with a larger
>> number of nfsd processes (16+) and many CPU cores.
> 
> I'm now rebooting with your patch, and hopefully will have some results 
> tomorrow.
> 

Here are the results from testing both patches : 
http://home.totalterror.net/freebsd/nfstest/results.html
Both tests ran for about 14 hours ( a bit too much, but I wanted to compare 
different zfs recordsize settings ),
and were done first after a fresh reboot.
The only noticeable difference seems to be much more context switches with 
Ivan's patch.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-18 Thread Nikolay Denev

On Oct 15, 2012, at 5:34 PM, Ivan Voras  wrote:

> On 15 October 2012 16:31, Nikolay Denev  wrote:
>> 
>> On Oct 15, 2012, at 2:52 PM, Ivan Voras  wrote:
> 
>>> http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch
>>> 
>>> It should apply to HEAD without Rick's patches.
>>> 
>>> It's a bit different approach than Rick's, breaking down locks even more.
>> 
>> Applied and compiled OK, I will be able to test it tomorrow.
> 
> Ok, thanks!
> 
> The differences should be most visible in edge cases with a larger
> number of nfsd processes (16+) and many CPU cores.

I'm now rebooting with your patch, and hopefully will have some results 
tomorrow.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: syncing large mmaped files

2012-10-18 Thread Nikolay Denev

On Oct 18, 2012, at 3:08 AM, Tristan Verniquet  wrote:

> 
> I want to work with large (1-10G) files in memory but eventually sync them 
> back out to disk. The problem is that the sync process appears to lock the 
> file in kernel for the duration of the sync, which can run into minutes. This 
> prevents other processes from reading from the file (unless they already have 
> it mapped) for this whole time. Is there any way to prevent this? I think I 
> read in a post somewhere about openbsd implementing partial-writes when it 
> hits a file with lots of dirty pages in order to prevent this. Is there 
> anything available for FreeBSD or is there another way around it?
> 
> Sorry if this is the wrong mailing list.
> 
> ___
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Isn't msync(2) what you are looking for?
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-15 Thread Nikolay Denev

On Oct 15, 2012, at 2:52 PM, Ivan Voras  wrote:

> On 13/10/2012 17:22, Nikolay Denev wrote:
> 
>> drc3.patch applied and build cleanly and shows nice improvement!
>> 
>> I've done a quick benchmark using iozone over the NFS mount from the Linux 
>> host.
>> 
> 
> Hi,
> 
> If you are already testing, could you please also test this patch:
> 
> http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch
> 
> It should apply to HEAD without Rick's patches.
> 
> It's a bit different approach than Rick's, breaking down locks even more.
> 

Applied and compiled OK, I will be able to test it tomorrow.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-15 Thread Nikolay Denev

On Oct 15, 2012, at 2:52 PM, Ivan Voras  wrote:

> On 13/10/2012 17:22, Nikolay Denev wrote:
> 
>> drc3.patch applied and build cleanly and shows nice improvement!
>> 
>> I've done a quick benchmark using iozone over the NFS mount from the Linux 
>> host.
>> 
> 
> Hi,
> 
> If you are already testing, could you please also test this patch:
> 
> http://people.freebsd.org/~ivoras/diffs/nfscache_lock.patch
> 
> It should apply to HEAD without Rick's patches.
> 
> It's a bit different approach than Rick's, breaking down locks even more.
> 

I will try to apply it to RELENG_9 as that's what I'm running and compare the 
results.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: NFS server bottlenecks

2012-10-13 Thread Nikolay Denev

On Oct 13, 2012, at 5:05 AM, Rick Macklem  wrote:

> I wrote:
>> Oops, I didn't get the "readahead" option description
>> quite right in the last post. The default read ahead
>> is 1, which does result in "rsize * 2", since there is
>> the read + 1 readahead.
>> 
>> "rsize * 16" would actually be for the option "readahead=15"
>> and for "readahead=16" the calculation would be "rsize * 17".
>> 
>> However, the example was otherwise ok, I think? rick
> 
> I've attached the patch drc3.patch (it assumes drc2.patch has already been
> applied) that replaces the single mutex with one for each hash list
> for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200.
> 
> These patches are also at:
>  http://people.freebsd.org/~rmacklem/drc2.patch
>  http://people.freebsd.org/~rmacklem/drc3.patch
> in case the attachments don't get through.
> 
> rick
> ps: I haven't tested drc3.patch a lot, but I think it's ok?

drc3.patch applied and build cleanly and shows nice improvement!

I've done a quick benchmark using iozone over the NFS mount from the Linux host.

drc2.pach (but with NFSRVCACHE_HASHSIZE=500)

TEST WITH 8K

-
Auto Mode
Using Minimum Record Size 8 KB
Using Maximum Record Size 8 KB
Using minimum file size of 2097152 kilobytes.
Using maximum file size of 2097152 kilobytes.
O_DIRECT feature enabled
SYNC Mode. 
OPS Mode. Output is in operations per second.
Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o -O -i 0 
-i 1 -i 2
Time Resolution = 0.01 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random  random
bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
read  rewrite read   fwrite frewrite   fread  freread
 2097152   819191914 2356 232123351706  


TEST WITH 1M

-
Auto Mode
Using Minimum Record Size 1024 KB
Using Maximum Record Size 1024 KB
Using minimum file size of 2097152 kilobytes.
Using maximum file size of 2097152 kilobytes.
O_DIRECT feature enabled
SYNC Mode. 
OPS Mode. Output is in operations per second.
Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o -O -i 0 
-i 1 -i 2
Time Resolution = 0.01 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random  random
bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
read  rewrite read   fwrite frewrite   fread  freread
 20971521024  73  64  477  486 496  61  



drc3.patch

TEST WITH 8K

-
Auto Mode
Using Minimum Record Size 8 KB
Using Maximum Record Size 8 KB
Using minimum file size of 2097152 kilobytes.
Using maximum file size of 2097152 kilobytes.
O_DIRECT feature enabled
SYNC Mode. 
OPS Mode. Output is in operations per second.
Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o -O -i 0 
-i 1 -i 2
Time Resolution = 0.01 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random  random
bkwd   record   stride   
  KB  reclen   write rewritereadrereadread   write
read  rewrite read   fwrite frewrite   fread  freread
 2097152   821082397 3001 301330102389  



TEST WITH 1M

-
Auto Mode
Using Minimum Record Size 1024 KB
Using Maximum Record Size 1024 KB
Using minimum file size of 2097152 kilobytes.
Using maximum file size of 2097152 kilobytes.
O_DIRECT feature enabled
SYNC Mode. 
OPS Mode. Output is in operations per second.
Command line use

Re: NFS server bottlenecks

2012-10-11 Thread Nikolay Denev

On Oct 11, 2012, at 7:20 PM, Nikolay Denev  wrote:

> On Oct 11, 2012, at 8:46 AM, Nikolay Denev  wrote:
> 
>> 
>> On Oct 11, 2012, at 1:09 AM, Rick Macklem  wrote:
>> 
>>> Nikolay Denev wrote:
>>>> On Oct 10, 2012, at 3:18 AM, Rick Macklem 
>>>> wrote:
>>>> 
>>>>> Nikolay Denev wrote:
>>>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem 
>>>>>> wrote:
>>>>>> 
>>>>>>> Garrett Wollman wrote:
>>>>>>>> <>>>>>>>  said:
>>>>>>>> 
>>>>>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>>>>>> entry
>>>>>>>>>> is on, rather than a global lock for everything. This would
>>>>>>>>>> reduce
>>>>>>>>>> the mutex contention, but I'm not sure how significantly since
>>>>>>>>>> I
>>>>>>>>>> don't have the means to measure it yet.
>>>>>>>>>> 
>>>>>>>>> Well, since the cache trimming is removing entries from the
>>>>>>>>> lists,
>>>>>>>>> I
>>>>>>>>> don't
>>>>>>>>> see how that can be done with a global lock for list updates?
>>>>>>>> 
>>>>>>>> Well, the global lock is what we have now, but the cache trimming
>>>>>>>> process only looks at one list at a time, so not locking the list
>>>>>>>> that
>>>>>>>> isn't being iterated over probably wouldn't hurt, unless there's
>>>>>>>> some
>>>>>>>> mechanism (that I didn't see) for entries to move from one list
>>>>>>>> to
>>>>>>>> another. Note that I'm considering each hash bucket a separate
>>>>>>>> "list". (One issue to worry about in that case would be
>>>>>>>> cache-line
>>>>>>>> contention in the array of hash buckets; perhaps
>>>>>>>> NFSRVCACHE_HASHSIZE
>>>>>>>> ought to be increased to reduce that.)
>>>>>>>> 
>>>>>>> Yea, a separate mutex for each hash list might help. There is also
>>>>>>> the
>>>>>>> LRU list that all entries end up on, that gets used by the
>>>>>>> trimming
>>>>>>> code.
>>>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't
>>>>>>> looked
>>>>>>> at
>>>>>>> it in a while.)
>>>>>>> 
>>>>>>> Also, increasing the hash table size is probably a good idea,
>>>>>>> especially
>>>>>>> if you reduce how aggressively the cache is trimmed.
>>>>>>> 
>>>>>>>>> Only doing it once/sec would result in a very large cache when
>>>>>>>>> bursts of
>>>>>>>>> traffic arrives.
>>>>>>>> 
>>>>>>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>>>>>> 
>>>>>>> This code was originally "production tested" on a server with
>>>>>>> 1Gbyte,
>>>>>>> so times have changed a bit;-)
>>>>>>> 
>>>>>>>>> I'm not sure I see why doing it as a separate thread will
>>>>>>>>> improve
>>>>>>>>> things.
>>>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if
>>>>>>>>> you
>>>>>>>>> wish)
>>>>>>>>> and having a bunch more "cache trimming threads" would just
>>>>>>>>> increase
>>>>>>>>> contention, wouldn't it?
>>>>>>>> 
>>>>>>>> Only one cache-trimming thread. The cache trim holds the (global)
>>>>>>>> mutex for much longer than any individual nfsd service thread has
>>>>>>>> any
>>>>>>>> need to, and having N threads doing that in parallel is why it's
>>>>>>>

Re: NFS server bottlenecks

2012-10-11 Thread Nikolay Denev
On Oct 11, 2012, at 8:46 AM, Nikolay Denev  wrote:

> 
> On Oct 11, 2012, at 1:09 AM, Rick Macklem  wrote:
> 
>> Nikolay Denev wrote:
>>> On Oct 10, 2012, at 3:18 AM, Rick Macklem 
>>> wrote:
>>> 
>>>> Nikolay Denev wrote:
>>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem 
>>>>> wrote:
>>>>> 
>>>>>> Garrett Wollman wrote:
>>>>>>> <>>>>>>  said:
>>>>>>> 
>>>>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>>>>> entry
>>>>>>>>> is on, rather than a global lock for everything. This would
>>>>>>>>> reduce
>>>>>>>>> the mutex contention, but I'm not sure how significantly since
>>>>>>>>> I
>>>>>>>>> don't have the means to measure it yet.
>>>>>>>>> 
>>>>>>>> Well, since the cache trimming is removing entries from the
>>>>>>>> lists,
>>>>>>>> I
>>>>>>>> don't
>>>>>>>> see how that can be done with a global lock for list updates?
>>>>>>> 
>>>>>>> Well, the global lock is what we have now, but the cache trimming
>>>>>>> process only looks at one list at a time, so not locking the list
>>>>>>> that
>>>>>>> isn't being iterated over probably wouldn't hurt, unless there's
>>>>>>> some
>>>>>>> mechanism (that I didn't see) for entries to move from one list
>>>>>>> to
>>>>>>> another. Note that I'm considering each hash bucket a separate
>>>>>>> "list". (One issue to worry about in that case would be
>>>>>>> cache-line
>>>>>>> contention in the array of hash buckets; perhaps
>>>>>>> NFSRVCACHE_HASHSIZE
>>>>>>> ought to be increased to reduce that.)
>>>>>>> 
>>>>>> Yea, a separate mutex for each hash list might help. There is also
>>>>>> the
>>>>>> LRU list that all entries end up on, that gets used by the
>>>>>> trimming
>>>>>> code.
>>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't
>>>>>> looked
>>>>>> at
>>>>>> it in a while.)
>>>>>> 
>>>>>> Also, increasing the hash table size is probably a good idea,
>>>>>> especially
>>>>>> if you reduce how aggressively the cache is trimmed.
>>>>>> 
>>>>>>>> Only doing it once/sec would result in a very large cache when
>>>>>>>> bursts of
>>>>>>>> traffic arrives.
>>>>>>> 
>>>>>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>>>>> 
>>>>>> This code was originally "production tested" on a server with
>>>>>> 1Gbyte,
>>>>>> so times have changed a bit;-)
>>>>>> 
>>>>>>>> I'm not sure I see why doing it as a separate thread will
>>>>>>>> improve
>>>>>>>> things.
>>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if
>>>>>>>> you
>>>>>>>> wish)
>>>>>>>> and having a bunch more "cache trimming threads" would just
>>>>>>>> increase
>>>>>>>> contention, wouldn't it?
>>>>>>> 
>>>>>>> Only one cache-trimming thread. The cache trim holds the (global)
>>>>>>> mutex for much longer than any individual nfsd service thread has
>>>>>>> any
>>>>>>> need to, and having N threads doing that in parallel is why it's
>>>>>>> so
>>>>>>> heavily contended. If there's only one thread doing the trim,
>>>>>>> then
>>>>>>> the nfsd service threads aren't spending time either contending
>>>>>>> on
>>>>>>> the
>>>>>>> mutex (it will be held less frequently and for shorter periods).
>>>>>

Re: NFS server bottlenecks

2012-10-10 Thread Nikolay Denev

On Oct 11, 2012, at 1:09 AM, Rick Macklem  wrote:

> Nikolay Denev wrote:
>> On Oct 10, 2012, at 3:18 AM, Rick Macklem 
>> wrote:
>> 
>>> Nikolay Denev wrote:
>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem 
>>>> wrote:
>>>> 
>>>>> Garrett Wollman wrote:
>>>>>> <>>>>>  said:
>>>>>> 
>>>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>>>> entry
>>>>>>>> is on, rather than a global lock for everything. This would
>>>>>>>> reduce
>>>>>>>> the mutex contention, but I'm not sure how significantly since
>>>>>>>> I
>>>>>>>> don't have the means to measure it yet.
>>>>>>>> 
>>>>>>> Well, since the cache trimming is removing entries from the
>>>>>>> lists,
>>>>>>> I
>>>>>>> don't
>>>>>>> see how that can be done with a global lock for list updates?
>>>>>> 
>>>>>> Well, the global lock is what we have now, but the cache trimming
>>>>>> process only looks at one list at a time, so not locking the list
>>>>>> that
>>>>>> isn't being iterated over probably wouldn't hurt, unless there's
>>>>>> some
>>>>>> mechanism (that I didn't see) for entries to move from one list
>>>>>> to
>>>>>> another. Note that I'm considering each hash bucket a separate
>>>>>> "list". (One issue to worry about in that case would be
>>>>>> cache-line
>>>>>> contention in the array of hash buckets; perhaps
>>>>>> NFSRVCACHE_HASHSIZE
>>>>>> ought to be increased to reduce that.)
>>>>>> 
>>>>> Yea, a separate mutex for each hash list might help. There is also
>>>>> the
>>>>> LRU list that all entries end up on, that gets used by the
>>>>> trimming
>>>>> code.
>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't
>>>>> looked
>>>>> at
>>>>> it in a while.)
>>>>> 
>>>>> Also, increasing the hash table size is probably a good idea,
>>>>> especially
>>>>> if you reduce how aggressively the cache is trimmed.
>>>>> 
>>>>>>> Only doing it once/sec would result in a very large cache when
>>>>>>> bursts of
>>>>>>> traffic arrives.
>>>>>> 
>>>>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>>>> 
>>>>> This code was originally "production tested" on a server with
>>>>> 1Gbyte,
>>>>> so times have changed a bit;-)
>>>>> 
>>>>>>> I'm not sure I see why doing it as a separate thread will
>>>>>>> improve
>>>>>>> things.
>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if
>>>>>>> you
>>>>>>> wish)
>>>>>>> and having a bunch more "cache trimming threads" would just
>>>>>>> increase
>>>>>>> contention, wouldn't it?
>>>>>> 
>>>>>> Only one cache-trimming thread. The cache trim holds the (global)
>>>>>> mutex for much longer than any individual nfsd service thread has
>>>>>> any
>>>>>> need to, and having N threads doing that in parallel is why it's
>>>>>> so
>>>>>> heavily contended. If there's only one thread doing the trim,
>>>>>> then
>>>>>> the nfsd service threads aren't spending time either contending
>>>>>> on
>>>>>> the
>>>>>> mutex (it will be held less frequently and for shorter periods).
>>>>>> 
>>>>> I think the little drc2.patch which will keep the nfsd threads
>>>>> from
>>>>> acquiring the mutex and doing the trimming most of the time, might
>>>>> be
>>>>> sufficient. I still don't see why a separate trimming thread will
>>>>> be
>>>>> an advantage. I'd also be worried 

Re: NFS server bottlenecks

2012-10-10 Thread Nikolay Denev

On Oct 10, 2012, at 3:18 AM, Rick Macklem  wrote:

> Nikolay Denev wrote:
>> On Oct 4, 2012, at 12:36 AM, Rick Macklem 
>> wrote:
>> 
>>> Garrett Wollman wrote:
>>>> <>>>  said:
>>>> 
>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>> entry
>>>>>> is on, rather than a global lock for everything. This would
>>>>>> reduce
>>>>>> the mutex contention, but I'm not sure how significantly since I
>>>>>> don't have the means to measure it yet.
>>>>>> 
>>>>> Well, since the cache trimming is removing entries from the lists,
>>>>> I
>>>>> don't
>>>>> see how that can be done with a global lock for list updates?
>>>> 
>>>> Well, the global lock is what we have now, but the cache trimming
>>>> process only looks at one list at a time, so not locking the list
>>>> that
>>>> isn't being iterated over probably wouldn't hurt, unless there's
>>>> some
>>>> mechanism (that I didn't see) for entries to move from one list to
>>>> another. Note that I'm considering each hash bucket a separate
>>>> "list". (One issue to worry about in that case would be cache-line
>>>> contention in the array of hash buckets; perhaps
>>>> NFSRVCACHE_HASHSIZE
>>>> ought to be increased to reduce that.)
>>>> 
>>> Yea, a separate mutex for each hash list might help. There is also
>>> the
>>> LRU list that all entries end up on, that gets used by the trimming
>>> code.
>>> (I think? I wrote this stuff about 8 years ago, so I haven't looked
>>> at
>>> it in a while.)
>>> 
>>> Also, increasing the hash table size is probably a good idea,
>>> especially
>>> if you reduce how aggressively the cache is trimmed.
>>> 
>>>>> Only doing it once/sec would result in a very large cache when
>>>>> bursts of
>>>>> traffic arrives.
>>>> 
>>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>> 
>>> This code was originally "production tested" on a server with
>>> 1Gbyte,
>>> so times have changed a bit;-)
>>> 
>>>>> I'm not sure I see why doing it as a separate thread will improve
>>>>> things.
>>>>> There are N nfsd threads already (N can be bumped up to 256 if you
>>>>> wish)
>>>>> and having a bunch more "cache trimming threads" would just
>>>>> increase
>>>>> contention, wouldn't it?
>>>> 
>>>> Only one cache-trimming thread. The cache trim holds the (global)
>>>> mutex for much longer than any individual nfsd service thread has
>>>> any
>>>> need to, and having N threads doing that in parallel is why it's so
>>>> heavily contended. If there's only one thread doing the trim, then
>>>> the nfsd service threads aren't spending time either contending on
>>>> the
>>>> mutex (it will be held less frequently and for shorter periods).
>>>> 
>>> I think the little drc2.patch which will keep the nfsd threads from
>>> acquiring the mutex and doing the trimming most of the time, might
>>> be
>>> sufficient. I still don't see why a separate trimming thread will be
>>> an advantage. I'd also be worried that the one cache trimming thread
>>> won't get the job done soon enough.
>>> 
>>> When I did production testing on a 1Gbyte server that saw a peak
>>> load of about 100RPCs/sec, it was necessary to trim aggressively.
>>> (Although I'd be tempted to say that a server with 1Gbyte is no
>>> longer relevant, I recently recall someone trying to run FreeBSD
>>> on a i486, although I doubt they wanted to run the nfsd on it.)
>>> 
>>>>> The only negative effect I can think of w.r.t. having the nfsd
>>>>> threads doing it would be a (I believe negligible) increase in RPC
>>>>> response times (the time the nfsd thread spends trimming the
>>>>> cache).
>>>>> As noted, I think this time would be negligible compared to disk
>>>>> I/O
>>>>> and network transit times in the total RPC response time?
>>>> 
>>>> With adaptive mutexes, many

Re: NFS server bottlenecks

2012-10-09 Thread Nikolay Denev
On Oct 9, 2012, at 5:12 PM, Nikolay Denev  wrote:

> 
> On Oct 4, 2012, at 12:36 AM, Rick Macklem  wrote:
> 
>> Garrett Wollman wrote:
>>> <>>  said:
>>> 
>>>>> Simple: just use a sepatate mutex for each list that a cache entry
>>>>> is on, rather than a global lock for everything. This would reduce
>>>>> the mutex contention, but I'm not sure how significantly since I
>>>>> don't have the means to measure it yet.
>>>>> 
>>>> Well, since the cache trimming is removing entries from the lists, I
>>>> don't
>>>> see how that can be done with a global lock for list updates?
>>> 
>>> Well, the global lock is what we have now, but the cache trimming
>>> process only looks at one list at a time, so not locking the list that
>>> isn't being iterated over probably wouldn't hurt, unless there's some
>>> mechanism (that I didn't see) for entries to move from one list to
>>> another. Note that I'm considering each hash bucket a separate
>>> "list". (One issue to worry about in that case would be cache-line
>>> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
>>> ought to be increased to reduce that.)
>>> 
>> Yea, a separate mutex for each hash list might help. There is also the
>> LRU list that all entries end up on, that gets used by the trimming code.
>> (I think? I wrote this stuff about 8 years ago, so I haven't looked at
>> it in a while.)
>> 
>> Also, increasing the hash table size is probably a good idea, especially
>> if you reduce how aggressively the cache is trimmed.
>> 
>>>> Only doing it once/sec would result in a very large cache when
>>>> bursts of
>>>> traffic arrives.
>>> 
>>> My servers have 96 GB of memory so that's not a big deal for me.
>>> 
>> This code was originally "production tested" on a server with 1Gbyte,
>> so times have changed a bit;-)
>> 
>>>> I'm not sure I see why doing it as a separate thread will improve
>>>> things.
>>>> There are N nfsd threads already (N can be bumped up to 256 if you
>>>> wish)
>>>> and having a bunch more "cache trimming threads" would just increase
>>>> contention, wouldn't it?
>>> 
>>> Only one cache-trimming thread. The cache trim holds the (global)
>>> mutex for much longer than any individual nfsd service thread has any
>>> need to, and having N threads doing that in parallel is why it's so
>>> heavily contended. If there's only one thread doing the trim, then
>>> the nfsd service threads aren't spending time either contending on the
>>> mutex (it will be held less frequently and for shorter periods).
>>> 
>> I think the little drc2.patch which will keep the nfsd threads from
>> acquiring the mutex and doing the trimming most of the time, might be
>> sufficient. I still don't see why a separate trimming thread will be
>> an advantage. I'd also be worried that the one cache trimming thread
>> won't get the job done soon enough.
>> 
>> When I did production testing on a 1Gbyte server that saw a peak
>> load of about 100RPCs/sec, it was necessary to trim aggressively.
>> (Although I'd be tempted to say that a server with 1Gbyte is no
>> longer relevant, I recently recall someone trying to run FreeBSD
>> on a i486, although I doubt they wanted to run the nfsd on it.)
>> 
>>>> The only negative effect I can think of w.r.t. having the nfsd
>>>> threads doing it would be a (I believe negligible) increase in RPC
>>>> response times (the time the nfsd thread spends trimming the cache).
>>>> As noted, I think this time would be negligible compared to disk I/O
>>>> and network transit times in the total RPC response time?
>>> 
>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
>>> network connectivity, spinning on a contended mutex takes a
>>> significant amount of CPU time. (For the current design of the NFS
>>> server, it may actually be a win to turn off adaptive mutexes -- I
>>> should give that a try once I'm able to do more testing.)
>>> 
>> Have fun with it. Let me know when you have what you think is a good patch.
>> 
>> rick
>> 
>>> -GAWollman
>>> ___
>>> freebsd-hackers@freebsd.org mailing

Re: NFS server bottlenecks

2012-10-09 Thread Nikolay Denev

On Oct 4, 2012, at 12:36 AM, Rick Macklem  wrote:

> Garrett Wollman wrote:
>> <>  said:
>> 
 Simple: just use a sepatate mutex for each list that a cache entry
 is on, rather than a global lock for everything. This would reduce
 the mutex contention, but I'm not sure how significantly since I
 don't have the means to measure it yet.
 
>>> Well, since the cache trimming is removing entries from the lists, I
>>> don't
>>> see how that can be done with a global lock for list updates?
>> 
>> Well, the global lock is what we have now, but the cache trimming
>> process only looks at one list at a time, so not locking the list that
>> isn't being iterated over probably wouldn't hurt, unless there's some
>> mechanism (that I didn't see) for entries to move from one list to
>> another. Note that I'm considering each hash bucket a separate
>> "list". (One issue to worry about in that case would be cache-line
>> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
>> ought to be increased to reduce that.)
>> 
> Yea, a separate mutex for each hash list might help. There is also the
> LRU list that all entries end up on, that gets used by the trimming code.
> (I think? I wrote this stuff about 8 years ago, so I haven't looked at
> it in a while.)
> 
> Also, increasing the hash table size is probably a good idea, especially
> if you reduce how aggressively the cache is trimmed.
> 
>>> Only doing it once/sec would result in a very large cache when
>>> bursts of
>>> traffic arrives.
>> 
>> My servers have 96 GB of memory so that's not a big deal for me.
>> 
> This code was originally "production tested" on a server with 1Gbyte,
> so times have changed a bit;-)
> 
>>> I'm not sure I see why doing it as a separate thread will improve
>>> things.
>>> There are N nfsd threads already (N can be bumped up to 256 if you
>>> wish)
>>> and having a bunch more "cache trimming threads" would just increase
>>> contention, wouldn't it?
>> 
>> Only one cache-trimming thread. The cache trim holds the (global)
>> mutex for much longer than any individual nfsd service thread has any
>> need to, and having N threads doing that in parallel is why it's so
>> heavily contended. If there's only one thread doing the trim, then
>> the nfsd service threads aren't spending time either contending on the
>> mutex (it will be held less frequently and for shorter periods).
>> 
> I think the little drc2.patch which will keep the nfsd threads from
> acquiring the mutex and doing the trimming most of the time, might be
> sufficient. I still don't see why a separate trimming thread will be
> an advantage. I'd also be worried that the one cache trimming thread
> won't get the job done soon enough.
> 
> When I did production testing on a 1Gbyte server that saw a peak
> load of about 100RPCs/sec, it was necessary to trim aggressively.
> (Although I'd be tempted to say that a server with 1Gbyte is no
> longer relevant, I recently recall someone trying to run FreeBSD
> on a i486, although I doubt they wanted to run the nfsd on it.)
> 
>>> The only negative effect I can think of w.r.t. having the nfsd
>>> threads doing it would be a (I believe negligible) increase in RPC
>>> response times (the time the nfsd thread spends trimming the cache).
>>> As noted, I think this time would be negligible compared to disk I/O
>>> and network transit times in the total RPC response time?
>> 
>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
>> network connectivity, spinning on a contended mutex takes a
>> significant amount of CPU time. (For the current design of the NFS
>> server, it may actually be a win to turn off adaptive mutexes -- I
>> should give that a try once I'm able to do more testing.)
>> 
> Have fun with it. Let me know when you have what you think is a good patch.
> 
> rick
> 
>> -GAWollman
>> ___
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to
>> "freebsd-hackers-unsubscr...@freebsd.org"
> ___
> freebsd...@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscr...@freebsd.org"

My quest for IOPS over NFS continues :)
So far I'm not able to achieve more than about 3000 8K read requests over NFS,
while the server locally gives much more.
And this is all from a file that is completely in ARC cache, no disk IO 
involved.

I've snatched some sample DTrace script from the net : [ 
http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes ]

And modified it for our new NFS server :

#!/usr/sbin/dtrace -qs 

fbt:kernel:nfsrvd_*:entry
{
self->ts = timestamp; 
@counts[probefunc] = count();
}

fbt:kernel:nfsrvd_*:return
/ self->ts > 0 /
{
this->delta = (timestamp-self->ts)/100;
}

fbt:kernel:nfsrvd_*:return
/ self

Re: NFS server bottlenecks

2012-10-06 Thread Nikolay Denev
On Oct 4, 2012, at 12:36 AM, Rick Macklem  wrote:

> Garrett Wollman wrote:
>> <>  said:
>> 
 Simple: just use a sepatate mutex for each list that a cache entry
 is on, rather than a global lock for everything. This would reduce
 the mutex contention, but I'm not sure how significantly since I
 don't have the means to measure it yet.
 
>>> Well, since the cache trimming is removing entries from the lists, I
>>> don't
>>> see how that can be done with a global lock for list updates?
>> 
>> Well, the global lock is what we have now, but the cache trimming
>> process only looks at one list at a time, so not locking the list that
>> isn't being iterated over probably wouldn't hurt, unless there's some
>> mechanism (that I didn't see) for entries to move from one list to
>> another. Note that I'm considering each hash bucket a separate
>> "list". (One issue to worry about in that case would be cache-line
>> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
>> ought to be increased to reduce that.)
>> 
> Yea, a separate mutex for each hash list might help. There is also the
> LRU list that all entries end up on, that gets used by the trimming code.
> (I think? I wrote this stuff about 8 years ago, so I haven't looked at
> it in a while.)
> 
> Also, increasing the hash table size is probably a good idea, especially
> if you reduce how aggressively the cache is trimmed.
> 
>>> Only doing it once/sec would result in a very large cache when
>>> bursts of
>>> traffic arrives.
>> 
>> My servers have 96 GB of memory so that's not a big deal for me.
>> 
> This code was originally "production tested" on a server with 1Gbyte,
> so times have changed a bit;-)
> 
>>> I'm not sure I see why doing it as a separate thread will improve
>>> things.
>>> There are N nfsd threads already (N can be bumped up to 256 if you
>>> wish)
>>> and having a bunch more "cache trimming threads" would just increase
>>> contention, wouldn't it?
>> 
>> Only one cache-trimming thread. The cache trim holds the (global)
>> mutex for much longer than any individual nfsd service thread has any
>> need to, and having N threads doing that in parallel is why it's so
>> heavily contended. If there's only one thread doing the trim, then
>> the nfsd service threads aren't spending time either contending on the
>> mutex (it will be held less frequently and for shorter periods).
>> 
> I think the little drc2.patch which will keep the nfsd threads from
> acquiring the mutex and doing the trimming most of the time, might be
> sufficient. I still don't see why a separate trimming thread will be
> an advantage. I'd also be worried that the one cache trimming thread
> won't get the job done soon enough.
> 
> When I did production testing on a 1Gbyte server that saw a peak
> load of about 100RPCs/sec, it was necessary to trim aggressively.
> (Although I'd be tempted to say that a server with 1Gbyte is no
> longer relevant, I recently recall someone trying to run FreeBSD
> on a i486, although I doubt they wanted to run the nfsd on it.)
> 
>>> The only negative effect I can think of w.r.t. having the nfsd
>>> threads doing it would be a (I believe negligible) increase in RPC
>>> response times (the time the nfsd thread spends trimming the cache).
>>> As noted, I think this time would be negligible compared to disk I/O
>>> and network transit times in the total RPC response time?
>> 
>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
>> network connectivity, spinning on a contended mutex takes a
>> significant amount of CPU time. (For the current design of the NFS
>> server, it may actually be a win to turn off adaptive mutexes -- I
>> should give that a try once I'm able to do more testing.)
>> 
> Have fun with it. Let me know when you have what you think is a good patch.
> 
> rick
> 
>> -GAWollman
>> ___
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to
>> "freebsd-hackers-unsubscr...@freebsd.org"
> ___
> freebsd...@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscr...@freebsd.org"

I was doing some NFS testing with RELENG_9 machine and
a Linux RHEL machine over 10G network, and noticed the same nfsd threads issue.

Previously I would read a 32G file locally on the FreeBSD ZFS/NFS server with 
"dd if=/tank/32G.bin of=/dev/null bs=1M" to cache it completely in ARC (machine 
has 196G RAM),
then if I do this again locally I would get close to 4GB/sec read - completely 
from the cache...

But If I try to read the file over NFS from the Linux machine I would only get 
about 100MB/sec speed, sometimes a bit more,
and all of the nfsd threads are clearly visible in top. pmcstat also showed the 
same mutex contention as in the original post.

I've now applied

accessing geom stats from the kernel

2009-12-08 Thread Nikolay Denev
Hello,

I have a small four SATA bay machine (HP ex470) which I'm using as a NAS with 
FreeBSD+ZFS.
It has four dual color leds for each SATA bay (red and blue + purple if lit 
together)
Which I'm controlling either from userspace by writing data to the enclosure 
management ioport,
or recently I've made a kernel module which uses the led(4) framework and 
exports the leds as device nodes in /dev.

What I'm wondering now is, if there is an easy way to access geom/disk stats 
from in kernel and make the leds
flash only during disk activity, without having to do it in userspace?
 

--
Regards,
Nikolay Denev




___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"