RE: NFS over RDMA benchmark
-Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Wednesday, April 24, 2013 18:27 To: Yan Burman Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote: On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote: -Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Wednesday, April 24, 2013 00:06 To: Yan Burman Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote: -Original Message- From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com] Sent: Wednesday, April 17, 2013 21:06 To: Atchley, Scott Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: NFS over RDMA benchmark On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov wrote: On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote: Hi. I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4- 512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200- 980MB/sec. Yan, Are you trying to optimize single client performance or server performance with multiple clients? I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs Would any profiling help figure out which code it's spending time in? (E.g. something simple as perf top might have useful output.) Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunction DSO ___ _ ___ _ __ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux I guess that means lots of contention on some mutex? If only we knew which one perf should also be able to collect stack statistics, I forget how. Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is: 36.18% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner | --- mutex_spin_on_owner | |--99.99%-- __mutex_lock_slowpath | mutex_lock | | | |--85.30%-- generic_file_aio_write | | do_sync_readv_writev | | do_readv_writev | | vfs_writev | | nfsd_vfs_write | | nfsd_write | | nfsd3_proc_write | | nfsd_dispatch | | svc_process_common | | svc_process | | nfsd | | kthread | | kernel_thread_helper | | | --14.70%-- svc_send
Re: NFS over RDMA benchmark
On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4- 512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200- 980MB/sec. ... I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs ... Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunction DSO ... 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux ... Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is: 36.18% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner | --- mutex_spin_on_owner | |--99.99%-- __mutex_lock_slowpath | mutex_lock | | | |--85.30%-- generic_file_aio_write That's the inode i_mutex. | | do_sync_readv_writev | | do_readv_writev | | vfs_writev | | nfsd_vfs_write | | nfsd_write | | nfsd3_proc_write | | nfsd_dispatch | | svc_process_common | | svc_process | | nfsd | | kthread | | kernel_thread_helper | | | --14.70%-- svc_send That's the xpt_mutex (ensuring rpc replies aren't interleaved). | svc_process | nfsd | kthread | kernel_thread_helper --0.01%-- [...] 9.63% nfsd [kernel.kallsyms] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--43.97%-- alloc_iova And that (and __free_iova below) looks like iova_rbtree_lock. --b. | intel_alloc_iova | __intel_map_single | intel_map_page | | | |--60.47%-- svc_rdma_sendto | | svc_send | | svc_process | | nfsd | | kthread | | kernel_thread_helper | | | |--30.10%-- rdma_read_xdr | | svc_rdma_recvfrom | | svc_recv | | nfsd | | kthread | | kernel_thread_helper | | | |--6.69%-- svc_rdma_post_recv | | send_reply | | svc_rdma_sendto |
Re: NFS over RDMA benchmark
On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields bfie...@fieldses.org wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. ... [snip] 36.18% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner That's the inode i_mutex. 14.70%-- svc_send That's the xpt_mutex (ensuring rpc replies aren't interleaved). 9.63% nfsd [kernel.kallsyms] [k] _raw_spin_lock_irqsave And that (and __free_iova below) looks like iova_rbtree_lock. Let's revisit your command: FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255 --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall --buffered=0 * inode's i_mutex: If increasing process/file count didn't help, maybe increase iodepth (say 512 ?) could offset the i_mutex overhead a little bit ? * xpt_mutex: (no idea) * iova_rbtree_lock DMA mapping fragmentation ? I have not studied whether NFS-RDMA routines such as svc_rdma_sendto() could do better but maybe sequential IO (instead of randread) could help ? Bigger block size (instead of 4K) can help ? -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html