RE: NFS over RDMA benchmark

2013-04-28 Thread Yan Burman


 -Original Message-
 From: J. Bruce Fields [mailto:bfie...@fieldses.org]
 Sent: Wednesday, April 24, 2013 18:27
 To: Yan Burman
 Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
 linux-...@vger.kernel.org; Or Gerlitz
 Subject: Re: NFS over RDMA benchmark
 
 On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
  On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
  
  
-Original Message-
From: J. Bruce Fields [mailto:bfie...@fieldses.org]
Sent: Wednesday, April 24, 2013 00:06
To: Yan Burman
Cc: Wendy Cheng; Atchley, Scott; Tom Tucker;
linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz
Subject: Re: NFS over RDMA benchmark
   
On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:


  -Original Message-
  From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
  Sent: Wednesday, April 17, 2013 21:06
  To: Atchley, Scott
  Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
  linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
  Subject: Re: NFS over RDMA benchmark
 
  On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
  atchle...@ornl.gov
  wrote:
   On Apr 17, 2013, at 1:15 PM, Wendy Cheng
   s.wendy.ch...@gmail.com
  wrote:
  
   On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
   y...@mellanox.com
  wrote:
   Hi.
  
   I've been trying to do some benchmarks for NFS over RDMA
   and I seem to
  only get about half of the bandwidth that the HW can give me.
   My setup consists of 2 servers each with 16 cores, 32Gb of
   memory, and
  Mellanox ConnectX3 QDR card over PCI-e gen3.
   These servers are connected to a QDR IB switch. The
   backing storage on
  the server is tmpfs mounted with noatime.
   I am running kernel 3.5.7.
  
   When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
 512K.
   When I run fio over rdma mounted nfs, I get 260-2200MB/sec
   for the
  same block sizes (4-512K). running over IPoIB-CM, I get 200-
 980MB/sec.
  
   Yan,
  
   Are you trying to optimize single client performance or
   server performance
  with multiple clients?
  

 I am trying to get maximum performance from a single server - I
 used 2
processes in fio test - more than 2 did not show any performance boost.
 I tried running fio from 2 different PCs on 2 different files,
 but the sum of
the two is more or less the same as running from single client PC.

 What I did see is that server is sweating a lot more than the
 clients and
more than that, it has 1 core (CPU5) in 100% softirq tasklet:
 cat /proc/softirqs
   
Would any profiling help figure out which code it's spending time in?
(E.g. something simple as perf top might have useful output.)
   
  
  
   Perf top for the CPU with high tasklet count gives:
  
samples  pcnt RIPfunction
   DSO
___ _ 
   ___
  
 _
 __
  
2787.00 24.1% 81062a00 mutex_spin_on_owner
 /root/vmlinux
 
  I guess that means lots of contention on some mutex?  If only we knew
  which one perf should also be able to collect stack statistics, I
  forget how.
 
 Googling around  I think we want:
 
   perf record -a --call-graph
   (give it a chance to collect some samples, then ^C)
   perf report --call-graph --stdio
 

Sorry it took me a while to get perf to show the call trace (did not enable 
frame pointers in kernel and struggled with perf options...), but what I get is:
36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
|
--- mutex_spin_on_owner
   |
   |--99.99%-- __mutex_lock_slowpath
   |  mutex_lock
   |  |
   |  |--85.30%-- generic_file_aio_write
   |  |  do_sync_readv_writev
   |  |  do_readv_writev
   |  |  vfs_writev
   |  |  nfsd_vfs_write
   |  |  nfsd_write
   |  |  nfsd3_proc_write
   |  |  nfsd_dispatch
   |  |  svc_process_common
   |  |  svc_process
   |  |  nfsd
   |  |  kthread
   |  |  kernel_thread_helper
   |  |
   |   --14.70%-- svc_send

Re: NFS over RDMA benchmark

2013-04-28 Thread J. Bruce Fields
On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote:
On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
y...@mellanox.com
I've been trying to do some benchmarks for NFS over RDMA
and I seem to
   only get about half of the bandwidth that the HW can give me.
My setup consists of 2 servers each with 16 cores, 32Gb of
memory, and
   Mellanox ConnectX3 QDR card over PCI-e gen3.
These servers are connected to a QDR IB switch. The
backing storage on
   the server is tmpfs mounted with noatime.
I am running kernel 3.5.7.
   
When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 
4-
  512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec
for the
   same block sizes (4-512K). running over IPoIB-CM, I get 200-
  980MB/sec.
...
  I am trying to get maximum performance from a single server - I
  used 2
 processes in fio test - more than 2 did not show any performance 
 boost.
  I tried running fio from 2 different PCs on 2 different files,
  but the sum of
 the two is more or less the same as running from single client PC.
 
  What I did see is that server is sweating a lot more than the
  clients and
 more than that, it has 1 core (CPU5) in 100% softirq tasklet:
  cat /proc/softirqs
...
Perf top for the CPU with high tasklet count gives:
   
 samples  pcnt RIPfunction  
  DSO
...
 2787.00 24.1% 81062a00 mutex_spin_on_owner
  /root/vmlinux
...
  Googling around  I think we want:
  
  perf record -a --call-graph
  (give it a chance to collect some samples, then ^C)
  perf report --call-graph --stdio
  
 
 Sorry it took me a while to get perf to show the call trace (did not enable 
 frame pointers in kernel and struggled with perf options...), but what I get 
 is:
 36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
 |
 --- mutex_spin_on_owner
|
|--99.99%-- __mutex_lock_slowpath
|  mutex_lock
|  |
|  |--85.30%-- generic_file_aio_write

That's the inode i_mutex.

|  |  do_sync_readv_writev
|  |  do_readv_writev
|  |  vfs_writev
|  |  nfsd_vfs_write
|  |  nfsd_write
|  |  nfsd3_proc_write
|  |  nfsd_dispatch
|  |  svc_process_common
|  |  svc_process
|  |  nfsd
|  |  kthread
|  |  kernel_thread_helper
|  |
|   --14.70%-- svc_send

That's the xpt_mutex (ensuring rpc replies aren't interleaved).

| svc_process
| nfsd
| kthread
| kernel_thread_helper
 --0.01%-- [...]
 
  9.63%  nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
 |
 --- _raw_spin_lock_irqsave
|
|--43.97%-- alloc_iova

And that (and __free_iova below) looks like iova_rbtree_lock.

--b.

|  intel_alloc_iova
|  __intel_map_single
|  intel_map_page
|  |
|  |--60.47%-- svc_rdma_sendto
|  |  svc_send
|  |  svc_process
|  |  nfsd
|  |  kthread
|  |  kernel_thread_helper
|  |
|  |--30.10%-- rdma_read_xdr
|  |  svc_rdma_recvfrom
|  |  svc_recv
|  |  nfsd
|  |  kthread
|  |  kernel_thread_helper
|  |
|  |--6.69%-- svc_rdma_post_recv
|  |  send_reply
|  |  svc_rdma_sendto
|  

Re: NFS over RDMA benchmark

2013-04-28 Thread Wendy Cheng
On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields bfie...@fieldses.org wrote:

 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
  same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
 ...

[snip]

 36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner

 That's the inode i_mutex.

 14.70%-- svc_send

 That's the xpt_mutex (ensuring rpc replies aren't interleaved).


  9.63%  nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave


 And that (and __free_iova below) looks like iova_rbtree_lock.



Let's revisit your command:

FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128
--ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255
--loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1
--norandommap --group_reporting --exitall --buffered=0

* inode's i_mutex:
If increasing process/file count didn't help, maybe increase iodepth
(say 512 ?) could offset the i_mutex overhead a little bit ?

* xpt_mutex:
(no idea)

* iova_rbtree_lock
DMA mapping fragmentation ? I have not studied whether NFS-RDMA
routines such as svc_rdma_sendto() could do better but maybe
sequential IO (instead of randread) could help ? Bigger block size
(instead of 4K) can help ?

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html