Re: NFS over RDMA benchmark

2013-06-20 Thread Or Gerlitz

On 19/06/2013 18:47, Wendy Cheng wrote:

what kind of HW I would need to run it ?


The mlx4 driver supports memory windows as of kernel 3.9

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-30 Thread Tom Talpey

On 4/30/2013 1:09 AM, Yan Burman wrote:




-Original Message-
From: J. Bruce Fields [mailto:bfie...@fieldses.org]
Sent: Sunday, April 28, 2013 17:43
To: Yan Burman
Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
linux-...@vger.kernel.org; Or Gerlitz
Subject: Re: NFS over RDMA benchmark

On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote:

On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
y...@mellanox.com

I've been trying to do some benchmarks for NFS over
RDMA and I seem to

only get about half of the bandwidth that the HW can give me.

My setup consists of 2 servers each with 16 cores,
32Gb of memory, and

Mellanox ConnectX3 QDR card over PCI-e gen3.

These servers are connected to a QDR IB switch. The
backing storage on

the server is tmpfs mounted with noatime.

I am running kernel 3.5.7.

When running ib_send_bw, I get 4.3-4.5 GB/sec for
block sizes 4-

512K.

When I run fio over rdma mounted nfs, I get
260-2200MB/sec for the

same block sizes (4-512K). running over IPoIB-CM, I get
200-

980MB/sec.

...

I am trying to get maximum performance from a single server
- I used 2

processes in fio test - more than 2 did not show any performance

boost.

I tried running fio from 2 different PCs on 2 different
files, but the sum of

the two is more or less the same as running from single client PC.




I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also 
way higher now).
For some reason when I had intel IOMMU enabled, the performance dropped 
significantly.
I now get up to ~95K IOPS and 4.1GB/sec bandwidth.


Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps.

What is the client CPU percentage you see under this workload, and
how different are the NFS/RDMA and NFS/IPoIB overheads?


Now I will take care of the issue that I am running only at 40Gbit/s instead of 
56Gbit/s, but that is another unrelated problem (I suspect I have a cable 
issue).

This is still strange, since ib_send_bw with intel iommu enabled did get up to 
4.5GB/sec, so why did intel iommu affect only nfs code?


You'll need to do more profiling to track that down. I would suspect
that ib_send_bw is using some sort of direct hardware access, bypassing
the IOMMU management and possibly performing no dynamic memory registration.

The NFS/RDMA code goes via the standard kernel DMA API, and correctly
registers/deregisters memory on a per-i/o basis in order to provide
storage data integrity. Perhaps there are overheads in the IOMMU
management which can be addressed.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-30 Thread J. Bruce Fields
On Sun, Apr 28, 2013 at 10:42:48AM -0400, J. Bruce Fields wrote:
 On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote:
 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
 y...@mellanox.com
 I've been trying to do some benchmarks for NFS over RDMA
 and I seem to
only get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores, 32Gb of
 memory, and
Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The
 backing storage on
the server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block 
 sizes 4-
   512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec
 for the
same block sizes (4-512K). running over IPoIB-CM, I get 200-
   980MB/sec.
 ...
   I am trying to get maximum performance from a single server - I
   used 2
  processes in fio test - more than 2 did not show any performance 
  boost.
   I tried running fio from 2 different PCs on 2 different files,
   but the sum of
  the two is more or less the same as running from single client PC.
  
   What I did see is that server is sweating a lot more than the
   clients and
  more than that, it has 1 core (CPU5) in 100% softirq tasklet:
   cat /proc/softirqs
 ...
 Perf top for the CPU with high tasklet count gives:

  samples  pcnt RIPfunction
 DSO
 ...
  2787.00 24.1% 81062a00 mutex_spin_on_owner
   /root/vmlinux
 ...
   Googling around  I think we want:
   
 perf record -a --call-graph
 (give it a chance to collect some samples, then ^C)
 perf report --call-graph --stdio
   
  
  Sorry it took me a while to get perf to show the call trace (did not enable 
  frame pointers in kernel and struggled with perf options...), but what I 
  get is:
  36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
  |
  --- mutex_spin_on_owner
 |
 |--99.99%-- __mutex_lock_slowpath
 |  mutex_lock
 |  |
 |  |--85.30%-- generic_file_aio_write
 
 That's the inode i_mutex.

Looking at the code  With CONFIG_MUTEX_SPIN_ON_OWNER it spins
(instead of sleeping) as long as the lock owner's still running.  So
this is just a lot of contention on the i_mutex, I guess.  Not sure what
to do aobut that.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-30 Thread Tom Talpey

On 4/30/2013 1:09 AM, Yan Burman wrote:

I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
...
 ib_send_bw with intel iommu enabled did get up to 4.5GB/sec


BTW, you may want to verify that these are the same GB. Many
benchmarks say KB/MB/GB when they really mean KiB/MiB/GiB.

At GB/GiB, the difference is about 7.5%, very close to the
difference between 4.1 and 4.5.

Just a thought.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: NFS over RDMA benchmark

2013-04-30 Thread Yan Burman


 -Original Message-
 From: Tom Talpey [mailto:t...@talpey.com]
 Sent: Tuesday, April 30, 2013 16:05
 To: Yan Burman
 Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux-
 r...@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz
 Subject: Re: NFS over RDMA benchmark
 
 On 4/30/2013 1:09 AM, Yan Burman wrote:
 
 
  -Original Message-
  From: J. Bruce Fields [mailto:bfie...@fieldses.org]
  Sent: Sunday, April 28, 2013 17:43
  To: Yan Burman
  Cc: Wendy Cheng; Atchley, Scott; Tom Tucker;
  linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz
  Subject: Re: NFS over RDMA benchmark
 
  On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote:
  On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
  y...@mellanox.com
  I've been trying to do some benchmarks for NFS over RDMA
  and I seem to
  only get about half of the bandwidth that the HW can give me.
  My setup consists of 2 servers each with 16 cores, 32Gb of
  memory, and
  Mellanox ConnectX3 QDR card over PCI-e gen3.
  These servers are connected to a QDR IB switch. The backing
  storage on
  the server is tmpfs mounted with noatime.
  I am running kernel 3.5.7.
 
  When running ib_send_bw, I get 4.3-4.5 GB/sec for block
  sizes 4-
  512K.
  When I run fio over rdma mounted nfs, I get 260-2200MB/sec
  for the
  same block sizes (4-512K). running over IPoIB-CM, I get
  200-
  980MB/sec.
  ...
  I am trying to get maximum performance from a single server
  - I used 2
  processes in fio test - more than 2 did not show any performance
  boost.
  I tried running fio from 2 different PCs on 2 different files,
  but the sum of
  the two is more or less the same as running from single client PC.
 
 
  I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is
 also way higher now).
  For some reason when I had intel IOMMU enabled, the performance
 dropped significantly.
  I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
 
 Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps.
 

That is not a typo. I get 95K IOPS with randrw test with block size 4K.
I get 4.1GBps with block size 256K randread test.

 What is the client CPU percentage you see under this workload, and how
 different are the NFS/RDMA and NFS/IPoIB overheads?

NFS/RDMA has about more 20-30% CPU usage than NFS/IPoIB, but RDMA has almost 
twice the bandwidth of IPoIB.
Overall, CPU usage gets up to about 20% for randread and 50% for randwrite.

 
  Now I will take care of the issue that I am running only at 40Gbit/s instead
 of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable
 issue).
 
  This is still strange, since ib_send_bw with intel iommu enabled did get up
 to 4.5GB/sec, so why did intel iommu affect only nfs code?
 
 You'll need to do more profiling to track that down. I would suspect that
 ib_send_bw is using some sort of direct hardware access, bypassing the
 IOMMU management and possibly performing no dynamic memory
 registration.
 
 The NFS/RDMA code goes via the standard kernel DMA API, and correctly
 registers/deregisters memory on a per-i/o basis in order to provide storage
 data integrity. Perhaps there are overheads in the IOMMU management
 which can be addressed.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: NFS over RDMA benchmark

2013-04-30 Thread Yan Burman


 -Original Message-
 From: Tom Talpey [mailto:t...@talpey.com]
 Sent: Tuesday, April 30, 2013 17:20
 To: Yan Burman
 Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux-
 r...@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz
 Subject: Re: NFS over RDMA benchmark
 
 On 4/30/2013 1:09 AM, Yan Burman wrote:
  I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
 ...
   ib_send_bw with intel iommu enabled did get up to 4.5GB/sec
 
 BTW, you may want to verify that these are the same GB. Many benchmarks
 say KB/MB/GB when they really mean KiB/MiB/GiB.
 
 At GB/GiB, the difference is about 7.5%, very close to the difference between
 4.1 and 4.5.
 
 Just a thought.

The question is not why there is 400MBps difference between ib_send_bw and 
NFSoRDMA.
The question is why with IOMMU ib_send_bw got to the same bandwidth as without 
it while NFSoRDMA got half.

From some googling, it seems that when IOMMU is enabled, dma mapping functions 
get a lot more expensive.
Perhaps that is the reason for the performance drop.

Yan

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-30 Thread Tom Talpey

On 4/30/2013 10:23 AM, Yan Burman wrote:

-Original Message-
From: Tom Talpey [mailto:t...@talpey.com]

On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote:

I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is

also way higher now).

For some reason when I had intel IOMMU enabled, the performance

dropped significantly.

I now get up to ~95K IOPS and 4.1GB/sec bandwidth.


Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps.



That is not a typo. I get 95K IOPS with randrw test with block size 4K.
I get 4.1GBps with block size 256K randread test.


Well, then I suggest you focus on whether you are satisfied with a
high bandwidth goal or a high IOPS goal. They are two very different
things, and clearly there are still significant issues to track down
in the server.


What is the client CPU percentage you see under this workload, and how
different are the NFS/RDMA and NFS/IPoIB overheads?


NFS/RDMA has about more 20-30% CPU usage than NFS/IPoIB, but RDMA has almost 
twice the bandwidth of IPoIB.


So, for 125% of the CPU, RDMA is delivering 200% of the bandwidth.
A common reporting approach is to calculate cycles per Byte (roughly,
CPU/MB/sec), and you'll find this can be a great tool for comparison
when overhead is a consideration.


Overall, CPU usage gets up to about 20% for randread and 50% for randwrite.


This is *client* CPU? Writes require the server to take additional
overhead to make RDMA Read requests, but the client side is doing
practically the same thing for the read vs write path. Again, you
may want to profile more deeply to track that difference down.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-30 Thread Wendy Cheng
On Mon, Apr 29, 2013 at 10:09 PM, Yan Burman y...@mellanox.com wrote:

 I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also 
 way higher now).
 For some reason when I had intel IOMMU enabled, the performance dropped 
 significantly.
 I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
 Now I will take care of the issue that I am running only at 40Gbit/s instead 
 of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable 
 issue).

 This is still strange, since ib_send_bw with intel iommu enabled did get up 
 to 4.5GB/sec, so why did intel iommu affect only nfs code?


That's very exciting ! The sad part is that IOMMU has to be turned off.

I think ib_send_bw uses a single buffer so the DMA mapping search
overhead is not an issue.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-30 Thread Tom Tucker

On 4/30/13 9:38 AM, Yan Burman wrote:



-Original Message-
From: Tom Talpey [mailto:t...@talpey.com]
Sent: Tuesday, April 30, 2013 17:20
To: Yan Burman
Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux-
r...@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz
Subject: Re: NFS over RDMA benchmark

On 4/30/2013 1:09 AM, Yan Burman wrote:

I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
...
  ib_send_bw with intel iommu enabled did get up to 4.5GB/sec

BTW, you may want to verify that these are the same GB. Many benchmarks
say KB/MB/GB when they really mean KiB/MiB/GiB.

At GB/GiB, the difference is about 7.5%, very close to the difference between
4.1 and 4.5.

Just a thought.

The question is not why there is 400MBps difference between ib_send_bw and 
NFSoRDMA.
The question is why with IOMMU ib_send_bw got to the same bandwidth as without 
it while NFSoRDMA got half.
NFSRDMA is constantly registering and unregistering memory when you use 
FRMR mode. By contrast IPoIB has a descriptor ring that is set up once 
and re-used. I suspect this is the difference maker. Have you tried 
running the server in ALL_PHYSICAL mode, i.e. where it uses a DMA_MR for 
all of memory?


Tom

From some googling, it seems that when IOMMU is enabled, dma mapping functions 
get a lot more expensive.
Perhaps that is the reason for the performance drop.

Yan


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: NFS over RDMA benchmark

2013-04-29 Thread Yan Burman


 -Original Message-
 From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
 Sent: Monday, April 29, 2013 08:35
 To: J. Bruce Fields
 Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
 linux-...@vger.kernel.org; Or Gerlitz
 Subject: Re: NFS over RDMA benchmark
 
 On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields bfie...@fieldses.org wrote:
 
  On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
 
  When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
  When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
   same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
  ...
 
 [snip]
 
  36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
 
  That's the inode i_mutex.
 
  14.70%-- svc_send
 
  That's the xpt_mutex (ensuring rpc replies aren't interleaved).
 
 
   9.63%  nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
 
 
  And that (and __free_iova below) looks like iova_rbtree_lock.
 
 
 
 Let's revisit your command:
 
 FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
 ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255
 --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --
 norandommap --group_reporting --exitall --buffered=0
 

I tried block sizes from 4-512K.
4K does not give 2.2GB bandwidth - optimal bandwidth is achieved around 
128-256K block size

 * inode's i_mutex:
 If increasing process/file count didn't help, maybe increase iodepth
 (say 512 ?) could offset the i_mutex overhead a little bit ?
 

I tried with different iodepth parameters, but found no improvement above 
iodepth 128.

 * xpt_mutex:
 (no idea)
 
 * iova_rbtree_lock
 DMA mapping fragmentation ? I have not studied whether NFS-RDMA
 routines such as svc_rdma_sendto() could do better but maybe sequential
 IO (instead of randread) could help ? Bigger block size (instead of 4K) can
 help ?
 

I am trying to simulate real load (more or less), that is the reason I use 
randread. Anyhow, read does not result in better performance.
It's probably because backing storage is tmpfs...

Yan

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-29 Thread Tom Tucker

On 4/29/13 7:16 AM, Yan Burman wrote:



-Original Message-
From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
Sent: Monday, April 29, 2013 08:35
To: J. Bruce Fields
Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
linux-...@vger.kernel.org; Or Gerlitz
Subject: Re: NFS over RDMA benchmark

On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields bfie...@fieldses.org wrote:


On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

...

[snip]


 36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner

That's the inode i_mutex.


 14.70%-- svc_send

That's the xpt_mutex (ensuring rpc replies aren't interleaved).


  9.63%  nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave


And that (and __free_iova below) looks like iova_rbtree_lock.



Let's revisit your command:

FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255
--loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --
norandommap --group_reporting --exitall --buffered=0


I tried block sizes from 4-512K.
4K does not give 2.2GB bandwidth - optimal bandwidth is achieved around 
128-256K block size


* inode's i_mutex:
If increasing process/file count didn't help, maybe increase iodepth
(say 512 ?) could offset the i_mutex overhead a little bit ?


I tried with different iodepth parameters, but found no improvement above 
iodepth 128.


* xpt_mutex:
(no idea)

* iova_rbtree_lock
DMA mapping fragmentation ? I have not studied whether NFS-RDMA
routines such as svc_rdma_sendto() could do better but maybe sequential
IO (instead of randread) could help ? Bigger block size (instead of 4K) can
help ?



I think the biggest issue is that max_payload for TCP is 2MB but only 
256k for RDMA.



I am trying to simulate real load (more or less), that is the reason I use 
randread. Anyhow, read does not result in better performance.
It's probably because backing storage is tmpfs...

Yan

--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-29 Thread Tom Tucker

On 4/29/13 8:05 AM, Tom Tucker wrote:

On 4/29/13 7:16 AM, Yan Burman wrote:



-Original Message-
From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
Sent: Monday, April 29, 2013 08:35
To: J. Bruce Fields
Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
linux-...@vger.kernel.org; Or Gerlitz
Subject: Re: NFS over RDMA benchmark

On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields 
bfie...@fieldses.org wrote:



On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
same block sizes (4-512K). running over IPoIB-CM, I get 
200-980MB/sec.

...

[snip]


 36.18%  nfsd [kernel.kallsyms]   [k] mutex_spin_on_owner

That's the inode i_mutex.


 14.70%-- svc_send

That's the xpt_mutex (ensuring rpc replies aren't interleaved).

  9.63%  nfsd [kernel.kallsyms]   [k] 
_raw_spin_lock_irqsave



And that (and __free_iova below) looks like iova_rbtree_lock.



Let's revisit your command:

FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255
--loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 
--randrepeat=1 --

norandommap --group_reporting --exitall --buffered=0


I tried block sizes from 4-512K.
4K does not give 2.2GB bandwidth - optimal bandwidth is achieved 
around 128-256K block size



* inode's i_mutex:
If increasing process/file count didn't help, maybe increase iodepth
(say 512 ?) could offset the i_mutex overhead a little bit ?

I tried with different iodepth parameters, but found no improvement 
above iodepth 128.



* xpt_mutex:
(no idea)

* iova_rbtree_lock
DMA mapping fragmentation ? I have not studied whether NFS-RDMA
routines such as svc_rdma_sendto() could do better but maybe 
sequential
IO (instead of randread) could help ? Bigger block size (instead 
of 4K) can

help ?



I think the biggest issue is that max_payload for TCP is 2MB but only 
256k for RDMA.


Sorry, I mean 1MB...



I am trying to simulate real load (more or less), that is the reason 
I use randread. Anyhow, read does not result in better performance.

It's probably because backing storage is tmpfs...

Yan

--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: NFS over RDMA benchmark

2013-04-29 Thread Yan Burman


 -Original Message-
 From: J. Bruce Fields [mailto:bfie...@fieldses.org]
 Sent: Sunday, April 28, 2013 17:43
 To: Yan Burman
 Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
 linux-...@vger.kernel.org; Or Gerlitz
 Subject: Re: NFS over RDMA benchmark
 
 On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote:
 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
 y...@mellanox.com
 I've been trying to do some benchmarks for NFS over
 RDMA and I seem to
only get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores,
 32Gb of memory, and
Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The
 backing storage on
the server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.

 When running ib_send_bw, I get 4.3-4.5 GB/sec for
 block sizes 4-
   512K.
 When I run fio over rdma mounted nfs, I get
 260-2200MB/sec for the
same block sizes (4-512K). running over IPoIB-CM, I get
200-
   980MB/sec.
 ...
   I am trying to get maximum performance from a single server
   - I used 2
  processes in fio test - more than 2 did not show any performance
 boost.
   I tried running fio from 2 different PCs on 2 different
   files, but the sum of
  the two is more or less the same as running from single client PC.
  

I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also 
way higher now).
For some reason when I had intel IOMMU enabled, the performance dropped 
significantly.
I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
Now I will take care of the issue that I am running only at 40Gbit/s instead of 
56Gbit/s, but that is another unrelated problem (I suspect I have a cable 
issue).

This is still strange, since ib_send_bw with intel iommu enabled did get up to 
4.5GB/sec, so why did intel iommu affect only nfs code?

Yan

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: NFS over RDMA benchmark

2013-04-28 Thread Yan Burman


 -Original Message-
 From: J. Bruce Fields [mailto:bfie...@fieldses.org]
 Sent: Wednesday, April 24, 2013 18:27
 To: Yan Burman
 Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
 linux-...@vger.kernel.org; Or Gerlitz
 Subject: Re: NFS over RDMA benchmark
 
 On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
  On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
  
  
-Original Message-
From: J. Bruce Fields [mailto:bfie...@fieldses.org]
Sent: Wednesday, April 24, 2013 00:06
To: Yan Burman
Cc: Wendy Cheng; Atchley, Scott; Tom Tucker;
linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz
Subject: Re: NFS over RDMA benchmark
   
On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:


  -Original Message-
  From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
  Sent: Wednesday, April 17, 2013 21:06
  To: Atchley, Scott
  Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
  linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
  Subject: Re: NFS over RDMA benchmark
 
  On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
  atchle...@ornl.gov
  wrote:
   On Apr 17, 2013, at 1:15 PM, Wendy Cheng
   s.wendy.ch...@gmail.com
  wrote:
  
   On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
   y...@mellanox.com
  wrote:
   Hi.
  
   I've been trying to do some benchmarks for NFS over RDMA
   and I seem to
  only get about half of the bandwidth that the HW can give me.
   My setup consists of 2 servers each with 16 cores, 32Gb of
   memory, and
  Mellanox ConnectX3 QDR card over PCI-e gen3.
   These servers are connected to a QDR IB switch. The
   backing storage on
  the server is tmpfs mounted with noatime.
   I am running kernel 3.5.7.
  
   When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
 512K.
   When I run fio over rdma mounted nfs, I get 260-2200MB/sec
   for the
  same block sizes (4-512K). running over IPoIB-CM, I get 200-
 980MB/sec.
  
   Yan,
  
   Are you trying to optimize single client performance or
   server performance
  with multiple clients?
  

 I am trying to get maximum performance from a single server - I
 used 2
processes in fio test - more than 2 did not show any performance boost.
 I tried running fio from 2 different PCs on 2 different files,
 but the sum of
the two is more or less the same as running from single client PC.

 What I did see is that server is sweating a lot more than the
 clients and
more than that, it has 1 core (CPU5) in 100% softirq tasklet:
 cat /proc/softirqs
   
Would any profiling help figure out which code it's spending time in?
(E.g. something simple as perf top might have useful output.)
   
  
  
   Perf top for the CPU with high tasklet count gives:
  
samples  pcnt RIPfunction
   DSO
___ _ 
   ___
  
 _
 __
  
2787.00 24.1% 81062a00 mutex_spin_on_owner
 /root/vmlinux
 
  I guess that means lots of contention on some mutex?  If only we knew
  which one perf should also be able to collect stack statistics, I
  forget how.
 
 Googling around  I think we want:
 
   perf record -a --call-graph
   (give it a chance to collect some samples, then ^C)
   perf report --call-graph --stdio
 

Sorry it took me a while to get perf to show the call trace (did not enable 
frame pointers in kernel and struggled with perf options...), but what I get is:
36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
|
--- mutex_spin_on_owner
   |
   |--99.99%-- __mutex_lock_slowpath
   |  mutex_lock
   |  |
   |  |--85.30%-- generic_file_aio_write
   |  |  do_sync_readv_writev
   |  |  do_readv_writev
   |  |  vfs_writev
   |  |  nfsd_vfs_write
   |  |  nfsd_write
   |  |  nfsd3_proc_write
   |  |  nfsd_dispatch
   |  |  svc_process_common
   |  |  svc_process
   |  |  nfsd
   |  |  kthread
   |  |  kernel_thread_helper
   |  |
   |   --14.70%-- svc_send

Re: NFS over RDMA benchmark

2013-04-28 Thread J. Bruce Fields
On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote:
On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
y...@mellanox.com
I've been trying to do some benchmarks for NFS over RDMA
and I seem to
   only get about half of the bandwidth that the HW can give me.
My setup consists of 2 servers each with 16 cores, 32Gb of
memory, and
   Mellanox ConnectX3 QDR card over PCI-e gen3.
These servers are connected to a QDR IB switch. The
backing storage on
   the server is tmpfs mounted with noatime.
I am running kernel 3.5.7.
   
When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 
4-
  512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec
for the
   same block sizes (4-512K). running over IPoIB-CM, I get 200-
  980MB/sec.
...
  I am trying to get maximum performance from a single server - I
  used 2
 processes in fio test - more than 2 did not show any performance 
 boost.
  I tried running fio from 2 different PCs on 2 different files,
  but the sum of
 the two is more or less the same as running from single client PC.
 
  What I did see is that server is sweating a lot more than the
  clients and
 more than that, it has 1 core (CPU5) in 100% softirq tasklet:
  cat /proc/softirqs
...
Perf top for the CPU with high tasklet count gives:
   
 samples  pcnt RIPfunction  
  DSO
...
 2787.00 24.1% 81062a00 mutex_spin_on_owner
  /root/vmlinux
...
  Googling around  I think we want:
  
  perf record -a --call-graph
  (give it a chance to collect some samples, then ^C)
  perf report --call-graph --stdio
  
 
 Sorry it took me a while to get perf to show the call trace (did not enable 
 frame pointers in kernel and struggled with perf options...), but what I get 
 is:
 36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
 |
 --- mutex_spin_on_owner
|
|--99.99%-- __mutex_lock_slowpath
|  mutex_lock
|  |
|  |--85.30%-- generic_file_aio_write

That's the inode i_mutex.

|  |  do_sync_readv_writev
|  |  do_readv_writev
|  |  vfs_writev
|  |  nfsd_vfs_write
|  |  nfsd_write
|  |  nfsd3_proc_write
|  |  nfsd_dispatch
|  |  svc_process_common
|  |  svc_process
|  |  nfsd
|  |  kthread
|  |  kernel_thread_helper
|  |
|   --14.70%-- svc_send

That's the xpt_mutex (ensuring rpc replies aren't interleaved).

| svc_process
| nfsd
| kthread
| kernel_thread_helper
 --0.01%-- [...]
 
  9.63%  nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
 |
 --- _raw_spin_lock_irqsave
|
|--43.97%-- alloc_iova

And that (and __free_iova below) looks like iova_rbtree_lock.

--b.

|  intel_alloc_iova
|  __intel_map_single
|  intel_map_page
|  |
|  |--60.47%-- svc_rdma_sendto
|  |  svc_send
|  |  svc_process
|  |  nfsd
|  |  kthread
|  |  kernel_thread_helper
|  |
|  |--30.10%-- rdma_read_xdr
|  |  svc_rdma_recvfrom
|  |  svc_recv
|  |  nfsd
|  |  kthread
|  |  kernel_thread_helper
|  |
|  |--6.69%-- svc_rdma_post_recv
|  |  send_reply
|  |  svc_rdma_sendto
|  

Re: NFS over RDMA benchmark

2013-04-28 Thread Wendy Cheng
On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields bfie...@fieldses.org wrote:

 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
  same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
 ...

[snip]

 36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner

 That's the inode i_mutex.

 14.70%-- svc_send

 That's the xpt_mutex (ensuring rpc replies aren't interleaved).


  9.63%  nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave


 And that (and __free_iova below) looks like iova_rbtree_lock.



Let's revisit your command:

FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128
--ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255
--loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1
--norandommap --group_reporting --exitall --buffered=0

* inode's i_mutex:
If increasing process/file count didn't help, maybe increase iodepth
(say 512 ?) could offset the i_mutex overhead a little bit ?

* xpt_mutex:
(no idea)

* iova_rbtree_lock
DMA mapping fragmentation ? I have not studied whether NFS-RDMA
routines such as svc_rdma_sendto() could do better but maybe
sequential IO (instead of randread) could help ? Bigger block size
(instead of 4K) can help ?

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Wendy Cheng
On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote:
 On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com
 wrote:

 So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
 tar ball) ... Here is a random thought (not related to the rb tree
 comment).

 The inflight packet count seems to be controlled by
 xprt_rdma_slot_table_entries that is currently hard-coded as
 RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
 with the bandwidth number if we pump it up, say 64 instead ? Not sure
 whether FMR pool size needs to get adjusted accordingly though.

 1)

 The client slot count is not hard-coded, it can easily be changed by
 writing a value to /proc and initiating a new mount. But I doubt that
 increasing the slot table will improve performance much, unless this is
 a small-random-read, and spindle-limited workload.

Hi Tom !

It was a shot in the dark :)  .. as our test bed has not been setup
yet .However, since I'll be working on (very) slow clients, increasing
this buffer is still interesting (to me). I don't see where it is
controlled by a /proc value (?) - but that is not a concern at this
moment as /proc entry is easy to add. More questions on the server
though (see below) ...


 2)

 The observation appears to be that the bandwidth is server CPU limited.
 Increasing the load offered by the client probably won't move the needle,
 until that's addressed.


Could you give more hints on which part of the path is CPU limited ?
Is there a known Linux-based filesystem that is reasonbly tuned for
NFS-RDMA ? Any specific filesystem features would work well with
NFS-RDMA ? I'm wondering when disk+FS are added into the
configuration, how much advantages would NFS-RDMA get when compared
with a plain TCP/IP, say IPOIB on CM , transport ?

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Phil Pishioneri

On 4/25/13 1:18 PM, Wendy Cheng wrote:

On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote:

1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.

It was a shot in the dark :)  .. as our test bed has not been setup
yet .However, since I'll be working on (very) slow clients, increasing
this buffer is still interesting (to me). I don't see where it is
controlled by a /proc value (?) - but that is not a concern at this
moment as /proc entry is easy to add. More questions on the server
though (see below) ...


Might there be confusion between the RDMA slot table and the TCP/UDP 
ones (which have proc entries under /proc/sys/sunrpc)?


-Phil
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Tom Talpey

On 4/25/2013 1:18 PM, Wendy Cheng wrote:

On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote:

On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com
wrote:



So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.


1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.


Hi Tom !

It was a shot in the dark :)  .. as our test bed has not been setup
yet .However, since I'll be working on (very) slow clients, increasing
this buffer is still interesting (to me). I don't see where it is
controlled by a /proc value (?) - but that is not a concern at this


The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking
for is called rdma_slot_table_entries.


moment as /proc entry is easy to add. More questions on the server
though (see below) ...



2)

The observation appears to be that the bandwidth is server CPU limited.
Increasing the load offered by the client probably won't move the needle,
until that's addressed.



Could you give more hints on which part of the path is CPU limited ?


Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server
spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker
has some ideas on the srv rdma code, but it could also be in the sunrpc
or infiniband driver layers, can't really tell without the call stacks.


Is there a known Linux-based filesystem that is reasonbly tuned for
NFS-RDMA ? Any specific filesystem features would work well with
NFS-RDMA ? I'm wondering when disk+FS are added into the
configuration, how much advantages would NFS-RDMA get when compared
with a plain TCP/IP, say IPOIB on CM , transport ?


NFS-RDMA is not really filesystem dependent, but certainly there are
considerations for filesystems to support NFS, and of course the goal in
general is performance. NFS-RDMA is a network transport, applicable to
both client and server. Filesystem choice is a server consideration.

I don't have a simple answer to your question about how much better
NFS-RDMA is over other transports. Architecturally, a lot. In practice,
there are many, many variables. Have you seen RFC5532, that I cowrote
with the late Chet Juszczak? You may find it's still quite relevant.
http://tools.ietf.org/html/rfc5532
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Tom Talpey

On 4/25/2013 3:01 PM, Phil Pishioneri wrote:

On 4/25/13 1:18 PM, Wendy Cheng wrote:

On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote:

1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.

It was a shot in the dark :)  .. as our test bed has not been setup
yet .However, since I'll be working on (very) slow clients, increasing
this buffer is still interesting (to me). I don't see where it is
controlled by a /proc value (?) - but that is not a concern at this
moment as /proc entry is easy to add. More questions on the server
though (see below) ...


Might there be confusion between the RDMA slot table and the TCP/UDP
ones (which have proc entries under /proc/sys/sunrpc)?



No, the xprtrdma.ko creates similar slot table controls when it loads.
See the names below, prefixed with rdma:


tmt@Home:~$ ls /proc/sys/sunrpc
max_resvport  nfsd_debug  nlm_debug  tcp_fin_timeout 
tcp_slot_table_entries  udp_slot_table_entries
min_resvport  nfs_debug   rpc_debug  tcp_max_slot_table_entries  transports
tmt@Home:~$ sudo insmod xprtrdma
tmt@Home:~$ ls /proc/sys/sunrpc
max_resvport  nlm_debug  rdma_memreg_strategy 
tcp_fin_timeout udp_slot_table_entries
min_resvport  rdma_inline_write_padding  rdma_pad_optimize
tcp_max_slot_table_entries
nfsd_debugrdma_max_inline_read   rdma_slot_table_entries  
tcp_slot_table_entries
nfs_debug rdma_max_inline_write  rpc_debugtransports
tmt@Home:~$



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Tom Tucker

On 4/25/13 3:04 PM, Tom Talpey wrote:

On 4/25/2013 1:18 PM, Wendy Cheng wrote:

On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote:

On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com
wrote:



So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.


1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.


Hi Tom !

It was a shot in the dark :)  .. as our test bed has not been setup
yet .However, since I'll be working on (very) slow clients, increasing
this buffer is still interesting (to me). I don't see where it is
controlled by a /proc value (?) - but that is not a concern at this


The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking
for is called rdma_slot_table_entries.


moment as /proc entry is easy to add. More questions on the server
though (see below) ...



2)

The observation appears to be that the bandwidth is server CPU limited.
Increasing the load offered by the client probably won't move the needle,
until that's addressed.



Could you give more hints on which part of the path is CPU limited ?


Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server
spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker
has some ideas on the srv rdma code, but it could also be in the sunrpc
or infiniband driver layers, can't really tell without the call stacks.


The Mellanox driver uses red-black trees extensively for resource 
management, e.g. QP ID, CQ ID, etc... When completions come in from the 
HW, these are used to find the associated software data structures I 
believe. It is certainly possible that these trees get hot on lookup when 
we're pushing a lot of data. I'm surprised, however, to see 
rb_insert_color there because I'm not aware of any where that resources 
are being inserted into and/or removed from a red-black tree in the data path.


They are also used by IPoIB and the IB CM, however, connections should not 
be coming and going unless we've got other problems. IPoIB is only used by 
the IB transport for connection set up and my impression is that this 
trace is for the IB transport.


I don't believe that red-black trees are used by either the client or 
server transports directly. Note that the rb_lock in the client is for 
buffers; not, as the name might imply, a red-black tree.


I think the key here is to discover what lock is being waited on. Are we 
certain that it's a lock on a red-black tree and if so, which one?


Tom



Is there a known Linux-based filesystem that is reasonbly tuned for
NFS-RDMA ? Any specific filesystem features would work well with
NFS-RDMA ? I'm wondering when disk+FS are added into the
configuration, how much advantages would NFS-RDMA get when compared
with a plain TCP/IP, say IPOIB on CM , transport ?


NFS-RDMA is not really filesystem dependent, but certainly there are
considerations for filesystems to support NFS, and of course the goal in
general is performance. NFS-RDMA is a network transport, applicable to
both client and server. Filesystem choice is a server consideration.

I don't have a simple answer to your question about how much better
NFS-RDMA is over other transports. Architecturally, a lot. In practice,
there are many, many variables. Have you seen RFC5532, that I cowrote
with the late Chet Juszczak? You may find it's still quite relevant.
http://tools.ietf.org/html/rfc5532
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Wendy Cheng
On Thu, Apr 25, 2013 at 2:17 PM, Tom Tucker t...@opengridcomputing.com wrote:
 The Mellanox driver uses red-black trees extensively for resource
 management, e.g. QP ID, CQ ID, etc... When completions come in from the HW,
 these are used to find the associated software data structures I believe. It
 is certainly possible that these trees get hot on lookup when we're pushing
 a lot of data. I'm surprised, however, to see rb_insert_color there because
 I'm not aware of any where that resources are being inserted into and/or
 removed from a red-black tree in the data path.


I think they (rb calls) are from base kernel, not from any NFS and/or
IB module (e.g. RPC, MLX, etc). See the right column ?  it says
/root/vmlinux. Just a guess - I don't know much about this perf
command.

 -- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Wendy Cheng
On Thu, Apr 25, 2013 at 2:58 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 On Thu, Apr 25, 2013 at 2:17 PM, Tom Tucker t...@opengridcomputing.com 
 wrote:
 The Mellanox driver uses red-black trees extensively for resource
 management, e.g. QP ID, CQ ID, etc... When completions come in from the HW,
 these are used to find the associated software data structures I believe. It
 is certainly possible that these trees get hot on lookup when we're pushing
 a lot of data. I'm surprised, however, to see rb_insert_color there because
 I'm not aware of any where that resources are being inserted into and/or
 removed from a red-black tree in the data path.


 I think they (rb calls) are from base kernel, not from any NFS and/or
 IB module (e.g. RPC, MLX, etc). See the right column ?  it says
 /root/vmlinux. Just a guess - I don't know much about this perf
 command.



Oops .. take my words back ! I confused Linux's RB tree w/ BSD's.
BSD's is a set of macros inside a header file while Linux's
implementation is a base kernel library. So every KMOD is a suspect
here :)

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: NFS over RDMA benchmark

2013-04-24 Thread Yan Burman


 -Original Message-
 From: J. Bruce Fields [mailto:bfie...@fieldses.org]
 Sent: Wednesday, April 24, 2013 00:06
 To: Yan Burman
 Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
 linux-...@vger.kernel.org; Or Gerlitz
 Subject: Re: NFS over RDMA benchmark
 
 On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:
 
 
   -Original Message-
   From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
   Sent: Wednesday, April 17, 2013 21:06
   To: Atchley, Scott
   Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
   linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
   Subject: Re: NFS over RDMA benchmark
  
   On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
   atchle...@ornl.gov
   wrote:
On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
   wrote:
   
On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
   wrote:
Hi.
   
I've been trying to do some benchmarks for NFS over RDMA and I
seem to
   only get about half of the bandwidth that the HW can give me.
My setup consists of 2 servers each with 16 cores, 32Gb of
memory, and
   Mellanox ConnectX3 QDR card over PCI-e gen3.
These servers are connected to a QDR IB switch. The backing
storage on
   the server is tmpfs mounted with noatime.
I am running kernel 3.5.7.
   
When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
the
   same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
   
Yan,
   
Are you trying to optimize single client performance or server
performance
   with multiple clients?
   
 
  I am trying to get maximum performance from a single server - I used 2
 processes in fio test - more than 2 did not show any performance boost.
  I tried running fio from 2 different PCs on 2 different files, but the sum 
  of
 the two is more or less the same as running from single client PC.
 
  What I did see is that server is sweating a lot more than the clients and
 more than that, it has 1 core (CPU5) in 100% softirq tasklet:
  cat /proc/softirqs
 
 Would any profiling help figure out which code it's spending time in?
 (E.g. something simple as perf top might have useful output.)
 


Perf top for the CPU with high tasklet count gives:

 samples  pcnt RIPfunctionDSO
 ___ _  ___ 
___

 2787.00 24.1% 81062a00 mutex_spin_on_owner 
/root/vmlinux
  978.00  8.4% 810297f0 clflush_cache_range 
/root/vmlinux
  445.00  3.8% 812ea440 __domain_mapping
/root/vmlinux
  441.00  3.8% 00018c30 svc_recv
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
  344.00  3.0% 813a1bc0 _raw_spin_lock_bh   
/root/vmlinux
  333.00  2.9% 813a19e0 _raw_spin_lock_irqsave  
/root/vmlinux
  288.00  2.5% 813a07d0 __schedule  
/root/vmlinux
  249.00  2.1% 811a87e0 rb_prev 
/root/vmlinux
  242.00  2.1% 813a19b0 _raw_spin_lock  
/root/vmlinux
  184.00  1.6% 2e90 svc_rdma_sendto 
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
  177.00  1.5% 810ac820 get_page_from_freelist  
/root/vmlinux
  174.00  1.5% 812e6da0 alloc_iova  
/root/vmlinux
  165.00  1.4% 810b1390 put_page
/root/vmlinux
  148.00  1.3% 00014760 sunrpc_cache_lookup 
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
  128.00  1.1% 00017f20 svc_xprt_enqueue
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
  126.00  1.1% 8139f820 __mutex_lock_slowpath   
/root/vmlinux
  108.00  0.9% 811a81d0 rb_insert_color 
/root/vmlinux
  107.00  0.9% 4690 svc_rdma_recvfrom   
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
  102.00  0.9% 2640 send_reply  
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
   99.00  0.9% 810e6490 kmem_cache_alloc
/root/vmlinux
   96.00  0.8% 810e5840 __slab_alloc
/root/vmlinux
   91.00  0.8% 6d30 mlx4_ib_post_send   
/lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
   88.00  0.8% 0dd0 svc_rdma_get_context
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
   86.00  0.7% 813a1a10 _raw_spin_lock_irq  
/root

Re: NFS over RDMA benchmark

2013-04-24 Thread J. Bruce Fields
On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
 
 
  -Original Message-
  From: J. Bruce Fields [mailto:bfie...@fieldses.org]
  Sent: Wednesday, April 24, 2013 00:06
  To: Yan Burman
  Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
  linux-...@vger.kernel.org; Or Gerlitz
  Subject: Re: NFS over RDMA benchmark
  
  On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:
  
  
-Original Message-
From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
Sent: Wednesday, April 17, 2013 21:06
To: Atchley, Scott
Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: NFS over RDMA benchmark
   
On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
atchle...@ornl.gov
wrote:
 On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
wrote:

 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
wrote:
 Hi.

 I've been trying to do some benchmarks for NFS over RDMA and I
 seem to
only get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores, 32Gb of
 memory, and
Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The backing
 storage on
the server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 
 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
 the
same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

 Yan,

 Are you trying to optimize single client performance or server
 performance
with multiple clients?

  
   I am trying to get maximum performance from a single server - I used 2
  processes in fio test - more than 2 did not show any performance boost.
   I tried running fio from 2 different PCs on 2 different files, but the 
   sum of
  the two is more or less the same as running from single client PC.
  
   What I did see is that server is sweating a lot more than the clients and
  more than that, it has 1 core (CPU5) in 100% softirq tasklet:
   cat /proc/softirqs
  
  Would any profiling help figure out which code it's spending time in?
  (E.g. something simple as perf top might have useful output.)
  
 
 
 Perf top for the CPU with high tasklet count gives:
 
  samples  pcnt RIPfunctionDSO
  ___ _  ___ 
 ___
 
  2787.00 24.1% 81062a00 mutex_spin_on_owner 
 /root/vmlinux

I guess that means lots of contention on some mutex?  If only we knew
which one perf should also be able to collect stack statistics, I
forget how.

--b.

   978.00  8.4% 810297f0 clflush_cache_range 
 /root/vmlinux
   445.00  3.8% 812ea440 __domain_mapping
 /root/vmlinux
   441.00  3.8% 00018c30 svc_recv
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
   344.00  3.0% 813a1bc0 _raw_spin_lock_bh   
 /root/vmlinux
   333.00  2.9% 813a19e0 _raw_spin_lock_irqsave  
 /root/vmlinux
   288.00  2.5% 813a07d0 __schedule  
 /root/vmlinux
   249.00  2.1% 811a87e0 rb_prev 
 /root/vmlinux
   242.00  2.1% 813a19b0 _raw_spin_lock  
 /root/vmlinux
   184.00  1.6% 2e90 svc_rdma_sendto 
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
   177.00  1.5% 810ac820 get_page_from_freelist  
 /root/vmlinux
   174.00  1.5% 812e6da0 alloc_iova  
 /root/vmlinux
   165.00  1.4% 810b1390 put_page
 /root/vmlinux
   148.00  1.3% 00014760 sunrpc_cache_lookup 
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
   128.00  1.1% 00017f20 svc_xprt_enqueue
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
   126.00  1.1% 8139f820 __mutex_lock_slowpath   
 /root/vmlinux
   108.00  0.9% 811a81d0 rb_insert_color 
 /root/vmlinux
   107.00  0.9% 4690 svc_rdma_recvfrom   
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
   102.00  0.9% 2640 send_reply  
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
99.00  0.9% 810e6490 kmem_cache_alloc
 /root/vmlinux
96.00  0.8% 810e5840 __slab_alloc
 /root/vmlinux

Re: NFS over RDMA benchmark

2013-04-24 Thread J. Bruce Fields
On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
 On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
  
  
   -Original Message-
   From: J. Bruce Fields [mailto:bfie...@fieldses.org]
   Sent: Wednesday, April 24, 2013 00:06
   To: Yan Burman
   Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
   linux-...@vger.kernel.org; Or Gerlitz
   Subject: Re: NFS over RDMA benchmark
   
   On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:
   
   
 -Original Message-
 From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
 Sent: Wednesday, April 17, 2013 21:06
 To: Atchley, Scott
 Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
 linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
 Subject: Re: NFS over RDMA benchmark

 On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
 atchle...@ornl.gov
 wrote:
  On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
 wrote:
 
  On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
 wrote:
  Hi.
 
  I've been trying to do some benchmarks for NFS over RDMA and I
  seem to
 only get about half of the bandwidth that the HW can give me.
  My setup consists of 2 servers each with 16 cores, 32Gb of
  memory, and
 Mellanox ConnectX3 QDR card over PCI-e gen3.
  These servers are connected to a QDR IB switch. The backing
  storage on
 the server is tmpfs mounted with noatime.
  I am running kernel 3.5.7.
 
  When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 
  4-512K.
  When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
  the
 same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
 
  Yan,
 
  Are you trying to optimize single client performance or server
  performance
 with multiple clients?
 
   
I am trying to get maximum performance from a single server - I used 2
   processes in fio test - more than 2 did not show any performance boost.
I tried running fio from 2 different PCs on 2 different files, but the 
sum of
   the two is more or less the same as running from single client PC.
   
What I did see is that server is sweating a lot more than the clients 
and
   more than that, it has 1 core (CPU5) in 100% softirq tasklet:
cat /proc/softirqs
   
   Would any profiling help figure out which code it's spending time in?
   (E.g. something simple as perf top might have useful output.)
   
  
  
  Perf top for the CPU with high tasklet count gives:
  
   samples  pcnt RIPfunction
  DSO
   ___ _  ___ 
  ___
  
   2787.00 24.1% 81062a00 mutex_spin_on_owner 
  /root/vmlinux
 
 I guess that means lots of contention on some mutex?  If only we knew
 which one perf should also be able to collect stack statistics, I
 forget how.

Googling around  I think we want:

perf record -a --call-graph
(give it a chance to collect some samples, then ^C)
perf report --call-graph --stdio

--b.

 
 --b.
 
978.00  8.4% 810297f0 clflush_cache_range 
  /root/vmlinux
445.00  3.8% 812ea440 __domain_mapping
  /root/vmlinux
441.00  3.8% 00018c30 svc_recv
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
344.00  3.0% 813a1bc0 _raw_spin_lock_bh   
  /root/vmlinux
333.00  2.9% 813a19e0 _raw_spin_lock_irqsave  
  /root/vmlinux
288.00  2.5% 813a07d0 __schedule  
  /root/vmlinux
249.00  2.1% 811a87e0 rb_prev 
  /root/vmlinux
242.00  2.1% 813a19b0 _raw_spin_lock  
  /root/vmlinux
184.00  1.6% 2e90 svc_rdma_sendto 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
177.00  1.5% 810ac820 get_page_from_freelist  
  /root/vmlinux
174.00  1.5% 812e6da0 alloc_iova  
  /root/vmlinux
165.00  1.4% 810b1390 put_page
  /root/vmlinux
148.00  1.3% 00014760 sunrpc_cache_lookup 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
128.00  1.1% 00017f20 svc_xprt_enqueue
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
126.00  1.1% 8139f820 __mutex_lock_slowpath   
  /root/vmlinux
108.00  0.9% 811a81d0 rb_insert_color 
  /root/vmlinux
107.00  0.9% 4690 svc_rdma_recvfrom   
  /lib

Re: NFS over RDMA benchmark

2013-04-24 Thread Wendy Cheng
On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote:
 On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
 On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
 
 
 
  Perf top for the CPU with high tasklet count gives:
 
   samples  pcnt RIPfunction
  DSO
   ___ _  ___ 
  ___
 
   2787.00 24.1% 81062a00 mutex_spin_on_owner 
  /root/vmlinux

 I guess that means lots of contention on some mutex?  If only we knew
 which one perf should also be able to collect stack statistics, I
 forget how.

 Googling around  I think we want:

 perf record -a --call-graph
 (give it a chance to collect some samples, then ^C)
 perf report --call-graph --stdio


I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere
in the paths ? Trees like that requires extensive lockings.

-- Wendy

.

978.00  8.4% 810297f0 clflush_cache_range 
  /root/vmlinux
445.00  3.8% 812ea440 __domain_mapping
  /root/vmlinux
441.00  3.8% 00018c30 svc_recv
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
344.00  3.0% 813a1bc0 _raw_spin_lock_bh   
  /root/vmlinux
333.00  2.9% 813a19e0 _raw_spin_lock_irqsave  
  /root/vmlinux
288.00  2.5% 813a07d0 __schedule  
  /root/vmlinux
249.00  2.1% 811a87e0 rb_prev 
  /root/vmlinux
242.00  2.1% 813a19b0 _raw_spin_lock  
  /root/vmlinux
184.00  1.6% 2e90 svc_rdma_sendto 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
177.00  1.5% 810ac820 get_page_from_freelist  
  /root/vmlinux
174.00  1.5% 812e6da0 alloc_iova  
  /root/vmlinux
165.00  1.4% 810b1390 put_page
  /root/vmlinux
148.00  1.3% 00014760 sunrpc_cache_lookup 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
128.00  1.1% 00017f20 svc_xprt_enqueue
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
126.00  1.1% 8139f820 __mutex_lock_slowpath   
  /root/vmlinux
108.00  0.9% 811a81d0 rb_insert_color 
  /root/vmlinux
107.00  0.9% 4690 svc_rdma_recvfrom   
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
102.00  0.9% 2640 send_reply  
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 99.00  0.9% 810e6490 kmem_cache_alloc
  /root/vmlinux
 96.00  0.8% 810e5840 __slab_alloc
  /root/vmlinux
 91.00  0.8% 6d30 mlx4_ib_post_send   
  /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
 88.00  0.8% 0dd0 svc_rdma_get_context
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 86.00  0.7% 813a1a10 _raw_spin_lock_irq  
  /root/vmlinux
 86.00  0.7% 1530 svc_rdma_send   
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 85.00  0.7% 81060a80 prepare_creds   
  /root/vmlinux
 83.00  0.7% 810a5790 find_get_pages_contig   
  /root/vmlinux
 79.00  0.7% 810e4620 __slab_free 
  /root/vmlinux
 79.00  0.7% 813a1a40 _raw_spin_unlock_irqrestore 
  /root/vmlinux
 77.00  0.7% 81065610 finish_task_switch  
  /root/vmlinux
 76.00  0.7% 812e9270 pfn_to_dma_pte  
  /root/vmlinux
 75.00  0.6% 810976d0 __call_rcu  
  /root/vmlinux
 73.00  0.6% 811a2fa0 _atomic_dec_and_lock
  /root/vmlinux
 73.00  0.6% 02e0 svc_rdma_has_wspace 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 67.00  0.6% 813a1a70 _raw_read_lock  
  /root/vmlinux
 65.00  0.6% f590 svcauth_unix_set_client 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
 63.00  0.5% 000180e0 svc_reserve 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
 60.00  0.5% 64d0 stamp_send_wqe  

Re: NFS over RDMA benchmark

2013-04-24 Thread Wendy Cheng
On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote:
 On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
 On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
 
 
 
  Perf top for the CPU with high tasklet count gives:
 
   samples  pcnt RIPfunction
  DSO
   ___ _  ___ 
  ___
 
   2787.00 24.1% 81062a00 mutex_spin_on_owner 
  /root/vmlinux

 I guess that means lots of contention on some mutex?  If only we knew
 which one perf should also be able to collect stack statistics, I
 forget how.

 Googling around  I think we want:

 perf record -a --call-graph
 (give it a chance to collect some samples, then ^C)
 perf report --call-graph --stdio


 I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
 that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere
 in the paths ? Trees like that requires extensive lockings.


So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.

In short, if anyone has benchmark setup handy, bumping up the slot
table size as the following might be interesting:

--- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h
2013-03-21 09:19:36.233006570 -0700
+++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h  2013-04-24
10:52:20.934781304 -0700
@@ -59,7 +59,7 @@
  * a single chunk type per message is supported currently.
  */
 #define RPCRDMA_MIN_SLOT_TABLE (2U)
-#define RPCRDMA_DEF_SLOT_TABLE (32U)
+#define RPCRDMA_DEF_SLOT_TABLE (64U)
 #define RPCRDMA_MAX_SLOT_TABLE (256U)

 #define RPCRDMA_DEF_INLINE  (1024) /* default inline max */

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-24 Thread Tom Talpey

On 4/24/2013 2:04 PM, Wendy Cheng wrote:

On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote:

On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote:

On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:

On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:




Perf top for the CPU with high tasklet count gives:

  samples  pcnt RIPfunctionDSO
  ___ _  ___ 
___

  2787.00 24.1% 81062a00 mutex_spin_on_owner 
/root/vmlinux


I guess that means lots of contention on some mutex?  If only we knew
which one perf should also be able to collect stack statistics, I
forget how.


Googling around  I think we want:

 perf record -a --call-graph
 (give it a chance to collect some samples, then ^C)
 perf report --call-graph --stdio



I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere
in the paths ? Trees like that requires extensive lockings.



So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.


1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.

2)

The observation appears to be that the bandwidth is server CPU limited.
Increasing the load offered by the client probably won't move the needle,
until that's addressed.




In short, if anyone has benchmark setup handy, bumping up the slot
table size as the following might be interesting:

--- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h
2013-03-21 09:19:36.233006570 -0700
+++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h  2013-04-24
10:52:20.934781304 -0700
@@ -59,7 +59,7 @@
   * a single chunk type per message is supported currently.
   */
  #define RPCRDMA_MIN_SLOT_TABLE (2U)
-#define RPCRDMA_DEF_SLOT_TABLE (32U)
+#define RPCRDMA_DEF_SLOT_TABLE (64U)
  #define RPCRDMA_MAX_SLOT_TABLE (256U)

  #define RPCRDMA_DEF_INLINE  (1024) /* default inline max */

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-23 Thread J. Bruce Fields
On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:
 
 
  -Original Message-
  From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
  Sent: Wednesday, April 17, 2013 21:06
  To: Atchley, Scott
  Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
  linux-...@vger.kernel.org
  Subject: Re: NFS over RDMA benchmark
  
  On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov
  wrote:
   On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
  wrote:
  
   On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
  wrote:
   Hi.
  
   I've been trying to do some benchmarks for NFS over RDMA and I seem to
  only get about half of the bandwidth that the HW can give me.
   My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
  Mellanox ConnectX3 QDR card over PCI-e gen3.
   These servers are connected to a QDR IB switch. The backing storage on
  the server is tmpfs mounted with noatime.
   I am running kernel 3.5.7.
  
   When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
   When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
  same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
  
   Yan,
  
   Are you trying to optimize single client performance or server performance
  with multiple clients?
  
 
 I am trying to get maximum performance from a single server - I used 2 
 processes in fio test - more than 2 did not show any performance boost.
 I tried running fio from 2 different PCs on 2 different files, but the sum of 
 the two is more or less the same as running from single client PC.
 
 What I did see is that server is sweating a lot more than the clients and 
 more than that, it has 1 core (CPU5) in 100% softirq tasklet:
 cat /proc/softirqs

Would any profiling help figure out which code it's spending time in?
(E.g. something simple as perf top might have useful output.)

--b.

 CPU0   CPU1   CPU2   CPU3   CPU4   
 CPU5   CPU6   CPU7   CPU8   CPU9   CPU10  CPU11  
 CPU12  CPU13  CPU14  CPU15
   HI:  0  0  0  0  0  
 0  0  0  0  0  0  0  
 0  0  0  0
TIMER: 418767  46596  43515  44547  50099  
 34815  40634  40337  39551  93442  73733  42631  
 42509  41592  40351  61793
   NET_TX:  28719309   1421   1294   1730   
 1243832937 11 44 41 20
  26 19 15 29
   NET_RX: 612070 19 22 21  6
 235  3  2  9  6 17 16 
 20 13 16 10
BLOCK:   5941  0  0  0  0  
 0  0  0519259   1238272
 253174215   2618
 BLOCK_IOPOLL:  0  0  0  0  0  
 0  0  0  0  0  0  0  
 0  0  0  0
  TASKLET: 28  1  1  1  1
 1540653  1  1 29  1  1  1 
  1  1  1  2
SCHED: 364965  26547  16807  18403  22919   
 8678  14358  14091  16981  64903  47141  18517  
 19179  18036  17037  38261
  HRTIMER: 13  0  1  1  0  
 0  0  0  0  0  0  0  
 1  1  0  1
  RCU: 945823 841546 715281 892762 823564  
 42663 863063 841622 333577 389013 393501 239103 
 221524 258159 313426 234030
  
   Remember there are always gaps between wire speed (that ib_send_bw
   measures) and real world applications.
 
 I realize that, but I don't expect the difference to be more than twice.
 
  
   That being said, does your server use default export (sync) option ?
   Export the share with async option can bring you closer to wire
   speed. However, the practice (async) is generally not recommended in
   a real production system - as it can cause data integrity issues, e.g.
   you have more chances to lose data when the boxes crash.
 
 I am running with async export option, but that should not matter too much, 
 since my backing storage is tmpfs mounted with noatime.
 
  
   -- Wendy
  
  
   Wendy,
  
   It has a been a few years since I looked at RPCRDMA, but I seem to
  remember that RPCs were limited to 32KB which means that you have to
  pipeline them to get linerate. In addition to requiring

RE: NFS over RDMA benchmark

2013-04-22 Thread Yan Burman


 -Original Message-
 From: Peng Tao [mailto:bergw...@gmail.com]
 Sent: Friday, April 19, 2013 05:28
 To: Yan Burman
 Cc: J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-
 n...@vger.kernel.org
 Subject: Re: NFS over RDMA benchmark
 
 On Wed, Apr 17, 2013 at 10:36 PM, Yan Burman y...@mellanox.com
 wrote:
  Hi.
 
  I've been trying to do some benchmarks for NFS over RDMA and I seem to
 only get about half of the bandwidth that the HW can give me.
  My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
 Mellanox ConnectX3 QDR card over PCI-e gen3.
  These servers are connected to a QDR IB switch. The backing storage on the
 server is tmpfs mounted with noatime.
  I am running kernel 3.5.7.
 
  When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
  When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same
 block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
  I got to these results after the following optimizations:
  1. Setting IRQ affinity to the CPUs that are part of the NUMA node the
  card is on 2. Increasing
  /proc/sys/sunrpc/svc_rdma/max_outbound_read_requests and
  /proc/sys/sunrpc/svc_rdma/max_requests to 256 on server 3. Increasing
  RPCNFSDCOUNT to 32 on server
 Did you try to affine nfsd to corresponding CPUs where your IB card locates?
 Given that you see a bottleneck on CPU (as in your later email), it might be
 worth trying.

I tried to affine nfsd to CPUs on the NUMA node the IB card is on.
I also set tmpfs memory policy to allocate from the same NUMA node.
I did not see big difference.

 
  4. FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128
  --ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255
  --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1
  --norandommap --group_reporting --exitall --buffered=0
 
 On client side, it may be good to affine FIO processes and nfsiod to CPUs
 where IB card locates as well, in case client is the bottleneck.
 

I am doing that - cpumask=255 affines it to the NUMA node my card is on.
For some reason doing taskset on nfsiod fails.

 --
 Thanks,
 Tao
N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: NFS over RDMA benchmark

2013-04-18 Thread Yan Burman


 -Original Message-
 From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
 Sent: Wednesday, April 17, 2013 21:06
 To: Atchley, Scott
 Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
 linux-...@vger.kernel.org
 Subject: Re: NFS over RDMA benchmark
 
 On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov
 wrote:
  On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
 wrote:
 
  On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
 wrote:
  Hi.
 
  I've been trying to do some benchmarks for NFS over RDMA and I seem to
 only get about half of the bandwidth that the HW can give me.
  My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
 Mellanox ConnectX3 QDR card over PCI-e gen3.
  These servers are connected to a QDR IB switch. The backing storage on
 the server is tmpfs mounted with noatime.
  I am running kernel 3.5.7.
 
  When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
  When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
 same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
 
  Yan,
 
  Are you trying to optimize single client performance or server performance
 with multiple clients?
 

I am trying to get maximum performance from a single server - I used 2 
processes in fio test - more than 2 did not show any performance boost.
I tried running fio from 2 different PCs on 2 different files, but the sum of 
the two is more or less the same as running from single client PC.

What I did see is that server is sweating a lot more than the clients and more 
than that, it has 1 core (CPU5) in 100% softirq tasklet:
cat /proc/softirqs
CPU0   CPU1   CPU2   CPU3   CPU4   CPU5 
  CPU6   CPU7   CPU8   CPU9   CPU10  CPU11  CPU12   
   CPU13  CPU14  CPU15
  HI:  0  0  0  0  0  0 
 0  0  0  0  0  0  0
  0  0  0
   TIMER: 418767  46596  43515  44547  50099  34815 
 40634  40337  39551  93442  73733  42631  42509
  41592  40351  61793
  NET_TX:  28719309   1421   1294   1730   1243 
   832937 11 44 41 20 26
 19 15 29
  NET_RX: 612070 19 22 21  6235 
 3  2  9  6 17 16 20
 13 16 10
   BLOCK:   5941  0  0  0  0  0 
 0  0519259   1238272253
174215   2618
BLOCK_IOPOLL:  0  0  0  0  0  0 
 0  0  0  0  0  0  0
  0  0  0
 TASKLET: 28  1  1  1  11540653 
 1  1 29  1  1  1  1
  1  1  2
   SCHED: 364965  26547  16807  18403  22919   8678 
 14358  14091  16981  64903  47141  18517  19179
  18036  17037  38261
 HRTIMER: 13  0  1  1  0  0 
 0  0  0  0  0  0  1
  1  0  1
 RCU: 945823 841546 715281 892762 823564  42663 
863063 841622 333577 389013 393501 239103 221524
 258159 313426 234030
 
  Remember there are always gaps between wire speed (that ib_send_bw
  measures) and real world applications.

I realize that, but I don't expect the difference to be more than twice.

 
  That being said, does your server use default export (sync) option ?
  Export the share with async option can bring you closer to wire
  speed. However, the practice (async) is generally not recommended in
  a real production system - as it can cause data integrity issues, e.g.
  you have more chances to lose data when the boxes crash.

I am running with async export option, but that should not matter too much, 
since my backing storage is tmpfs mounted with noatime.

 
  -- Wendy
 
 
  Wendy,
 
  It has a been a few years since I looked at RPCRDMA, but I seem to
 remember that RPCs were limited to 32KB which means that you have to
 pipeline them to get linerate. In addition to requiring pipelining, the
 argument from the authors was that the goal was to maximize server
 performance and not single client performance.
 

What I see is that performance increases almost linearly up to block size 256K 
and falls a little at block size 512K

  Scott
 
 
 That (client count) brings up a good

Re: NFS over RDMA benchmark

2013-04-18 Thread Wendy Cheng
On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman y...@mellanox.com wrote:


 What do you suggest for benchmarking NFS?


I believe SPECsfs has been widely used by NFS (server) vendors to
position their product lines. Its workload was based on a real life
NFS deployment. I think it is more torward office type of workload
(large client/user count with smaller file sizes e.g. software
development with build, compile, etc).

BTW, we're experimenting a similar project and would be interested to
know your findings.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-18 Thread Wendy Cheng
On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler
spencer.shep...@gmail.com wrote:

 Note that SPEC SFS does not support RDMA.


IIRC, the benchmark comes with source code - wondering anyone has
modified it to run on RDMA ?  Or is there any real user to share the
experience ?

-- Wendy

 
 From: Wendy Cheng
 Sent: 4/18/2013 9:16 AM
 To: Yan Burman
 Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
 linux-...@vger.kernel.org; Or Gerlitz

 Subject: Re: NFS over RDMA benchmark

 On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman y...@mellanox.com wrote:


 What do you suggest for benchmarking NFS?


 I believe SPECsfs has been widely used by NFS (server) vendors to
 position their product lines. Its workload was based on a real life
 NFS deployment. I think it is more torward office type of workload
 (large client/user count with smaller file sizes e.g. software
 development with build, compile, etc).

 BTW, we're experimenting a similar project and would be interested to
 know your findings.

 -- Wendy
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in

 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-18 Thread Atchley, Scott
On Apr 18, 2013, at 3:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:

 On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler
 spencer.shep...@gmail.com wrote:
 
 Note that SPEC SFS does not support RDMA.
 
 
 IIRC, the benchmark comes with source code - wondering anyone has
 modified it to run on RDMA ?  Or is there any real user to share the
 experience ?

I am not familiar with SpecSFS, but if it exercises the filesystem, it does not 
know which RPC layer that NFS uses, no? Or does it implement its own client and 
directly access the RPC layer?

 
 -- Wendy
 
 
 From: Wendy Cheng
 Sent: 4/18/2013 9:16 AM
 To: Yan Burman
 Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
 linux-...@vger.kernel.org; Or Gerlitz
 
 Subject: Re: NFS over RDMA benchmark
 
 On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman y...@mellanox.com wrote:
 
 
 What do you suggest for benchmarking NFS?
 
 
 I believe SPECsfs has been widely used by NFS (server) vendors to
 position their product lines. Its workload was based on a real life
 NFS deployment. I think it is more torward office type of workload
 (large client/user count with smaller file sizes e.g. software
 development with build, compile, etc).
 
 BTW, we're experimenting a similar project and would be interested to
 know your findings.
 
 -- Wendy
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-18 Thread Peng Tao
On Wed, Apr 17, 2013 at 10:36 PM, Yan Burman y...@mellanox.com wrote:
 Hi.

 I've been trying to do some benchmarks for NFS over RDMA and I seem to only 
 get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores, 32Gb of memory, and 
 Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The backing storage on the 
 server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block 
 sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
 I got to these results after the following optimizations:
 1. Setting IRQ affinity to the CPUs that are part of the NUMA node the card 
 is on
 2. Increasing /proc/sys/sunrpc/svc_rdma/max_outbound_read_requests and 
 /proc/sys/sunrpc/svc_rdma/max_requests to 256 on server
 3. Increasing RPCNFSDCOUNT to 32 on server
Did you try to affine nfsd to corresponding CPUs where your IB card
locates? Given that you see a bottleneck on CPU (as in your later
email), it might be worth trying.

 4. FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 
 --ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255 
 --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 
 --norandommap --group_reporting --exitall --buffered=0

On client side, it may be good to affine FIO processes and nfsiod to
CPUs where IB card locates as well, in case client is the bottleneck.

--
Thanks,
Tao
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-18 Thread Spencer


On Apr 18, 2013, at 6:03 PM, Atchley, Scott wrote:

 On Apr 18, 2013, at 3:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 
 On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler
 spencer.shep...@gmail.com wrote:
 
 Note that SPEC SFS does not support RDMA.
 
 
 IIRC, the benchmark comes with source code - wondering anyone has
 modified it to run on RDMA ?  Or is there any real user to share the
 experience ?
 
 I am not familiar with SpecSFS, but if it exercises the filesystem, it does 
 not know which RPC layer that NFS uses, no? Or does it implement its own 
 client and directly access the RPC layer?


Yes, the SPEC SFS benchmark implements  its own NFSv3 client, RPC layer, etc.

Spencer

 
 
 -- Wendy
 
 
 From: Wendy Cheng
 Sent: 4/18/2013 9:16 AM
 To: Yan Burman
 Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
 linux-...@vger.kernel.org; Or Gerlitz
 
 Subject: Re: NFS over RDMA benchmark
 
 On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman y...@mellanox.com wrote:
 
 
 What do you suggest for benchmarking NFS?
 
 
 I believe SPECsfs has been widely used by NFS (server) vendors to
 position their product lines. Its workload was based on a real life
 NFS deployment. I think it is more torward office type of workload
 (large client/user count with smaller file sizes e.g. software
 development with build, compile, etc).
 
 BTW, we're experimenting a similar project and would be interested to
 know your findings.
 
 -- Wendy
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-17 Thread Wendy Cheng
On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote:
 Hi.

 I've been trying to do some benchmarks for NFS over RDMA and I seem to only 
 get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores, 32Gb of memory, and 
 Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The backing storage on the 
 server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block 
 sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

Remember there are always gaps between wire speed (that ib_send_bw
measures) and real world applications.

That being said, does your server use default export (sync) option ?
Export the share with async option can bring you closer to wire
speed. However, the practice (async) is generally not recommended in a
real production system - as it can cause data integrity issues, e.g.
you have more chances to lose data when the boxes crash.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-17 Thread Atchley, Scott
On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:

 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote:
 Hi.
 
 I've been trying to do some benchmarks for NFS over RDMA and I seem to only 
 get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores, 32Gb of memory, and 
 Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The backing storage on the 
 server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.
 
 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same 
 block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

Yan,

Are you trying to optimize single client performance or server performance with 
multiple clients?


 Remember there are always gaps between wire speed (that ib_send_bw
 measures) and real world applications.
 
 That being said, does your server use default export (sync) option ?
 Export the share with async option can bring you closer to wire
 speed. However, the practice (async) is generally not recommended in a
 real production system - as it can cause data integrity issues, e.g.
 you have more chances to lose data when the boxes crash.
 
 -- Wendy


Wendy,

It has a been a few years since I looked at RPCRDMA, but I seem to remember 
that RPCs were limited to 32KB which means that you have to pipeline them to 
get linerate. In addition to requiring pipelining, the argument from the 
authors was that the goal was to maximize server performance and not single 
client performance.

Scott

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-17 Thread Wendy Cheng
On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov wrote:
 On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:

 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote:
 Hi.

 I've been trying to do some benchmarks for NFS over RDMA and I seem to only 
 get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores, 32Gb of memory, and 
 Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The backing storage on the 
 server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same 
 block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

 Yan,

 Are you trying to optimize single client performance or server performance 
 with multiple clients?


 Remember there are always gaps between wire speed (that ib_send_bw
 measures) and real world applications.

 That being said, does your server use default export (sync) option ?
 Export the share with async option can bring you closer to wire
 speed. However, the practice (async) is generally not recommended in a
 real production system - as it can cause data integrity issues, e.g.
 you have more chances to lose data when the boxes crash.

 -- Wendy


 Wendy,

 It has a been a few years since I looked at RPCRDMA, but I seem to remember 
 that RPCs were limited to 32KB which means that you have to pipeline them to 
 get linerate. In addition to requiring pipelining, the argument from the 
 authors was that the goal was to maximize server performance and not single 
 client performance.

 Scott


That (client count) brings up a good point ...

FIO is really not a good benchmark for NFS. Does anyone have SPECsfs
numbers on NFS over RDMA to share ?

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html