RE: NFSoRDMA developers bi-weekly meeting minutes (11/20)

2014-11-24 Thread Yan Burman


 -Original Message-
 From: Shirley Ma [mailto:shirley...@oracle.com]
 Sent: Friday, November 21, 2014 00:00
 To: Cheng, Wendy; Charles EDWARD Lever; anna.schuma...@netapp.com;
 devesh.sha...@emulex.com; dledf...@redhat.com;
 dominique.marti...@cea.fr; jeffrey.c.bec...@nasa.gov; rsdance@soft-
 forge.com; s...@lanl.gov; spra...@redhat.com; ste...@redhat.com;
 sw...@opengridcomputing.com; Yan Burman; linux-rdma; Linux NFS Mailing
 List
 Subject: Re: NFSoRDMA developers bi-weekly meeting minutes (11/20)
 
 
 On 11/20/2014 12:15 PM, Cheng, Wendy wrote:
  -Original Message-
  From: Shirley Ma [mailto:shirley...@oracle.com]
  Sent: Thursday, November 20, 2014 10:24 AM
 
  
  iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS.
  32K could reach 1.8 million IOPS
 
 
  How did the ISER data get measured ? Was the measure done on ISER layer,
 block layer, or filesystem layer ?
 
 Here is the link on iser how to set up and measure performance:
 http://community.mellanox.com/docs/DOC-1483

Actual numbers are (there seems to be some misunderstanding in the meeting 
minutes):
For single LUN/session in iSER on ConnectX-3 FDR link with 8 core 2.6GHz Xeon 
are:
8K block size reaches 2.5GB/s
Somewhere between 16K and 32K block size iSER reaches 5.5GB/s which is almost 
line rate
256K block size gives 5.7GB/s

With 16 sessions, it is possible to reach 1.7M IOPS with 1K block size and 
about 600K IOPS with 8K block size.

Note these numbers are for SCST iSER implementation and there are some more 
tunings and enhancements
that can be applied to further improve performance.

Another issue that came up is the benefit of RDMA vs send.
In order to check that you can use ib_send_lat vs ib_write_lat and see the 
latencies of different block sizes.
From past experience RDMA starts to get more efficient somewhere around 8K.

Yan

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSoRDMA developers bi-weekly meeting minutes (11/20)

2014-11-20 Thread Chuck Lever

On Nov 20, 2014, at 1:24 PM, Shirley Ma shirley...@oracle.com wrote:

 Attendees:
 
 Jeff Becker (NASA)
 Yan Burman (Mellanox)
 Wendy Cheng (Intel)
 Rupert Dance (Soft Forge)
 Steve Dickson (Red Hat)
 Chuck Lever (Oracle)
 Doug Ledford (RedHat)
 Shirley Ma (Oracle)
 Sachin Prabhu (RedHat)
 Devesh Sharma (Emulex)
 Anna Schumaker (Net App)
 Steve Wise (OpenGridComputing, Chelsio)
 
 Moderator:
 Shirley Ma (Oracle)
 
 NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA 
 development and test effort from different resources to speed up NFSoRDMA 
 upstream kernel work and NFSoRDMA diagnosing/debugging tools development. 
 Hopefully the quality of NFSoRDMA upstream patches can be improved by being 
 tested with a quorum of HW vendors.
 
 Today's meeting notes:
 
 NFSoRDMA performance:
 -
 Even though NFSoRDMA performance seems better than IPoIB-cm, the gap between 
 what the IB protocol can provide and what NFS(RDMA,IPoIB-cm) can achieve is 
 still big on small I/O block size (focused on 8K IO size for database 
 workload). Even large I/O block size(128K above), NFS performance is not 
 comparable to RDMA microbenchmark. We are focusing the effort to figure out 
 the root cause. Several experimental methods have been used on how to improve 
 NFSoRDMA performance.
 
 Yan saw NFS server does RDMA send for small packet size, less than 100bytes, 
 which should have used post_send instead.

This is an artifact of how NFS/RDMA works.

The client provides a registered area for the server to write
into if an RPC reply is larger than the small pre-posted
buffers that are normally used.

Most of the time, each RPC reply is small enough to use RDMA
SEND, and the server can convey the RPC/RDMA header and the
RPC reply in a single SEND operation.

If the reply is large, the server conveys the RPC/RDMA header
via RDMA send, and the RPC reply via an RDMA WRITE into the
client’s registered buffer.

Solaris server chooses RDMA SEND in nearly every case.

Linux server chooses RDMA SEND then RDMA WRITE whenever
the client offers that choice.

Originally, it was felt that doing the RDMA WRITE is better
for the client because the client doesn’t have to copy the
RPC header from the RDMA receive buffer back into rq_rcv_buf.
Note that the RPC header is generally just a few hundred
bytes.

Several people have claimed that RDMA WRITE for small I/O
is relatively expensive and should be avoided. It’s also
expensive for the client to register and deregister the
receive buffer for the RDMA WRITE if the server doesn’t
use it.

I’ve explored changing the client to offer no registered
buffer if it knows the RPC reply will be small, thus
forcing the server to use RDMA SEND where it’s safe.

Solaris server worked fine. Of course, it already works
this way.

Linux server showed some data and metadata corruption on
complex workloads like kernel builds. There’s a bug in
there somewhere that will need to be addressed before we
can change the client behavior.

The improvement was consistent, but under ten microseconds
per RPC with FRWR (more with FMR because deregistering the
buffer takes longer and is synchronous with RPC execution).

At this stage, there are bigger problems to be addressed,
so this is not a top priority.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: NFSoRDMA developers bi-weekly meeting minutes (11/20)

2014-11-20 Thread Cheng, Wendy
 -Original Message-
 From: Shirley Ma [mailto:shirley...@oracle.com]
 Sent: Thursday, November 20, 2014 10:24 AM
 
 
 iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS. 32K could
 reach 1.8 million IOPS
 

How did the ISER data get measured ? Was the measure done on ISER layer, block 
layer, or filesystem layer ?

-- Wendy 
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSoRDMA developers bi-weekly meeting minutes (11/20)

2014-11-20 Thread Shirley Ma

On 11/20/2014 12:15 PM, Cheng, Wendy wrote:
 -Original Message-
 From: Shirley Ma [mailto:shirley...@oracle.com]
 Sent: Thursday, November 20, 2014 10:24 AM

 
 iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS. 32K could
 reach 1.8 million IOPS

 
 How did the ISER data get measured ? Was the measure done on ISER layer, 
 block layer, or filesystem layer ?

Here is the link on iser how to set up and measure performance:
http://community.mellanox.com/docs/DOC-1483
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html