RE: NFSoRDMA developers bi-weekly meeting minutes (11/20)
> -Original Message- > From: Shirley Ma [mailto:shirley...@oracle.com] > Sent: Friday, November 21, 2014 00:00 > To: Cheng, Wendy; Charles EDWARD Lever; anna.schuma...@netapp.com; > devesh.sha...@emulex.com; dledf...@redhat.com; > dominique.marti...@cea.fr; jeffrey.c.bec...@nasa.gov; rsdance@soft- > forge.com; s...@lanl.gov; spra...@redhat.com; ste...@redhat.com; > sw...@opengridcomputing.com; Yan Burman; linux-rdma; Linux NFS Mailing > List > Subject: Re: NFSoRDMA developers bi-weekly meeting minutes (11/20) > > > On 11/20/2014 12:15 PM, Cheng, Wendy wrote: > >> -Original Message- > >> From: Shirley Ma [mailto:shirley...@oracle.com] > >> Sent: Thursday, November 20, 2014 10:24 AM > >> > >> > >> iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS. > >> 32K could reach 1.8 million IOPS > >> > > > > How did the ISER data get measured ? Was the measure done on ISER layer, > block layer, or filesystem layer ? > > Here is the link on iser how to set up and measure performance: > http://community.mellanox.com/docs/DOC-1483 Actual numbers are (there seems to be some misunderstanding in the meeting minutes): For single LUN/session in iSER on ConnectX-3 FDR link with 8 core 2.6GHz Xeon are: 8K block size reaches 2.5GB/s Somewhere between 16K and 32K block size iSER reaches 5.5GB/s which is almost line rate 256K block size gives 5.7GB/s With 16 sessions, it is possible to reach 1.7M IOPS with 1K block size and about 600K IOPS with 8K block size. Note these numbers are for SCST iSER implementation and there are some more tunings and enhancements that can be applied to further improve performance. Another issue that came up is the benefit of RDMA vs send. In order to check that you can use ib_send_lat vs ib_write_lat and see the latencies of different block sizes. >From past experience RDMA starts to get more efficient somewhere around 8K. Yan -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSoRDMA developers bi-weekly meeting minutes (11/20)
On 11/20/2014 12:15 PM, Cheng, Wendy wrote: >> -Original Message- >> From: Shirley Ma [mailto:shirley...@oracle.com] >> Sent: Thursday, November 20, 2014 10:24 AM >> >> >> iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS. 32K could >> reach 1.8 million IOPS >> > > How did the ISER data get measured ? Was the measure done on ISER layer, > block layer, or filesystem layer ? Here is the link on iser how to set up and measure performance: http://community.mellanox.com/docs/DOC-1483 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: NFSoRDMA developers bi-weekly meeting minutes (11/20)
> -Original Message- > From: Shirley Ma [mailto:shirley...@oracle.com] > Sent: Thursday, November 20, 2014 10:24 AM > > > iser 8K could reach 4.5GB/s in 56Gb/s link speed, 1.5 million IOPS. 32K could > reach 1.8 million IOPS > How did the ISER data get measured ? Was the measure done on ISER layer, block layer, or filesystem layer ? -- Wendy -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSoRDMA developers bi-weekly meeting minutes (11/20)
On Nov 20, 2014, at 1:24 PM, Shirley Ma wrote: > Attendees: > > Jeff Becker (NASA) > Yan Burman (Mellanox) > Wendy Cheng (Intel) > Rupert Dance (Soft Forge) > Steve Dickson (Red Hat) > Chuck Lever (Oracle) > Doug Ledford (RedHat) > Shirley Ma (Oracle) > Sachin Prabhu (RedHat) > Devesh Sharma (Emulex) > Anna Schumaker (Net App) > Steve Wise (OpenGridComputing, Chelsio) > > Moderator: > Shirley Ma (Oracle) > > NFSoRDMA developers bi-weekly meeting is to help organizing NFSoRDMA > development and test effort from different resources to speed up NFSoRDMA > upstream kernel work and NFSoRDMA diagnosing/debugging tools development. > Hopefully the quality of NFSoRDMA upstream patches can be improved by being > tested with a quorum of HW vendors. > > Today's meeting notes: > > NFSoRDMA performance: > - > Even though NFSoRDMA performance seems better than IPoIB-cm, the gap between > what the IB protocol can provide and what NFS(RDMA,IPoIB-cm) can achieve is > still big on small I/O block size (focused on 8K IO size for database > workload). Even large I/O block size(128K above), NFS performance is not > comparable to RDMA microbenchmark. We are focusing the effort to figure out > the root cause. Several experimental methods have been used on how to improve > NFSoRDMA performance. > > Yan saw NFS server does RDMA send for small packet size, less than 100bytes, > which should have used post_send instead. This is an artifact of how NFS/RDMA works. The client provides a registered area for the server to write into if an RPC reply is larger than the small pre-posted buffers that are normally used. Most of the time, each RPC reply is small enough to use RDMA SEND, and the server can convey the RPC/RDMA header and the RPC reply in a single SEND operation. If the reply is large, the server conveys the RPC/RDMA header via RDMA send, and the RPC reply via an RDMA WRITE into the client’s registered buffer. Solaris server chooses RDMA SEND in nearly every case. Linux server chooses RDMA SEND then RDMA WRITE whenever the client offers that choice. Originally, it was felt that doing the RDMA WRITE is better for the client because the client doesn’t have to copy the RPC header from the RDMA receive buffer back into rq_rcv_buf. Note that the RPC header is generally just a few hundred bytes. Several people have claimed that RDMA WRITE for small I/O is relatively expensive and should be avoided. It’s also expensive for the client to register and deregister the receive buffer for the RDMA WRITE if the server doesn’t use it. I’ve explored changing the client to offer no registered buffer if it knows the RPC reply will be small, thus forcing the server to use RDMA SEND where it’s safe. Solaris server worked fine. Of course, it already works this way. Linux server showed some data and metadata corruption on complex workloads like kernel builds. There’s a bug in there somewhere that will need to be addressed before we can change the client behavior. The improvement was consistent, but under ten microseconds per RPC with FRWR (more with FMR because deregistering the buffer takes longer and is synchronous with RPC execution). At this stage, there are bigger problems to be addressed, so this is not a top priority. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html