Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On 4/29/2013 11:36 PM, Jason Gunthorpe wrote: On Mon, Apr 29, 2013 at 10:52:21PM +0300, Or Gerlitz wrote: On Fri, Apr 26, 2013 at 12:40 AM, Jason Gunthorpe wrote: But I don't follow why the send QPNs have to be sequential for IPoIB. It looks like this is being motivated by RSS and RSS QPNs are just being reused for TSS? Go read It turns out that there are IPoIB drivers used by some operating-systems and/or Hypervisors in a para-virtualization (PV) scheme which extract the source QPN from the CQ WC associated with an incoming packets in order to.. and what follows in the change-log of patch 4/5 http://marc.info/?l=linux-rdmam=136412901621797w=2 This is what I said in the first place, the RFC is premised on the src.QPN to be set properly, you can't just mess with it, because stuff needs it. I think you should have split this patch up, there is lots going on here. - Add proper TSS that doesn't change the wire protocol - Add fake TSS that does change the wire protocol, and properly document those changes so other people can follow/implement them - Add RSS And.. 'tss_qpn_mask_sz' seems unnecessarily limiting, using WC.srcQPN + ipoib_header.tss_qpn_offset == real QPN (ie use a signed offset, not a mask) Seems much better than Wc.srcQPN ~((1(ipoib_header.tss_qpn_mask_sz 12))-1) == real QPN (Did I even get that right?) Specifically it means the requirements for alignment and contiguous-ness are gone. This means you can implement it without using the QP groups API and it will work immediately with every HCA out there. I think if we are going to actually mess with the wire protocol this sort of broad applicability is important. As for the other two questions: seems reasonable to me. Without a consensus among HW vendors how to do this it makes sense to move ahead *in the kernel* with a minimal API. Userspace is a different question of course.. Jason Hi Jason, Your suggestion could have been valid if the the IPoIB header was larger. Please note that the a QPN occupies 3 octets and thus its value lies in the range of [0..0xFF]. On the other hand the reserved field in the IPoIB header occupies only 2 octets, so given an arbitrary group of source QPN it may be not possible to recover the real QPN. This is why the real QPN should be a power of two and the rest should have consecutive numbers. And since the number of the TSS QP is relatively small, that is, in the order of the number of the cores than masking the lower bits of the Wc.srcQPN will recover the real QPN number. Also by sending only the mask length we don't use the entire reserved filed but only 4 bits leaving 12 bits to future use. Best regards, S.P. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/30/2013 1:09 AM, Yan Burman wrote: -Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Sunday, April 28, 2013 17:43 To: Yan Burman Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4- 512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200- 980MB/sec. ... I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now). For some reason when I had intel IOMMU enabled, the performance dropped significantly. I now get up to ~95K IOPS and 4.1GB/sec bandwidth. Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps. What is the client CPU percentage you see under this workload, and how different are the NFS/RDMA and NFS/IPoIB overheads? Now I will take care of the issue that I am running only at 40Gbit/s instead of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable issue). This is still strange, since ib_send_bw with intel iommu enabled did get up to 4.5GB/sec, so why did intel iommu affect only nfs code? You'll need to do more profiling to track that down. I would suspect that ib_send_bw is using some sort of direct hardware access, bypassing the IOMMU management and possibly performing no dynamic memory registration. The NFS/RDMA code goes via the standard kernel DMA API, and correctly registers/deregisters memory on a per-i/o basis in order to provide storage data integrity. Perhaps there are overheads in the IOMMU management which can be addressed. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Sun, Apr 28, 2013 at 10:42:48AM -0400, J. Bruce Fields wrote: On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4- 512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200- 980MB/sec. ... I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs ... Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunction DSO ... 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux ... Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is: 36.18% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner | --- mutex_spin_on_owner | |--99.99%-- __mutex_lock_slowpath | mutex_lock | | | |--85.30%-- generic_file_aio_write That's the inode i_mutex. Looking at the code With CONFIG_MUTEX_SPIN_ON_OWNER it spins (instead of sleeping) as long as the lock owner's still running. So this is just a lot of contention on the i_mutex, I guess. Not sure what to do aobut that. --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/30/2013 1:09 AM, Yan Burman wrote: I now get up to ~95K IOPS and 4.1GB/sec bandwidth. ... ib_send_bw with intel iommu enabled did get up to 4.5GB/sec BTW, you may want to verify that these are the same GB. Many benchmarks say KB/MB/GB when they really mean KiB/MiB/GiB. At GB/GiB, the difference is about 7.5%, very close to the difference between 4.1 and 4.5. Just a thought. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: NFS over RDMA benchmark
-Original Message- From: Tom Talpey [mailto:t...@talpey.com] Sent: Tuesday, April 30, 2013 16:05 To: Yan Burman Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux- r...@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On 4/30/2013 1:09 AM, Yan Burman wrote: -Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Sunday, April 28, 2013 17:43 To: Yan Burman Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4- 512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200- 980MB/sec. ... I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now). For some reason when I had intel IOMMU enabled, the performance dropped significantly. I now get up to ~95K IOPS and 4.1GB/sec bandwidth. Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps. That is not a typo. I get 95K IOPS with randrw test with block size 4K. I get 4.1GBps with block size 256K randread test. What is the client CPU percentage you see under this workload, and how different are the NFS/RDMA and NFS/IPoIB overheads? NFS/RDMA has about more 20-30% CPU usage than NFS/IPoIB, but RDMA has almost twice the bandwidth of IPoIB. Overall, CPU usage gets up to about 20% for randread and 50% for randwrite. Now I will take care of the issue that I am running only at 40Gbit/s instead of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable issue). This is still strange, since ib_send_bw with intel iommu enabled did get up to 4.5GB/sec, so why did intel iommu affect only nfs code? You'll need to do more profiling to track that down. I would suspect that ib_send_bw is using some sort of direct hardware access, bypassing the IOMMU management and possibly performing no dynamic memory registration. The NFS/RDMA code goes via the standard kernel DMA API, and correctly registers/deregisters memory on a per-i/o basis in order to provide storage data integrity. Perhaps there are overheads in the IOMMU management which can be addressed. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] libibverbs: Use autoreconf in autogen.sh
Bump bump. :-) On Apr 25, 2013, at 11:38 AM, Jeff Squyres (jsquyres) jsquy...@cisco.com wrote: Bump. On Apr 22, 2013, at 1:41 PM, Jeff Squyres jsquy...@cisco.com wrote: The old sequence of Autotools commands listed in autogen.sh is no longer correct. Instead, just use the single autoreconf command, which will invoke all the Right Autotools commands in the correct order. Signed-off-by: Jeff Squyres jsquy...@cisco.com --- autogen.sh | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/autogen.sh b/autogen.sh index fd47839..6c9233e 100755 --- a/autogen.sh +++ b/autogen.sh @@ -1,8 +1,4 @@ #! /bin/sh set -x -aclocal -I config -libtoolize --force --copy -autoheader -automake --foreign --add-missing --copy -autoconf +autoreconf -ifv -I config -- 1.8.1.1 -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: NFS over RDMA benchmark
-Original Message- From: Tom Talpey [mailto:t...@talpey.com] Sent: Tuesday, April 30, 2013 17:20 To: Yan Burman Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux- r...@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On 4/30/2013 1:09 AM, Yan Burman wrote: I now get up to ~95K IOPS and 4.1GB/sec bandwidth. ... ib_send_bw with intel iommu enabled did get up to 4.5GB/sec BTW, you may want to verify that these are the same GB. Many benchmarks say KB/MB/GB when they really mean KiB/MiB/GiB. At GB/GiB, the difference is about 7.5%, very close to the difference between 4.1 and 4.5. Just a thought. The question is not why there is 400MBps difference between ib_send_bw and NFSoRDMA. The question is why with IOMMU ib_send_bw got to the same bandwidth as without it while NFSoRDMA got half. From some googling, it seems that when IOMMU is enabled, dma mapping functions get a lot more expensive. Perhaps that is the reason for the performance drop. Yan -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/30/2013 10:23 AM, Yan Burman wrote: -Original Message- From: Tom Talpey [mailto:t...@talpey.com] On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote: I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now). For some reason when I had intel IOMMU enabled, the performance dropped significantly. I now get up to ~95K IOPS and 4.1GB/sec bandwidth. Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps. That is not a typo. I get 95K IOPS with randrw test with block size 4K. I get 4.1GBps with block size 256K randread test. Well, then I suggest you focus on whether you are satisfied with a high bandwidth goal or a high IOPS goal. They are two very different things, and clearly there are still significant issues to track down in the server. What is the client CPU percentage you see under this workload, and how different are the NFS/RDMA and NFS/IPoIB overheads? NFS/RDMA has about more 20-30% CPU usage than NFS/IPoIB, but RDMA has almost twice the bandwidth of IPoIB. So, for 125% of the CPU, RDMA is delivering 200% of the bandwidth. A common reporting approach is to calculate cycles per Byte (roughly, CPU/MB/sec), and you'll find this can be a great tool for comparison when overhead is a consideration. Overall, CPU usage gets up to about 20% for randread and 50% for randwrite. This is *client* CPU? Writes require the server to take additional overhead to make RDMA Read requests, but the client side is doing practically the same thing for the read vs write path. Again, you may want to profile more deeply to track that difference down. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Mon, Apr 29, 2013 at 10:09 PM, Yan Burman y...@mellanox.com wrote: I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now). For some reason when I had intel IOMMU enabled, the performance dropped significantly. I now get up to ~95K IOPS and 4.1GB/sec bandwidth. Now I will take care of the issue that I am running only at 40Gbit/s instead of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable issue). This is still strange, since ib_send_bw with intel iommu enabled did get up to 4.5GB/sec, so why did intel iommu affect only nfs code? That's very exciting ! The sad part is that IOMMU has to be turned off. I think ib_send_bw uses a single buffer so the DMA mapping search overhead is not an issue. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Tue, Apr 30, 2013 at 12:04:25PM +0300, Shlomo Pongratz wrote: And.. 'tss_qpn_mask_sz' seems unnecessarily limiting, using WC.srcQPN + ipoib_header.tss_qpn_offset == real QPN (ie use a signed offset, not a mask) Seems much better than Wc.srcQPN ~((1(ipoib_header.tss_qpn_mask_sz 12))-1) == real QPN (Did I even get that right?) Your suggestion could have been valid if the the IPoIB header was larger. Please note that the a QPN occupies 3 octets and thus its value lies in the range of [0..0xFF]. I am aware of this, and it isn't really a problem, adaptors that allocate randomly across the entire QPN space would not be compatible with this approach, but most adaptors allocate QPNs quasi-contiguously. Basically, at startup, IPoIB would allocate a TX QP, then allocate TSS QPs, and throw away any that can't fit in the encoding, until it reaches the target number or tries too long. No need for a special API to the driver. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/30/13 9:38 AM, Yan Burman wrote: -Original Message- From: Tom Talpey [mailto:t...@talpey.com] Sent: Tuesday, April 30, 2013 17:20 To: Yan Burman Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux- r...@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On 4/30/2013 1:09 AM, Yan Burman wrote: I now get up to ~95K IOPS and 4.1GB/sec bandwidth. ... ib_send_bw with intel iommu enabled did get up to 4.5GB/sec BTW, you may want to verify that these are the same GB. Many benchmarks say KB/MB/GB when they really mean KiB/MiB/GiB. At GB/GiB, the difference is about 7.5%, very close to the difference between 4.1 and 4.5. Just a thought. The question is not why there is 400MBps difference between ib_send_bw and NFSoRDMA. The question is why with IOMMU ib_send_bw got to the same bandwidth as without it while NFSoRDMA got half. NFSRDMA is constantly registering and unregistering memory when you use FRMR mode. By contrast IPoIB has a descriptor ring that is set up once and re-used. I suspect this is the difference maker. Have you tried running the server in ALL_PHYSICAL mode, i.e. where it uses a DMA_MR for all of memory? Tom From some googling, it seems that when IOMMU is enabled, dma mapping functions get a lot more expensive. Perhaps that is the reason for the performance drop. Yan -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
Jason Gunthorpe jguntho...@obsidianresearch.com wrote: For the TSS case, I'd say just allocate normal QPs and provide something like ibv_override_ud_src_qpn(). This is very general and broadly useful for any application using UD QPs. I've lost you, how you suggest to implement ibv_override_ud_src_qpn(), is that for future HW or you have a method to get work today. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Tue, Apr 30, 2013 at 11:08:19PM +0300, Or Gerlitz wrote: Jason Gunthorpe jguntho...@obsidianresearch.com wrote: For the TSS case, I'd say just allocate normal QPs and provide something like ibv_override_ud_src_qpn(). This is very general and broadly useful for any application using UD QPs. I've lost you, how you suggest to implement ibv_override_ud_src_qpn(), is that for future HW or you have a method to get work today. I meant as a user space API alternative to the parent/child group API for transmit. It would require some level of driver/FW/HW support of course. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 6/9] IB/core: Add receive Flow Steering support
On Mon, Apr 29, 2013 at 10:40 PM, Christoph Lameter c...@linux.com wrote: On Mon, 29 Apr 2013, Steve Wise wrote: Hey Or, This looks good at first glance. I must confess I cannot tell yet if this will provide everything we need for chelsio's RAW packet requirements. But I think we should move forward on this, and enhance as needed. Well we are using the raw qp s here too and would like to use receive flow steering. Could we please get this merged? Steve, Christoph -- thanks for the positive feedback. So Roland, not that I expect this double ack to behave as our gerrit system where a +2 feedback triggers acceptance... but still, there's real world need here and real patches that address that need - any questions or comments on them? if not, are they going to get into 3.10? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html