Re: tune ib stack
2013/4/9 Sebastian Riemer sebastian.rie...@profitbricks.com: Because 2048 is the default and 4096 is the max. supported MTU by the hardware. How can i set active mtu? Something like this: echo 4096 /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu After doing this all srp connections down and port is down. I need to restart openibd 06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] Subsystem: Mellanox Technologies Device 0017 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 42 Region 0: Memory at df90 (64-bit, non-prefetchable) [size=1M] Region 2: Memory at de00 (64-bit, prefetchable) [size=8M] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Vital Product Data Not readable Capabilities: [9c] MSI-X: Enable+ Count=128 Masked- Vector table: BAR=0 offset=0007c000 PBA: BAR=0 offset=0007d000 Capabilities: [60] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s 64ns, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [148 v1] Device Serial Number 00-25-90-ff-ff-17-9b-24 Capabilities: [18c v1] #19 Kernel driver in use: mlx4_core Kernel modules: mlx4_core Could be a bug. Which OFED/Kernel (if using in-tree IB modules) do you use? Mine says with ConnectX2 QDR: 40 Gb/sec (4X QDR) I'm using stock 3.8.6 kernel and xen patches on top. And i'm use modules provided with kernel. (only ib_srp i'm use from Bart github repo) You should see 40 Gb/sec (4X QDR) here. Perhaps the OFED is too old so that FDR and ConnectX 3 aren't supported, yet. 10 Gb/sec (4X) seems to be the default case if a rate isn't supported. Yes, in older card with ConnecX i see this, but in case of ConnectX-3 only 10 Gb -- Vasiliy Tolstov, e-mail: v.tols...@selfip.ru jabber: v...@selfip.ru -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 09.04.2013 13:51, Vasiliy Tolstov wrote: Something like this: echo 4096 /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu After doing this all srp connections down and port is down. I need to restart openibd Sorry for that! It's much easier to set the IP MTU. Managed switches support setting the RDMA MTU. So it could be possible that it is a setting in the SM config. But I'm not sure. $ man opensm says that it can be set in the partitions.conf You should see 40 Gb/sec (4X QDR) here. Perhaps the OFED is too old so that FDR and ConnectX 3 aren't supported, yet. 10 Gb/sec (4X) seems to be the default case if a rate isn't supported. Yes, in older card with ConnecX i see this, but in case of ConnectX-3 only 10 Gb The kernel version is okay. It depends on the user space. There is a support note in OFED 3.5: - ConnectX-3 (fw-ConnectX3 Rev 2.11.0500) (FDR and FDR10 Modes are Supported) Before OFED 3.5 these HCAs aren't supported. A look at the related source code could be worth a try. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 4/9/2013 8:15 AM, Sebastian Riemer wrote: On 09.04.2013 13:51, Vasiliy Tolstov wrote: Something like this: echo 4096 /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu After doing this all srp connections down and port is down. I need to restart openibd Sorry for that! It's much easier to set the IP MTU. Managed switches support setting the RDMA MTU. So it could be possible that it is a setting in the SM config. But I'm not sure. IP MTU is different than link MTU. For UD mode, it's link MTU - 4. For RC (connected) mode, this can be a much larger number than the link MTU as the HCA does the segmentation/reassembly down to the path MTU. $ man opensm says that it can be set in the partitions.conf Yes, MTU for the IPoIB interface is set in the partition file. This would need configuring for the larger (4K) MTU assuming all ports support the 4K MTU. If not, some ports won't be able to join the IPoIB broadcast (or other) IB multicast groups and IPoIB won't work. -- Hal -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC/PATCH v3] IPoIB: Leave space in skb linear buffer for IP headers
From: Roland Dreier rol...@purestorage.com + if (wc-byte_len IPOIB_UD_HEAD_SIZE) { + page = priv-rx_ring[wr_id].page; + priv-rx_ring[wr_id].page = NULL; + } else { + page = NULL; + } + /* * If we can't allocate a new RX buffer, dump * this packet and reuse the old buffer. */ if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) { ++dev-stats.rx_dropped; + priv-rx_ring[wr_id].page = page; goto repost; } Can you go through the else of the first if (page is NULL), then enter the second if? If so, isn't the page lost? Dean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 09.04.2013 14:49, Hal Rosenstock wrote: On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote: Hello. I have some servers, with mellanox ConnectX-3 and have some questions: Why max_mtu differs with active_mtu? What does peer port say for max MTU ? How can i set active mtu? SM sets active MTU to min of peer ports max MTUs. So with peer port max MTU do you mean this file?: /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu I've seen that it can be set as well. I've got two ConnectX-2 machines connected back2back. In general these have 4K max and active. So let's try something: Host1: $ echo 2048 /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu # Port is not active, let's reactivate it. $ echo 1 /sys/class/infiniband/mlx4_0/device/enable ibv_devinfo Host1: max_mtu:2048 (4) active_mtu: 2048 (4) Host2: max_mtu:4096 (5) active_mtu: 2048 (4) Both had 4096 (5) before everywhere. So that's the recommended way to reduce the MTU? I've heard that reducing the MTU in a fabric can help fighting congestion issues. As congestion control doesn't work yet, could this help against congestion? Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 4/9/2013 9:16 AM, Sebastian Riemer wrote: On 09.04.2013 14:49, Hal Rosenstock wrote: On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote: Hello. I have some servers, with mellanox ConnectX-3 and have some questions: Why max_mtu differs with active_mtu? What does peer port say for max MTU ? How can i set active mtu? SM sets active MTU to min of peer ports max MTUs. So with peer port max MTU do you mean this file?: /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu I meant NeighborMTU from PortInfo as active MTU and MTUCap there is supported MTU. -- Hal -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 09.04.2013 15:34, Hal Rosenstock wrote: On 4/9/2013 9:16 AM, Sebastian Riemer wrote: On 09.04.2013 14:49, Hal Rosenstock wrote: On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote: Hello. I have some servers, with mellanox ConnectX-3 and have some questions: Why max_mtu differs with active_mtu? What does peer port say for max MTU ? How can i set active mtu? SM sets active MTU to min of peer ports max MTUs. So with peer port max MTU do you mean this file?: /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu I meant NeighborMTU from PortInfo as active MTU and MTUCap there is supported MTU. So these values are exactly the same as in ibv_devinfo and can be set in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu. I've found the PortInfo with the command smpquery portinfo -C mlx4_0 3 1 where I'm using the first HCA to contact the SM. I tell the SM the destination LID ('3' here in my case) and the destination port ('1'). Is there another method to set the max MTU? I know that switches can also set the max MTU for their switch ports where most of them use 2048 as default. How to change these switch port MTUs for unmanaged switches? On managed switches this can be done over the web front-end. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
On 03/04/2013 23:12, Hefty, Sean wrote: Hi Sean, Ping. You had concerns on the suggested concept, we want to know if we addressed them, can you comment? I'm in meetings this week until tomorrow. I'll try to take a look at the updated patches then or Friday. any feedback? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 4/9/2013 9:56 AM, Sebastian Riemer wrote: On 09.04.2013 15:34, Hal Rosenstock wrote: On 4/9/2013 9:16 AM, Sebastian Riemer wrote: On 09.04.2013 14:49, Hal Rosenstock wrote: On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote: Hello. I have some servers, with mellanox ConnectX-3 and have some questions: Why max_mtu differs with active_mtu? What does peer port say for max MTU ? How can i set active mtu? SM sets active MTU to min of peer ports max MTUs. So with peer port max MTU do you mean this file?: /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu I meant NeighborMTU from PortInfo as active MTU and MTUCap there is supported MTU. So these values are exactly the same as in ibv_devinfo and can be set in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu. I've found the PortInfo with the command smpquery portinfo -C mlx4_0 3 1 where I'm using the first HCA to contact the SM. I tell the SM the destination LID ('3' here in my case) and the destination port ('1'). Is there another method to set the max MTU? That doesn't set max MTU (MTUCap) but merely reads it (for that port). I know that switches can also set the max MTU for their switch ports where most of them use 2048 as default. You would need to contact your CA and/or switch vendor(s) (see below). How to change these switch port MTUs for unmanaged switches? On managed switches this can be done over the web front-end. Yes. MTUCap is RO in terms of the SM so there are only out of band mechanisms to change this which are vendor specific like a web front end. -- Hal Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tune ib stack
On 09.04.2013 16:23, Hal Rosenstock wrote: So these values are exactly the same as in ibv_devinfo and can be set in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu. I've found the PortInfo with the command smpquery portinfo -C mlx4_0 3 1 where I'm using the first HCA to contact the SM. I tell the SM the destination LID ('3' here in my case) and the destination port ('1'). Is there another method to set the max MTU? That doesn't set max MTU (MTUCap) but merely reads it (for that port). Sorry, copy and paste error. I've meant the mlx4 file: /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu But you've answered that by vendor specific. Thanks for the valuable information! For us most interesting would be if the MTU can be changed live without any service disruption. Looks like the mlx4 driver can't provide that. Perhaps switches can do that. Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH v3] IPoIB: Leave space in skb linear buffer for IP headers
On Tue, Apr 9, 2013 at 6:13 AM, Luick, Dean dean.lu...@intel.com wrote: Can you go through the else of the first if (page is NULL), then enter the second if? If so, isn't the page lost? Thanks, good catch. I'll fix that up. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
any feedback? I have no issue with RSS/TSS. But the 'qp group' interface to using this seems kludgy. On a node, this is multiple send/receive queues grouped together to form a larger construct. On the wire, this is a single QP - maybe? I'm still not clear on that. From what's written, all the send queues appear as a single QPN. The receive queues appear as different QPNs. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Fri, Apr 05, 2013 at 04:54:39PM -0400, Michael R. Hines wrote: To be more specific, here's what I did: 1. apply kernel module patch - re-insert module 1. QEMU does: ibv_reg_mr(IBV_ACCESS_GIFT | IBV_ACCESS_REMOTE_READ) 2. Start the RDMA migration 3. Migration completes without any errors This test does *not* work with a cgroup swap limit, however. The process gets killed. (Both with and without GIFT) - Michael Try to attach a debugger and see where it is when it gets killed? On 04/05/2013 04:43 PM, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:17 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: I also removed the IBV_*_WRITE flags on the sender-side and activated cgroups with the memory.memsw.limit_in_bytes activated and the migration with RDMA also succeeded without any problems (both with *and* without GIFT also worked). Not sure I'm interpreting this correctly. Are you saying that things worked without actually setting the GIFT flag? In which case why are we adding this flag? - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On 04/09/2013 12:39 PM, Michael S. Tsirkin wrote: On Fri, Apr 05, 2013 at 04:54:39PM -0400, Michael R. Hines wrote: To be more specific, here's what I did: 1. apply kernel module patch - re-insert module 1. QEMU does: ibv_reg_mr(IBV_ACCESS_GIFT | IBV_ACCESS_REMOTE_READ) 2. Start the RDMA migration 3. Migration completes without any errors This test does *not* work with a cgroup swap limit, however. The process gets killed. (Both with and without GIFT) - Michael Try to attach a debugger and see where it is when it gets killed? It's killed by cgroups - not a CPU exception. The same test works fine using TCP migration with cgroups - everything is fine there. The memory that RDMA attempted to register hits some kind of cgroups policy which results in a kernel message saying that the cgroup swap limit was hit and then it goes ahead and kills the process altogether. It's not a QEMU problem - it seems to be a kernel bug. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH v3] IPoIB: Leave space in skb linear buffer for IP headers
-IPOIB_UD_HEAD_SIZE = IB_GRH_BYTES + IPOIB_ENCAP_LEN, +/* add 128 bytes of tailroom for IP/TCP headers */ +IPOIB_UD_HEAD_SIZE = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 128, Hello, the version 3 of the patch finally works. I can see the performance gains but I cannot feel them (in real life). Here are the results of my testbed: Test 1: netperf/netserver message size 16K kernel 3.5 default : 5.1 GBit/s kernel 3.5 + patch v3 : 7.7 GBit/s kernel 3.5 + max MTU 3K: 10.8 GBit/s Test 2: Disk write performance VM with disk mounted on IB async NFS server block size | default | patch v3 | max MTU 3K +--+--+-- 1 KB | 10 MB/s | 10 MB/s | 10 MB/s 2 KB | 20 MB/s | 21 MB/s | 20 MB/s 4 KB | 40 MB/s | 40 MB/s | 43 MB/s 8 KB | 68 MB/s | 70 MB/s | 78 MB/s 16 KB | 105 MB/s | 105 MB/s | 120 MB/s 32 KB | 150 MB/s | 150 MB/s | 170 MB/s 64 KB | 200 MB/s | 210 MB/s | 260 MB/s 128 KB | 270 MB/s | 290 MB/s | 400 MB/s 256 KB | 300 MB/s | 310 MB/s | 430 MB/s 512 KB | 305 MB/s | 320 MB/s | 470 MB/s 1024 KB | 310 MB/s | 325 MB/s | 500 MB/s 2048 KB | 310 MB/s | 325 MB/s | 510 MB/s 4096 KB | 370 MB/s | 325 MB/s | 510 MB/s 8192 KB | 400 MB/s | 325 MB/s | 520 MB/s As you can see netperf throughput increases while NFS does not even care about the optimizations. Maybe it does not work well with fragmented SKBs. The MAX MTU 3K values once again are forced through a hack inside ipoib_main.c. For curiosity I changed the block splitting in your v3 patch from small head with large fragment to large head with small fragment in this line. IPOIB_UD_HEAD_SIZE = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 3072 In my 2044 MTU case this brings the netperf NFS throughput to the same levels as the dirty hack. Of course this no longer reflects a head but equals more or less to something like a new constant IPOIB_UD_FIXED_SKB_SIZE. I guess 4K MTU will not see any further gains but avoiding the skb_pull calls should improve speed as well. Maybe a final adaption could put the cherry on the cake. Markus -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
presumably is_dup_page reads the page, so should not break COW ... I'm not sure about the cgroups swap limit - you might have too many non COW pages so attempting to fault them all in makes you exceed the limit. You really should look at what is going on in the pagemap, to see if there's measureable gain from the patch. On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote: Well, I have the is_dup_page() commented out...when RDMA is activated. Is there something else in QEMU that could be touching the page that I don't know about? - Michael On 04/05/2013 05:03 PM, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: Sorry, I was wrong. ignore the comments about cgroups. That's still broken. (i.e. trying to register RDMA memory while using a cgroup swap limit cause the process get killed). But the GIFT flag patch works (my understanding is that GIFT flag allows the adapter to transmit stale memory information, it does not have anything to do with cgroups specifically). The point of the GIFT patch is to avoid triggering copy-on-write so that memory doesn't blow up during migration. If that doesn't work then there's no point to the patch. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
-Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- Subject: Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums. On Apr 4, 2013, at 1:57 PM, Weiny, Ira ira.we...@intel.com wrote: In hindsight, the user space API never should have exposed the mtu as an enum... Since an enum is an int, and we're never going to have anything with an mtu = 5 bytes, couldn't we just store all new mtu values directly as their byte value? That seems like a pretty good idea. Agreed, but changing to an int would seem to have some fairly serious backwards compatibility issues. What is the right way to move forward here? Just to re-state: our issue is that there does not seem to be any other way to get the max UD message size without knowing the actual MTU (are we incorrect about that?). Hence, using the IB-defined values is not really sufficient. I guess I am confused. Is this patch trying to support RoCE or a VNIC? Ira -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Fri, Apr 05, 2013 at 01:43:49PM -0700, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:17 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: I also removed the IBV_*_WRITE flags on the sender-side and activated cgroups with the memory.memsw.limit_in_bytes activated and the migration with RDMA also succeeded without any problems (both with *and* without GIFT also worked). Not sure I'm interpreting this correctly. Are you saying that things worked without actually setting the GIFT flag? In which case why are we adding this flag? - R. We are adding the flag to reduce memory when there's lots of COW pages. There's no guarantee there will be COW pages so I expect things to work both with and without breaking COW, just using much more memory when we break COW. -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
On Tue, Apr 9, 2013 at 8:06 PM, Hefty, Sean sean.he...@intel.com wrote: I have no issue with RSS/TSS. But the 'qp group' interface to using this seems kludgy. OK, so lets take it over the patch that has the QP group description On a node, this is multiple send/receive queues grouped together to form a larger construct. On the wire, this is a single QP - maybe? I'm still not clear on that. From what's written, all the send queues appear as a single QPN. The receive queues appear as different QPNs. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Fri, Apr 05, 2013 at 04:17:36PM -0400, Michael R. Hines wrote: The userland part of the patch was missing (IBV_ACCESS_GIFT). I added flag that to /usr/include in addition to this patch and did a test RDMA migrate and it seems to work without any problems. I also removed the IBV_*_WRITE flags on the sender-side and activated cgroups with the memory.memsw.limit_in_bytes activated and the migration with RDMA also succeeded without any problems (both with *and* without GIFT also worked). Any additional tests you would like? - Michael RDMA can't really work with swap so not sure how that's relevant. Please check memory.usage_in_bytes - is it lower with the GIFT flag? I think this is what we really care about. -- MST -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups
This patch introduces the concept of RSS and TSS QP groups which allows for implementing them by low level drivers and using it by IPoIB and later also by user space ULPs. A QP group is a set of QPs consists of a parent QP and two disjoint sets of RSS and TSS QPs. The creation of a QP group is a two stage process: In the the 1st stage, the parent QP is created. In the 2nd stage the children QPs of the parent are created. Each child QP indicates if its a RSS or TSS QP. Both the TSS and RSS sets of QPs should have contiguous QP numbers. It is forbidden to modify parent QP state before all RSS/TSS children were created. In the same manner it is disallowed to destroy the parent QP unless all RSS/TSS children were destroyed. A few new elements/concepts are introduced to support this: Three new device capabilities that can be set by the low level driver: - IB_DEVICE_QPG which is set to indicate QP groups are supported. - IB_DEVICE_UD_RSS which is set to indicate that the device supports RSS, that is applying hash function on incoming TCP/UDP/IP packets and dispatching them to multiple rings (child QPs). - IB_DEVICE_UD_TSS which is set to indicate that the device supports HW TSS which means that the HW is capable of over-riding the source UD QPN present in sent IB datagram header (DTH) with the parent's QPN. Low level drivers not supporting HW TSS, could still support QP groups, such as combination is referred as SW TSS. Where in this case, the low level drive fills in the qpg_tss_mask_sz field of struct ib_qp_cap returned from ib_create_qp. Such that this mask can be used to retrieve the parent QPN from incoming packets carrying a child QPN (as of the contiguous QP numbers requirement). - max rss table size device attribute, which is the maximal size of the RSS indirection table supported by the device - qp group type attribute for qp creation saying whether this is a parent QP or rx/tx (rss/tss) child QP or none of the above for non rss/tss QPs. - per qp group type, another attribute is added, for parent QPs, the number of rx/tx child QPs and for child QPs pointer to the parent. - IB_QP_GROUP_RSS attribute mask, which should be used when modifying the parent QP state from reset to init On Tue, Apr 9, 2013 at 8:06 PM, Hefty, Sean sean.he...@intel.com wrote: I have no issue with RSS/TSS. But the 'qp group' interface to using this seems kludgy. lets try to be more specific On a node, this is multiple send/receive queues grouped together to form a larger construct. On the wire, this is a single QP - maybe? I'm still not clear on that. From what's written, all the send queues appear as a single QPN. The receive queues appear as different QPNs. Starting with RSS QP groups: its a group made of one parent QP and N RSS child QPs. On the wire everything is sent to the RSS parent QP, however, when the HW receives a packet for which this QP/QPN is the destination, it applies a hash function on the packet header and subject to the hash result dispatches the packet to one of the N child QPs. The design applies for IB UD QPs and Raw Ethernet Packet QP types, under IB the QPN of the parent is on the wire, under Eth, there are no QPNs on the wire, but that HW has some steering rule which makes certain packets to be steered to that RSS parent, and the RSS parent in turn further does dispatching decision (hashing) to determine which of the child RSS QPs will actually receive that packet. With IPoIB, the remote side is provided with the RSS parent QPN as part of the IPoIB HW address provided in the ARP reply payload, so packets are sent to that QPN. With RAW Packet Eth QPs, the remote side isn't aware to QPNs at all, all goes through a steering rule who is directing to the RSS parent. You can send packets over RSS packet QP but not receive packets. So for RSS, the remote side isn't aware to that QP group @ all. Makes sense? As for TSS QP groups, basically generally speaking, the only case that really matters are applications/drivers that care for the source QPN of a packet. but lets get there after hopefully agreeing what is RSS QP group. Or. Signed-off-by: Shlomo Pongratz shlo...@mellanox.com --- drivers/infiniband/core/uverbs_cmd.c |1 + drivers/infiniband/core/verbs.c | 118 ++ drivers/infiniband/hw/amso1100/c2_provider.c |3 + drivers/infiniband/hw/cxgb3/iwch_provider.c |2 + drivers/infiniband/hw/cxgb4/qp.c |3 + drivers/infiniband/hw/ehca/ehca_qp.c |3 + drivers/infiniband/hw/ipath/ipath_qp.c |3 + drivers/infiniband/hw/mlx4/qp.c |3 + drivers/infiniband/hw/mthca/mthca_provider.c |3 + drivers/infiniband/hw/nes/nes_verbs.c|3 + drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |5 + drivers/infiniband/hw/qib/qib_qp.c |5 + include/rdma/ib_verbs.h | 40
Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
On Apr 8, 2013, at 6:16 PM, Hefty, Sean sean.he...@intel.com wrote: Why can't IB_MTU_1500 = 1500? It certainly could. Additionally, since Roland was a little concerned about the IB prefix (since 1500 and 9000 are not IBTA-sanctioned MTUs), they could have a different prefix -- perhaps RDMA_MTU_1500. Although I admit that it would be weird to have an enum that contains values with different prefixes: enum ib_mtu { IB_MTU_256 = 1, IB_MTU_512 = 2, IB_MTU_1024 = 3, IB_MTU_2048 = 4, IB_MTU_4096 = 5, RDMA_MTU_1500 = 1500, RDMA_MTU_9000 = 9000 }; -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
-Original Message- From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] Subject: Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums. On Apr 8, 2013, at 6:16 PM, Hefty, Sean sean.he...@intel.com wrote: Why can't IB_MTU_1500 = 1500? Sean, If the IBTA were to release new MTU enumerations which values would you recommend then? Ira It certainly could. Additionally, since Roland was a little concerned about the IB prefix (since 1500 and 9000 are not IBTA-sanctioned MTUs), they could have a different prefix -- perhaps RDMA_MTU_1500. Although I admit that it would be weird to have an enum that contains values with different prefixes: enum ib_mtu { IB_MTU_256 = 1, IB_MTU_512 = 2, IB_MTU_1024 = 3, IB_MTU_2048 = 4, IB_MTU_4096 = 5, RDMA_MTU_1500 = 1500, RDMA_MTU_9000 = 9000 }; -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
With respect, I'm going to offload testing this patch back to the author =) because I'm trying to address all of Paolo's other minor issues with the RDMA patch before we can merge. Since dynamic page registration (as you requested) is now fully implemented, this patch is less urgent since we now have a mechanism in place to avoid page pinning on both sides of the migration. - Michael On 04/09/2013 03:03 PM, Michael S. Tsirkin wrote: presumably is_dup_page reads the page, so should not break COW ... I'm not sure about the cgroups swap limit - you might have too many non COW pages so attempting to fault them all in makes you exceed the limit. You really should look at what is going on in the pagemap, to see if there's measureable gain from the patch. On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote: Well, I have the is_dup_page() commented out...when RDMA is activated. Is there something else in QEMU that could be touching the page that I don't know about? - Michael On 04/05/2013 05:03 PM, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: Sorry, I was wrong. ignore the comments about cgroups. That's still broken. (i.e. trying to register RDMA memory while using a cgroup swap limit cause the process get killed). But the GIFT flag patch works (my understanding is that GIFT flag allows the adapter to transmit stale memory information, it does not have anything to do with cgroups specifically). The point of the GIFT patch is to avoid triggering copy-on-write so that memory doesn't blow up during migration. If that doesn't work then there's no point to the patch. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
-Original Message- From: Hefty, Sean Sent: Tuesday, April 09, 2013 6:30 PM To: Weiny, Ira; Jeff Squyres (jsquyres) Cc: Hal Rosenstock; Roland Dreier; linux-rdma@vger.kernel.org; Upinder Malhi (umalhi) Subject: RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums. If the IBTA were to release new MTU enumerations which values would you recommend then? I don't think there's a great solution here. We're mixing IBTA encoded values with non-IBTA values. We could reserve the 6-bit encoded values for IB, and use direct values for others (or at least jump beyond the 6-bit range). Or we can stop matching new IBTA MTU encodings (e.g. IB_MTU_1500 = 6). Or we go back in time and make mtu an int. I thought reserving the 6 bit's for IB and allowing the enum values to match the MTU was a pretty good compromise. Especially since PathRecord is defined in sa.h which is provided by libibverbs. That allows for that IB MTU enum to be used there. OTOH, now that we have moved toward decent defines in the libibumad library we could define the MTU enum there. But then we again go down the path of defining things multiple places and confusing the users... :-( As an aside I like the use of RDMA_MTU_* for these values. Again to distinguish them from the IBTA values. But I know that is poor form. Ira -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag
On Tue, Apr 09, 2013 at 09:26:59PM -0400, Michael R. Hines wrote: With respect, I'm going to offload testing this patch back to the author =) because I'm trying to address all of Paolo's other minor issues with the RDMA patch before we can merge. Fair enough, this likely means it won't happen anytime soon though. Since dynamic page registration (as you requested) is now fully implemented, this patch is less urgent since we now have a mechanism in place to avoid page pinning on both sides of the migration. - Michael Which mechanism do you refer to? You patches still seem to pin each page in guest memory at some point, which will break all COW. In particular any pagemap tricks to detect duplicates on source that I suggested won't work. On 04/09/2013 03:03 PM, Michael S. Tsirkin wrote: presumably is_dup_page reads the page, so should not break COW ... I'm not sure about the cgroups swap limit - you might have too many non COW pages so attempting to fault them all in makes you exceed the limit. You really should look at what is going on in the pagemap, to see if there's measureable gain from the patch. On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote: Well, I have the is_dup_page() commented out...when RDMA is activated. Is there something else in QEMU that could be touching the page that I don't know about? - Michael On 04/05/2013 05:03 PM, Roland Dreier wrote: On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines mrhi...@linux.vnet.ibm.com wrote: Sorry, I was wrong. ignore the comments about cgroups. That's still broken. (i.e. trying to register RDMA memory while using a cgroup swap limit cause the process get killed). But the GIFT flag patch works (my understanding is that GIFT flag allows the adapter to transmit stale memory information, it does not have anything to do with cgroups specifically). The point of the GIFT patch is to avoid triggering copy-on-write so that memory doesn't blow up during migration. If that doesn't work then there's no point to the patch. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html