Re: tune ib stack

2013-04-09 Thread Vasiliy Tolstov
2013/4/9 Sebastian Riemer sebastian.rie...@profitbricks.com:
 Because 2048 is the default and 4096 is the max. supported MTU by the
 hardware.

 How can i set active mtu?

 Something like this:
 echo 4096  /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

After doing this all srp connections down and port is down. I need to
restart openibd

06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
Subsystem: Mellanox Technologies Device 0017
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort-
TAbort- MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 42
Region 0: Memory at df90 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at de00 (64-bit, prefetchable) [size=8M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Not readable
Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
64ns, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq-
AuxPwr- TransPend-
LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s,
Latency L0 unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance-
SpeedDis-, Selectable De-emphasis: -6dB
 Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
 Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete-, EqualizationPhase1-
 EqualizationPhase2-, EqualizationPhase3-,
LinkEqualizationRequest-
Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [148 v1] Device Serial Number 00-25-90-ff-ff-17-9b-24
Capabilities: [18c v1] #19
Kernel driver in use: mlx4_core
Kernel modules: mlx4_core


 Could be a bug. Which OFED/Kernel (if using in-tree IB modules) do you use?
 Mine says with ConnectX2 QDR: 40 Gb/sec (4X QDR)

I'm using stock 3.8.6 kernel and xen patches on top. And i'm use
modules provided with kernel. (only ib_srp i'm use from Bart github
repo)


 You should see 40 Gb/sec (4X QDR) here. Perhaps the OFED is too old so
 that FDR and ConnectX 3 aren't supported, yet. 10 Gb/sec (4X) seems to
 be the default case if a rate isn't supported.

Yes, in older card with ConnecX i see this, but in case of ConnectX-3 only 10 Gb

--
Vasiliy Tolstov,
e-mail: v.tols...@selfip.ru
jabber: v...@selfip.ru
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 13:51, Vasiliy Tolstov wrote:
 Something like this:
 echo 4096  /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu
 
 After doing this all srp connections down and port is down. I need to
 restart openibd

Sorry for that! It's much easier to set the IP MTU. Managed switches
support setting the RDMA MTU. So it could be possible that it is a
setting in the SM config. But I'm not sure.

$ man opensm
says that it can be set in the partitions.conf

 You should see 40 Gb/sec (4X QDR) here. Perhaps the OFED is too old so
 that FDR and ConnectX 3 aren't supported, yet. 10 Gb/sec (4X) seems to
 be the default case if a rate isn't supported.
 
 Yes, in older card with ConnecX i see this, but in case of ConnectX-3 only 10 
 Gb

The kernel version is okay. It depends on the user space.
There is a support note in OFED 3.5:
- ConnectX-3 (fw-ConnectX3 Rev 2.11.0500) (FDR and FDR10 Modes are
Supported)

Before OFED 3.5 these HCAs aren't supported. A look at the related
source code could be worth a try.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Hal Rosenstock
On 4/9/2013 8:15 AM, Sebastian Riemer wrote:
 On 09.04.2013 13:51, Vasiliy Tolstov wrote:
 Something like this:
 echo 4096  /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

 After doing this all srp connections down and port is down. I need to
 restart openibd
 
 Sorry for that! It's much easier to set the IP MTU. Managed switches
 support setting the RDMA MTU. So it could be possible that it is a
 setting in the SM config. But I'm not sure.

IP MTU is different than link MTU. For UD mode, it's link MTU - 4. For
RC (connected) mode, this can be a much larger number than the link MTU
as the HCA does the segmentation/reassembly down to the path MTU.

 $ man opensm
 says that it can be set in the partitions.conf

Yes, MTU for the IPoIB interface is set in the partition file. This
would need configuring for the larger (4K) MTU assuming all ports
support the 4K MTU. If not, some ports won't be able to join the IPoIB
broadcast (or other) IB multicast groups and IPoIB won't work.

-- Hal
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC/PATCH v3] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-09 Thread Luick, Dean
 From: Roland Dreier rol...@purestorage.com
 + if (wc-byte_len  IPOIB_UD_HEAD_SIZE) {
 + page = priv-rx_ring[wr_id].page;
 + priv-rx_ring[wr_id].page = NULL;
 + } else {
 + page = NULL;
 + }
 +
   /*
* If we can't allocate a new RX buffer, dump
* this packet and reuse the old buffer.
*/
   if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) {
   ++dev-stats.rx_dropped;
 + priv-rx_ring[wr_id].page = page;
   goto repost;
   }


Can you go through the else of the first if (page is NULL), then enter the 
second if? If so, isn't the page lost?


Dean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 14:49, Hal Rosenstock wrote:
 On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote:
 Hello. I have some servers, with mellanox ConnectX-3 and have some questions:
 Why max_mtu differs with active_mtu? 
 
 What does peer port say for max MTU ?
 
 How can i set active mtu?
 
 SM sets active MTU to min of peer ports max MTUs.

So with peer port max MTU do you mean this file?:

/sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

I've seen that it can be set as well. I've got two ConnectX-2 machines
connected back2back. In general these have 4K max and active.

So let's try something:

Host1:
$ echo 2048  /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu
# Port is not active, let's reactivate it.
$ echo 1  /sys/class/infiniband/mlx4_0/device/enable

ibv_devinfo Host1:
max_mtu:2048 (4)
active_mtu: 2048 (4)

Host2:
max_mtu:4096 (5)
active_mtu: 2048 (4)

Both had 4096 (5) before everywhere.
So that's the recommended way to reduce the MTU?

I've heard that reducing the MTU in a fabric can help fighting
congestion issues. As congestion control doesn't work yet, could this
help against congestion?

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Hal Rosenstock
On 4/9/2013 9:16 AM, Sebastian Riemer wrote:
 On 09.04.2013 14:49, Hal Rosenstock wrote:
 On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote:
 Hello. I have some servers, with mellanox ConnectX-3 and have some 
 questions:
 Why max_mtu differs with active_mtu? 

 What does peer port say for max MTU ?

 How can i set active mtu?

 SM sets active MTU to min of peer ports max MTUs.
 
 So with peer port max MTU do you mean this file?:
 
 /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

I meant NeighborMTU from PortInfo as active MTU and MTUCap there is
supported MTU.

-- Hal
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 15:34, Hal Rosenstock wrote:
 On 4/9/2013 9:16 AM, Sebastian Riemer wrote:
 On 09.04.2013 14:49, Hal Rosenstock wrote:
 On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote:
 Hello. I have some servers, with mellanox ConnectX-3 and have some 
 questions:
 Why max_mtu differs with active_mtu? 

 What does peer port say for max MTU ?

 How can i set active mtu?

 SM sets active MTU to min of peer ports max MTUs.

 So with peer port max MTU do you mean this file?:

 /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu
 
 I meant NeighborMTU from PortInfo as active MTU and MTUCap there is
 supported MTU.

So these values are exactly the same as in ibv_devinfo and can be set
in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu.

I've found the PortInfo with the command
smpquery portinfo -C mlx4_0 3 1
where I'm using the first HCA to contact the SM. I tell the SM the
destination LID ('3' here in my case) and the destination port ('1').

Is there another method to set the max MTU?

I know that switches can also set the max MTU for their switch ports
where most of them use 2048 as default.
How to change these switch port MTUs for unmanaged switches?

On managed switches this can be done over the web front-end.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support

2013-04-09 Thread Or Gerlitz

On 03/04/2013 23:12, Hefty, Sean wrote:

Hi Sean, Ping. You had concerns on the suggested concept, we want to
know if we addressed them, can you comment?

I'm in meetings this week until tomorrow.  I'll try to take a look at the 
updated patches then or Friday.



any feedback?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Hal Rosenstock
On 4/9/2013 9:56 AM, Sebastian Riemer wrote:
 On 09.04.2013 15:34, Hal Rosenstock wrote:
 On 4/9/2013 9:16 AM, Sebastian Riemer wrote:
 On 09.04.2013 14:49, Hal Rosenstock wrote:
 On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote:
 Hello. I have some servers, with mellanox ConnectX-3 and have some 
 questions:
 Why max_mtu differs with active_mtu? 

 What does peer port say for max MTU ?

 How can i set active mtu?

 SM sets active MTU to min of peer ports max MTUs.

 So with peer port max MTU do you mean this file?:

 /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

 I meant NeighborMTU from PortInfo as active MTU and MTUCap there is
 supported MTU.
 
 So these values are exactly the same as in ibv_devinfo and can be set
 in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu.
 
 I've found the PortInfo with the command
 smpquery portinfo -C mlx4_0 3 1
 where I'm using the first HCA to contact the SM. I tell the SM the
 destination LID ('3' here in my case) and the destination port ('1').
 
 Is there another method to set the max MTU?

That doesn't set max MTU (MTUCap) but merely reads it (for that port).

 I know that switches can also set the max MTU for their switch ports
 where most of them use 2048 as default.

You would need to contact your CA and/or switch vendor(s) (see below).

 How to change these switch port MTUs for unmanaged switches?
 
 On managed switches this can be done over the web front-end.

Yes. MTUCap is RO in terms of the SM so there are only out of band
mechanisms to change this which are vendor specific like a web front end.

-- Hal

 Cheers,
 Sebastian
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 16:23, Hal Rosenstock wrote:
 So these values are exactly the same as in ibv_devinfo and can be set
 in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu.

 I've found the PortInfo with the command
 smpquery portinfo -C mlx4_0 3 1
 where I'm using the first HCA to contact the SM. I tell the SM the
 destination LID ('3' here in my case) and the destination port ('1').

 Is there another method to set the max MTU?
 
 That doesn't set max MTU (MTUCap) but merely reads it (for that port).

Sorry, copy and paste error. I've meant the mlx4 file:
/sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

But you've answered that by vendor specific. Thanks for the valuable
information!

For us most interesting would be if the MTU can be changed live without
any service disruption. Looks like the mlx4 driver can't provide that.
Perhaps switches can do that.

Cheers,
Sebastian

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH v3] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-09 Thread Roland Dreier
On Tue, Apr 9, 2013 at 6:13 AM, Luick, Dean dean.lu...@intel.com wrote:
 Can you go through the else of the first if (page is NULL), then enter the 
 second if? If so, isn't the page lost?

Thanks, good catch.  I'll fix that up.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support

2013-04-09 Thread Hefty, Sean
 any feedback?

I have no issue with RSS/TSS.  But the 'qp group' interface to using this seems 
kludgy.

On a node, this is multiple send/receive queues grouped together to form a 
larger construct.  On the wire, this is a single QP - maybe?  I'm still not 
clear on that.  From what's written, all the send queues appear as a single 
QPN.  The receive queues appear as different QPNs.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Fri, Apr 05, 2013 at 04:54:39PM -0400, Michael R. Hines wrote:
 To be more specific, here's what I did:
 
 1. apply kernel module patch - re-insert module
 1. QEMU does: ibv_reg_mr(IBV_ACCESS_GIFT | IBV_ACCESS_REMOTE_READ)
 2. Start the RDMA migration
 3. Migration completes without any errors
 
 This test does *not* work with a cgroup swap limit, however. The
 process gets killed. (Both with and without GIFT)
 
 - Michael

Try to attach a debugger and see where it is when it gets killed?

 On 04/05/2013 04:43 PM, Roland Dreier wrote:
 On Fri, Apr 5, 2013 at 1:17 PM, Michael R. Hines
 mrhi...@linux.vnet.ibm.com wrote:
 I also removed the IBV_*_WRITE flags on the sender-side and activated
 cgroups with the memory.memsw.limit_in_bytes activated and the migration
 with RDMA also succeeded without any problems (both with *and* without GIFT
 also worked).
 Not sure I'm interpreting this correctly.  Are you saying that things
 worked without actually setting the GIFT flag?   In which case why are
 we adding this flag?
 
   - R.
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael R. Hines

On 04/09/2013 12:39 PM, Michael S. Tsirkin wrote:

On Fri, Apr 05, 2013 at 04:54:39PM -0400, Michael R. Hines wrote:

To be more specific, here's what I did:

1. apply kernel module patch - re-insert module
1. QEMU does: ibv_reg_mr(IBV_ACCESS_GIFT | IBV_ACCESS_REMOTE_READ)
2. Start the RDMA migration
3. Migration completes without any errors

This test does *not* work with a cgroup swap limit, however. The
process gets killed. (Both with and without GIFT)

- Michael

Try to attach a debugger and see where it is when it gets killed?



It's killed by cgroups - not a CPU exception.

The same test works fine using TCP migration with cgroups - everything 
is fine there.


The memory that RDMA attempted to register hits some kind of cgroups policy
which results in a kernel message saying that the cgroup swap limit was hit
and then it goes ahead and kills the process altogether.

It's not a QEMU problem - it seems to be a kernel bug.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH v3] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-09 Thread Markus Stockhausen

 
-IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN,
+/* add 128 bytes of tailroom for IP/TCP headers */
+IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 128,

Hello,

the version 3 of the patch finally works. I can see the performance
gains but I cannot feel them (in real life). Here are the results
of my testbed:

Test 1:
netperf/netserver message size 16K

kernel 3.5 default :  5.1 GBit/s
kernel 3.5 + patch v3  :  7.7 GBit/s
kernel 3.5 + max MTU 3K: 10.8 GBit/s

Test 2:
Disk write performance
VM with disk mounted on IB async NFS server

block size  | default  | patch v3 | max MTU 3K
+--+--+--
   1 KB |  10 MB/s |  10 MB/s |  10 MB/s
   2 KB |  20 MB/s |  21 MB/s |  20 MB/s
   4 KB |  40 MB/s |  40 MB/s |  43 MB/s
   8 KB |  68 MB/s |  70 MB/s |  78 MB/s
  16 KB | 105 MB/s | 105 MB/s | 120 MB/s
  32 KB | 150 MB/s | 150 MB/s | 170 MB/s
  64 KB | 200 MB/s | 210 MB/s | 260 MB/s
 128 KB | 270 MB/s | 290 MB/s | 400 MB/s
 256 KB | 300 MB/s | 310 MB/s | 430 MB/s
 512 KB | 305 MB/s | 320 MB/s | 470 MB/s
1024 KB | 310 MB/s | 325 MB/s | 500 MB/s
2048 KB | 310 MB/s | 325 MB/s | 510 MB/s
4096 KB | 370 MB/s | 325 MB/s | 510 MB/s
8192 KB | 400 MB/s | 325 MB/s | 520 MB/s


As you can see netperf throughput increases while NFS does not
even care about the optimizations. Maybe it does not work well
with fragmented SKBs. The MAX MTU 3K values once again are
forced through a hack inside ipoib_main.c.

For curiosity I changed the block splitting in your v3 patch
from small head with large fragment to large head with small
fragment in this line.

IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 3072

In my 2044 MTU case this brings the netperf  NFS throughput to
the same levels as the dirty hack. Of course this no longer
reflects a head but equals more or less to something like a
new constant IPOIB_UD_FIXED_SKB_SIZE.

I guess 4K MTU will not see any further gains but avoiding the
skb_pull calls should improve speed as well. Maybe a final
adaption could put the cherry on the cake.

Markus


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
presumably is_dup_page reads the page, so should not break COW ...

I'm not sure about the cgroups swap limit - you might have
too many non COW pages so attempting to fault them all in
makes you exceed the limit. You really should look at
what is going on in the pagemap, to see if there's
measureable gain from the patch.


On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote:
 Well, I have the is_dup_page() commented out...when RDMA is
 activated.
 
 Is there something else in QEMU that could be touching the page that
 I don't know about?
 
 - Michael
 
 
 On 04/05/2013 05:03 PM, Roland Dreier wrote:
 On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines
 mrhi...@linux.vnet.ibm.com wrote:
 Sorry, I was wrong. ignore the comments about cgroups. That's still broken.
 (i.e. trying to register RDMA memory while using a cgroup swap limit cause
 the process get killed).
 
 But the GIFT flag patch works (my understanding is that GIFT flag allows the
 adapter to transmit stale memory information, it does not have anything to
 do with cgroups specifically).
 The point of the GIFT patch is to avoid triggering copy-on-write so
 that memory doesn't blow up during migration.  If that doesn't work
 then there's no point to the patch.
 
   - R.
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-09 Thread Weiny, Ira
 -Original Message-
 From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
 Subject: Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
 
 On Apr 4, 2013, at 1:57 PM, Weiny, Ira ira.we...@intel.com wrote:
 
  In hindsight, the user space API never should have exposed the mtu as
  an enum...
 
  Since an enum is an int, and we're never going to have anything with
  an mtu = 5 bytes, couldn't we just store all new mtu values directly
  as their byte value?
 
  That seems like a pretty good idea.
 
 
 Agreed, but changing to an int would seem to have some fairly serious
 backwards compatibility issues.
 
 What is the right way to move forward here?
 
 Just to re-state: our issue is that there does not seem to be any other way to
 get the max UD message size without knowing the actual MTU (are we
 incorrect about that?).  Hence, using the IB-defined values is not really
 sufficient.

I guess I am confused.  Is this patch trying to support RoCE or a VNIC?

Ira


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Fri, Apr 05, 2013 at 01:43:49PM -0700, Roland Dreier wrote:
 On Fri, Apr 5, 2013 at 1:17 PM, Michael R. Hines
 mrhi...@linux.vnet.ibm.com wrote:
  I also removed the IBV_*_WRITE flags on the sender-side and activated
  cgroups with the memory.memsw.limit_in_bytes activated and the migration
  with RDMA also succeeded without any problems (both with *and* without GIFT
  also worked).
 
 Not sure I'm interpreting this correctly.  Are you saying that things
 worked without actually setting the GIFT flag?   In which case why are
 we adding this flag?
 
  - R.

We are adding the flag to reduce memory when there's lots of COW pages.
There's no guarantee there will be COW pages so I expect things to work
both with and without breaking COW, just using much more memory when we
break COW.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support

2013-04-09 Thread Or Gerlitz
On Tue, Apr 9, 2013 at 8:06 PM, Hefty, Sean sean.he...@intel.com wrote:

 I have no issue with RSS/TSS.  But the 'qp group' interface to using this 
 seems kludgy.

OK, so lets take it over the patch that has the QP group description

 On a node, this is multiple send/receive queues grouped together to form a 
 larger
 construct.  On the wire, this is a single QP - maybe?  I'm still not clear on 
 that.  From
 what's written, all the send queues appear as a single QPN.  The receive 
 queues
 appear as different QPNs.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Fri, Apr 05, 2013 at 04:17:36PM -0400, Michael R. Hines wrote:
 The userland part of the patch was missing (IBV_ACCESS_GIFT).
 
 I added flag that to /usr/include in addition to this patch and did
 a test RDMA migrate and it seems to work without any problems.
 
 I also removed the IBV_*_WRITE flags on the sender-side and
 activated cgroups with the memory.memsw.limit_in_bytes activated
 and the migration with RDMA also succeeded without any problems
 (both with *and* without GIFT also worked).
 
 Any additional tests you would like?
 
 
 - Michael

RDMA can't really work with swap so not sure how that's relevant.

Please check memory.usage_in_bytes - is it lower with
the GIFT flag?  I think this is what we really care about.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups

2013-04-09 Thread Or Gerlitz
 This patch introduces the concept of RSS and TSS QP groups which
 allows for implementing them by low level drivers and using it
 by IPoIB and later also by user space ULPs.

 A QP group is a set of QPs consists of a parent QP and two disjoint sets
 of RSS and TSS QPs. The creation of a QP group is a two stage process:

 In the the 1st stage, the parent QP is created.

 In the 2nd stage the children QPs of the parent are created.

 Each child QP indicates if its a RSS or TSS QP. Both the TSS
 and RSS sets of QPs should have contiguous QP numbers.

 It is forbidden to modify parent QP state before all RSS/TSS children
 were created. In the same manner it is disallowed to destroy the parent
 QP unless all RSS/TSS children were destroyed.

 A few new elements/concepts are introduced to support this:

 Three new device capabilities that can be set by the low level driver:

 - IB_DEVICE_QPG which is set to indicate QP groups are supported.

 - IB_DEVICE_UD_RSS which is set to indicate that the device supports
 RSS, that is applying hash function on incoming TCP/UDP/IP packets and
 dispatching them to multiple rings (child QPs).

 - IB_DEVICE_UD_TSS which is set to indicate that the device supports
 HW TSS which means that the HW is capable of over-riding the source
 UD QPN present in sent IB datagram header (DTH) with the parent's QPN.

 Low level drivers not supporting HW TSS, could still support QP groups, such
 as combination is referred as SW TSS. Where in this case, the low level 
 drive
 fills in the qpg_tss_mask_sz field of struct ib_qp_cap returned from
 ib_create_qp. Such that this mask can be used to retrieve the parent QPN from
 incoming packets carrying a child QPN (as of the contiguous QP numbers 
 requirement).

 - max rss table size device attribute, which is the maximal size of the RSS
 indirection table  supported by the device

 - qp group type attribute for qp creation saying whether this is a parent QP
 or rx/tx (rss/tss) child QP or none of the above for non rss/tss QPs.

 - per qp group type, another attribute is added, for parent QPs, the number
 of rx/tx child QPs and for child QPs pointer to the parent.

 - IB_QP_GROUP_RSS attribute mask, which should be used when modifying
 the parent QP state from reset to init


On Tue, Apr 9, 2013 at 8:06 PM, Hefty, Sean sean.he...@intel.com wrote:

 I have no issue with RSS/TSS.  But the 'qp group' interface to using this 
 seems kludgy.

lets try to be more specific

 On a node, this is multiple send/receive queues grouped together to form a 
 larger
 construct.  On the wire, this is a single QP - maybe?  I'm still not clear on 
 that.  From
 what's written, all the send queues appear as a single QPN.  The receive 
 queues
 appear as different QPNs.

Starting with RSS QP groups: its a group made of one parent QP and N
RSS child QPs.

On the wire everything is sent to the RSS parent QP, however, when the
HW receives a packet for which this QP/QPN is the destination, it
applies a hash function on the packet header and subject to the hash
result dispatches the packet to one of the N child QPs.

The design applies for IB UD QPs and Raw Ethernet Packet QP types,
under IB the QPN of the parent is on the wire, under Eth, there are no
QPNs on the wire, but that HW has some steering rule which makes
certain packets to be steered to that RSS parent, and the RSS parent
in turn further does dispatching decision (hashing) to determine which
of the child RSS QPs will actually receive that packet.

With IPoIB, the remote side is provided with the RSS parent QPN as
part of the IPoIB HW address provided in the ARP reply payload, so
packets are sent to that QPN. With RAW Packet Eth QPs, the remote side
isn't aware to QPNs at all, all goes through a steering rule who is
directing to the RSS parent.

You can send packets over RSS packet QP but not receive packets.

So for RSS, the remote side isn't aware to that QP group @ all.

Makes sense?

As for TSS QP groups, basically  generally speaking, the only case
that really matters are applications/drivers that care for the source
QPN of a packet.

but lets get there after hopefully agreeing what is RSS QP group.

Or.


 Signed-off-by: Shlomo Pongratz shlo...@mellanox.com
 ---
  drivers/infiniband/core/uverbs_cmd.c |1 +
  drivers/infiniband/core/verbs.c  |  118 
 ++
  drivers/infiniband/hw/amso1100/c2_provider.c |3 +
  drivers/infiniband/hw/cxgb3/iwch_provider.c  |2 +
  drivers/infiniband/hw/cxgb4/qp.c |3 +
  drivers/infiniband/hw/ehca/ehca_qp.c |3 +
  drivers/infiniband/hw/ipath/ipath_qp.c   |3 +
  drivers/infiniband/hw/mlx4/qp.c  |3 +
  drivers/infiniband/hw/mthca/mthca_provider.c |3 +
  drivers/infiniband/hw/nes/nes_verbs.c|3 +
  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |5 +
  drivers/infiniband/hw/qib/qib_qp.c   |5 +
  include/rdma/ib_verbs.h  |   40 

Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-09 Thread Jeff Squyres (jsquyres)
On Apr 8, 2013, at 6:16 PM, Hefty, Sean sean.he...@intel.com wrote:

 Why can't IB_MTU_1500 = 1500?


It certainly could.  Additionally, since Roland was a little concerned about 
the IB prefix (since 1500 and 9000 are not IBTA-sanctioned MTUs), they could 
have a different prefix -- perhaps RDMA_MTU_1500.  

Although I admit that it would be weird to have an enum that contains values 
with different prefixes:

enum ib_mtu {
IB_MTU_256  = 1,
IB_MTU_512  = 2,
IB_MTU_1024 = 3,
IB_MTU_2048 = 4,
IB_MTU_4096 = 5,
RDMA_MTU_1500 = 1500,
RDMA_MTU_9000 = 9000
};

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-09 Thread Weiny, Ira
 -Original Message-
 From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com]
 Subject: Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
 
 On Apr 8, 2013, at 6:16 PM, Hefty, Sean sean.he...@intel.com wrote:
 
  Why can't IB_MTU_1500 = 1500?
 

Sean,

If the IBTA were to release new MTU enumerations which values would you 
recommend then?

Ira

 
 It certainly could.  Additionally, since Roland was a little concerned about 
 the
 IB prefix (since 1500 and 9000 are not IBTA-sanctioned MTUs), they could
 have a different prefix -- perhaps RDMA_MTU_1500.
 
 Although I admit that it would be weird to have an enum that contains values
 with different prefixes:
 
 enum ib_mtu {
 IB_MTU_256  = 1,
 IB_MTU_512  = 2,
 IB_MTU_1024 = 3,
 IB_MTU_2048 = 4,
 IB_MTU_4096 = 5,
 RDMA_MTU_1500 = 1500,
 RDMA_MTU_9000 =   9000
 };
 
 --
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael R. Hines

With respect, I'm going to offload testing this patch back to the author =)
because I'm trying to address all of Paolo's other minor issues
with the RDMA patch before we can merge.

Since dynamic page registration (as you requested) is now fully
implemented, this patch is less urgent since we now have a
mechanism in place to avoid page pinning on both sides of the migration.

- Michael

On 04/09/2013 03:03 PM, Michael S. Tsirkin wrote:

presumably is_dup_page reads the page, so should not break COW ...

I'm not sure about the cgroups swap limit - you might have
too many non COW pages so attempting to fault them all in
makes you exceed the limit. You really should look at
what is going on in the pagemap, to see if there's
measureable gain from the patch.


On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote:

Well, I have the is_dup_page() commented out...when RDMA is
activated.

Is there something else in QEMU that could be touching the page that
I don't know about?

- Michael


On 04/05/2013 05:03 PM, Roland Dreier wrote:

On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines
mrhi...@linux.vnet.ibm.com wrote:

Sorry, I was wrong. ignore the comments about cgroups. That's still broken.
(i.e. trying to register RDMA memory while using a cgroup swap limit cause
the process get killed).

But the GIFT flag patch works (my understanding is that GIFT flag allows the
adapter to transmit stale memory information, it does not have anything to
do with cgroups specifically).

The point of the GIFT patch is to avoid triggering copy-on-write so
that memory doesn't blow up during migration.  If that doesn't work
then there's no point to the patch.

  - R.



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-09 Thread Weiny, Ira
 -Original Message-
 From: Hefty, Sean
 Sent: Tuesday, April 09, 2013 6:30 PM
 To: Weiny, Ira; Jeff Squyres (jsquyres)
 Cc: Hal Rosenstock; Roland Dreier; linux-rdma@vger.kernel.org; Upinder
 Malhi (umalhi)
 Subject: RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
 
  If the IBTA were to release new MTU enumerations which values would
  you recommend then?
 
 I don't think there's a great solution here.  We're mixing IBTA encoded values
 with non-IBTA values.  We could reserve the 6-bit encoded values for IB, and
 use direct values for others (or at least jump beyond the 6-bit range).  Or we
 can stop matching new IBTA MTU encodings (e.g. IB_MTU_1500 = 6).  Or we
 go back in time and make mtu an int.
 

I thought reserving the 6 bit's for IB and allowing the enum values to match 
the MTU was a pretty good compromise.  Especially since PathRecord is defined 
in sa.h which is provided by libibverbs.  That allows for that IB MTU enum to 
be used there.

OTOH, now that we have moved toward decent defines in the libibumad  library we 
could define the MTU enum there.  But then we again go down the path of 
defining things multiple places and confusing the users...  :-(

As an aside I like the use of RDMA_MTU_* for these values.  Again to 
distinguish them from the IBTA values.  But I know that is poor form.

Ira

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin

On Tue, Apr 09, 2013 at 09:26:59PM -0400, Michael R. Hines wrote:
 With respect, I'm going to offload testing this patch back to the author =)
 because I'm trying to address all of Paolo's other minor issues
 with the RDMA patch before we can merge.

Fair enough, this likely means it won't happen anytime soon though.

 Since dynamic page registration (as you requested) is now fully
 implemented, this patch is less urgent since we now have a
 mechanism in place to avoid page pinning on both sides of the migration.
 
 - Michael
 

Which mechanism do you refer to? You patches still seem to pin
each page in guest memory at some point, which will break all
COW. In particular any pagemap tricks to detect duplicates
on source that I suggested won't work.

 On 04/09/2013 03:03 PM, Michael S. Tsirkin wrote:
 presumably is_dup_page reads the page, so should not break COW ...
 
 I'm not sure about the cgroups swap limit - you might have
 too many non COW pages so attempting to fault them all in
 makes you exceed the limit. You really should look at
 what is going on in the pagemap, to see if there's
 measureable gain from the patch.
 
 
 On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote:
 Well, I have the is_dup_page() commented out...when RDMA is
 activated.
 
 Is there something else in QEMU that could be touching the page that
 I don't know about?
 
 - Michael
 
 
 On 04/05/2013 05:03 PM, Roland Dreier wrote:
 On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines
 mrhi...@linux.vnet.ibm.com wrote:
 Sorry, I was wrong. ignore the comments about cgroups. That's still 
 broken.
 (i.e. trying to register RDMA memory while using a cgroup swap limit cause
 the process get killed).
 
 But the GIFT flag patch works (my understanding is that GIFT flag allows 
 the
 adapter to transmit stale memory information, it does not have anything to
 do with cgroups specifically).
 The point of the GIFT patch is to avoid triggering copy-on-write so
 that memory doesn't blow up during migration.  If that doesn't work
 then there's no point to the patch.
 
   - R.
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html