IB/iSER major problems with Linux 3.0 and Solaris targets

2012-01-11 Thread Sebastian Riemer
Hi list, hi Or,

at the moment we use our kernel 3.0 ported OFA kernel modules from
OFED-1.5.4 to make iSER work together with our kernel 3.0 ported
open-iscsi 2.0.869 and Solaris 11 COMSTAR targets.

Now, we've got the problem that ib_iser from in there ignores its
debug_level parameter. /sys/module/ib_iser/parameters/debug_level
shows 0 but it spams the whole kernel log with debug messages. Seems
to be a code bug.

We've also tested a 3.0.15 mainline kernel with in-tree IB modules
together with the OFED-1.5.4 user-space and this has much better IPoIB
performance than the kernel stuff from OFED. So, we want to use them
instead, but there is the same problem with the iSER debug messages and
iSER doesn't work together with our Solaris 11 COMSTAR targets.

We've tested this with open-iscsi (2.0.872, commit
4323e342d2c9fb8ed7233ce855001c189ec55b23) user-space.
TCP is O.K. but iSER reports an error while login attempt:
iscsiadm: initiator reported error (11 - iSCSI PDU timed out)

The sent PDUs from iscsid debugging are the same, but there is an IO
page fault in the kernel log. I've attached the relevant part and the
iscsid log.

This looks interesting:
iser: iser_drain_tx_cq:tx id 88402391f898 status 4 vend_err 57

Or, could you please investigate/explain?

It is a pain that we need both: working iSER and IPoIB traffic with good
performance.

Cheers,

Sebastian


On 19/12/11 10:14, Sebastian Riemer wrote:
 Hi list,

 I've already sent this to the open-iscsi mailing list, but I guess
 this is more relevant for linux-rdma.

 Finally I've got IB/iSER running on Debian Squeeze with Linux kernel 3.0
 smoothly.

 The problem was that we did not have the suitable OFED for our kernel
 and we did not use the open-iscsi from OFED. Kernel 3.0 is supported
 since OFED-1.5.4 from 2011-12-05.

 So, I've taken the 1.5.2-based stuff from Debian/Experimental and I've
 updated it to 1.5.4 from OFA. Then, I've noticed that Debian doesn't
 build ib_iser in the OFA kernel source and that they don't build the
 open-iscsi kernel/user-space code - I made it do so.

 The next problem was that open-iscsi kernel code in OFED-1.5.4 is for =
 2.6.32 based RedHat distributions. I had to port the source from 2.6.30
 to 3.0 due to kernel API changes. OFA even forgot libiscsi_tcp.[ch] in
 OFED-1.5.4. So, I had to import it from 2.6.30 mainline.
 I did so, because we wanted to compare TCP and iSER speed over
 InfiniBand. Our Solaris COMSTAR targets provide both.

 After fixing the kernel, there was still a problem in the open-iscsi
 2.0.869 user-space from OFED. Some sysfs magic has changed - so that the
 iSCSI host number couldn't be found.

 After fixing that, it worked for me.

 Cheers,

 Sebastian
   

Jan 11 12:53:25 pserver214 kernel: [  716.518372] SCSI subsystem initialized
Jan 11 12:53:25 pserver214 kernel: [  716.521146] Loading iSCSI transport class v2.0-870.
Jan 11 12:53:25 pserver214 kernel: [  716.528756] iscsi: registered transport (tcp)
Jan 11 12:53:30 pserver214 kernel: [  721.903544] iscsi: registered transport (iser)
Jan 11 12:54:46 pserver214 kernel: [  797.537439] iser: iser_connect:connecting to: 10.1.24.204, port 0xbc0c
Jan 11 12:54:46 pserver214 kernel: [  797.563158] iser: iser_cma_handler:event 0 status 0 conn 880807b17a80 id 880807594400
Jan 11 12:54:46 pserver214 kernel: [  797.566402] iser: iser_cma_handler:event 2 status 0 conn 880807b17a80 id 880807594400
Jan 11 12:54:46 pserver214 kernel: [  797.579704] iser: iser_create_ib_conn_res:setting conn 880807b17a80 cma_id 880807594400: fmr_pool 88082426b400 qp 8807ed22aa00
Jan 11 12:54:46 pserver214 kernel: [  797.586557] iser: iser_cma_handler:event 9 status 0 conn 880807b17a80 id 880807594400
Jan 11 12:54:46 pserver214 kernel: [  797.787932] iser: iscsi_iser_ep_poll:ib conn 880807b17a80 rc = 1
Jan 11 12:54:46 pserver214 kernel: [  797.788137] scsi0 : iSCSI Initiator over iSER, v.0.1
Jan 11 12:54:46 pserver214 kernel: [  797.794249] iser: iscsi_iser_conn_bind:binding iscsi/iser conn 8808058deab8 8808058decc8 to ib_conn 880807b17a80
Jan 11 12:54:46 pserver214 kernel: [  797.794710] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488000 flags=0x0050]
Jan 11 12:54:46 pserver214 kernel: [  797.794919] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488200 flags=0x0050]
Jan 11 12:54:46 pserver214 kernel: [  797.794998] iser: iser_drain_tx_cq:tx id 88402391f898 status 4 vend_err 57
Jan 11 12:54:46 pserver214 kernel: [  797.795006]  connection1:0: detected conn error (1011)
Jan 11 12:54:46 pserver214 kernel: [  797.795338] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488100 flags=0x0050]
Jan 11 12:54:46 pserver214 kernel: [  797.795535] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488040 flags=0x0050]
Jan 11 12:54:46 pserver214 kernel: [  797.795730] AMD-Vi: 

Re: Send with immediate data completion

2012-01-11 Thread Atchley, Scott
On Jan 3, 2012, at 12:35 PM, Atchley, Scott wrote:

 On Jan 3, 2012, at 11:55 AM, Hefty, Sean wrote:
 
 I have a question about a completion for a send with immediate data. The IB
 spec (1.2.1) only mentions that the WC's immediate data be present at the
 receiver. It is silent on the value on the sender at completion. It does say
 that it is only valid if the WC's immediate data indicator is set.
 
 Can you provide a section reference to the spec on the areas that you're 
 looking at?  Looking quickly, section 11.4.2.1 reads like immediate data 
 should be available in either case.
 
 I've never checked imm data on the send wc.  I'm just trying to determine if 
 there's an issue in the spec that should be addressed, or if this is simply 
 a bug in the hca/driver.
 
 There is the definition in the glossary:
 
 Immediate Data
 
 Data contained in a Work Queue Element that is sent along with the payload to 
 the remote Channel Adapter and placed in a Receive Work Completion.
 
 Section 3.7.4 Transport Layer:
 
 The Immediate Data (IMMDT) field is optionally present in RDMA WRITE and SEND 
 messages. It contains data that the consumer placed in the Send or RDMA Write 
 request and the receiving QP will place that value in the current receive 
 WQE. An RDMA Write with immediate data will consume a receive WQE even though 
 the QP did not place any data into the receive buffer since the IMMDT is 
 placed in a CQE that references the receive WQE and indicates that the WQE 
 has completed.
 
 Section 11.4.1.1 Post Send Request has:
 
 Immediate Data Indicator. This is set if Immediate Data is to
 be included in the outgoing request. Valid only for Send or
 Write RDMA operations.
 
 4-byte Immediate Data. Valid only for Send or Write RDMA operations.
 
 11.4.2.1 Poll for Completion
 
 Immediate data indicator. This is set if immediate data is present.
 
 4-byte immediate data.
 
 
 None specifically mention the sender's completion event.

Sean,

Any thoughts?

Personally, I would like to have it in the send completion, but it might not be 
possible for all drivers to implement. If not, then the spec should be 
clarified.

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Upstream support for multicast IBoE

2012-01-11 Thread Shawn Bohrer
On Wed, Nov 23, 2011 at 11:17:02PM +0200, Or Gerlitz wrote:
 Roland Dreier rol...@purestorage.com wrote:
  looks like the libmlx4 patches I applied from you are missing multicast AH 
  support?
 
 Yes, when I reworked the patches for submission I went from easier
 (RC) to harder (ucast UD + Vlans) and eventually didn't nailed down
 the multicast front. I probably planned getting there on a later point
 but it remained behind, anyway, if you're okay to have the ad hoc
 MGID - ethernet address mapping as you commented on this thread, I
 can make the patch to add that.

Is there any estimate on when we might see something like this
upstream?

Thanks,
Shawn


---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] opensm: Get correct guid in case of multiple ports

2012-01-11 Thread Goldwyn Rodrigues

Hi Alex,

Let me start with how we encountered the problem:
This problem came up when our customer was using a 2 port card with only
one of the port active. opensm could not get the guid of the port
that was active in daemon mode.

 On 13:36 Wed 05 Oct , Goldwyn Rodrigues wrote:
  
  In case of multiple ports and running in daemon mode, the active port is 
  not selected because opt.guid is set to INVALID_GUID in main() but the 
  check in get_port_guid is done against zero: 
  if (port_guid == 0) {
 
 opt.guid is set to 0 by default.
 opt.guid is set to INVALID_GUID if a user used -g WRONG_GUID command line
 option when executing the SM. 

What happens when -g 0 -B is specified? Check the getopt code. It sets guid
to INVALID_GUID. Consider /etc/sysconfig/opensm  as well.

What happens when you provide -g WRONG_GUID -B?
I think in this case, -B should take priority and set with the first
active port available.


 In that case, when SM runs not in daemon mode,
 SM prompts the user to choose available port GUID out of available range.
 In case when SM runs in daemon mode, it can't prompt the user so it just 
 exits.
 
  
  On second thoughts, passing port_guid is worthless because this function is 
  called only when no guid is supplied at the command prompt. So, removed the 
  port_guid parameter from the function altogether.
  
  If not in daemon mode, it would show the list of ports as intended.
  
  Also added error message if no ports are found.
  
  Signed-off-by: Goldwyn Rodrigues rgold...@suse.de
  
  diff --git a/opensm/main.c b/opensm/main.c
  index 51c8291..a236859 100644
  --- a/opensm/main.c
  +++ b/opensm/main.c
  @@ -403,7 +403,7 @@ static void show_usage(void)
  exit(2);
   }
   
  -static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t 
  port_guid)
  +static ib_net64_t get_port_guid(IN osm_opensm_t *p_osm)
   {
  ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS];
  uint32_t num_ports = MAX_LOCAL_IBPORTS;
  @@ -436,21 +436,19 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * 
  p_osm, uint64_t port_guid)
 cl_hton64(attr_array[0].port_guid));
  return attr_array[0].port_guid;
  }
  -   /* If port_guid is 0 - use the first connected port */
  -   if (port_guid == 0) {
  +   /* If in daemon mode autoselect first available port */
  +   if (p_osm-subn.opt.daemon) {
  for (i = 0; i  num_ports; i++)
  if (attr_array[i].link_state  IB_LINK_DOWN)
  break;
  +   /* No port found which is available */
  if (i == num_ports)
  -   i = 0;
  +   return 0;
  printf(Using default GUID 0x% PRIx64 \n,
 cl_hton64(attr_array[i].port_guid));
  return attr_array[i].port_guid;
  }
   
  -   if (p_osm-subn.opt.daemon)
  -   return 0;
  -
  /* More than one possible port - list all ports and let the user
   * to choose. */
  while (1) {
  @@ -1106,10 +1104,12 @@ int main(int argc, char *argv[])
 then get a port GUID value with which to bind.
   */
  if (opt.guid == 0 || cl_hton64(opt.guid) == CL_HTON64(INVALID_GUID))
  -   opt.guid = get_port_guid(osm, opt.guid);
  +   opt.guid = get_port_guid(osm);
   
  -   if (opt.guid == 0)
  +   if (opt.guid == 0) {
  +   printf(\nError: No available ports\n);
  goto Exit;
  +   }
   
  status = osm_opensm_bind(osm, opt.guid);
  if (status != IB_SUCCESS) {
  
  -- 
  Goldwyn
  --
  To unsubscribe from this list: send the line unsubscribe linux-rdma in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 -- 
 
 -- Alex

-- 
Goldwyn
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Send with immediate data completion

2012-01-11 Thread Hefty, Sean
 Any thoughts?
 
 Personally, I would like to have it in the send completion, but it might not
 be possible for all drivers to implement. If not, then the spec should be
 clarified.

From my interpretation of the spec, it looks like a bug and the immediate data 
should be available in the send wc, since it's not explicitly excluded.  I'll 
verify that this is the intent with the IBTA.

I haven't traced through the library and driver code to see if mlx4 driver can 
report it.  I'll see if I can do this within the next couple of days.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-3.3 0/6] IB: set of fixes and additions

2012-01-11 Thread Or Gerlitz
The series is made of four elements

This subset (patches 1-3) fixes two bugs port info related bugs, where
the first patch is just making following code related to speed possible...

[PATCH 1/6] IB: use central enum for speed instead of hard-coded values
[PATCH 2/6] IB/core: fix wrong display of rate in sysfs
[PATCH 3/6] IB/mlx4: fix wrong info returned when querying IBoE ports

This patch comes to address the long pending patch for 4k mtu
along the lines of what was requested over the last review

[PATCH 4/6] net/mlx4: allow for dynamic mtu configuration for IB ports

This patch is simple and comes to unify the approach for checksum
indication with possible / hopefully soon changes in libibverbs

[PATCH 5/6] IB: change CQ TCP/IP checksum offload mark to use bit flag 
indication instead of integer

This patch extends the IPoIB child scheme, and will serve as
a building block when using IPoIB in the Virtualization space.

[PATCH 6/6] IB/ipoib: add support for clones / multiple childs on the same 
partition

Roland, Yep, this submission comes a bit into the 3.3 merge cycle, but
with its fairly limited scope, I hope it can make it there... as of the recent
changes in mlx4_core, patches 3  4 need you to pull Linus tree, but this should
be done now anyway for any future patches to the IB stack through out the 3.3 
cycle
and elsewhere


Or Gerlitz (6):
  IB: use central enum for speed instead of hard-coded values
  IB/core: fix wrong display of rate in sysfs
  IB/mlx4: fix wrong info returned when querying IBoE ports
  net/mlx4: allow for dynamic mtu configuration for IB ports
  IB: change CQ TCP/IP checksum offload mark to use bit flag indication instead 
of integer
  IB/ipoib: add support for clones / multiple childs on the same partition

 Documentation/infiniband/ipoib.txt   |   24 +
 drivers/infiniband/core/sysfs.c  |   25 --
 drivers/infiniband/hw/amso1100/c2_provider.c |2 +-
 drivers/infiniband/hw/cxgb3/iwch_provider.c  |2 +-
 drivers/infiniband/hw/cxgb4/provider.c   |2 +-
 drivers/infiniband/hw/ehca/ehca_hca.c|2 +-
 drivers/infiniband/hw/mlx4/cq.c  |3 +-
 drivers/infiniband/hw/mlx4/main.c|  107 +---
 drivers/infiniband/hw/mthca/mthca_cq.c   |3 +-
 drivers/infiniband/hw/nes/nes_verbs.c|2 +-
 drivers/infiniband/hw/qib/qib_rc.c   |1 -
 drivers/infiniband/hw/qib/qib_uc.c   |1 -
 drivers/infiniband/ulp/ipoib/ipoib.h |9 ++-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c  |3 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c|   48 ---
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c|   40 ++---
 drivers/net/ethernet/mellanox/mlx4/main.c|  118 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h|2 +
 drivers/net/ethernet/mellanox/mlx4/port.c|   25 +-
 include/linux/mlx4/device.h  |1 +
 include/rdma/ib_verbs.h  |   11 ++-
 21 files changed, 330 insertions(+), 101 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] IB: use central enum for speed instead of hard-coded values

2012-01-11 Thread Or Gerlitz
The kernel IB stack uses one enumeration for IB speed, which wasn't
explicitly specified in the verbs header file. Add that enum, and
use it all over the code. Note that the IB speed/width notation is
also used by iWARP and IBoE hw drivers who apply the convention of
rate = speed X width, to advertize their port link rate.

Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/core/sysfs.c  |   15 +--
 drivers/infiniband/hw/amso1100/c2_provider.c |2 +-
 drivers/infiniband/hw/cxgb3/iwch_provider.c  |2 +-
 drivers/infiniband/hw/cxgb4/provider.c   |2 +-
 drivers/infiniband/hw/ehca/ehca_hca.c|2 +-
 drivers/infiniband/hw/mlx4/main.c|   10 +-
 drivers/infiniband/hw/nes/nes_verbs.c|2 +-
 include/rdma/ib_verbs.h  |9 +
 8 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index c61bca3..9ce70ca 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -189,21 +189,24 @@ static ssize_t rate_show(struct ib_port *p, struct 
port_attribute *unused,
rate = (25 * attr.active_speed) / 10;

switch (attr.active_speed) {
-   case 2:
+   case IB_SPEED_SDR:
+   speed =  SDR;
+   break;
+   case IB_SPEED_DDR:
speed =  DDR;
break;
-   case 4:
+   case IB_SPEED_QDR:
speed =  QDR;
break;
-   case 8:
+   case IB_SPEED_FDR10:
speed =  FDR10;
rate = 10;
break;
-   case 16:
+   case IB_SPEED_FDR:
speed =  FDR;
rate = 14;
break;
-   case 32:
+   case IB_SPEED_EDR:
speed =  EDR;
rate = 25;
break;
@@ -214,7 +217,7 @@ static ssize_t rate_show(struct ib_port *p, struct 
port_attribute *unused,
return -EINVAL;

return sprintf(buf, %d%s Gb/sec (%dX%s)\n,
-  rate, (attr.active_speed == 1) ? .5 : ,
+  rate, (attr.active_speed == IB_SPEED_SDR) ? .5 : ,
   ib_width_enum_to_int(attr.active_width), speed);
 }

diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c 
b/drivers/infiniband/hw/amso1100/c2_provider.c
index 12f923d..07eb3a8 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -94,7 +94,7 @@ static int c2_query_port(struct ib_device *ibdev,
props-pkey_tbl_len = 1;
props-qkey_viol_cntr = 0;
props-active_width = 1;
-   props-active_speed = 1;
+   props-active_speed = IB_SPEED_SDR;

return 0;
 }
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 37c224f..0bdf09a 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -1227,7 +1227,7 @@ static int iwch_query_port(struct ib_device *ibdev,
props-gid_tbl_len = 1;
props-pkey_tbl_len = 1;
props-active_width = 2;
-   props-active_speed = 2;
+   props-active_speed = IB_SPEED_DDR;
props-max_msg_sz = -1;

return 0;
diff --git a/drivers/infiniband/hw/cxgb4/provider.c 
b/drivers/infiniband/hw/cxgb4/provider.c
index 247fe70..be1c18f 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -329,7 +329,7 @@ static int c4iw_query_port(struct ib_device *ibdev, u8 port,
props-gid_tbl_len = 1;
props-pkey_tbl_len = 1;
props-active_width = 2;
-   props-active_speed = 2;
+   props-active_speed = IB_SPEED_DDR;
props-max_msg_sz = -1;

return 0;
diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c 
b/drivers/infiniband/hw/ehca/ehca_hca.c
index 73edc36..9ed4d25 100644
--- a/drivers/infiniband/hw/ehca/ehca_hca.c
+++ b/drivers/infiniband/hw/ehca/ehca_hca.c
@@ -233,7 +233,7 @@ int ehca_query_port(struct ib_device *ibdev,
props-phys_state  = 5;
props-state   = rblock-state;
props-active_width= IB_WIDTH_12X;
-   props-active_speed= 0x1;
+   props-active_speed= IB_SPEED_SDR;
}

 query_port1:
diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 7b445df..6ff6bdf 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -215,16 +215,16 @@ static int ib_link_query_port(struct ib_device *ibdev, u8 
port,

switch (ext_active_speed) {
case 1:
-   props-active_speed = 16; /* FDR */
+   props-active_speed = IB_SPEED_FDR;
break;
case 2:
-   props-active_speed = 32; /* EDR */
+

[PATCH 2/6] IB/core: fix wrong display of rate in sysfs

2012-01-11 Thread Or Gerlitz
commit 71eeba16 IB: Add new InfiniBand link speeds introduced a bug
under which the rate for IB SDR/4X links was displayed as 8.5Gbs
instead of 10Gbs, fix that.

Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/core/sysfs.c |   12 +---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index 9ce70ca..0b3edc0 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -213,12 +213,18 @@ static ssize_t rate_show(struct ib_port *p, struct 
port_attribute *unused,
}

rate *= ib_width_enum_to_int(attr.active_width);
+
if (rate  0)
return -EINVAL;

-   return sprintf(buf, %d%s Gb/sec (%dX%s)\n,
-  rate, (attr.active_speed == IB_SPEED_SDR) ? .5 : ,
-  ib_width_enum_to_int(attr.active_width), speed);
+   if (attr.active_speed == IB_SPEED_SDR) {
+   rate = (25 * ib_width_enum_to_int(attr.active_width)) / 10;
+   return sprintf(buf, %d%s Gb/sec (%dX%s)\n, rate,
+  (attr.active_width == IB_WIDTH_1X) ? .5 : ,
+   ib_width_enum_to_int(attr.active_width), speed);
+   } else
+   return sprintf(buf, %d Gb/sec (%dX%s)\n,
+  rate, ib_width_enum_to_int(attr.active_width), speed);
 }

 static ssize_t phys_state_show(struct ib_port *p, struct port_attribute 
*unused,
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] IB/mlx4: fix wrong info returned when querying IBoE ports

2012-01-11 Thread Or Gerlitz
To issue port query, use the QUERY_(Ethernet)_PORT command instead of
MAD_IFC command as the latter attempts to query the firmware IB SMA agent,
which is irrelevant for IBoE ports. This allows to support both 10Gb/s and
40Gb/s rates (e.g in sysfs), using QDR speed (10Gb/s) and width of 1X or 4X.

Signed-off-by: Dotan Barak dot...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/hw/mlx4/main.c |   97 -
 1 files changed, 53 insertions(+), 44 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 6ff6bdf..85f6048 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -182,12 +182,27 @@ mlx4_ib_port_link_layer(struct ib_device *device, u8 
port_num)
 }

 static int ib_link_query_port(struct ib_device *ibdev, u8 port,
- struct ib_port_attr *props,
- struct ib_smp *in_mad,
- struct ib_smp *out_mad)
+ struct ib_port_attr *props)
 {
+   struct ib_smp *in_mad  = NULL;
+   struct ib_smp *out_mad = NULL;
int ext_active_speed;
-   int err;
+   int err = -ENOMEM;
+
+   in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
+   out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
+   if (!in_mad || !out_mad)
+   goto out;
+
+   init_query_mad(in_mad);
+   in_mad-attr_id  = IB_SMP_ATTR_PORT_INFO;
+   in_mad-attr_mod = cpu_to_be32(port);
+
+   err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL,
+   in_mad, out_mad);
+   if (err)
+   goto out;
+

props-lid  = be16_to_cpup((__be16 *) (out_mad-data + 16));
props-lmc  = out_mad-data[34]  0x7;
@@ -234,15 +249,17 @@ static int ib_link_query_port(struct ib_device *ibdev, u8 
port,
err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port,
   NULL, NULL, in_mad, out_mad);
if (err)
-   return err;
+   goto out;

/* Checking LinkSpeedActive for FDR-10 */
if (out_mad-data[15]  0x1)
props-active_speed = IB_SPEED_FDR10;
}
}
-
-   return 0;
+out:
+   kfree(in_mad);
+   kfree(out_mad);
+   return err;
 }

 static u8 state_to_phys_state(enum ib_port_state state)
@@ -251,32 +268,42 @@ static u8 state_to_phys_state(enum ib_port_state state)
 }

 static int eth_link_query_port(struct ib_device *ibdev, u8 port,
-  struct ib_port_attr *props,
-  struct ib_smp *out_mad)
+  struct ib_port_attr *props)
 {
-   struct mlx4_ib_iboe *iboe = to_mdev(ibdev)-iboe;
+
+   struct mlx4_ib_dev *mdev = to_mdev(ibdev);
+   struct mlx4_ib_iboe *iboe = mdev-iboe;
struct net_device *ndev;
enum ib_mtu tmp;
+   struct mlx4_cmd_mailbox *mailbox;
+   int err = 0;
+
+   mailbox = mlx4_alloc_cmd_mailbox(mdev-dev);
+   if (IS_ERR(mailbox))
+   return PTR_ERR(mailbox);
+
+   err = mlx4_cmd_box(mdev-dev, 0, mailbox-dma, port, 0,
+  MLX4_CMD_QUERY_PORT, MLX4_CMD_TIME_CLASS_B,
+  MLX4_CMD_WRAPPED);
+   if (err)
+   goto out;

-   props-active_width = IB_WIDTH_1X;
+   props-active_width =  (((u8 *)mailbox-buf)[5] == 0x40) ?
+   IB_WIDTH_4X : IB_WIDTH_1X;
props-active_speed = IB_SPEED_QDR;
props-port_cap_flags   = IB_PORT_CM_SUP;
-   props-gid_tbl_len  = to_mdev(ibdev)-dev-caps.gid_table_len[port];
-   props-max_msg_sz   = to_mdev(ibdev)-dev-caps.max_msg_sz;
+   props-gid_tbl_len  = mdev-dev-caps.gid_table_len[port];
+   props-max_msg_sz   = mdev-dev-caps.max_msg_sz;
props-pkey_tbl_len = 1;
-   props-bad_pkey_cntr= be16_to_cpup((__be16 *) (out_mad-data + 46));
-   props-qkey_viol_cntr   = be16_to_cpup((__be16 *) (out_mad-data + 48));
props-max_mtu  = IB_MTU_4096;
-   props-subnet_timeout   = 0;
-   props-max_vl_num   = out_mad-data[37]  4;
-   props-init_type_reply  = 0;
+   props-max_vl_num   = 2;
props-state= IB_PORT_DOWN;
props-phys_state   = state_to_phys_state(props-state);
props-active_mtu   = IB_MTU_256;
spin_lock(iboe-lock);
ndev = iboe-netdevs[port - 1];
if (!ndev)
-   goto out;
+   goto out_unlock;

tmp = iboe_get_mtu(ndev-mtu);
props-active_mtu = tmp ? min(props-max_mtu, tmp) : IB_MTU_256;
@@ -284,41 +311,23 @@ static int eth_link_query_port(struct ib_device *ibdev, 
u8 port,
  

[PATCH 4/6] net/mlx4: allow for dynamic mtu configuration for IB ports

2012-01-11 Thread Or Gerlitz
Set the mtu for IB ports by the driver instead of using the firmware
2KB default, with the default being 4KB. Allow for dynamic mtu
configuration through a new, per port sysfs entry.

As there's a dependency between the port mtu to the maximal number
of HW VLs the port can support, apply a mim/max approach, using a
loop that goes down from the highest possible number of VLs to
the lowest. Use the firmware return status as an indication for
the requested number of VLs being impossible with that mtu.

For now, and as done with the dynamic link type change / VPI support,
the sysfs entry to change the mtu is exposed only when running in
NON SRIOV mode. To allow changing the mtu for the master in SRIOV
mode, PF initiated FLR (Function Level Reset) has to be implemented.

Signed-off-by: Or Gerlitz ogerl...@mellanox.com

---

Roland, on v1 - http://marc.info/?l=linux-rdmam=130636143927387w=2 you made a 
comment
saying I mean set the MTU port-by-port with the module loaded, the same way we 
are
supposed to be able to do for the port type.  Rather than having one global 
module
parameter, see thread http://marc.info/?l=linux-rdmam=130632548008337w=2, 
this
posting addresses that comment

changes from v1
  - added ib_port_mtu field to mlx4 device ports and use it on SET_PORT
  - removed module param
  - added sysfs entries to set and show the port IB mtu
  - rebased the patch against Linus tree which has the SRIOV patches

Also, some code (enum ibta_mtu and ibta_mtu_to_int) could be eliminated
if we allow for the mlx4_core driver to #include rdma/ib_verbs.h
basically I don't see a concrete problem arising from such inclusion.

 drivers/net/ethernet/mellanox/mlx4/main.c |  118 -
 drivers/net/ethernet/mellanox/mlx4/mlx4.h |2 +
 drivers/net/ethernet/mellanox/mlx4/port.c |   25 ++-
 include/linux/mlx4/device.h   |1 +
 4 files changed, 141 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 6bb62c5..7a58613 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -575,7 +575,6 @@ static ssize_t show_port_type(struct device *dev,

return strlen(buf);
 }
-
 static ssize_t set_port_type(struct device *dev,
 struct device_attribute *attr,
 const char *buf, size_t count)
@@ -648,6 +647,99 @@ out:
return err ? err : count;
 }

+enum ibta_mtu {
+   IB_MTU_256  = 1,
+   IB_MTU_512  = 2,
+   IB_MTU_1024 = 3,
+   IB_MTU_2048 = 4,
+   IB_MTU_4096 = 5
+};
+
+static inline int int_to_ibta_mtu(int mtu)
+{
+   switch (mtu) {
+   case 256:  return IB_MTU_256;
+   case 512:  return IB_MTU_512;
+   case 1024: return IB_MTU_1024;
+   case 2048: return IB_MTU_2048;
+   case 4096: return IB_MTU_4096;
+   default: return -1;
+   }
+}
+
+static inline int ibta_mtu_to_int(enum ibta_mtu mtu)
+{
+   switch (mtu) {
+   case IB_MTU_256:  return  256;
+   case IB_MTU_512:  return  512;
+   case IB_MTU_1024: return 1024;
+   case IB_MTU_2048: return 2048;
+   case IB_MTU_4096: return 4096;
+   default: return -1;
+   }
+}
+
+static ssize_t show_port_ib_mtu(struct device *dev,
+struct device_attribute *attr,
+char *buf)
+{
+   struct mlx4_port_info *info = container_of(attr, struct mlx4_port_info,
+  port_mtu_attr);
+   struct mlx4_dev *mdev = info-dev;
+
+   if (mdev-caps.port_type[info-port] == MLX4_PORT_TYPE_ETH)
+   mlx4_warn(mdev, port level mtu is only used for IB ports\n);
+
+   sprintf(buf, %d\n,
+   ibta_mtu_to_int(mdev-caps.port_ib_mtu[info-port]));
+   return strlen(buf);
+}
+
+static ssize_t set_port_ib_mtu(struct device *dev,
+struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   struct mlx4_port_info *info = container_of(attr, struct mlx4_port_info,
+  port_mtu_attr);
+   struct mlx4_dev *mdev = info-dev;
+   struct mlx4_priv *priv = mlx4_priv(mdev);
+   int err, port, mtu, ibta_mtu = -1;
+
+   if (mdev-caps.port_type[info-port] == MLX4_PORT_TYPE_ETH) {
+   mlx4_warn(mdev, port level mtu is only used for IB ports\n);
+   return -EINVAL;
+   }
+
+   err = sscanf(buf, %d, mtu);
+   if (err  0)
+   ibta_mtu = int_to_ibta_mtu(mtu);
+
+   if (err = 0 || ibta_mtu  0) {
+   mlx4_err(mdev, %s is invalid IBTA mtu\n, buf);
+   return -EINVAL;
+   }
+
+   mdev-caps.port_ib_mtu[info-port] = ibta_mtu;
+
+   mlx4_stop_sense(mdev);
+   mutex_lock(priv-port_mutex);
+   mlx4_unregister_device(mdev);
+   for (port = 1; port = 

[PATCH 5/6] IB: change CQ TCP/IP checksum offload mark to use bit flag indication instead of integer

2012-01-11 Thread Or Gerlitz

Use a bit in wc_flags rather then a whole integer, this change for itself
doesn't reduce the size of struct ib_wc on 64bit machines, so it stays on 56
bytes as of padding. However, it will allow to add more fields in the future w.o
enlarging that structure. Also, it will serve to have unified approach with
future libibverbs checksum offload reporting towards applications, where
adding a bit flag doesn't break the library ABI. This patch was during
conversation with Liran Liss lir...@mellanox.com

Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/hw/mlx4/cq.c |3 ++-
 drivers/infiniband/hw/mthca/mthca_cq.c  |3 ++-
 drivers/infiniband/hw/qib/qib_rc.c  |1 -
 drivers/infiniband/hw/qib/qib_uc.c  |1 -
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |3 ++-
 include/rdma/ib_verbs.h |2 +-
 6 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 5ecf38d..09eb1d1 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -720,7 +720,8 @@ repoll:
wc-dlid_path_bits = (g_mlpath_rqpn  24)  0x7f;
wc-wc_flags  |= g_mlpath_rqpn  0x8000 ? IB_WC_GRH : 0;
wc-pkey_index = be32_to_cpu(cqe-immed_rss_invalid)  0x7f;
-   wc-csum_ok= mlx4_ib_ipoib_csum_ok(cqe-status, 
cqe-checksum);
+   wc-wc_flags  |= mlx4_ib_ipoib_csum_ok(cqe-status,
+   cqe-checksum) ? IB_WC_IP_CSUM_OK : 0;
if (rdma_port_get_link_layer(wc-qp-device,
(*cur_qp)-port) == IB_LINK_LAYER_ETHERNET)
wc-sl  = be16_to_cpu(cqe-sl_vid)  13;
diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c 
b/drivers/infiniband/hw/mthca/mthca_cq.c
index 53157b8..40ba833 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -643,7 +643,8 @@ static inline int mthca_poll_one(struct mthca_dev *dev,
entry-wc_flags   |= cqe-g_mlpath  0x80 ? IB_WC_GRH : 0;
checksum = (be32_to_cpu(cqe-rqpn)  24) |
((be32_to_cpu(cqe-my_ee)  16)  0xff00);
-   entry-csum_ok = (cqe-sl_ipok  1  checksum == 0x);
+   entry-wc_flags   |=  (cqe-sl_ipok  1  checksum == 0x) ?
+   IB_WC_IP_CSUM_OK : 0;
}

entry-status = IB_WC_SUCCESS;
diff --git a/drivers/infiniband/hw/qib/qib_rc.c 
b/drivers/infiniband/hw/qib/qib_rc.c
index 894afac..765b4cb 100644
--- a/drivers/infiniband/hw/qib/qib_rc.c
+++ b/drivers/infiniband/hw/qib/qib_rc.c
@@ -2048,7 +2048,6 @@ send_last:
wc.pkey_index = 0;
wc.dlid_path_bits = 0;
wc.port_num = 0;
-   wc.csum_ok = 0;
/* Signal completion event if the solicited bit is set. */
qib_cq_enter(to_icq(qp-ibqp.recv_cq), wc,
 (ohdr-bth[0] 
diff --git a/drivers/infiniband/hw/qib/qib_uc.c 
b/drivers/infiniband/hw/qib/qib_uc.c
index 847e7af..7ce2ac2 100644
--- a/drivers/infiniband/hw/qib/qib_uc.c
+++ b/drivers/infiniband/hw/qib/qib_uc.c
@@ -422,7 +422,6 @@ last_imm:
wc.pkey_index = 0;
wc.dlid_path_bits = 0;
wc.port_num = 0;
-   wc.csum_ok = 0;
/* Signal completion event if the solicited bit is set. */
qib_cq_enter(to_icq(qp-ibqp.recv_cq), wc,
 (ohdr-bth[0] 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 4115be5..5c1bc99 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -296,7 +296,8 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, 
struct ib_wc *wc)
dev-stats.rx_bytes += skb-len;

skb-dev = dev;
-   if ((dev-features  NETIF_F_RXCSUM)  likely(wc-csum_ok))
+   if ((dev-features  NETIF_F_RXCSUM) 
+   likely(wc-wc_flags  IB_WC_IP_CSUM_OK))
skb-ip_summed = CHECKSUM_UNNECESSARY;

napi_gro_receive(priv-napi, skb);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 01179b0..31996e3 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -518,6 +518,7 @@ enum ib_wc_flags {
IB_WC_GRH   = 1,
IB_WC_WITH_IMM  = (11),
IB_WC_WITH_INVALIDATE   = (12),
+   IB_WC_IP_CSUM_OK= (13),
 };

 struct ib_wc {
@@ -538,7 +539,6 @@ struct ib_wc {
u8  sl;
u8  dlid_path_bits;
u8  port_num;   /* valid only for DR SMPs on 
switches */
-   int csum_ok;
 };

 enum ib_cq_notify_flags {
-- 
1.7.1


--
To unsubscribe from this list: send the line unsubscribe 

[PATCH 6/6] IB/ipoib: add support for clones / multiple childs on the same partition

2012-01-11 Thread Or Gerlitz
Allow creating clone child interfaces which further partition an IPoIB
interface to sub interfaces which either use the same pkey as their parent
or use the same pkey as an already created child interface.

Each child now has a child index, which together with the pkey is
used as the identifier of the created network device.

Child interfaces can still be created/deleted using only the pkey,
and they are referred to as legacy childs.

Non-legacy, clone childs of IPoIB device named ibN are named ibN.pkey.index
and those using the same pkey as the parent ibN device are named ibN.index,
where (0x8001 = pkey = 0x) as before this change, and (0  index = 255),
example device names are:

legacy - ib0, ib1, ib0.8001, ib0.8002

non-legacy (clones) - ib0.1, ib0.2, ib0.8001.1, ib0.8001.2

All sorts of childs are still created/deleted through sysfs, in a similar
manner to the way legacy child interfaces are.

A major use case for clone childs is for virtualization purposes, e.g under
schemes where a per VM NIC / HW queue is desired at the hypervisor level.

Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---

for the case of non-legacy and same pkey childs I wanted to use
a notation of ibN.pkey:index and ibN:index but this is problematic with
tools (e.g ifconfig) who treat devices with colon in their names as aliases
which are restriced, e.g w.r.t counters, etc, any ideas?

 Documentation/infiniband/ipoib.txt|   24 ++
 drivers/infiniband/ulp/ipoib/ipoib.h  |9 -
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   48 +---
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c |   40 +++
 4 files changed, 93 insertions(+), 28 deletions(-)

diff --git a/Documentation/infiniband/ipoib.txt 
b/Documentation/infiniband/ipoib.txt
index 64eeb55..2c958f8 100644
--- a/Documentation/infiniband/ipoib.txt
+++ b/Documentation/infiniband/ipoib.txt
@@ -24,6 +24,30 @@ Partitions and P_Keys
   The P_Key for any interface is given by the pkey file, and the
   main interface for a subinterface is in parent.

+Clones
+
+  It is possible to further partition an IPoIB interfaces, and create
+  clone child interfaces which either use the same pkey as their
+  parent, or as an already created child interface. Each child now has
+  a child index, which together with the pkey is used as the identifier
+  of the created network device.
+
+ All sorts of childs are still created/deleted through sysfs, in a
+ similar manner to the way conventional child interfaces are, for example:
+
+echo 0x8001.1  /sys/class/net/ib0/create_child
+
+  will create an interface named ib0.8001.1 with P_Key 0x8001 and index 1
+
+echo .1  /sys/class/net/ib0/create_child
+
+  will create an interface named ib0.1 with same P_Key as ib0 and index 1
+
+  To remove a such subinterface, use the delete_child file:
+
+echo 0x8001.1  /sys/class/net/ib0/create_child
+echo .1   /sys/class/net/ib0/create_child
+
 Datagram vs Connected modes

   The IPoIB driver supports two modes of operation: datagram and
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index b3cc1e0..53e021d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -101,6 +101,8 @@ enum {

MAX_SEND_CQE  = 16,
IPOIB_CM_COPYBREAK= 256,
+
+   MAX_CHILDS_PER_PKEY   = 255,
 };

 #defineIPOIB_OP_RECV   (1ul  31)
@@ -330,6 +332,7 @@ struct ipoib_dev_priv {
struct net_device *parent;
struct list_head child_intfs;
struct list_head list;
+   int child_index;

 #ifdef CONFIG_INFINIBAND_IPOIB_CM
struct ipoib_cm_dev_priv cm;
@@ -488,8 +491,10 @@ void ipoib_transport_dev_cleanup(struct net_device *dev);
 void ipoib_event(struct ib_event_handler *handler,
 struct ib_event *record);

-int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey);
-int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
+int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey,
+   unsigned char clone_index);
+int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey,
+   unsigned char clone_index);

 void ipoib_pkey_poll(struct work_struct *work);
 int ipoib_pkey_dev_delay_open(struct net_device *dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 3514ca0..3a6848d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1110,17 +1110,44 @@ int ipoib_add_umcast_attr(struct net_device *dev)
return device_create_file(dev-dev, dev_attr_umcast);
 }

+static int parse_child(struct device *dev, const char *buf, int *pkey,
+   int *child_index)
+{
+   int ret;
+   struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev));
+
+   *pkey = 

RE: [PATCH 1/6] IB: use central enum for speed instead of hard-coded values

2012-01-11 Thread Hefty, Sean
 +enum ib_port_seed {

typo on enum name (seed - speed)

The rest of the patch looks good to me.

- Sean

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/6] IB/core: fix wrong display of rate in sysfs

2012-01-11 Thread Hefty, Sean
 @@ -213,12 +213,18 @@ static ssize_t rate_show(struct ib_port *p, struct
 port_attribute *unused,
   }
 
   rate *= ib_width_enum_to_int(attr.active_width);
 +
   if (rate  0)
   return -EINVAL;
 
 - return sprintf(buf, %d%s Gb/sec (%dX%s)\n,
 -rate, (attr.active_speed == IB_SPEED_SDR) ? .5 : ,
 -ib_width_enum_to_int(attr.active_width), speed);
 + if (attr.active_speed == IB_SPEED_SDR) {
 + rate = (25 * ib_width_enum_to_int(attr.active_width)) / 10;
 + return sprintf(buf, %d%s Gb/sec (%dX%s)\n, rate,
 +(attr.active_width == IB_WIDTH_1X) ? .5 : ,
 + ib_width_enum_to_int(attr.active_width), speed);
 + } else
 + return sprintf(buf, %d Gb/sec (%dX%s)\n,
 +rate, ib_width_enum_to_int(attr.active_width), speed);
  }

This function ends up like this, with my comments inline:

rate = (25 * attr.active_speed) / 10;
We set rate here, but...

switch (attr.active_speed) {
case 2:
speed =  DDR;
break;
case 4:
speed =  QDR;
break;
case 8:
speed =  FDR10;
rate = 10;
break;
case 16:
speed =  FDR;
rate = 14;
break;
case 32:
speed =  EDR;
rate = 25;
break;
}
The above switch statement changes the value for 3 of the 5 cases, with 1 case 
not covered.

rate *= ib_width_enum_to_int(attr.active_width);
It is modified again here

if (rate  0)
return -EINVAL;

if (attr.active_speed == IB_SPEED_SDR) {
rate = (25 * ib_width_enum_to_int(attr.active_width)) / 10;
And we override it in this case, which is the missing switch case.

return sprintf(buf, %d%s Gb/sec (%dX%s)\n, rate,
   (attr.active_width == IB_WIDTH_1X) ? .5 : ,
ib_width_enum_to_int(attr.active_width), speed);
} else
return sprintf(buf, %d Gb/sec (%dX%s)\n,
   rate, ib_width_enum_to_int(attr.active_width), speed);

How about just setting rate (possibly as a string value) one time in each 
switch case?

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] IB/core: fix wrong display of rate in sysfs

2012-01-11 Thread Or Gerlitz
On Wed, Jan 11, 2012 at 9:30 PM, Hefty, Sean sean.he...@intel.com wrote:
 This function ends up like this, with my comments inline:
        rate = (25 * attr.active_speed) / 10;
 We set rate here, but [...] The above switch statement
  changes the value for 3 of the 5 cases, with 1 case not covered.
        rate *= ib_width_enum_to_int(attr.active_width);
 It is modified again here

Yep, this function is surely not programming state of the art..
however, except for SDR, the above line doing rate = rate *
ib_width_enum_to_int(attr.active_width) would hold for all the speeds
and all the widths (1X, 4X, 12X), if we just set rate (possibly as a
string value) one time in each switch case, we would need to repeat
that multiplication there... I'm okay with that, would you recommend
to go that way? Roland any preference?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Upstream support for multicast IBoE

2012-01-11 Thread Or Gerlitz
Shawn Bohrer sboh...@rgmadvisors.com wrote:
 Is there any estimate on when we might see something like this upstream?

Could you elaborate a little on your use case for multicast IBoE
traffic? e.g how the setup looks like and how are your Ethernet
switches act to route that traffic.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/iSER major problems with Linux 3.0 and Solaris targets

2012-01-11 Thread Or Gerlitz
Sebastian Riemer sebastian.rie...@profitbricks.com wrote:
 [...] we've also tested a 3.0.15 mainline kernel with in-tree IB modules
 together with the OFED-1.5.4 user-space and this has much better IPoIB
 performance than the kernel stuff from OFED. So, we want to use them
 instead, but there is the same problem with the iSER debug messages and
 iSER doesn't work together with our Solaris 11 COMSTAR targets.

On the full part of the glass, its nice to hear the upstream has good
IPoIB performance, on the other part, I'll give 3.0.15 a try tomorrow,
however, the error you're getting iser_drain_tx_cq:tx id
88402391f898 status 4 vend_err 57 means that iser got local
protection error (=4) on the first buffer we used with IB (the
connection handshake buffers belong to the IB CM, not to the ULP)
which is the login request. Sounds like something is broken maybe dma
mapping wise, for this reason I think its likely that the problem
might not hit me on my testbed (but I will try, sure).

What environment you're running in - is that Dom0 for some hypervisor?
if yes, is it KVM which means plain Linux or Xen? it would help if you
send me your (compressed, please) .config  - send to my @mellanox
email. Also, please run with iser debug logs open, set debug_level=2
(lets discuss the extra verbosity later, please) and also the session
debug param for the libiscsi module. Is there a chance you can build
the latest kernel 3.2 for your environment and run with it once to see
if the problem happens there as well?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Send with immediate data completion

2012-01-11 Thread Hefty, Sean
 From my interpretation of the spec, it looks like a bug and the immediate data
 should be available in the send wc, since it's not explicitly excluded.  I'll
 verify that this is the intent with the IBTA.

I'm still waiting on feedback from the IBTA, but they are looking into the 
matter.
 
 I haven't traced through the library and driver code to see if mlx4 driver can
 report it.  I'll see if I can do this within the next couple of days.

The following patch to libmlx4 added the immediate data to the send work 
completion for me.  A similar change looks like it would be needed in the 
kernel.

Report immediate data with send work completions

From: Sean Hefty sean.he...@intel.com

Signed-off-by: Sean Hefty sean.he...@intel.com
---
 src/cq.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/src/cq.c b/src/cq.c
index 8226b6b..ad1d20c 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -46,6 +46,7 @@
 
 #include mlx4.h
 #include doorbell.h
+#include wqe.h
 
 enum {
MLX4_CQ_DOORBELL= 0x20
@@ -194,6 +195,7 @@ static int mlx4_poll_one(struct mlx4_cq *cq,
 {
struct mlx4_wq *wq;
struct mlx4_cqe *cqe;
+   struct mlx4_wqe_ctrl_seg *ctrl;
struct mlx4_srq *srq;
uint32_t qpn;
uint32_t g_mlpath_rqpn;
@@ -240,6 +242,8 @@ static int mlx4_poll_one(struct mlx4_cq *cq,
wq = (*cur_qp)-sq;
wqe_index = ntohs(cqe-wqe_index);
wq-tail += (uint16_t) (wqe_index - (uint16_t) wq-tail);
+   ctrl = (*cur_qp)-buf.buf + (*cur_qp)-sq.offset +
+   ((wq-tail  (wq-wqe_cnt - 1))  
(*cur_qp)-sq.wqe_shift);
wc-wr_id = wq-wrid[wq-tail  (wq-wqe_cnt - 1)];
++wq-tail;
} else if ((*cur_qp)-ibv_qp.srq) {
@@ -265,11 +269,13 @@ static int mlx4_poll_one(struct mlx4_cq *cq,
switch (cqe-owner_sr_opcode  MLX4_CQE_OPCODE_MASK) {
case MLX4_OPCODE_RDMA_WRITE_IMM:
wc-wc_flags |= IBV_WC_WITH_IMM;
+   wc-imm_data = ctrl-imm;
case MLX4_OPCODE_RDMA_WRITE:
wc-opcode= IBV_WC_RDMA_WRITE;
break;
case MLX4_OPCODE_SEND_IMM:
wc-wc_flags |= IBV_WC_WITH_IMM;
+   wc-imm_data = ctrl-imm;
case MLX4_OPCODE_SEND:
wc-opcode= IBV_WC_SEND;
break;


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Send with immediate data completion

2012-01-11 Thread Hefty, Sean
 I'm still waiting on feedback from the IBTA, but they are looking into the
 matter.

The intent is for immediate data only to be provided on receive work 
completions.  The IBTA will clarify the spec on this.  I'll submit patches that 
remove setting the wc flag, which may help avoid this confusion some.

Thanks,
- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Send with immediate data completion

2012-01-11 Thread Roland Dreier
On Wed, Jan 11, 2012 at 2:22 PM, Hefty, Sean sean.he...@intel.com wrote:
 The intent is for immediate data only to be provided on receive work 
 completions.  The IBTA will clarify the spec on this.  I'll submit patches 
 that remove setting the wc flag, which may help avoid this confusion some.

The unfortunate thing is that we never defined enum values like

IBV_WC_RDMA_WRITE_WITH_IMM

to go along with

IBV_WR_RDMA_WRITE_WITH_IMM

If there was a good reason for that, I've long since forgotten it.

But in any case we don't have a good way to return the opcode
in the completion queue entry for an RDMA write with immediate
request, except the way we do it now.

Perhaps the only way forward is with a documentation update?

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Send with immediate data completion

2012-01-11 Thread Jason Gunthorpe
On Wed, Jan 11, 2012 at 03:35:58PM -0800, Roland Dreier wrote:
 On Wed, Jan 11, 2012 at 2:22 PM, Hefty, Sean sean.he...@intel.com wrote:
  The intent is for immediate data only to be provided on receive
  work completions. ?The IBTA will clarify the spec on this. ?I'll
  submit patches that remove setting the wc flag, which may help
  avoid this confusion some.
 
 The unfortunate thing is that we never defined enum values like
 
 IBV_WC_RDMA_WRITE_WITH_IMM
 
 to go along with
 
 IBV_WR_RDMA_WRITE_WITH_IMM
 
 If there was a good reason for that, I've long since forgotten it.

Do you think there is a need to have a WC discern if there was an
attached immediate data? There is no resource attached to the
immediate data that needs special handling.

That is about the only argument I can see for continuing to set
the IBV_WC_WITH_IMM flag on the WC.

All I think is really needed here is a firm note someplace that
imm_data is only valid if IBV_WC_RECV is set, while IBV_WC_WITH_IMM
is set based on the opcode.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/qib: detour pcie_caps for certain chip sets

2012-01-11 Thread Mike Marciniszyn
Commit 8d4548f2b (IB/qib: Default some module parameters optimally)
introduced an issue with older root complexes.  They cannot handle
the pcie_caps of 0x51.

A typical diagnostic in this situation reported by syslog contains
the text:

  [PCIe Poisoned TLP][Send DMA memory read]

Insure the 0x51 is not used on the suspect chip sets.

Reviewed-by: Mark Debbage mark.debb...@qlogic.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com
---
 drivers/infiniband/hw/qib/qib_pcie.c |   19 ++-
 1 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_pcie.c 
b/drivers/infiniband/hw/qib/qib_pcie.c
index 0de55c0..bb23ba0 100644
--- a/drivers/infiniband/hw/qib/qib_pcie.c
+++ b/drivers/infiniband/hw/qib/qib_pcie.c
@@ -577,7 +577,7 @@ static int qib_tune_pcie_coalesce(struct qib_devdata *dd)
  * BIOS may not set PCIe bus-utilization parameters for best performance.
  * Check and optionally adjust them to maximize our throughput.
  */
-static int qib_pcie_caps = 0x51;
+static int qib_pcie_caps;
 module_param_named(pcie_caps, qib_pcie_caps, int, S_IRUGO);
 MODULE_PARM_DESC(pcie_caps, Max PCIe tuning: Payload (0..3), ReadReq (4..7));
 
@@ -589,6 +589,7 @@ static int qib_tune_pcie_caps(struct qib_devdata *dd)
u16 pcaps, pctl, ecaps, ectl;
int rc_sup, ep_sup;
int rc_cur, ep_cur;
+   int caps = 0x51; /* may be overridden below */
 
/* Find out supported and configured values for parent (root) */
parent = dd-pcidev-bus-self;
@@ -609,6 +610,14 @@ static int qib_tune_pcie_caps(struct qib_devdata *dd)
pci_read_config_word(dd-pcidev, epos + PCI_EXP_DEVCTL, ectl);
} else
goto bail;
+   if (!qib_pcie_caps) {
+   u16 devid = parent-device;
+   if ((devid = 0x25e2  devid = 0x25fa) ||
+   (devid = 0x65e2  devid = 0x65fa) ||
+   (devid = 0x4021  devid = 0x402e))
+   caps = 0;
+   } else
+   caps = qib_pcie_caps;
ret = 0;
/* Find max payload supported by root, endpoint */
rc_sup = fld2val(pcaps, PCI_EXP_DEVCAP_PAYLOAD);
@@ -620,8 +629,8 @@ static int qib_tune_pcie_caps(struct qib_devdata *dd)
ep_cur = fld2val(ectl, PCI_EXP_DEVCTL_PAYLOAD);
 
/* If Supported greater than limit in module param, limit it */
-   if (rc_sup  (qib_pcie_caps  7))
-   rc_sup = qib_pcie_caps  7;
+   if (rc_sup  (caps  7))
+   rc_sup = caps  7;
/* If less than (allowed, supported), bump root payload */
if (rc_sup  rc_cur) {
rc_cur = rc_sup;
@@ -643,8 +652,8 @@ static int qib_tune_pcie_caps(struct qib_devdata *dd)
 * which is code '5' (log2(4096) - 7)
 */
rc_sup = 5;
-   if (rc_sup  ((qib_pcie_caps  4)  7))
-   rc_sup = (qib_pcie_caps  4)  7;
+   if (rc_sup  ((caps  4)  7))
+   rc_sup = (caps  4)  7;
rc_cur = fld2val(pctl, PCI_EXP_DEVCTL_READRQ);
ep_cur = fld2val(ectl, PCI_EXP_DEVCTL_READRQ);
 


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/qib: detour pcie_caps for certain chip sets

2012-01-11 Thread Jason Gunthorpe
On Wed, Jan 11, 2012 at 10:00:49PM -0500, Mike Marciniszyn wrote:
 Commit 8d4548f2b (IB/qib: Default some module parameters optimally)
 introduced an issue with older root complexes.  They cannot handle
 the pcie_caps of 0x51.

Should whatever this issue is be a general PCI fixup? Like broken MSI,
etc.

Might be nice to include what 0x51 tunes in the commit to aide other
peoole with the broken chipset :)

Isn't it necesary to check the PCI vendor as well as the devid?

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/qib: detour pcie_caps for certain chip sets

2012-01-11 Thread Roland Dreier
On Wed, Jan 11, 2012 at 7:00 PM, Mike Marciniszyn
mike.marcinis...@qlogic.com wrote:
 +       if (!qib_pcie_caps) {
 +               u16 devid = parent-device;
 +               if ((devid = 0x25e2  devid = 0x25fa) ||
 +                   (devid = 0x65e2  devid = 0x65fa) ||
 +                   (devid = 0x4021  devid = 0x402e))
 +                       caps = 0;
 +       } else

Does this work on systems where the broken chipset might
not be the immediate parent of the qib device (ie there are
some PCIe switches in between)?

Pushing knowledge of completely unrelated devices out into
every device driver looks it will be impossible to maintain.
If you need to know about the whole PCIe fabric to do this
type of tuning, the only way I can see to make this work
is to add some helper functions to set these values for
a given device, and then add all the workaround handling
to the core PCI driver.

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/qib: detour pcie_caps for certain chip sets

2012-01-11 Thread Roland Dreier
On Wed, Jan 11, 2012 at 9:48 PM, Jason Gunthorpe
jguntho...@obsidianresearch.com wrote:
 Isn't it necesary to check the PCI vendor as well as the devid?

That definitely seems like a good idea.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][TRIVIAL] infiniband-diags/configure.in: Fix typo

2012-01-11 Thread Ira Weiny
On Tue, 10 Jan 2012 12:22:58 -0800
Hal Rosenstock h...@dev.mellanox.co.il wrote:

 
 Signed-off-by: Hal Rosenstock h...@mellanox.com

Thanks applied,
Ira

 ---
 diff --git a/configure.in b/configure.in
 index 5a596dc..215c1ea 100644
 --- a/configure.in
 +++ b/configure.in
 @@ -65,7 +65,7 @@ IBDIAG_CONFIG_PATH_TMP1=`eval echo ${sysconfdir}`
  IBDIAG_CONFIG_PATH_TMP2=`echo $IBDIAG_CONFIG_PATH_TMP1 | sed 
 's/^NONE/$ac_default_prefix/'`
  IBDIAG_CONFIG_PATH=`eval echo $IBDIAG_CONFIG_PATH_TMP2`/infiniband-diags
  AC_SUBST(IBDIAG_CONFIG_PATH)
 -AC_DEFINE_UNQUOTED([IBDIAG_CONFIG_PATH], $IBDIAG_CONFIG_PATH, [Define the 
 path to configuratios])
 +AC_DEFINE_UNQUOTED([IBDIAG_CONFIG_PATH], $IBDIAG_CONFIG_PATH, [Define the 
 path to configurations])
  
  dnl Check if we should include test utilities
  AC_MSG_CHECKING(for --enable-test-utils)


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
wei...@llnl.gov
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibswitches command displays wrong information when switch description string includes base

2012-01-11 Thread Ira Weiny
While Jason is correct that I am working to deprecate the scripts this bug 
still exists in the 1.5 branch.

Thanks for pointing this out.  As Open Fabrics is doing an additional 1.5 
release I will attempt to fix before that release.

Thanks,
Ira

On Fri, 6 Jan 2012 16:22:32 -0800
pierre orzechowski pierre.e.orzechow...@oracle.com wrote:

 Hello,
 
 We have a customer reporting an issue with the ibswitches command part 
 of the infiniband-diags package.
 If the switch description includes the string base anywhere, then the 
 match operation in the awk section of ibwitches does not behave as 
 expected and output is garbled.
 
 Example :
 
 [root@slcc09sw-ib2 tmp]# grep Switch ibnetdisc.out
 Switch  36 S-002128468ee1a0a0 # SUN DCS 36P QDR basesw-ib3 
 10.242.48.23 enhanced port 0 lid 1 lmc 0
 Switch  36 S-0021286cc8a3a0a0 # SUN DCS 46P QDR slcc09sw-ib3 
 10.242.48.24 enhanced port 0 lid 12 lmc 0
 [root@slcc09sw-ib2 tmp]# ibswitches ibnetdisc.out
 Switch  : 0x002128468ee1a0a0 ports 36 SUN DCS 36P QDR base port 0 lid 1 
 lmc 0
 Switch  : 0x0021286cc8a3a0a0 ports 36 SUN DCS 46P QDR slcc09sw-ib3 
 10.242.48.24 enhanced port 0 lid 12 lmc 0
 
 The 1 liner workaround we have is to replace in red :
 echo $text | awk '
 /^Switch/   {
  l=$0
  desc=substr(l, match(l, #[ \t]*)+RLENGTH)
  pi=match(desc, port 0.*)
  pinfo=substr(desc, pi)
  desc=substr(desc, 1, pi-2)
 type=base
  ti=match(desc, type)
 ...
 
 With
 
 echo $text | awk '
 /^Switch/   {
  l=$0
  desc=substr(l, match(l, #[ \t]*)+RLENGTH)
  pi=match(desc, port 0.*)
  pinfo=substr(desc, pi)
  desc=substr(desc, 1, pi-2)
 type=\base\
  ti=match(desc, type)
 ...
 
 Thanks for your help in reviewing this potential defect.
 Apologies if posting to the wrong mailing list.
 
 Best regards,
 Pierre Orzechowski
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
wei...@llnl.gov
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html