Re: [net-next-2.6 PATCH] ipoib: remove addrlen check for mc addresses

2010-03-23 Thread Or Gerlitz
Eli Cohen wrote:
 Could you send a link to the git tree where I can find this commit and
 the related fixes?

basically, as the subject line suggests, it should be in Dave's net-next tree

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3 for-2.6.35] ib/iser: fix multipathing over iser, reduce fail-over time

2010-05-05 Thread Or Gerlitz
Roland,

This patch series fixes and reduces DM multipath fail-over / time
over iscsi/iser, the core patch is #3.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] ib/iser: remove buggy back-pointer setting

2010-05-05 Thread Or Gerlitz
iscsi connection object life cycle includes binding and unbinding
(conn_stop) to/from the iscsi transport connection object. Since
iscsi connection objects are recycled, on the time the transport
connection (e.g iser's ib connection) is released it is illegal
to touch the iscsi connection tied to the transport back-pointer, as
it may already point to a different transport connection.

Signed-off-by: Or Gerlitz ogerl...@voltaire.com

---
 drivers/infiniband/ulp/iser/iser_verbs.c |2 --
 1 file changed, 2 deletions(-)

Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
===
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iser_verbs.c
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -346,8 +346,6 @@ static void iser_conn_release(struct ise
/* on EVENT_ADDR_ERROR there's no device yet for this conn */
if (device != NULL)
iser_device_try_release(device);
-   if (ib_conn-iser_conn)
-   ib_conn-iser_conn-ib_conn = NULL;
iscsi_destroy_endpoint(ib_conn-ep);
 }

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] ib/iser: enhance disconnection logic for multi-pathing

2010-05-05 Thread Or Gerlitz
The iser connection teardown flow isn't over till the underlying
Connection Manager (e.g the IB CM) delivers a disconnected or timeout
event through the RDMA-CM. When the remote (target) side isn't reachable,
e.g when some HW e.g port/hca/switch isn't functioning or taken down
administratively, the CM timeout flow is used and the event may be
generated only after relatively long time, in the order of tens of seconds.

The current iser code exposes this possibly long delay to higher layers,
specifically to the iscsid daemon and iscsi kernel stack. As a result,
the iscsi stack doesn't respond well, to the extent of this low-level CM
delay being added to the fail-over time under HA schemes such as the one
provided by DM multipath through the multipathd(8) service.

This patch enhances the reference counting scheme on iser's IB
connections such that the disconnect flow initiated by iscsid from
user space (ep_disconnect) isn't waiting for the CM to deliver the
disconnect/timeout event. On the other hand, the connection teardown
isn't done from iser's view point till the event is delivered.

The iser ib (rdma) connection object is destroyed when its reference
count reaches zero. When this happens on the RDMA-CM callback context,
extra care is taken such that the RDMA-CM does the actual destroying
of the associated ID as doing it in the callback is prohibited.

The reference count of iser ib connection would normally reach
three, where the ref, deref relations are
1. conn init, terminate
2. conn bind, stop/destroy
3. cma id create, disconnect/error/timeout callbacks

Signed-off-by: Or Gerlitz ogerl...@voltaire.com

---

with this patch, multipath fail-over time is about 30 seconds,
which is seen here, when a DD over the multi-path device is done
before/during/after the fail-over

regulary, before taking a port down
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.926 s, 1.0 GB/s

taking a port down, causing fail-over during IO
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 46.6117 s, 369 MB/s

after path-failure, back to speed
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.6474 s, 1.0 GB/s

13:00:09 iser: iser_event_handler:async event 10 on device mlx4_0 port 1
13:00:24 connection8:0: ping timeout of 10 secs expired, recv timeout 5, last 
rx [...]
13:00:24 connection8:0: detected conn error (1011)
13:00:24 iscsid: Kernel reported iSCSI connection 8:0 error (1011) state (3)
13:00:39 cto-1 kernel: device-mapper: multipath: Failing path 8:48.
13:00:39 cto-1 multipathd: 8:48: mark as failed
13:00:39 cto-1 multipathd: mpathd: remaining active paths: 1
-- the disconnected event is delivered after the IB CM timeout expires
-- but fail-over doesn't pend on this
13:01:56 iser: iser_cma_handler:event 10 status 0 conn 88022dcb39b0 id 
88022cf09400

without this patch, multipath fail-over time is about 130 seconds

before taking a port down
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.6812 s, 1.0 GB/s

taking a port down during IO
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 145.094 s, 118 MB/s

after fail-over, back to speed
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.8935 s, 1.0 GB/s

14:24:05 iser: iser_event_handler:async event 10 on device mlx4_0 port 1
14:24:20 connection4:0: ping timeout of 10 secs expired, recv timeout 5, last 
rx [...]
14:24:20 kernel: connection4:0: detected conn error (1011)
14:24:21 iscsid: Kernel reported iSCSI connection 4:0 error (1011) state (3)
-- the disconnected event is delivered after the IB CM timeout expires
-- fail-over pending on this
14:25:59 iser: iser_cma_handler:event 10 conn 88022625a1b0 id 
880222537c00
14:26:14 session4: session recovery timed out after 15 secs
14:26:14 device-mapper: multipath: Failing path 8:64.
14:26:14 multipathd: mpathd: remaining active paths: 1

 drivers/infiniband/ulp/iser/iscsi_iser.c |9 ++-
 drivers/infiniband/ulp/iser/iscsi_iser.h |3 -
 drivers/infiniband/ulp/iser/iser_verbs.c |   72 +--
 3 files changed, 46 insertions(+), 38 deletions(-)

Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
===
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iser_verbs.c
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -238,7 +238,7 @@ alloc_err:
  * releases the FMR pool, QP and CMA ID objects, returns 0 on success,
  * -1 on failure
  */
-static int iser_free_ib_conn_res(struct iser_conn *ib_conn)
+static int iser_free_ib_conn_res(struct iser_conn *ib_conn, int can_destroy_id)
 {
BUG_ON(ib_conn == NULL);

@@ -253,7 +253,8 @@ static int iser_free_ib_conn_res(struct
if (ib_conn-qp != NULL)
rdma_destroy_qp(ib_conn-cma_id);

-   if (ib_conn-cma_id

Re: [PATCH/RFC] cxgb4: Add MAINTAINERS info

2010-05-05 Thread Or Gerlitz

Roland Dreier wrote:

 +CXGB4 ETHERNET DRIVER (CXGB4)
  
not sure who's the butterfly that caused this, but this was somehow 
committed as  CXGB4 ETHERNET DRIVER (CXGB3) and same goes for the IW_ 
piece


Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch v3] infiniband: ulp/iser, fix error retval in iser_create_ib_conn_res

2010-05-06 Thread Or Gerlitz
Roland Dreier wrote:
 Or, I don't think we ever fixed this.  This patch looks correct to me,
 any problem with merging this for 2.6.35?

Roland, please use V4 below, the patch is okay and would apply before and after 
applying the multipathing patches I sent yesterday (same goes for them).

[PATCH V4] ib/iser: fix error flow in iser_create_ib_conn_res

From: Dan Carpenter erro...@gmail.com

We shouldn't free things here because we free them later. 
The call tree looks like this: 
iser_connect() == initiating the connection establishment
and later 
iser_cma_handler() = iser_route_handler() = iser_create_ib_conn_res()
if we fail here, eventually iser_conn_release() is called, resulted in double 
free.

 
Signed-off-by: Dan Carpenter erro...@gmail.com
Signed-off-by: Or Gerlitz ogerl...@voltaire.com
---
V1 fixed unreachable code
V2 noticed that the original code had a double free
V3 Roland Dreier points out that I left a dangling ERR_PTR() in
   ib_conn-fmr_pool which would be freed later on.
V4 reviewed/enhanced the change-log
---
 drivers/infiniband/ulp/iser/iser_verbs.c |   25 +
 1 file changed, 9 insertions(+), 16 deletions(-)

Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
===
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iser_verbs.c
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -163,10 +163,8 @@ static int iser_create_ib_conn_res(struc
device = ib_conn-device;
 
ib_conn-login_buf = kmalloc(ISER_RX_LOGIN_SIZE, GFP_KERNEL);
-   if (!ib_conn-login_buf) {
-   goto alloc_err;
-   ret = -ENOMEM;
-   }
+   if (!ib_conn-login_buf)
+   goto out_err;
 
ib_conn-login_dma = ib_dma_map_single(ib_conn-device-ib_device,
(void *)ib_conn-login_buf, ISER_RX_LOGIN_SIZE,
@@ -175,10 +173,9 @@ static int iser_create_ib_conn_res(struc
ib_conn-page_vec = kmalloc(sizeof(struct iser_page_vec) +
(sizeof(u64) * (ISCSI_ISER_SG_TABLESIZE 
+1)),
GFP_KERNEL);
-   if (!ib_conn-page_vec) {
-   ret = -ENOMEM;
-   goto alloc_err;
-   }
+   if (!ib_conn-page_vec)
+   goto out_err;
+
ib_conn-page_vec-pages = (u64 *) (ib_conn-page_vec + 1);
 
params.page_shift= SHIFT_4K;
@@ -198,7 +195,8 @@ static int iser_create_ib_conn_res(struc
ib_conn-fmr_pool = ib_create_fmr_pool(device-pd, params);
if (IS_ERR(ib_conn-fmr_pool)) {
ret = PTR_ERR(ib_conn-fmr_pool);
-   goto fmr_pool_err;
+   ib_conn-fmr_pool = NULL;
+   goto out_err;
}
 
memset(init_attr, 0, sizeof init_attr);
@@ -216,7 +214,7 @@ static int iser_create_ib_conn_res(struc
 
ret = rdma_create_qp(ib_conn-cma_id, device-pd, init_attr);
if (ret)
-   goto qp_err;
+   goto out_err;
 
ib_conn-qp = ib_conn-cma_id-qp;
iser_err(setting conn %p cma_id %p: fmr_pool %p qp %p\n,
@@ -224,12 +222,7 @@ static int iser_create_ib_conn_res(struc
 ib_conn-fmr_pool, ib_conn-cma_id-qp);
return ret;
 
-qp_err:
-   (void)ib_destroy_fmr_pool(ib_conn-fmr_pool);
-fmr_pool_err:
-   kfree(ib_conn-page_vec);
-   kfree(ib_conn-login_buf);
-alloc_err:
+out_err:
iser_err(unable to alloc mem or create resource, err %d\n, ret);
return ret;
 }
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ib/iser: enhance disconnection logic for multi-pathing

2010-05-11 Thread Or Gerlitz
Or Gerlitz ogerl...@voltaire.com wrote:
  [...] with this patch, multipath fail-over time is about 30 seconds, which 
 is seen here,
 when a DD over the multi-path device is done before/during/after the 
 fail-over [...] without
  this patch, multipath fail-over time is about 130 seconds

Hi Roland, as we're @ -rc7 now, I wanted to check with you if there's
any issue merging this patch series for 2.6.35. If you have any
question or anything need to be addressed/fixed, I'd like to do that
sooner rather then later.

Or
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ib/iser: enhance disconnection logic for multi-pathing

2010-05-12 Thread Or Gerlitz
Roland Dreier rdre...@cisco.com wrote:

 I have these 3 + Dan Carpenter's fix applied now.

cool

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1

2010-05-15 Thread Or Gerlitz

Eli Cohen wrote:

Roland Dreier wrote:
  

  @@ -1007,7 +1010,7 @@ static void ib_sa_add_one(struct ib_device *device)
  - sa_dev = kmalloc(sizeof *sa_dev +
  + sa_dev = kzalloc(sizeof *sa_dev +

Do you happen to remember why you needed these kmalloc - kzalloc conversions?


I can't remember why. I do have this habbit of prefering kzalloc over kmalloc 
because it saves troubles sometimes.
  
Hi Eli, just a friendly comment, best if such cleanup is done in a 
separate patch, else later someone attempting to debug/bisect (who might 
be yourself btw) could spend a hell of time wondering why it was done 
here and in the framework of this patch...


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [PATCHv8 03/11] IB/umad: Enable support only for IB ports

2010-05-15 Thread Or Gerlitz

Eli Cohen wrote:

Roland Dreier wrote:
  

Why do we not allow umad for IBoE ports?  I understand there's no QP0 but why 
can't userspace use QP1 just like for IB link layer ports?


Currently QP1 is only used by the CM protocol which is implemented in the 
kernel. Since we handle the iboe specific flow in the cma rather than the SA, 
there is no need to expose qp1 to userspace.
Eli, any reason not to let reading (e.g perfquery) the HCA/port traffic 
counters with  IBoE?


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] librdmacm 1.0.12

2010-05-25 Thread Or Gerlitz
Sean Hefty wrote:
 I've pushed out release 1.0.12 of librdmacm.  

Hi Sean, below is a tiny patch which will help direct users to the correct 
mailing list


set the mailing list info to be linux-rdma instead of the ofa general list

signed-off-by: Or Gerlitz ogerl...@voltaire.com

diff --git a/configure.in b/configure.in
index d6c4a62..d0f2623 100644
--- a/configure.in
+++ b/configure.in
@@ -1,7 +1,7 @@
 dnl Process this file with autoconf to produce a configure script.
 
 AC_PREREQ(2.57)
-AC_INIT(librdmacm, 1.0.12, gene...@lists.openfabrics.org)
+AC_INIT(librdmacm, 1.0.12, linux-rdma@vger.kernel.org)
 AC_CONFIG_SRCDIR([src/cma.c])
 AC_CONFIG_AUX_DIR(config)
 AM_CONFIG_HEADER(config.h)
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?

2010-06-03 Thread Or Gerlitz

Moni Shoua wrote:

Did you try OFED-1.5.1 or even better, OFED-1.5.2? I know patches for counters 
with RoCEE were submitted since OFED-1.5 and I saw it working
Mony, I'm not using ofed, sorry... I am interested in a clarification in 
the context of the upstream submission, e.g does the problem exist in 
the latest patch-set, is there a bz case tracking this, etc.


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?

2010-06-07 Thread Or Gerlitz
Eli Cohen wrote:
 counter should work as regular in upstream kernel patches for IB link layer. 

okay good, can you validate that? basically, I can set some time to clone 
Roland's tree
and use the iboe branch as a basis for testing that the IB stack is live and 
kicking as it used to be before the patches. I just need an updated copy of the 
rest of the patch set (Roland has three patches so far) for that end. Over the 
review process there were bunch of comments but no new posting, how are you 
planning to proceed in the review/merge process?

 for IBoE, they will not work since the SMA does not support them
 I have patches that allow to show counters using sysfs 

I am not with you. The counters are read using a MAD sent to the firmware PMA 
(QP #1), 
this applies for both perfquery and sysfs, isn't it? 

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?

2010-06-07 Thread Or Gerlitz

Eli Cohen wrote:

Why are you asking me to validate that? Did you actually encounter a problem 
with this?
yes, I did. It didn't work with some ofed drop I was using. Anyway, as I 
said, I can do some validation that IBoE doesn't break upstream IB, just 
need the patches for that end, so once they are available, I will give 
them a try over 2.6.35-rcX



The counters patches will divert the code: for iboe it will not issue a MAD to 
the firmware. It will use another command.
Can you be more specific what is the origin for this new design? is it 
HW limitation or firware limitation or something else? In case its not 
hardware limitation, I don't think we need to go for non MAD based 
scheme, at least not for mlx4


Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage

2010-06-11 Thread Or Gerlitz
On Fri, Jun 11, 2010 at 3:47 PM, Chien Tung chien.tin.t...@intel.com wrote:

 V2 changes:

What you consider to be V1, this thread from 2007?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FW: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type

2010-06-11 Thread Or Gerlitz
Walukiewicz, Miroslaw miroslaw.walukiew...@intel.com wrote:
 The patch adds a new test application describing a usage of the 
 IBV_QPT_RAW_ETH
 for IPv4 multicast acceleration on iWARP cards. See man mcraw for parameters 
 description

So this is the only raw qp related patch to librdmacm? any reason not
to patch mckey to support both IB and Ethernet raw QPs? does raw qp
has any relation to the iWARP/TOE HW stack? there's also raw qp patch
posted to ewg for mlx4 which has no backing iwarp logic.

Or
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [PATCH] node description patch

2010-06-14 Thread Or Gerlitz
Mike Heinz wrote:
 This patch fixes a problem with the openibd initialization script. 
 On machines using slower DHCP servers, openibd frequently sets the HCA's node 
 description 
 to HCA-1. This patch modifies openibd to add a @ instead of the hostname 
 and adds a 
 small hook in the core drivers to replace the @ sign with the system's 
 utsname().
 Because this patch depends on changes to openibd, it cannot be submitted 
 to the upstream kernel, but it still corrects an outstanding issue with OFED 

Mike, 

The fact that you patch is both to user and kernel space code doesn't mean the 
kernel part can't be submitted upstream. I suggested you re post the patches to 
linux-rdma in a series made of two patches, one to the kernel and one to the 
service script. The kernel part then could be picked by the maintainer and will 
come into play once there's user space code plugging to it. This is similar to 
cases where people have kernel netlink agent code, merging is not dependent on 
the existence of specific matching user space code.

As for the user space part, the IB stack provided by the distros does have a 
service script and this service script attempts to set the node descriptor, e.g 
here's the RHEL6 beta rdma service code 

 # grep -A 13 node description /etc/rc.d/init.d/rdma
 # Add node description to sysfs
 IBSYSDIR=/sys/class/infiniband
 if [ -d ${IBSYSDIR} ]; then
 declare -i hca_id=1
 for hca in ${IBSYSDIR}/*
 do
 if [ -w ${hca}/node_desc ]; then
 echo -n $(hostname | cut -f 1 -d .) HCA-${hca_id}  
 ${hca}/node_desc 2 /dev/null
 fi
 let hca_id++
 done
 fi
 
 errata_58

If you want to patch this, you can here open a bugzilla case with the relevant 
distro and propose a patch.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FW: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type

2010-06-14 Thread Or Gerlitz
Walukiewicz, Miroslaw wrote:
 The mckey works on UD_QP type and mcraw works on RAW_QP type. 
 The data payload prepared for UD and RAW_QP are on different layers. 
 The mckey uses rdma_join_multicast() that triggers a state machine for IB 
 multicast joining. 
 The mcraw does not trigger such state machine because for sending the 
 ethernet multicast there is 
 no need for any multicast joining state machine. The multicast destination 
 address on ethernet 
 is determined by multicast group address.

Miroslaw,

I tend not to agree with the entire set of your arguments, to start with, for 
example, the code issues IP_ADD_MEMBERSHIP call on a socket and also computes 
the actual L2 address derived from the L3 multicast address. 

If these ops are required for raw ethernet multicast operation, you can enhance 
the rdmacm to have rdma_join_multicast carry these ops similarly to what it 
does for PS_UDP and PS_IPOIB over IB (determine the L2 address, join through 
the relevant state machine, etc).

Its not that you claim to be able to run raw multicast without any relation to 
the rdma-cm, e.g. you even want this code to be shipped by librdmacm... so lets 
understand the architecture, for example what port space  is needed here, 
looking on the code, it looks like you want it to operate in the same manner as 
PS_IPOIB,so maybe extend this port space or declare an equivalent one for 
ethernet.

Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] dapl-2.0 - scm, ucm: add pkey, pkey_index, sl override for QP's

2010-06-17 Thread Or Gerlitz

Hefty, Sean wrote:

The index isn't guaranteed to be the same across all nodes.  If a consumer is 
going to manually control this, they should really be forced to use the actual 
pkey.
yes, I saw this confusion in action, for most users pkey index doesn't 
mean anything, it may also change across time, which can break 
scripts/setting to run specific jobs using specific partitions.


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [PATCH] pkey fix for ipoib - resubmission

2010-06-22 Thread Or Gerlitz
Jason Gunthorpe wrote:
 Be aware that mainline and OFED are different in this regard, OFED overrides 
 the pkey unconditionally for multicast addresses, while mainline doesn't
 
Can you clarify this, please?

 ipoib bonding had much the same problem with invalid maddrs, and a
 patch was put in that flushed the maddr table in certain bond scenarios. 

Yes, reading through this thread, I tend to agree with Jason that we're in the 
same boat (problem) that used to be for bonding/ipoib and was fixed in commit 
75c78500ddad74b229cd0691496b8549490496a2  bonding: remap multicast addresses 
without using dev_close() and dev_open(), so I assume a similar solution 
can/should be applied here as well, unless someone comes with a magic approach 
to eliminate the problem all together...

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mlx4 pci device table

2010-06-22 Thread Or Gerlitz
Hi Yevgeny, Roland

I wonder if you can spare few words what would be the correct location
of the PCI Id table under the two tier architecture of the mlx4 driver?

If the table is placed in mlx4_core (as of today in upstream), then I
assume the mlx4_en and _ib aren't being probed by pci hot-plug
mechasnisms, correct? else if you put it in _en _ib et al files, then
one has to maintain two copies of the table, but maybe this would be
the correct approach? how this should work with multi-protcol mlx4
devices and/or IBoE?

Yevgeny, I see you placed in ofed some patch which isn't upstream
who puts a copy or some modified clone of the table in mlx4_en, what
the problem you were trying to solve?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [PATCH] pkey fix for ipoib - resubmission

2010-06-23 Thread Or Gerlitz

Jason Gunthorpe wrote:

OFED works on kernels that have compiled-in inline'd multicast map functions 
that do not include the pkey copy, while mainline's multicast map functions do. 
So to work around this there is a bit of code in OFED to overwrite the pkey in 
the multicast hw address. This means on OFED with those kernels ip maddr 
returns the wrong hw address sometimes..
okay, got it. Anyway, with this not being the essence of the patch nor 
the discussion here, I would wait to hear what  Todd and Mike think 
about your suggestion to apply the approach taken for the bonding 
problem and solution.


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sysfs IPoIB root owned writable files

2010-06-24 Thread Or Gerlitz



the following files created under /sys which are world writeable
/sys/class/net/ib0/delete_child  /sys/class/net/ib0/create_child
At least the create_child  delete_child files appear to be dangerous to leave 
as world writeable because they result in resources allocations.

Roland,

If I see a patch in linux-rdma patchwork, e.g 
https://patchwork.kernel.org/patch/104502 with the below patch, does 
this mean it will get to be reviewed/merged towards 2.6.36, or you 
prefer a reminder on the list?


Or.

Yes, this looks bad. The below patch fixes that, I tested it on 2.6.35-rc1

[PATCH] make ipoib child entries non-world writable

Sumeet Lahorani sumeet.lahor...@oracle.com reported that the ipoib 
child entries are world writable, fix them to be root only writable


Signed-off-by: Or Gerlitz ogerl...@voltaire.com

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index df3eb8c..b4b2257 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1163,7 +1163,7 @@ static ssize_t create_child(struct device *dev,
 
 	return ret ? ret : count;

 }
-static DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child);
+static DEVICE_ATTR(create_child, S_IWUSR, NULL, create_child);
 
 static ssize_t delete_child(struct device *dev,

struct device_attribute *attr,
@@ -1183,7 +1183,7 @@ static ssize_t delete_child(struct device *dev,
return ret ? ret : count;
 
 }

-static DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child);
+static DEVICE_ATTR(delete_child, S_IWUSR, NULL, delete_child);
 
 int ipoib_add_pkey_attr(struct net_device *dev)

 {


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mlx4 pci device table

2010-06-28 Thread Or Gerlitz

Roland Dreier wrote:

I think the current upstream location is correct.  This matches the practice of 
eg iw_cxgb3 as well as cxgb3i, bnx2i etc.  This does have the disadvantage that 
mlx4_en and mlx4_ib are not auto-loaded by PCI hotplug, but so it goes.
okay. Still, its too bad that ofed ships patches that do things the 
other way around vs upstream. Yevgeny, if you have reasoning in place to 
do things the other way, why not submit upstream?


Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver

2010-07-06 Thread Or Gerlitz
miroslaw.walukiew...@intel.com wrote:
 adds a IB_QPT_RAW_PACKET QP type implementation for nes driver 

 +++ b/drivers/infiniband/hw/nes/nes_ud.c
 +static const struct file_operations nes_ud_sksq_fops = {
 + .owner = THIS_MODULE,
 + .open = nes_ud_sksq_open,
 + .release = nes_ud_sksq_close,
 + .write = nes_ud_sksq_write,
 + .read = nes_ud_sksq_read,
 + .mmap = nes_ud_sksq_mmap,
 +};
 +
 +
 +static struct miscdevice nes_ud_sksq_misc = {
 + .minor = MISC_DYNAMIC_MINOR,
 + .name = nes_ud_sksq,
 + .fops = nes_ud_sksq_fops,
 +};

Reading through the May 2010 RDMA/nes: IB_QPT_RAW_PACKET QP type support for 
nes driver email thread, e.g at the below links, you say


 The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is 
 shared with
 all other user-kernel  communication and it is quite complex. It is a perfect 
 path
 for QP/CQ/PD/mem management but for me it is too complex for traffic 
 acceleration.
 The user-kernel  path  through additional driver, shared page for 
 lkey/vaddr/len
 passing and SW memory translation in kernel is much more effective.

http://marc.info/?l=linux-rdmam=127299659017928
http://marc.info/?l=linux-rdmam=127306694704653

I still don't see what is the performance issue with the uverbs 
post_send/post_recv and if there is such why it can't be fixed, to avoid 
introducing lib/driver nes special char device. Could you explain it with some 
more details? You were mention the rdma-cm device file, but the uverbs cmd api 
is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway 
a slow path.

Also, I understand that .read (.write) entry maps to posting a receive (send) 
buffer, what is the use case for .mmap entry

 --- a/drivers/infiniband/hw/nes/nes_verbs.c
 +++ b/drivers/infiniband/hw/nes/nes_verbs.c

 @@ -1139,7 +1141,6 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd,
[...]
 - atomic_inc(qps_created);
 @@ -1405,10 +1406,122 @@ static struct ib_qp *nes_create_qp(struct ib_pd 
 *ibpd,
[...]
 + /* moved here to be sure that QP is really created */
 + /*(now it counted a number of QP creation trials */
 + atomic_inc(qps_created);

best if this change and couple more of its such will be placed in a clean-up 
patch to nes_verbs.c, such that the amount of RAW QP related changes to review 
is minimized.

 @@ -2939,6 +3130,9 @@ int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr 
 *attr,
   nesqp-hwqp.qp_id, attr-qp_state, nesqp-ibqp_state,
   nesqp-iwarp_state, atomic_read(nesqp-refcount));
  
 + if (ibqp-qp_type == IB_QPT_RAW_PACKET)
 + return 0;

isn't a raw qp associated with a specific port of the device?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: root owned writable files under /sys

2010-07-06 Thread Or Gerlitz
Sumeet Lahorani wrote:
 # find /sys -type f -perm -222
 /sys/devices/pci:00/:00:04.0/:13:00.0/port_trigger
 /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port2
 /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port1

Jack, Tziporet 

Can you clarify the status of the upstream kernel mlx4 multi-protocol support? 
looking on Linus git, I see one commit, 
7ff93f8b7ecbc36e7ffc5c11a61643821c1bfee5 mlx4_core: Multiple port type 
support dated to Oct 2008, wheres ofed ships couple of patches touching this 
area, e.g adding the above sysfs entries. So what is the extra functionality 
introduced or bug/s fixed by those patches? any reason not to push them 
upstream? 


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When IBoE will be merged to upstream?

2010-07-07 Thread Or Gerlitz
Liran Liss wrote:
 but keeping ib_create_ah() callable from any context is not a goal by itself.

going with your approach, if your proposed design is accepted, I believe that 
you probably need to patch all the code-chains that makes calls under the 
current assumption

 I am looking for constructive ideas for supporting iboe without breaking 
 Verbs/CQE/CM syntax. 

I don't agree that exposing the Ethernet L2 related information to the caller 
is breaking something, the converse, it is a required enhancement. 

I think we need to let resolve through the rdma-cm  get to know at the 
consumer level, what are the source / destination macs, vlan id and vlan 
priority used by an IBoE QP, in the exact manner all the IB equivalents 
(src/dst lid, pkey, sl) are resolved by the rdma-cm and exposed to the consmer 
app for IB QP.

Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: root owned writable files under /sys

2010-07-07 Thread Or Gerlitz

Tziporet Koren wrote:

Jack is on vacation and will be back in 2 weeks. I will ask him to look at this 
when he is back
All this could have been much simpler if Yevgeny was responding, he's 
signed on the multi-protocol related patches shipped with ofed. So far, 
I had hard time getting responses form him on any of the notes I sent re 
mlx4_en and _core


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: root owned writeable files under /sys

2010-07-07 Thread Or Gerlitz
Roland Dreier rdre...@cisco.com wrote:
 thanks, applied

I don't see it, and none of the other patches you accepted last night,
in the for-next brach of yours, where are they...?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: some dapl assistance

2010-07-13 Thread Or Gerlitz
Davis, Arlin R wrote:
 There is limited debug in the non-debug builds. If you want full debugging 
 capabilities
 you can install the source RPM and configure and make as follows [..] (OFED 
 target example):

okay, got that, once I built the sources by hand as you suggested I could see 
debug prints
but things didn't really work, so I stepped back and installed the latest rpms 
- dapl-2.0.29-1
and compat-dapl-1.2.18-1, now I couldn't get intel-mpi to run:

 [r...@dodly0 ~]# rpm -qav | grep dapl
 dapl-utils-2.0.29-1
 dapl-2.0.29-1
 compat-dapl-1.2.18-1

 [r...@dodly0 ~]# ldconfig -p | grep libdat
 libdat2.so.2 (libc6,x86-64) = /usr/lib64/libdat2.so.2
 libdat.so.1 (libc6,x86-64) = /usr/lib64/libdat.so.1

 [r...@dodly0 ~]# rpm -qf /usr/lib64/libdat.so.1
 compat-dapl-1.2.18-1
 [r...@dodly0 ~]# rpm -qf /usr/lib64/libdat2.so.2
 dapl-2.0.29-1

 [r...@dodly0 ~]# /opt/intel/impi/4.0.0.027/intel64/bin/mpiexec -ppn 1 -n 2  
 -env DAPL_IB_PKEY 0x8002 -env DAPL_DBG_TYPE 0xff -env DAPL_DBG_DEST 0x3  -env 
 I_MPI_DEBUG 3 -env I_MPI_CHECK_DAPL_PROVIDER_MISMATCH none -env I_MPI_FABRICS 
 dapl:dapl /tmp/osu
 [0] MPI startup(): cannot open dynamic library libdat.so
 [1] MPI startup(): cannot open dynamic library libdat.so
 [0] MPI startup(): cannot open dynamic library libdat2.so
 [0] dapl fabric is not available and fallback fabric is not enabled
 [1] MPI startup(): cannot open dynamic library libdat2.so
 [1] dapl fabric is not available and fallback fabric is not enabled
 rank 1 in job 5  dodly0_54941   caused collective abort of all ranks
   exit status of rank 1: return code 254
 rank 0 in job 5  dodly0_54941   caused collective abort of all ranks
   exit status of rank 0: return code 254

Any idea what we're doing wrong?

BTW - before things stopped to work, exporting LD_DEBUG=libs to the MPI rank, 
I noticed that it used the compat-1.2 rpm ...

Now, I can run dapltest fine,
 [r...@dodly0 ~]# dapltest -T S -D ofa-v2-mthca0-1
 Dapltest: Service Point Ready - ofa-v2-mthca0-1
 Dapltest: Service Point Ready - ofa-v2-mthca0-1
 Server: Transaction Test Finished for this client

 [r...@dodly4 ~]# dapltest -T T -D ofa-v2-mlx4_0-1 -s dodly0 -i 1000 server SR 
 65536 4 client SR 65536 4
 Server Name: dodly0
 Server Net Address: 172.30.3.230
 DT_cs_Client: Starting Test ...
 - Stats  : 1 threads, 1 EPs
 Total WQE:2919.70 WQE/Sec
 Total Time   :   0.68 sec
 Total Send   : 262.14 MB - 382.69 MB/Sec
 Total Recv   : 262.14 MB - 382.69 MB/Sec
 Total RDMA Read  :   0.00 MB -   0.00 MB/Sec
 Total RDMA Write :   0.00 MB -   0.00 MB/Sec
 DT_cs_Client: == End of Work -- Client Exiting

I also noted that the dapl-utils and the compat-dapl-utils are mutual exclusive 
as both 
attempt to install the same man page for dat.conf
 # rpm -Uvh /usr/src/redhat/RPMS/x86_64/compat-dapl-utils-1.2.18-1.x86_64.rpm
 Preparing...### [100%]
 file /usr/share/man/man5/dat.conf.5.gz from install of 
 compat-dapl-utils-1.2.18-1.x86_64 conflicts with file from package 
 dapl-utils-2.0.29-1.x86_64

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: root owned writable files under /sys

2010-07-18 Thread Or Gerlitz
Jack Morgenstein wrote:
 The sysfs entries you refer to are introduced in commit 
 7ff93f8b7ecbc36e7ffc5c11a61643821c1bfee5
 which patches in ofed but not upstream are you referring to?

Hi Jack,

I took another look, indeed the mlx4_port{1,2} sysfs entries are introduced in 
the commit
you pointed on and their permissions looks okay (S_IRUGO | S_IWUSR), they are 
not world writable. 

As for the port_trigger sysfs entry, it is introduced by a patch shipped with 
ofed which isn't upstream (mlx4_1190_sense_port_trigger.patch) and indeed this 
entry is world writable.

So the question here, if there's any reason for multi-protocol related patches 
such as this
guy and its such not to be pushed upstream? I failed to get any constructive 
response (== pathces to Roland or Dave Miller) from Yevgeny and I was hoping 
you could be helpful here.

Or.

 Sumeet Lahorani wrote:
 # find /sys -type f -perm -222
 /sys/devices/pci:00/:00:04.0/:13:00.0/port_trigger
 /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port2
 /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port1
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver

2010-07-18 Thread Or Gerlitz
 I don't think there are applications around which would use raw qp AND
 are linked against libibverbs-1.0, such that they would exercise the 1_0
 wrapper, so we can ignore the 1st allocation, the one at the wrapper code.
 As for the 2nd allocation, since a WQE --posting-- is synchronous, 
 using the maximal values specified during the creation of the QP, I
 believe that this allocation can be done once per QP and used later.

[...] 

Hi Mirek, any comment on my response to the NES patch you sent?

Or.



 
 dive to kernel:
 ib_uverbs_post_send()
 user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); - 3. dyn alloc
 next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) +
user_wr-num_sge * sizeof (struct ib_sge),
GFP_KERNEL); - 4. dyn
 alloc
  And now there is finel call to driver. 
 ~same here for #4 you can compute/allocate once the maximal possible
 size for next per qp and use it later. As for #3, this need further
 thinking.
 
 But before diving to all this design changes, what was the penalty
 introduced by these allocations? is it in packets-per-second, latency?
 
 Diving to kernel is treated as a something like passing signal to
 kernel that there is prepared information to post_send/post_recv. The
 information about buffers are passed through shared page (available to
 userspace through mmap) to avoid copying of data. Write() ops is used
 to passing signal about post_send. Read() ops is used to pass
 information about post_recv(). We avoid additional copying of the data
 that way.
 thanks for the heads-up, I took a look and this user/kernel shared
 memory page is used to hold the work-request, nothing to do with data.
 
 As for the work request, you still have to copy it in user space from
 the user work request to the library mmaped buffer. So the only
 difference would be the copy_from_user done by uverbs, for few tens of
 bytes, can you tell if/what is the extra penalty introduced by this copy?
 
 struct nes_ud_send_wr {
 u32   wr_cnt;
 u32   qpn;
 u32   flags;
 u32   resv[1];
 struct ib_sge sg_list[64];
 };

 struct nes_ud_recv_wr {
 u32   wr_cnt;
 u32   qpn;
 u32   resv[2];
 struct ib_sge sg_list[64];
 };
 Looking on struct nes_ud_send/recv_wr, I wasn't sure to follow, the same
 instance can be used to post list of work requests, where is work
 request is limited to use one SGE, am I correct?
 
 I don't think there a need to support posting 64 --send-- requests, for
 recv it might makes sense, but it could be done in a batch/background
 flow, thoughts?
 
 Or.
 -- 
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver

2010-07-19 Thread Or Gerlitz
Walukiewicz, Miroslaw wrote:
 I agree with you that it is possible to fix the post_send path in OFED.
 Let me think a few days yet.

Hi Mirek,

okay. 

Just one comment, the way I see it, ofed is very much not something that has 
post_send path, its a temporary, ad-hock, very far from being well organized, 
and actually much worse then you may think (try the archives for shovel in 
unreviewed junk or pile of shit) collection of bits which pretend to be a 
distribution of the Linux IB stack 

The credit or discredit and or questions, patches, bugs  flames for this or 
that element of the IB stack, should all go to the maintainer/s. Specifically 
of libibverbs, ib_uverbs etc (happen to be CC-ed here...)

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


sense remote hardware address change by rdma-cm applications

2010-07-19 Thread Or Gerlitz
Today, the kernel neighbouring maintainance state-machine / engine
doesn't come into play for neighbours created on behalf of rdma-cm
consumers. This is b/c the send path is offloaded away from the
network-stack to the app QP, and as such the neighbour created
follwing the ARP request / reply initiated by rdma_resolve_address is
quickly getting aged and deleted, am I correct in that?

This behaviour makes rdma-cm RC apps to sense remote hardware address
change based only on the RC QP timeout, where UD  apps have no way
other then implementing some sort of keep-alive / probing mechanism to
make sure their AH is valid,  so how about


A. ref a neighbour created on behalf of or used by an rdma-cm ID  (*)

B. enhance the rdma-cm address_change event to report on remote
hardware address change, based on neighbour events

Or.

(*) would per ID neigh_hold() call (paired with neigh_release() when
the ID gets destroyed) work for that end?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NULL pointer dereference in rdma_ucm

2010-07-20 Thread Or Gerlitz

Josh England wrote:

It may be that the in-kernel field cm_id_priv has a NULL -alt_av.port , 
causing the Oops, but I don't know for sure.  Any ideas on how to debug this?

seems like this was reported in the past but remained unsolved,
http://lists.openfabrics.org/pipermail/general/2009-August/thread.html#61522

Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sense remote hardware address change by rdma-cm applications

2010-07-20 Thread Or Gerlitz
Jason Gunthorpe wrote:
 It is a bit wider problem than just ND entries, changes in routing can
 also alter the L2 address, so that needs to be tracked as well. 

sure, when we did the address change work, see commit dd5bdff RDMA/cma: Add 
RDMA_CM_EVENT_ADDR_CHANGE event, the problem I wanted to solve was related to 
the local bonding. Over the review thread, remote address change related 
to bonding fail-over and routing changes were mentioned, and left to future 
work.


 this is back to original criticisms from netdev of this whole separated 
 stack idea - it isn't integrated, so where do you draw the line? What gets 
 left out? 
 Today, it is pretty clear that only the CM portion integrates at all
 with netdev and after that things are separate.

the address change event was an attempt to make the CM part which integrates 
with netdev
go a step further and help the data path which is offloaded to be more 
consistent with netdev,
this email is about going another step.

 So.. I think to tackle this you need to start looking at how the
 dst_entry structure works in netdev and apply the same idea to RDMA-CM
 and reflect the changes in AH back to the QP owner.

I can take a look (pointer would be very much appreciated...) still, the dst 
entry is used
for every netdev xmit where here the xmit is offloaded, so I don't see what 
could be really used from the dst code, but I might be wrong. The rdma app uses 
the neighbour once, upon address resolving, and I was trying to see if we can 
ref the neighbour so the neigh sub-system probes would keep going even though 
the neighbour is not directly used.

 Is this an iwarp problem too? Not sure how L3-L2 translation works there.

I never managed to understand how address resolving really works with iwarp... 

Doing a bit of detective work... you can see that addr4_resolve says

 /* If the device does ARP internally, return 'done' */
 if (rt-idev-dev-flags  IFF_NOARP) {
 rdma_copy_addr(addr, rt-idev-dev, NULL);
 goto put;
 }

and later cma_connect_iw places into the iwarp cm the src/dst IP addresses

 sin = (struct sockaddr_in*) id_priv-id.route.addr.src_addr;
 cm_id-local_addr = *sin;
 sin = (struct sockaddr_in*) id_priv-id.route.addr.dst_addr;
 cm_id-remote_addr = *sin;

so all the iwarp providers do ARP resolving in their TOE stack?! Steve, can you
clarify that?

 
 Not sure what you do about UD.. Maybe RDMA-CM learns to do UC where
 the only action is to register notification monitors for L2 addressing
 changes in the kernel?

The problem exists for all IB transports (even for RD, if it would have been 
implemented...), the only difference between the U and R onces is that for the 
R's, if the remote side vanished, eventually the IB HW would let you know on 
that in the form of CQ error.

 Can this be hidden with Sean's recent work on simplified progamming models?

not sure how Sean's work relates to this proposed change.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-2.6.36] ib: fix some sparse warnings

2010-07-20 Thread Or Gerlitz
fixed the following drivers/infiniband sparse pointed issues

  CHECK   drivers/infiniband/hw/cxgb3/iwch_cm.c
iwch_cm.c:140:5: warning: symbol 'iwch_l2t_send' was not declared. Should it be 
static?
  CHECK   drivers/infiniband/hw/nes/nes_verbs.c
nes_verbs.c:1944:45: warning: Using plain integer as NULL pointer
nes_verbs.c:1944:48: warning: Using plain integer as NULL pointer
  CHECK   drivers/infiniband/hw/nes/nes_cm.c
nes_cm.c:2645:43: warning: mixing different enum types
nes_cm.c:2645:43: int enum iw_cm_event_type  versus
nes_cm.c:2645:43: int enum iw_cm_event_status
  CHECK   drivers/infiniband/ulp/iser/iser_initiator.c
iser_initiator.c:173:5: warning: symbol 'iser_alloc_rx_descriptors' was not 
declared. Should it be static?

Signed-off-by: Or Gerlitz ogerl...@voltaire.com


I didn't address these two

  CHECK   drivers/infiniband/hw/cxgb3/iwch_cq.c
drivers/infiniband/hw/cxgb3/iwch_cq.c:192:9: warning: context imbalance in 
'iwch_poll_cq_one' - different lock contexts for basic block
  CHECK   drivers/infiniband/hw/cxgb3/iwch_qp.c
drivers/infiniband/hw/cxgb3/iwch_qp.c:805:13: warning: context imbalance in 
'__flush_qp' - unexpected unlock

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c 
b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index ebfb117..3cdb535 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -137,7 +137,7 @@ static void stop_ep_timer(struct iwch_ep *ep)
put_ep(ep-com);
 }

-int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry 
*l2e)
+static int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct 
l2t_entry *l2e)
 {
int error = 0;
struct cxio_rdev *rdev;
diff --git a/drivers/infiniband/hw/nes/nes_cm.c 
b/drivers/infiniband/hw/nes/nes_cm.c
index 986d6f3..98887af 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -2565,7 +2565,7 @@ static int nes_cm_disconn_true(struct nes_qp *nesqp)
u16 last_ae;
u8 original_hw_tcp_state;
u8 original_ibqp_state;
-   enum iw_cm_event_type disconn_status = IW_CM_EVENT_STATUS_OK;
+   enum iw_cm_event_status  disconn_status = IW_CM_EVENT_STATUS_OK;
int issue_disconn = 0;
int issue_close = 0;
int issue_flush = 0;
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index 9bc2d74..0df51a4 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -1941,7 +1941,7 @@ static int nes_reg_mr(struct nes_device *nesdev, struct 
nes_pd *nespd,
u8  use_256_pbls = 0;
u8  use_4k_pbls = 0;
u16 use_two_level = (pbl_count_4k  1) ? 1 : 0;
-   struct nes_root_vpbl new_root = {0, 0, 0};
+   struct nes_root_vpbl new_root = {0, NULL, NULL};
u32 opcode = 0;
u16 major_code;

diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c 
b/drivers/infiniband/ulp/iser/iser_initiator.c
index 0b9ef07..95a08a8 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -170,7 +170,7 @@ static void iser_create_send_desc(struct iser_conn  
*ib_conn,
 }


-int iser_alloc_rx_descriptors(struct iser_conn *ib_conn)
+static int iser_alloc_rx_descriptors(struct iser_conn *ib_conn)
 {
int i, j;
u64 dma_addr;
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: knockdown voltaire switch with ARP multicast

2010-07-21 Thread Or Gerlitz

Bob Ciotti wrote:

Maybe someone on the voltaire side can help.
I'm working the issue now Wed Jul 21 00:34:14 PDT 2010

Hi Bob,

I understand that some folks from Voltaire are working with you directly.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NULL pointer dereference in rdma_ucm

2010-07-21 Thread Or Gerlitz
Josh England wrote:
 Do you think upgrading to OFED-1.5.1 would help at all?

it might help you to diagnose the problem better, if you read through the
thread I pointed on (its very short, four messages, let then two minutes),
you would see that Arthur is reporting on the lap_state and Sean is suggesting 
to use the IB CM sysfs counter to further debug this. I don't know if these 
counters exist on the IB stack used for the ofed drop you're using, but they 
should be in 1.5.x

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sense remote hardware address change by rdma-cm applications

2010-07-21 Thread Or Gerlitz
Steve Wise wrote:
 The cxgb3/4 drivers do not set IFF_NOARP and rely on ND being done as
 part of connection setup.  The driver will initiate ND if there isn't a
 neigh entry available at the time the iwarp driver tries to send a SYN or 
 SYN/ACK.  

okay, understood, thanks for clarifying this out.

 The cxgb* drivers actually reference the neigh and dst structs until the
 offload connection is gone.  Also if the the offloaded connection has
 problems transmitting (due to a L2 address change, for example), then
 the driver will initiate ND again by calling neigh_event_send().  See
 t4_l2t_send_event() in l2t.c which is called by the iwarp driver in
 peer_abort() from iwch_cm.c when the HW tells us its retransmitting too much.

In the general case of rdma-cm consumer, e.g IB RC based and/or UD unicast 
based, 
we don't have such feedback mechanism from the HW. As such, I would draw the 
line here around adopting into the rdma-cm the behavior of referencing the 
neigh and dst structures until the connection is gone (could you point on the 
func/path in drivers/net/cxgb3/l2t.c which does this? i wasn't sure).

 What doesn't happen is active positive feedback during the connection to
 avoid NUD.  IE once the connection is setup, nobody calls dst_confirm()
 It is only called during connection setup/teardown.

I think we can live with that, this is similar to the case of an app using UDP 
in uni-directional manner between host A -- B so the NUD part of the network 
stack @ host A has to issue timely probes to validate the L2 address of host B. 
The only difference is that we have the A -- B comm offloaded and eventually 
without keeping the ref the neighbour and dst are deleted, the proposed patch 
eliminates this deletion.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sense remote hardware address change by rdma-cm applications

2010-07-21 Thread Or Gerlitz
Jason Gunthorpe wrote:
 I'm thinking something like this..
 - The RDMA CM gets the dst from its route lookup locks it and stores it.
 - Instead of doing a route lookup cxgb gets the dst from RDMA CM,
   locks it and stores it
 - RDMA CM traps all notifications/etc and generates callback to cxgb
   to say the dst has changed.
 - cxgb releases the old dst and grabs the new one, updates the HW, etc.


Jason,

I'm up for extending the rdma-cm event of address change, on which an app can 
decide if
to re-act or not. For example, the in-tree iser and rds code treat this event 
the same as a disconnection request arriving, which means higher layer (e.g the 
user space iscsi daemon in the iser case) would try to re-connect. This has the 
advantage of simplifying the ULP state-machine, so there's no need for special 
handing for address-change, just treat it as a hint that re-connection is 
needed.

the cxgb* code take this deeper as they handle L2 changes in the driver level 
and not as event delivered to the ULP which can optionally address or ignore it.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with RDMA_CM on systems with multiple IB HCA's.

2010-07-22 Thread Or Gerlitz
Hari Subramoni wrote:
 [subra...@amd6 perftest]$ ./ib_rdma_bw -c 172.16.1.5
 11928: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | 
 duplex=0 | cma=1 |
 11928: Local address:  LID , QPN 00, PSN 0x5bfbba RKey 0x90042602 
 VAddr 0x002b27feabe000
 11928: Remote address: LID , QPN 00, PSN 0x392fe6, RKey 0xf8042605 
 VAddr 0x002b9d5c93b000


you can see the lid and qp numbers are zero, something is broken... when you 
use the rdma-cm, 
the address to be provided to the utility should be on an IPoIB subnet, is that 
what you're doing?

Basically, I would suggest that you first use rping(1) provided by 
librdmacm-utils to make 
sure things are working well in your configuration and then move to the 
perftest utils.

Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with RDMA_CM on systems with multiple IB HCA's.

2010-07-23 Thread Or Gerlitz
Hari Subramoni subra...@cse.ohio-state.edu wrote:

 The nodes have LID's assigned to them and OpenSM is running fine.
 I've attached the configurations of the two hosts along with this e-mail.
  As Jonathan mentioned, we are able to ping between them.

are the two HCAs on each of the nodes connected to the same IB subnet?

 The issue is intermittent. It happens at times and at other times, things
 work fine. Please let us know if you need any more information.

lets focus on rping, please use both -v -d  flags with rping, also
when  rping fails, please send the neighbours info (#ip neigh show)
from host .5

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CMA handler status code

2010-07-27 Thread Or Gerlitz
Eldad Zinger wrote:
 event.status = ib_event-param.sidr_rep_rcvd.status
 event.status = ib_event-param.rej_rcvd.reason
 event.status should be 0 for success, or negative value of generic error code.
 In that code, the error code is positive and do not comply with generic error 
 code.

Basically, I believe that the status equals reject reason for rdma-cm reject 
event
is known to the kernel developers that deal with the rdma-cm. Personally, I'm 
fine
with it, we could document that, but currently there's no rdma-cm document 
under 
Documentation/infiniband which could have this.

For user space, I would add a comment in the man pages

 In order to make the status field available for other modules (like
 SDP), that field should be format-consistent.

With SDP being out of tree for about four-six years (and counting), somehow
hard to take into account claims related to it.

Ot.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CMA handler status code

2010-07-27 Thread Or Gerlitz
 For user space, I would add a comment in the man pages

[PATCH] librdmacm/man: document status field semantics for rejected event

document status being the IB reject reason for RDMA_CM_EVENT_REJECTED event

Signed-off-by: Or Gerlitz ogerl...@voltaire.com

diff --git a/man/rdma_get_cm_event.3 b/man/rdma_get_cm_event.3
index 79bf606..91317c4 100644
--- a/man/rdma_get_cm_event.3
+++ b/man/rdma_get_cm_event.3
@@ -126,7 +126,8 @@ Generated on the active side to notify the user that the 
remote server is
 not reachable or unable to respond to a connection request.
 .IP RDMA_CM_EVENT_REJECTED
 Indicates that a connection request or response was rejected by the remote
-end point.
+end point. Under Infiniband, the event status field contains the reject reason
+as provided by the IB CM.
 .IP RDMA_CM_EVENT_ESTABLISHED
 Indicates that a connection has been established with the remote end point.
 .IP RDMA_CM_EVENT_DISCONNECTED
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ib/mlx4: add IB_CQ_REPORT_MISSED_EVENTS support

2010-07-27 Thread Or Gerlitz
enhance the cq arming code to support IB_CQ_REPORT_MISSED_EVENTS

Signed-off-by: Or Gerlitz ogerl...@voltaire.com



I noted that the IB_CQ_REPORT_MISSED_EVENTS flag was added in the same cycle 
with mlx4
and maybe as of this, mlx4 didn't implement the flag, which is used by IPoIB

The patch is compile tested only, if the patch seems okay, I can conduct 
further testing.

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 5a219a2..4366811 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -755,6 +755,13 @@ int mlx4_ib_arm_cq(struct ib_cq *ibcq, enum 
ib_cq_notify_flags flags)
to_mdev(ibcq-device)-uar_map,
MLX4_GET_DOORBELL_LOCK(to_mdev(ibcq-device)-uar_lock));
 
+   if (flags  IB_CQ_REPORT_MISSED_EVENTS) {
+   struct mlx4_cqe *cqe;
+   cqe = next_cqe_sw(to_mcq(ibcq));
+   if (cqe)
+   return 1;
+   }
+
return 0;
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ib/mlx4: add IB_CQ_REPORT_MISSED_EVENTS support

2010-07-27 Thread Or Gerlitz
Eli Cohen wrote:
 returning 1 means that you must poll the CQ to avoid a race condition
 which is not true for mlx4. 

makes sense, thanks for clarifying that.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CMA handler status code

2010-07-29 Thread Or Gerlitz
Hefty, Sean wrote:
 The original intent was to expose the transport specific status values to the 
 user, 
 rather than trying to map them.

yes, this makes sense, are you okay with documenting that, e.g in the spirit of 
the patch I sent?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: InfiniBand/RDMA merge plans for 2.6.36

2010-08-05 Thread Or Gerlitz

Walukiewicz, Miroslaw wrote:

Hello Roland,  What about a series from Aleksey Senin [...] And my patch 
RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver 
https://patchwork.kernel.org/patch/110252

Hi Mirek,

Reading your response @ http://marc.info/?l=linux-rdmam=127954552519544 
to the comments made during the review, I was under the impression that 
you're going to try and modify the NES implementation, isn't this the 
case any more?


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FW: [PATCH v2] rdma/ib_pack.h: add new bth opcodes

2010-08-09 Thread Or Gerlitz
Robert Pearson wrote:
 Several new opcodes have been added since the last time ib_pack.h was updated.
 These changes add them.
 +++ b/include/rdma/ib_pack.h
 +   IB_OPCODE_CN= 0x80,
 +   IB_OPCODE_XRC   = 0xA0,

Is this tied to some IBA 1.2 existing/new annex? pointer would be appreciated

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] rdma/ib_pack.h: add new bth opcodes

2010-08-11 Thread Or Gerlitz
Bob Pearson wrote:
 My interest is supporting the rxe driver, a software implementation of
 the IB transport over Ethernet, [...] I spent a little time looking at
 trying to exploit congestion notification to see if it would bu useful in 
 this context.

Hi Bob,

As the IB congestion control / notification has the part of the IB switches 
marking 
packets with FECNs, I don't see how does IB CCA fits into IBoE scheme, Paul?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] rdma/ib_pack.h: add new bth opcodes

2010-08-12 Thread Or Gerlitz

Bob Pearson wrote:
I was wondering if I could use this to cause ConnectX RDMAoE senders to slow down 
in response to these packets. There is a challenge managing fast ROCE senders 
in networks that may not fully implement per priority pause.
  

Hi Bob,

QCN (IEEE 802.1 based Ethernet congestion control mechanism) can apply 
for IBoE traffic, in the same manner it would for FCoE, IP etc. Is there 
a specific reason you wanted to apply the IB mechanism and not use the 
Ethernet one? Yep, PFC is helpful.


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RDMA/nes: double CLOSE event indication crash

2010-08-14 Thread Or Gerlitz
Faisal Latif wrote:
 During a stress testing in a large cluster, multiple close event is detected
 and BUG() is hit in core. The cause is [...]

Do you refer to the core of the IB stack? if not, to whose core?


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: using same IP subnet on multiple interfaces (was: dual HCAs with upstream kernel)

2010-08-15 Thread Or Gerlitz
Hefty, Sean wrote:
 Does anyone have a system with multiple HCAs that's running a recent upstream 
 kernel?
 Oracle has reported a bug connecting between two HCAs in the same system 
 using the rdma_cm 

Sean,

With 2.6.35, I was hitting the reported failure (address error event, status 
-ETIMEDOUT) with 
simpler configuration of two ports belonging to the same HCA. I used ucmatose 
and not 
rping as the former allows to specify local binding wheres the latter doesn't 
(see below).

Next, I realized that similar test with ping(8) doesn't work either, the arp 
request was xmitted 
through one interface (ib0) and received on the other (ib1) but no reply was 
generated. At this 
point, I thought that maybe one of the arp/related sysctls could effect that, 
and I got an initial 
hit... following commit 8153a10, once I have set net.ipv4.conf.ib1.accept_local 
to 1 I could 
# ping -I ib0 to ib1's address where before that, I couldn't, ucmatose got to 
work either, no problem.

 commit 8153a10c08f1312af563bb92532002e46d3f504a
 Author: Patrick McHardy ka...@trash.net
 Date:   Thu Dec 3 01:25:58 2009 +
[...]
 Change fib_validate_source() to accept packets with a local source address 
 when
 the accept_local sysctl is set for the incoming inet device. Combined with 
 the
 previous patches, this allows to communicate between multiple local 
 interfaces over the wire.

 # ip r s
 192.168.20.0/24 dev ib0  proto kernel  scope link  src 192.168.20.1
 192.168.20.0/24 dev ib1  proto kernel  scope link  src 192.168.20.100

before net.ipv4.conf.ib1.accept_local was set to 1, ping isn't working

 # ping -I ib0 192.168.20.100 -q 
 # PING 192.168.20.100 (192.168.20.100) from 192.168.20.1 ib0: 56(84) bytes of 
 data.

 # tcpdump -ni ib0
 10:12:14.679101 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 
 56
 10:12:15.679337 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 
 56

 # tcpdump -ni ib1
 10:13:35.798332 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 
 56
 10:13:36.798569 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 
 56

 # ip n s
 192.168.20.100 dev ib0  INCOMPLETE

after net.ipv4.conf.ib1.accept_local to 1, ping (and ucmatose) work, but

 # tcpdump -ni ib0
 10:29:32.196866 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 
 56
 10:29:32.197047 ARP, Reply 192.168.20.100 is-at 
 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56
 10:29:32.197058 IP 192.168.20.1  192.168.20.100: ICMP echo request, id 
 33038, seq 1, length 64
 10:29:32.197125 IP 192.168.20.1  192.168.20.100: ICMP echo request, id 
 33038, seq 1, length 64
 10:29:33.197013 IP 192.168.20.1  192.168.20.100: ICMP echo request, id 
 33038, seq 2, length 64

 # tcpdump -ni ib1
 10:29:32.196920 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 
 56
 10:29:32.196944 ARP, Reply 192.168.20.100 is-at 
 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56
 10:29:32.197029 ARP, Reply 192.168.20.100 is-at 
 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56
 10:29:32.197136 IP 192.168.20.1  192.168.20.100: ICMP echo request, id 
 33038, seq 1, length 64
 10:29:33.197023 IP 192.168.20.1  192.168.20.100: ICMP echo request, id 
 33038, seq 2, length 64
 10:29:34.197357 IP 192.168.20.1  192.168.20.100: ICMP echo request, id 
 33038, seq 3, length 64

the echo requests go on the wire, the replies not, probably (...) internally, 
Patrick?

I noted that the neighbour on the NIC which is replying quickly gets stale and 
later aged out

 # ip  n s
 192.168.20.100 dev ib0 lladdr 
 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8 REACHABLE
 192.168.20.1   dev ib1 lladdr 
 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e7 STALE

Or.

This is my related configuration, I tried changing rp_filter to 0 but it didn't 
change things either

 # sysctl -a | grep accept_local | grep ib[0,1]
 net.ipv4.conf.ib0.accept_local = 1
 net.ipv4.conf.ib1.accept_local = 1

 # sysctl -a | grep rp_ | grep ib[0,1]
 net.ipv4.conf.ib0.rp_filter = 1
 net.ipv4.conf.ib0.arp_filter = 0
 net.ipv4.conf.ib0.arp_announce = 0
 net.ipv4.conf.ib0.arp_ignore = 1
 net.ipv4.conf.ib0.arp_accept = 0
 net.ipv4.conf.ib0.arp_notify = 0
 net.ipv4.conf.ib0.proxy_arp_pvlan = 0
 net.ipv4.conf.ib1.rp_filter = 1
 net.ipv4.conf.ib1.arp_filter = 0
 net.ipv4.conf.ib1.arp_announce = 0
 net.ipv4.conf.ib1.arp_ignore = 1
 net.ipv4.conf.ib1.arp_accept = 0
 net.ipv4.conf.ib1.arp_notify = 0
 net.ipv4.conf.ib1.proxy_arp_pvlan = 0
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RDMA/nes: double CLOSE event indication crash

2010-08-16 Thread Or Gerlitz
Latif, Faisal wrote:
 BUG() was in iw_cm.ko in its close handler mentioned as core in my email and 
 caused by iw_nes.ko.

I see, looks like iwcm.c accounts for most of the BUG* calls made from the 
core, could be nice
to reduce them over time.

Or.

 # grep -n BUG drivers/infiniband/core/*.c | grep (
 drivers/infiniband/core/cma.c:1262: BUG_ON(1);
 drivers/infiniband/core/cm.c:1169:  BUG_ON(cm_id-state != IB_CM_IDLE);
 drivers/infiniband/core/cm.c:1318:  BUG_ON(!work);
 drivers/infiniband/core/device.c:175:   BUG_ON(size  sizeof (struct 
 ib_device));
 drivers/infiniband/core/device.c:194:   BUG_ON(device-reg_state != 
 IB_DEV_UNREGISTERED);
 drivers/infiniband/core/iwcm.c:120: 
 BUG_ON(!list_empty(cm_id_priv-work_free_list));
 drivers/infiniband/core/iwcm.c:163: 
 BUG_ON(atomic_read(cm_id_priv-refcount)==0);
 drivers/infiniband/core/iwcm.c:165: 
 BUG_ON(!list_empty(cm_id_priv-work_list));
 drivers/infiniband/core/iwcm.c:186: 
 BUG_ON(!list_empty(cm_id_priv-work_list));
 drivers/infiniband/core/iwcm.c:241: BUG_ON(qp == NULL);
 drivers/infiniband/core/iwcm.c:298: BUG();
 drivers/infiniband/core/iwcm.c:374: BUG();
 drivers/infiniband/core/iwcm.c:397: 
 BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, cm_id_priv-flags));
 drivers/infiniband/core/iwcm.c:518: BUG_ON(cm_id_priv-state != 
 IW_CM_STATE_CONN_RECV);
 drivers/infiniband/core/iwcm.c:583: BUG_ON(cm_id_priv-state != 
 IW_CM_STATE_CONN_SENT);
 drivers/infiniband/core/iwcm.c:620: BUG_ON(iw_event-status);
 drivers/infiniband/core/iwcm.c:695: BUG_ON(cm_id_priv-state != 
 IW_CM_STATE_CONN_RECV);
 drivers/infiniband/core/iwcm.c:723: BUG_ON(cm_id_priv-state != 
 IW_CM_STATE_CONN_SENT);
 drivers/infiniband/core/iwcm.c:795: BUG();
 drivers/infiniband/core/iwcm.c:824: BUG();
 drivers/infiniband/core/iwcm.c:865: 
 BUG_ON(atomic_read(cm_id_priv-refcount)==0);
 drivers/infiniband/core/iwcm.c:869: 
 BUG_ON(!list_empty(cm_id_priv-work_list));
 drivers/infiniband/core/mad.c:587:  BUG_ON(!mad_list-mad_queue);
 drivers/infiniband/core/mad.c:1396: BUG_ON(!*method);
 drivers/infiniband/core/mad.c:1406: BUG_ON(*method);
 drivers/infiniband/core/mad.c:2242: BUG_ON(1);
 drivers/infiniband/core/verbs.c:91: BUG();


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: using same IP subnet on multiple interfaces

2010-08-16 Thread Or Gerlitz
Jason Gunthorpe wrote:
 [...] The socket that is bound to a device will then use its device for 
 sending, 
 but other sockets not bound to devices will do route lookups and use the lo 
 device.
 Do: [...] To see the difference in each side.

sure, makes sense, the ping-reply code does route lookup and will use the 
loopback device.

I took a 2nd look on ping w.r.t to various sysctl states, and when rp_filter is 
set to its default

 # sysctl -a | grep -wE accept_local|rp_filter|arp_ignore | grep ib
 net.ipv4.conf.ib0.rp_filter = 1
 net.ipv4.conf.ib0.accept_local = 1
 net.ipv4.conf.ib0.arp_ignore = 1
 net.ipv4.conf.ib1.rp_filter = 1
 net.ipv4.conf.ib1.accept_local = 1
 net.ipv4.conf.ib1.arp_ignore = 1

ping isn't working since there's no arp reply

 # ping -I ib0 192.168.20.100
 PING 192.168.20.100 (192.168.20.100) from 192.168.20.1 ib0: 56(84) bytes of 
 data.
 From 192.168.20.1 icmp_seq=2 Destination Host Unreachable
 From 192.168.20.1 icmp_seq=3 Destination Host Unreachable
 From 192.168.20.1 icmp_seq=4 Destination Host Unreachable

 # tcpdump -ni ib0
 18:04:39.492306 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 
 56
 18:04:40.492541 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 
 56

 # tcpdump -ni ib1
 18:04:42.497039 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 
 56
 18:04:43.497268 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 
 56

Once I'm setting net.ipv4.conf.ib1.rp_filter=0 arps replies are generated and 
ping
is working as you explained, echo-request externally, echo-reply internally

 # tcpdump -ni ib1
 18:06:33.103248 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 
 56
 18:06:33.103281 ARP, Reply 192.168.20.100 is-at 
 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56
 18:06:33.103369 ARP, Reply 192.168.20.100 is-at 
 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56
 18:06:33.103461 IP 192.168.20.1  192.168.20.100: ICMP echo request, id 
 26906, seq 1, length 64
 18:06:34.107465 IP 192.168.20.1  192.168.20.100: ICMP echo request, id 
 26906, seq 2, length 64

Now, If I return rp_filter to 1, ping keeps working using the neighbour 
previously created. ping 
even keeps working when I set net.ipv4.conf.ib1.accept_local to 0, which is a 
bit weird unless 
this sysctl is made to act in the neigbour level (i.e control arp replies and 
not any packet xmit).

 To really effect a full external loopback you need to have both sides
 bound to their respective devices. Note that binding to a device and
 binding to a source IP are not the same thing in Linux.

Even without being fully into the details of what does binding to a source IP 
actually translates to, I understand there's a difference. 

 In the RDMA CM case the listening side doesn't do any IP
 routing operations at all so a device bind isn't necessary.

Yes, indeed. As for the active side, the RDMA CM doesn't have a BINDTODEVICE 
equivalent.

As for the original issue we were discussing here, Sean - the conclusion is 
that with 
upstream 2.6.35 bits for the rdma connection to go from hca1 port1 to hca1 
port2 (or from 
hca1 port1 to hca2 port1), the rdma-cm needs a neighbour, similarly to a ping 
-I ib0 to 
ib1 address.

A neighbour isn't created unless the responding NIC (ib1 in my example) has 
both rp_filter 
set to 0 and accept_local set to 1, Jason, does this makes sense?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: using same IP subnet on multiple interfaces

2010-08-18 Thread Or Gerlitz

Jason Gunthorpe wrote:

As for the original issue we were discussing here, the conclusion is that with 
upstream 2.6.35 bits for the rdma connection to go from hca1 port1 to hca1 
port2 (or from hca1 port1 to hca2 port1), the rdma-cm needs a neighbour, 
similarly to a ping -I ib0 to ib1 address. A neighbour isn't created unless the 
responding NIC (ib1 in my example) has both rp_filter set to 0 and accept_local 
set to 1,
does this makes sense?

This description seemed reasonable to me. It is pretty confusing what binding 
means in RDMA CM, it is different then sockets, and is some combination of 
SO_BINDTODEVICE and bind to address.
I was thinking that one of the things taken care by the patch set to 
addr.c/cma.c you, David and Sean did last year was to make binding in 
rdma-cm to be bind to address by-the-book, in what aspect it is 
different now?


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] rdma/ib_pack.h: add new bth opcodes

2010-08-19 Thread Or Gerlitz
Bob Pearson wrote:
 I was curious to see if I could force a ConnectX device to slow down from a 
 remote application.
 But since the MADs have been crippled for IBOE there is no way to configure 
 it.

QP1 MADs are working for ConnectX, e.g the IB CM is fully functional for IBoE, 
and I don't think the 
mad layer was modified to emulate MADs for the CM over regular UD QP, UDP or 
their such, Eli, 
am I correct in that? For some reason the PMA (QP1 performance counters) 
service isn't exposed, but
it should be working (and helpful) as well.

Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net-next pull request: RDS

2010-09-12 Thread Or Gerlitz
Hi Andy, 

Some clarifications/questions from whatever quick look one can have over 107 
patches...

Zach Brown's RDS/IB: print IB event strings as well as their number - commit
1bde04a63d532c2540d6fdee0a661530a62b1686 in net-next-2.6 looks perfect to 
reside as a helper function in the core IB stack which can be in use by other 
rdma drivers (e.g ipoib, iser, srp, etc). 

Chris Mason's rds: recycle FMRs through lockless lists added net/rds/xlist.h 
- 6fa70da6081bbcf948801fd5ee0be4d98a43 adds net/rds/xlist.h - isn't this 
something that better be placed under include/linux/. etc?

And last, your RDS/IB: add _to_node() macros for numa and use 
{k,v}malloc_node() patch looks interesting, 1st, it has some macros which 
could be placed in more general locations e.g pcidev_to_node and ibdev_to_node, 
your significantly helps performance comment is interesting, I'll send a 
separate note about that to the rdma mailing list.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDS/IB: add _to_node() macros for numa and use {k,v}malloc_node()

2010-09-12 Thread Or Gerlitz
Hi Andy, looking on this net-next-2.6 patch, I wonder if you can elaborate on 
your significantly helps performance comment - what improvement you see with 
this patch? 

What about the QP/CQ memory, are they better be placed in node-local to the HCA 
manner?

Or.

commit e4c52c98e04937ea87b0979a81354d0040d284f9
Author: Andy Grover andy.gro...@oracle.com
Date:   Fri Apr 23 10:49:53 2010 -0700

RDS/IB: add _to_node() macros for numa and use {k,v}malloc_node()

Allocate send/recv rings in memory that is node-local to the HCA.
This significantly helps performance.

Signed-off-by: Andy Grover andy.gro...@oracle.com

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 7a2131d..7d289d7 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -77,7 +77,7 @@ void rds_ib_add_one(struct ib_device *device)
goto free_attr;
}
 
-   rds_ibdev = kmalloc(sizeof *rds_ibdev, GFP_KERNEL);
+   rds_ibdev = kmalloc_node(sizeof *rds_ibdev, GFP_KERNEL, 
ibdev_to_node(device));
if (!rds_ibdev)
goto free_attr;
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index c506604..4bc3e2f 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -3,6 +3,8 @@
 
 #include rdma/ib_verbs.h
 #include rdma/rdma_cm.h
+#include linux/pci.h
+#include linux/slab.h
 #include rds.h
 #include rdma_transport.h
 
@@ -167,6 +169,10 @@ struct rds_ib_device {
spinlock_t  spinlock;   /* protect the above */
 };
 
+#define pcidev_to_node(pcidev) pcibus_to_node(pcidev-bus)
+#define ibdev_to_node(ibdev) pcidev_to_node(to_pci_dev(ibdev-dma_device))
+#define rdsibdev_to_node(rdsibdev) ibdev_to_node(rdsibdev-dev)
+
 /* bits for i_ack_flags */
 #define IB_ACK_IN_FLIGHT   0
 #define IB_ACK_REQUESTED   1
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 75eda9c..b5d0b60 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -347,7 +347,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
goto out;
}
 
-   ic-i_sends = vmalloc(ic-i_send_ring.w_nr * sizeof(struct 
rds_ib_send_work));
+   ic-i_sends = vmalloc_node(ic-i_send_ring.w_nr * sizeof(struct 
rds_ib_send_work),
+  ibdev_to_node(dev));
if (!ic-i_sends) {
ret = -ENOMEM;
rdsdebug(send allocation failed\n);
@@ -355,7 +356,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
}
memset(ic-i_sends, 0, ic-i_send_ring.w_nr * sizeof(struct 
rds_ib_send_work));
 
-   ic-i_recvs = vmalloc(ic-i_recv_ring.w_nr * sizeof(struct 
rds_ib_recv_work));
+   ic-i_recvs = vmalloc_node(ic-i_recv_ring.w_nr * sizeof(struct 
rds_ib_recv_work),
+  ibdev_to_node(dev));
if (!ic-i_recvs) {
ret = -ENOMEM;
rdsdebug(recv allocation failed\n);
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 7315fff..cc341cd 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -297,7 +297,7 @@ static struct rds_ib_mr *rds_ib_alloc_fmr(struct 
rds_ib_device *rds_ibdev)
rds_ib_flush_mr_pool(pool, 0);
}
 
-   ibmr = kzalloc(sizeof(*ibmr), GFP_KERNEL);
+   ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL, 
rdsibdev_to_node(rds_ibdev));
if (!ibmr) {
err = -ENOMEM;
goto out_no_cigar;
@@ -376,7 +376,8 @@ static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, 
struct rds_ib_mr *ibm
if (page_cnt  fmr_message_size)
return -EINVAL;
 
-   dma_pages = kmalloc(sizeof(u64) * page_cnt, GFP_ATOMIC);
+   dma_pages = kmalloc_node(sizeof(u64) * page_cnt, GFP_ATOMIC,
+rdsibdev_to_node(rds_ibdev));
if (!dma_pages)
return -ENOMEM;
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [rds-devel] net-next pull request: RDS

2010-09-15 Thread Or Gerlitz
Andrew Grover wrote:
 Once net-next gets pushed to mainline and Roland pulls from that, 
 then we'll be in a good position to put these helpers where they should go, 
 and change other ULPs to use them.

Andy, as Roland commented, you can push such helpers through Dave once Roland 
made a review of them, in a similar manner to a situation where an iser patch 
is merged by the iscsi maintainer, etc. if going to the review now... Roland - 
what's your take on the below patch net-next-2.6 and also on its net-next-2.6  
RDS/IB: print IB event strings as well as their number 
1bde04a63d532c2540d6fdee0a661530a62b1686 buddy?

Or.


commit 59f740a6aeb2cde2f79fe0df38262d4c1ef35cd8
Author: Zach Brown zach.br...@oracle.com
Date:   Tue Aug 3 13:52:47 2010 -0700

RDS/IB: print string constants in more places

This prints the constant identifier for work completion status and rdma
cm event types, like we already do for IB event types.

A core string array helper is added that each string type uses.

Signed-off-by: Zach Brown zach.br...@oracle.com

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 8e3886d..bb6ad81 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -40,6 +40,15 @@
 
 #include rds.h
 
+char *rds_str_array(char **array, size_t elements, size_t index)
+{
+   if ((index  elements)  array[index])
+   return array[index];
+   else
+   return unknown;
+}
+EXPORT_SYMBOL(rds_str_array);
+
 /* this is just used for stats gathering :/ */
 static DEFINE_SPINLOCK(rds_sock_lock);
 static unsigned long rds_sock_count;
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 2189fd4..7ad3d57 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -345,6 +345,7 @@ u32 rds_ib_ring_completed(struct rds_ib_work_ring *ring, 
u32 wr_id, u32 oldest);
 extern wait_queue_head_t rds_ib_ring_empty_wait;
 
 /* ib_send.c */
+char *rds_ib_wc_status_str(enum ib_wc_status status);
 void rds_ib_xmit_complete(struct rds_connection *conn);
 int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
unsigned int hdr_off, unsigned int sg, unsigned int off);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 0e2fea8..bc3dbc1 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -39,7 +39,8 @@
 #include ib.h
 
 static char *rds_ib_event_type_strings[] = {
-#define RDS_IB_EVENT_STRING(foo) [IB_EVENT_##foo] = __stringify(foo)
+#define RDS_IB_EVENT_STRING(foo) \
+   [IB_EVENT_##foo] = __stringify(IB_EVENT_##foo)
RDS_IB_EVENT_STRING(CQ_ERR),
RDS_IB_EVENT_STRING(QP_FATAL),
RDS_IB_EVENT_STRING(QP_REQ_ERR),
@@ -63,11 +64,8 @@ static char *rds_ib_event_type_strings[] = {
 
 static char *rds_ib_event_str(enum ib_event_type type)
 {
-   if (type  ARRAY_SIZE(rds_ib_event_type_strings) 
-   rds_ib_event_type_strings[type])
-   return rds_ib_event_type_strings[type];
-   else
-   return unknown;
+   return rds_str_array(rds_ib_event_type_strings,
+ARRAY_SIZE(rds_ib_event_type_strings), type);
 };
 
 /*
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index a2f5f6f..e29e0ca 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -966,8 +966,9 @@ static inline void rds_poll_cq(struct rds_ib_connection *ic,
struct rds_ib_recv_work *recv;
 
while (ib_poll_cq(ic-i_recv_cq, 1, wc)  0) {
-   rdsdebug(wc wr_id 0x%llx status %u byte_len %u imm_data %u\n,
-(unsigned long long)wc.wr_id, wc.status, wc.byte_len,
+   rdsdebug(wc wr_id 0x%llx status %u (%s) byte_len %u imm_data 
%u\n,
+(unsigned long long)wc.wr_id, wc.status,
+rds_ib_wc_status_str(wc.status), wc.byte_len,
 be32_to_cpu(wc.ex.imm_data));
rds_ib_stats_inc(s_ib_rx_cq_event);
 
@@ -985,10 +986,11 @@ static inline void rds_poll_cq(struct rds_ib_connection 
*ic,
} else {
/* We expect errors as the qp is drained during 
shutdown */
if (rds_conn_up(conn) || rds_conn_connecting(conn))
-   rds_ib_conn_error(conn, recv completion on 
- %pI4 had status %u, 
disconnecting and 
+   rds_ib_conn_error(conn, recv completion on 
%pI4 had 
+ status %u (%s), 
disconnecting and 
  reconnecting\n, 
conn-c_faddr,
- wc.status);
+ wc.status,
+ 
rds_ib_wc_status_str(wc.status));
}
 
/*
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 15f7569..808544a 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -38,6 +38,40 @@
 #include rds.h
 #include 

Re: gratuitous arps lost during IB switch failure

2010-10-02 Thread Or Gerlitz
Sumeet sumeet.lahor...@oracle.com wrote:
 It turns out that this problem was being caused because we had multiple IPs
 configured on the bonded infiniband interface. It appears that grat. arps are
 being sent out for only one of those IPs. [...]  Can the bonding
 driver be fixed to send out grat arps for both these IPs?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gratuitous arps lost during IB switch failure

2010-10-02 Thread Or Gerlitz
Sumeet sumeet.lahor...@oracle.com wrote:
 It turns out that this problem was being caused because we had multiple IPs
 configured on the bonded infiniband interface. It appears that grat. arps are
 being sent out for only one of those IPs. [...]  Can the bonding
 driver be fixed to send out grat arps for both these IPs?

is there anything that makes you think this issue has something to do
with ipoib/bonding? did you check with ethernet? the bonding driver
isn't maintained over linux-rdma but rather over netdev.

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


re: mlx4: propagate node_description changes down to FW

2010-10-03 Thread Or Gerlitz
Hi Jack, 

I just came across this patch of yours which was placed in ofed 1.5.2, I didn't 
see any track of it 
here @ linux-rdma (any specific reason for that?) - some questions/issues to 
discuss -

1st and most, (say) for 1k node cluster, is it correct that for each node doing 
start/restart of the openibd 
service a trap will be sent to opensm and the latter will heavy sweep?! this 
doesn't sound very much scalable...
have you tested it over large clusters? what was the impact?

Or.

mlx4: propagate node_description changes down to FW.

The Node Description cannot be changed via MADs (it is read-only).
Until now, it was changed in the driver, and the new Node Description
was simply overwritten by the driver on MAD responses.

The node description was modified in the driver by openibd via sysfs.
However, that generated a race condition, where OpenSM could get the
FW node description rather than the overwritten description if OpenSM
queried the device before openibd had a chance to enter the new description.

The solution is a new FW command (SET_NODE) which allows passing the
new node description to FW. When this command is invoked, FW issues
a 144 trap to OpenSM.  Upon receiving this trap, OpenSM initiates a
heavy sweep, thus updating the node description properly -- and eliminating
the race.

This patch works whether or not the new FW command is available.  If SET_NODE
is not available, things work as before.

Fixes FM82320

Signed-off-by: Jack Morgenstein ja...@dev.mellanox.co.il

Index: ofed_kernel/drivers/infiniband/hw/mlx4/main.c
===
--- ofed_kernel.orig/drivers/infiniband/hw/mlx4/main.c  2010-09-27 
17:20:54.069787000 +0200
+++ ofed_kernel/drivers/infiniband/hw/mlx4/main.c   2010-09-27 
17:21:15.07481 +0200
@@ -421,14 +421,34 @@ out:
 static int mlx4_ib_modify_device(struct ib_device *ibdev, int mask,
 struct ib_device_modify *props)
 {
+   struct mlx4_cmd_mailbox *mailbox;
+   int err;
+
if (mask  ~IB_DEVICE_MODIFY_NODE_DESC)
return -EOPNOTSUPP;
 
-   if (mask  IB_DEVICE_MODIFY_NODE_DESC) {
-   spin_lock(to_mdev(ibdev)-sm_lock);
-   memcpy(ibdev-node_desc, props-node_desc, 64);
-   spin_unlock(to_mdev(ibdev)-sm_lock);
-   }
+   if (!(mask  IB_DEVICE_MODIFY_NODE_DESC))
+   return 0;
+
+   spin_lock(to_mdev(ibdev)-sm_lock);
+   memcpy(ibdev-node_desc, props-node_desc, 64);
+   spin_unlock(to_mdev(ibdev)-sm_lock);
+
+   /* if possible, pass node desc to FW, so it can generate
+* a 144 trap. If cmd fails, just ignore.
+*/
+   mailbox = mlx4_alloc_cmd_mailbox(to_mdev(ibdev)-dev);
+   if (IS_ERR(mailbox))
+   return 0;
+
+   memset(mailbox-buf, 0, 256);
+   memcpy(mailbox-buf, props-node_desc, 64);
+   err = mlx4_cmd(to_mdev(ibdev)-dev, mailbox-dma, 1, 0,
+  MLX4_CMD_SET_NODE, MLX4_CMD_TIME_CLASS_A);
+   if (err)
+   mlx4_ib_dbg(SET_NODE command failed (%d), err);
+
+   mlx4_free_cmd_mailbox(to_mdev(ibdev)-dev, mailbox);
 
return 0;
 }
Index: ofed_kernel/include/linux/mlx4/cmd.h
===
--- ofed_kernel.orig/include/linux/mlx4/cmd.h   2010-09-27 17:20:40.519054000 
+0200
+++ ofed_kernel/include/linux/mlx4/cmd.h2010-09-27 17:21:15.081799000 
+0200
@@ -58,6 +58,7 @@ enum {
MLX4_CMD_SENSE_PORT  = 0x4d,
MLX4_CMD_HW_HEALTH_CHECK = 0x50,
MLX4_CMD_SET_PORT= 0xc,
+   MLX4_CMD_SET_NODE= 0x5a,
MLX4_CMD_ACCESS_DDR  = 0x2e,
MLX4_CMD_MAP_ICM = 0xffa,
MLX4_CMD_UNMAP_ICM   = 0xff9,
Index: ofed_kernel/drivers/net/mlx4/cmd.c
===
--- ofed_kernel.orig/drivers/net/mlx4/cmd.c 2010-09-27 17:20:32.995814000 
+0200
+++ ofed_kernel/drivers/net/mlx4/cmd.c  2010-09-27 17:21:15.088792000 +0200
@@ -242,8 +242,11 @@ static int mlx4_cmd_poll(struct mlx4_dev
  __raw_readl(hcr + 
HCR_OUT_PARAM_OFFSET + 4));
stat = be32_to_cpu((__force __be32) __raw_readl(hcr + 
HCR_STATUS_OFFSET))  24;
err = mlx4_status_to_errno(stat);
-   if (err)
-   mlx4_err(dev, command 0x%x failed: fw status = 0x%x\n, op, 
stat);
+   if (err) {
+   if (op != MLX4_CMD_SET_NODE || stat != CMD_STAT_BAD_OP)
+   mlx4_err(dev, command 0x%x failed: fw status = 0x%x\n,
+op, stat);
+   }
 
 out:
up(priv-cmd.poll_sem);
@@ -296,8 +299,9 @@ static int mlx4_cmd_wait(struct mlx4_dev
 
err = context-result;
if (err) {
-   mlx4_err(dev, command 0x%x failed: fw status = 0x%x\n,
-op, context-fw_status);
+   if (op != MLX4_CMD_SET_NODE || 

Re: mlx4: propagate node_description changes down to FW

2010-10-04 Thread Or Gerlitz

Jack Morgenstein wrote:
I have not yet submitted the patch to the list.  
sounds like its about time to do that... could you send this to 
review/merge into 2.6.37?


From what was commented here and further looking, the sentence [...] 
Upon receiving this trap, OpenSM initiates a heavy sweep, thus updating 
the node description properly [...] isn't accurate, I suggest to change 
that into something like Upon receiving this trap, OpenSM issues 
SubnGet(NodeDescription) to the node that sent the trap thus updating 
the node description properly also, I guess Fixes FM82320 isn't 
meaningful for the upstream change log...

Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Maximum size for memory registration (ibv_reg_mr)

2010-10-07 Thread Or Gerlitz

Eli Cohen wrote:
If you create a MR in kernel, it covers the entire address space and 
the HCA does not pose any limit since you do not consume MTTs. And if 
you use MTTs then the page size is a parameter in this calculation - 
huge page, regular page etc.
I agree that the kernel case is not of large interest, even though what 
you wrote only applies for dma mr, when some FMR scheme is used, MTTs 
are consumed, ofcourse. But, typically, kernel code will not go to the 
order of giga-bytes, and in other words will not hit the HCA limit.



Do leaving it as is seems to be the most accurate thing...
I would implement it for regular pages and drop a note in the libibverbs 
man page that if huge pages are used (well, the huge pages patch set 
isn't fully merged, maybe its about time to make this happen...) then 
the actual limit is bigger, e.g follows the proportion between the 
regular to the huge pages used.


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv10 0/11] IBoE support to Infiniband

2010-10-07 Thread Or Gerlitz
Eli Cohen wrote:
 We have successfully tested MPI, SDP, RDS, and native Verbs applications over 
 IBoE.

I came across your ofed commit e5414cccaa13e6dd80d8d6fc3dafe95355facdef sdp: 
module parameter 
to disable SDP over ROCEE and wasn't sure what's behind it, can you clarify 
that?

Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv10 0/11] IBoE support to Infiniband

2010-10-10 Thread Or Gerlitz

Amir Vadai wrote:

It is from the days that SDP over RoCE wasn't stable. In addition, when 
customers had a very long delay before TCP connection established, in the 
following scenario:
   1. in libsdp.conf, setting mode to 'both' (Try SDP and fallback to TCP)
   2. application tries to open socket to a remote peer connected using 10G 
ethernet
   3. Remote host don't support RoCE.
It took few seconds till the CMA gave up trying to connect, and SDP connection 
failed, and TCP connection was established.
  
thanks for the clarification re remote host not supporting IBoE, anyway, 
I don't see if/what this has to do with sdp stability, its just a delay 
on the connection establishment and you say the default is to be changed 
to off for this param.


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mlx4: Limit num of fast reg WRs

2010-10-11 Thread Or Gerlitz
Eli Cohen e...@dev.mellanox.co.il wrote:
 Fix the limit of max fast regisreation WRs that can be posted to CX to match
 hardware capabilities.

Guys, can you clarify if the hardware limitation is 511 entries or its
(PAGE_SIZE / sizeof(pointer)) - 1 which is 4096 / 8  - 1 = 511 but can
change if the page size  gets bigger or smaller?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Work completions generated after a queue pair has made the transition to an error state

2010-10-12 Thread Or Gerlitz
Bart Van Assche bvanass...@acm.org wrote:
 Has anyone been looking into this before ?

nope, never ever, what hca is that?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Work completions generated after a queue pair has made the transition to an error state

2010-10-13 Thread Or Gerlitz

Eli Cohen wrote:
Completions with non-zero (error) status and a wr_id / opcode 
combination were received that were never queued by the application.

In case of error the opcode of the completed operation is not provided. I am 
not sure why.
Eli, there's nothing in the IB spec that mandates the WC.opcode of a non 
successful work request to be valid, the only WC fields that must be 
valid are the work-request ID (cookie) and the status code, I believe 
that hardware vendors would also make sure to have the vendor id valid...


Bart, reading your initial posting, I was under the impression that the 
wr_id is something your app didn't post, so in that respect I take back 
my response, so, of-course, when you program to IB you can't assume 
anything on WC.opcode of an error-ed WR.


Or.





Note: some work requests were queued with and some without the flag
IB_SEND_SIGNALED. I'm not sure however whether that has anything to do
with the observed behavior.

If you have WRs for which you did not set IB_SEND_SIGNALED, they are
not considered completed before a comletion entry is pushed to the CQ
that correspnds to that send queue. I am not sure if it means that all
the WR in the send queue should be completed with error.

This behavior is easy to reproduce. If I interpret the InfiniBand
Architecture Specification correctly, this behavior is non-compliant.

Has anyone been looking into this before ?

I haven't seen it. It isn't supposed to happen.

What hardware and software are you using and how do you
reproduce it?

Hello Ralph and Or,

The way I reproduce that behavior is by modifying the state of a queue
pair into IB_QPS_ERR while RDMA is ongoing. The application, which is
multithreaded, performs RDMA by calling ib_post_recv() and
ib_post_send() (opcodes IB_WR_SEND, IB_WR_RDMA_READ and
IB_WR_RDMA_WRITE). This has been observed with the mlx4 driver, a
ConnectX HCA and firmware version 2.7.0.

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv10 08/12] mlx4: Add support for IBoE - address resolution

2010-10-22 Thread Or Gerlitz
Eli Cohen e...@dev.mellanox.co.il wrote:
 [...] Address resolution is done atomically in the
 case of a link local address or a multicast GID and otherwise -EINVAL is
 returned.  mlx4 transport packets were changed too to accommodate for IBoE.
 Multicast groups attach/detach calls dev_mc_add/remove to update the NIC's
 multicast filters.

This change log and also I assume the patch as well, deals alot with
multicast, however, patch 0/10 says With these patches, IBoE
multicast frames may be broadcast as there is
currently no use of a L2 multicast group membership protocol. - does
this means some/much of the code added/changed by this patch is dead
code or not needed at this point?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mlx4: Fix unneeded return error in eth_link_query_port

2010-10-25 Thread Or Gerlitz
Eli Cohen wrote:
 On Sun, Oct 24, 2010 at 6:22 PM, Roland Dreier rdre...@cisco.com wrote:
   No you did not. It was there already but we never noticed before Yossi's 
 patch.
 But AFAICT Yossi's patch (5eb620c8) went into 2.6.22 about 2.5 years
 ago... wasn't that already there way before the IBoE stuff started?

 I see... I think the reason it started failing comes from this portion of 
 patch 8:

I pulled/built/booted with the for-next branch of Roland's tree, and I can't 
get IB link for the node, 
I don't think this is my problem, since I'm on L2 IB and not Eth, but should 
this work with pre 2.7 
firmware?! if not, maybe patch the mlx4 driver to print some error,

 # ibv_devinfo
 hca_id: mlx4_0
 transport:  InfiniBand (0)
 fw_ver: 2.6.818
 node_guid:  0002:c903:0002:6be2
 sys_image_guid: 0002:c903:0002:6be5
 vendor_id:  0x02c9
 vendor_part_id: 26418
 hw_ver: 0xA0
 board_id:   MT_0A50110002
 phys_port_cnt:  2
 port:   1
 state:  PORT_INIT (2)
 max_mtu:2048 (4)
 active_mtu: 2048 (4)
 sm_lid: 0
 port_lid:   0
 port_lmc:   0x00
 
 port:   2
 state:  PORT_INIT (2)
 max_mtu:2048 (4)
 active_mtu: 2048 (4)
 sm_lid: 0
 port_lid:   0
 port_lmc:   0x00

 # dmesg
 mlx4_core: Mellanox ConnectX core driver v0.01 (May 1, 2007)
 mlx4_core: Initializing :0b:00.0
 mlx4_core :0b:00.0: PCI INT A - GSI 30 (level, low) - IRQ 30
 mlx4_core :0b:00.0: setting latency timer to 64
 mlx4_core :0b:00.0: FW version 2.6.818 (cmd intf rev 3), max commands 16
 mlx4_core :0b:00.0: Catastrophic error buffer at 0x1f020, size 0x10, BAR 0
 mlx4_core :0b:00.0: FW size 385 KB
 mlx4_core :0b:00.0: Clear int @ f0058, BAR 0
 mlx4_core :0b:00.0: Mapped 26 chunks/6168 KB for FW.
 mlx4_core :0b:00.0: BlueFlame available (reg size 512, regs/page 256)
 mlx4_core :0b:00.0: Base MM extensions: flags 0cc0, rsvd L_Key 
 0500
 mlx4_core :0b:00.0: Max ICM size 4294967296 MB
 mlx4_core :0b:00.0: Max QPs: 16777216, reserved QPs: 64, entry size: 256
 mlx4_core :0b:00.0: Max SRQs: 16777216, reserved SRQs: 64, entry size: 128
 mlx4_core :0b:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 128
 mlx4_core :0b:00.0: Max EQs: 512, reserved EQs: 4, entry size: 128
 mlx4_core :0b:00.0: reserved MPTs: 16, reserved MTTs: 16
 mlx4_core :0b:00.0: Max PDs: 8388608, reserved PDs: 4, reserved UARs: 1
 mlx4_core :0b:00.0: Max QP/MCG: 8388608, reserved MGMs: 0
 mlx4_core :0b:00.0: Max CQEs: 4194304, max WQEs: 16384, max SRQ WQEs: 
 16384
 mlx4_core :0b:00.0: Local CA ACK delay: 15, max MTU: 4096, port width 
 cap: 3
 mlx4_core :0b:00.0: Max SQ desc size: 1008, max SQ S/G: 62
 mlx4_core :0b:00.0: Max RQ desc size: 512, max RQ S/G: 32
 mlx4_core :0b:00.0: Max GSO size: 131072
 mlx4_core :0b:00.0: DEV_CAP flags:
 mlx4_core :0b:00.0: RC transport
 mlx4_core :0b:00.0: UC transport
 mlx4_core :0b:00.0: UD transport
 mlx4_core :0b:00.0: XRC transport
 mlx4_core :0b:00.0: FCoIB support
 mlx4_core :0b:00.0: SRQ support
 mlx4_core :0b:00.0: IPoIB checksum offload
 mlx4_core :0b:00.0: P_Key violation counter
 mlx4_core :0b:00.0: Q_Key violation counter
 mlx4_core :0b:00.0: Big LSO headers
 mlx4_core :0b:00.0: APM support
 mlx4_core :0b:00.0: Atomic ops support
 mlx4_core :0b:00.0: Address vector port checking support
 mlx4_core :0b:00.0: UD multicast support
 mlx4_core :0b:00.0: Router support
 mlx4_core :0b:00.0: IBoE support
 mlx4_core :0b:00.0:   profile[ 0] (  CMPT): 2^26 entries @ 0x 0, 
 size 0x 1
 mlx4_core :0b:00.0:   profile[ 1] (RDMARC): 2^21 entries @ 0x 1, 
 size 0x   400
 mlx4_core :0b:00.0:   profile[ 2] (   MTT): 2^20 entries @ 0x 10400, 
 size 0x   400
 mlx4_core :0b:00.0:   profile[ 3] (QP): 2^17 entries @ 0x 10800, 
 size 0x   200
 mlx4_core :0b:00.0:   profile[ 4] (  ALTC): 2^17 entries @ 0x 10a00, 
 size 0x80
 mlx4_core :0b:00.0:   profile[ 5] (   SRQ): 2^16 entries @ 0x 10a80, 
 size 0x80
 mlx4_core :0b:00.0:   profile[ 6] (CQ): 2^16 entries @ 0x 10b00, 
 size 0x80
 mlx4_core :0b:00.0:   

Re: can't get IB link with the for-next branch / IBoE patches (was mlx4: Fix unneeded return error...)

2010-10-25 Thread Or Gerlitz
 I pulled/built/booted with the for-next branch of Roland's tree, and I can't 
 get IB link for the node, 
 I don't think this is my problem, since I'm on L2 IB and not Eth, but should 
 this work with pre 2.7 
 firmware?! if not, maybe patch the mlx4 driver to print some error,

okay, I verified that with 2.6.36 this node gets IB link and IPoIB is working 
fine, so it must be something in or related to the for-next branch, I assume 
around the IBoE patches that touch mlx4 which make this failure to happen. With 
2.6.36 I also see the awk: /etc/ofed/setup-mlx4.awk:6: 
(FILENAME=/etc/ofed/mlx4.conf FNR=21) fatal: cannot open file 
`/sbin/setup-mlx4' for reading (No such file or directory) warning when 
loading mlx4_ib, but it doesn't disruptive in the sense that the node works 
fine, IB wise. 


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can't get IB link with the for-next branch / IBoE patches (was mlx4: Fix unneeded return error...)

2010-10-25 Thread Or Gerlitz
 I pulled/built/booted with the for-next branch of Roland's tree, and I can't 
 get IB link for the node, 
 I don't think this is my problem, since I'm on L2 IB and not Eth, but should 
 this work with pre 2.7 
 firmware?! if not, maybe patch the mlx4 driver to print some error,

okay, I verified that with 2.6.36 this node gets IB link and IPoIB is working 
fine, so it must be something in or related to the for-next branch, I assume 
around the IBoE patches that touch mlx4 which make this failure to happen. With 
2.6.36 I also see the awk: /etc/ofed/setup-mlx4.awk:6: 
(FILENAME=/etc/ofed/mlx4.conf FNR=21) fatal: cannot open file 
`/sbin/setup-mlx4' for reading (No such file or directory) warning when 
loading mlx4_ib, but it doesn't disruptive in the sense that the node works 
fine, IB wise. 


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mlx4: Fix unneeded return error in eth_link_query_port

2010-10-25 Thread Or Gerlitz
On Mon, Oct 25, 2010 at 1:34 PM, Eli Cohen e...@dev.mellanox.co.il wrote:
 IBoE will not work with firmware prior to 2.7.000. I don't think an
 error message is required in this case.

But I'm on **IB** not IBoE, I don't think you mean that the Linux
kernel IB stack is not functional over pre-2.7 firmware with the IBoE
patches?! are you?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can't get IB link with the for-next branch / IBoE patches

2010-10-25 Thread Or Gerlitz
On Mon, Oct 25, 2010 at 6:17 PM, Eli Cohen e...@dev.mellanox.co.il wrote:
 On Mon, Oct 25, 2010 at 06:36:43AM -0700, Roland Dreier wrote:

 I suspect I broke either the UD header packing or the build_mlx_header
 function when I cleaned up the patches.  I see the same problem, I'll
 take a look today.

 I think this will fix things up. The + operator has precedence over
 the ? operator so we end up with packet_length equal IB_GRH_BYTES / 4
 which is wrong.

Once you guys feel to have a fix I would be happy to give Roland's
for-next bits some further basic kernel (e.g IB link up/down, IPoIB,
running SM on a node with IBoE patches) testing and a bit of more
advanced (e.g IB/iSER, IB/RDS [Andy]) testing to see that things are
in place with L2 IB, I would recommend also the iWARP folks to do the
same as the addr/rdma-cm modules were also modified.

The merge window still has about 9 days, so we're okay with delaying
the push in 1-2 days, thoughts people?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mlx4: Fix unneeded return error in eth_link_query_port

2010-10-25 Thread Or Gerlitz
On Mon, Oct 25, 2010 at 4:36 PM, Eli Cohen e...@dev.mellanox.co.il wrote:
 Of course not. I just noticed that the IB link for IB link layer does
 come up, is that what you're seeing?

No, I didn't have IB Link when I used the for-next bits
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mlx4: Fix unneeded return error in eth_link_query_port

2010-10-25 Thread Or Gerlitz
On Mon, Oct 25, 2010 at 7:13 PM, Eli Cohen e...@dev.mellanox.co.il wrote:
 On Mon, Oct 25, 2010 at 06:46:39PM +0200, Or Gerlitz wrote:
 No, I didn't have IB Link when I used the for-next bits
 Can you summarize what is the problem that you're seeing?

Eli, this is pretty simple, I do the following
1. pull/build/boot Roland's for-next
2. modprobe mlx4_ib
-- the port state is INIT forever, is that clear?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can't get IB link with the for-next branch / IBoE patches

2010-10-26 Thread Or Gerlitz
Roland Dreier wrote:
 Yep, looks like that's where my cleanup broke things.  I rolled this in
 and pushed it out; I'm testing it myself now.
 
 My IB port comes to active now, I think that fixed things.

same here, I have IB port coming to active and basic IPoIB, opensm working okay
on the node with the current for-next/IBoE bits

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can't get IB link with the for-next branch / IBoE patches

2010-10-26 Thread Or Gerlitz
 I have IB port coming to active and basic IPoIB, opensm working okay
 on the node with the current for-next/IBoE bits

doing a little bit stress testing, I came across the below oops, when running 
IPoIB
and couple of iperf/udp sessions, it doesn't look like a problem in the IB 
stack.

Also with rds, using rds-stress from rds-tools-1.5-1.el5 and rds-stress -s 
192.168.20.18 -p 4000 -t 1 
-q 1K -a 1K -D 1M on the client side, the node running the for-next/IBoE bits 
and acting as the passive side
of the test, got hanged. Also here, this could be a bug in RDS and not in the 
IBoE patches, I know that the rds guys queued about a hundred! patches for 
2.6.37 so with these patches things might be better. I have the oops trace in 
jpg, will send to Andy, Roland and Eli. I guess we can continue these tests for 
2-3 days and have the push over the weekend, or push it before and get fixes if 
needed through the -rc cycle.


Oct 26 12:36:30 nsg2 kernel: BUG: spinlock bad magic on CPU#0, iperf/20845
Oct 26 12:36:30 nsg2 kernel:  lock: 81663ef8, .magic: , .owner: 
none/-1, .owner_cpu: 0
Oct 26 12:36:30 nsg2 kernel: Pid: 20845, comm: iperf Not tainted 
2.6.36-rc5-42052-gce806e1 #1
Oct 26 12:36:30 nsg2 kernel: Call Trace:
Oct 26 12:36:30 nsg2 kernel:  [811542b7] ? do_raw_spin_lock+0x22/0x122
Oct 26 12:36:30 nsg2 kernel:  [81268b2b] ? dev_queue_xmit+0x10d/0x346
Oct 26 12:36:30 nsg2 kernel:  [8128ca13] ? 
ip_push_pending_frames+0x2bf/0x318
Oct 26 12:36:30 nsg2 kernel:  [812a7e66] ? 
udp_push_pending_frames+0x2d2/0x351
Oct 26 12:36:30 nsg2 kernel:  [812a970c] ? udp_sendmsg+0x4b0/0x59c
Oct 26 12:36:30 nsg2 kernel:  [8112e9f7] ? cap_socket_sendmsg+0x0/0x3
Oct 26 12:36:30 nsg2 kernel:  [812e7d8e] ? common_interrupt+0xe/0x13
Oct 26 12:36:30 nsg2 kernel:  [8112e9f7] ? cap_socket_sendmsg+0x0/0x3
Oct 26 12:36:30 nsg2 kernel:  [81256bbb] ? sock_aio_write+0xf5/0x10d
Oct 26 12:36:30 nsg2 kernel:  [810029ae] ? 
reschedule_interrupt+0xe/0x20
Oct 26 12:36:30 nsg2 kernel:  [812e7d8e] ? common_interrupt+0xe/0x13
Oct 26 12:36:30 nsg2 kernel:  [812e7d8e] ? common_interrupt+0xe/0x13
Oct 26 12:36:30 nsg2 kernel:  [810b9b49] ? do_sync_write+0xab/0xeb
Oct 26 12:36:30 nsg2 kernel:  [812e7abf] ? 
_raw_spin_unlock_irq+0x9/0xd
Oct 26 12:36:30 nsg2 kernel:  [8112e83f] ? 
security_file_permission+0x18/0x6b
Oct 26 12:36:30 nsg2 kernel:  [810ba1f7] ? vfs_write+0xbe/0x132
Oct 26 12:36:30 nsg2 kernel:  [810ba754] ? sys_write+0x45/0x6e
Oct 26 12:36:30 nsg2 kernel:  [81001e6b] ? 
system_call_fastpath+0x16/0x1b
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: can't get IB link with the for-next branch / IBoE patches

2010-10-26 Thread Or Gerlitz
 doing a little bit stress testing, I came across the below oops, when running 
 IPoIB
 and couple of iperf/udp sessions, it doesn't look like a problem in the IB 
 stack.

To trigger this I run from client node the following iperf -uc 192.168.21.18 
-l 64000 -t 72000 -i 1 -b 40g  -d -P 4 where the server node (21.18 here) was 
the one that has the IBoE patches and got this oops 

 Oct 26 12:36:30 nsg2 kernel: BUG: spinlock bad magic on CPU#0, iperf/20845
 Oct 26 12:36:30 nsg2 kernel:  lock: 81663ef8, .magic: , 
 .owner: none/-1, .owner_cpu: 0
 Oct 26 12:36:30 nsg2 kernel: Pid: 20845, comm: iperf Not tainted 
 2.6.36-rc5-42052-gce806e1 #1
 Oct 26 12:36:30 nsg2 kernel: Call Trace:
 Oct 26 12:36:30 nsg2 kernel:  [811542b7] ? 
 do_raw_spin_lock+0x22/0x122
 Oct 26 12:36:30 nsg2 kernel:  [81268b2b] ? 
 dev_queue_xmit+0x10d/0x346
 Oct 26 12:36:30 nsg2 kernel:  [8128ca13] ? 
 ip_push_pending_frames+0x2bf/0x318
 Oct 26 12:36:30 nsg2 kernel:  [812a7e66] ? 
 udp_push_pending_frames+0x2d2/0x351
 Oct 26 12:36:30 nsg2 kernel:  [812a970c] ? udp_sendmsg+0x4b0/0x59c
 Oct 26 12:36:30 nsg2 kernel:  [8112e9f7] ? 
 cap_socket_sendmsg+0x0/0x3
 Oct 26 12:36:30 nsg2 kernel:  [812e7d8e] ? common_interrupt+0xe/0x13
 Oct 26 12:36:30 nsg2 kernel:  [8112e9f7] ? 
 cap_socket_sendmsg+0x0/0x3
 Oct 26 12:36:30 nsg2 kernel:  [81256bbb] ? sock_aio_write+0xf5/0x10d
 Oct 26 12:36:30 nsg2 kernel:  [810029ae] ? 
 reschedule_interrupt+0xe/0x20
 Oct 26 12:36:30 nsg2 kernel:  [812e7d8e] ? common_interrupt+0xe/0x13
 Oct 26 12:36:30 nsg2 kernel:  [812e7d8e] ? common_interrupt+0xe/0x13
 Oct 26 12:36:30 nsg2 kernel:  [810b9b49] ? do_sync_write+0xab/0xeb
 Oct 26 12:36:30 nsg2 kernel:  [812e7abf] ? 
 _raw_spin_unlock_irq+0x9/0xd
 Oct 26 12:36:30 nsg2 kernel:  [8112e83f] ? 
 security_file_permission+0x18/0x6b
 Oct 26 12:36:30 nsg2 kernel:  [810ba1f7] ? vfs_write+0xbe/0x132
 Oct 26 12:36:30 nsg2 kernel:  [810ba754] ? sys_write+0x45/0x6e
 Oct 26 12:36:30 nsg2 kernel:  [81001e6b] ? 
 system_call_fastpath+0x16/0x1b

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/13] IB-mgmt: Port madeye to userspace

2010-11-04 Thread Or Gerlitz

Hefty, Sean wrote:

[...] an alternative goal f these patches is to allow ibacm and similar 
applications to detect and react to SA and CM timeouts.

Hi Sean,

As far as I understand CM timeout is an event not a mad... when 
referring to detecting/reacting on CM timeouts, did you mean detecting 
mads like CM retries  and reacting on them?


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 0/3] New RAW_PACKET QP type

2010-11-04 Thread Or Gerlitz

Aleksey Senin wrote:

The following patches add a new QP type named RAW_PACKET.
Is there anything different in this patch set compared to V1 of 
https://patchwork.kernel.org/patch/110153 or its just a repost?


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 0/3] New RAW_PACKET QP type

2010-11-07 Thread Or Gerlitz
Steve Wise wrote:
 I'm working on similar code for Chelsio that will use these QPs. 

Will the TX flow require going into kernel space or will be fully offloaded?

Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


re: Fix IPoIB to conform to ethtool definitions

2010-11-07 Thread Or Gerlitz
Hi Eli, can this patch of yours which you placed in ofed be pushed
upstream? Or.

From 4237a1fbc1bae6bb86665f81cd93cfac37b216d2 Mon Sep 17 00:00:00 2001
From: Eli Cohen e...@mellanox.co.il
Date: Wed, 3 Nov 2010 10:56:38 +0200
Subject: [PATCH] IPoIB: Fix IPoIB to conform to ethtool definitions

Ethtool documentation states that when once of the parameters,
rx_coalesce_usecs or rx_max_coalesced_frames are set to zero while the other
has a none zero value, the none zero parameter should still be operative. For
example, if rx_max_coalesced_frames is set to zero while rx_coalesce_usecs is 
0, the rate of events is limited to not exceed (1 / rx_coalesce_usecs). In the
opposite case, an event is generated only after rx_max_coalesced_frames have
arrived. The documentation also states that setting both to zero is invalid.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 drivers/infiniband/ulp/ipoib/ipoib_ethtool.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
index e9795f6..e602b7f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
@@ -78,6 +78,13 @@ static int ipoib_set_coalesce(struct net_device *dev,
coal-rx_max_coalesced_frames  0x)
return -EINVAL;

+   if (coal-rx_max_coalesced_frames | coal-rx_coalesce_usecs) {
+   if (!coal-rx_max_coalesced_frames)
+   coal-rx_max_coalesced_frames = 0x;
+   else if (!coal-rx_coalesce_usecs)
+   coal-rx_coalesce_usecs = 0x;
+   }
+
ret = ib_modify_cq(priv-recv_cq, coal-rx_max_coalesced_frames,
   coal-rx_coalesce_usecs);
if (ret  ret != -ENOSYS) {
-- 
1.7.3.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IBoE fixes/enhancements

2010-11-07 Thread Or Gerlitz
Hi Eli, are there known IBoE fixes which are in ofed but missed 2.6.37-rc1?
Also, can the below and/or any other enhancements you've placed in ofed be
pushed upstream? it would be great to have perf counters operating fine for IBoE

Or.

From 72c316b60f62401e031520fe3f55ec6879bbc42b Mon Sep 17 00:00:00 2001
From: Eli Cohen e...@mellanox.co.il
Date: Wed, 6 Jan 2010 14:09:38 +0200
Subject: [PATCH 12/12] mlx4: add support for reading performance counters

This patch uses basic or extended counters which can be read by a command
interface, to report counters for all the QPs that work on an rdmaoe port. This
effectively allows to implement performance counter ala IB.

Signed-off-by: Eli Cohen e...@mellanox.co.il
---
 drivers/infiniband/hw/mlx4/mad.c |   86 -
 drivers/infiniband/hw/mlx4/main.c|   17 ++-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |1 +
 drivers/infiniband/hw/mlx4/qp.c  |2 +
 drivers/net/mlx4/fw.h|1 -
 drivers/net/mlx4/main.c  |   22 +++--
 include/linux/mlx4/cmd.h |4 ++
 include/linux/mlx4/device.h  |   36 ++
 8 files changed, 159 insertions(+), 10 deletions(-)

Index: ofed_kernel-fixes/drivers/infiniband/hw/mlx4/mad.c
===
--- ofed_kernel-fixes.orig/drivers/infiniband/hw/mlx4/mad.c 2010-09-01 
15:30:01.0 +0300
+++ ofed_kernel-fixes/drivers/infiniband/hw/mlx4/mad.c  2010-09-01 
15:33:48.571462204 +0300
@@ -229,9 +229,9 @@ static void forward_trap(struct mlx4_ib_
}
 }

-int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,u8 
port_num,
-   struct ib_wc *in_wc, struct ib_grh *in_grh,
-   struct ib_mad *in_mad, struct ib_mad *out_mad)
+static int ib_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num,
+  struct ib_wc *in_wc, struct ib_grh *in_grh,
+  struct ib_mad *in_mad, struct ib_mad *out_mad)
 {
u16 slid, prev_lid = 0;
int err;
@@ -299,6 +299,87 @@ int mlx4_ib_process_mad(struct ib_device
return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY;
 }

+static __be32 be64_to_be32(__be64 b64)
+{
+   return cpu_to_be32(be64_to_cpu(b64)  0x);
+}
+
+static void edit_counters(struct mlx4_counters *cnt, void *data)
+{
+   *(__be32 *)(data + 40 + 24) = be64_to_be32(cnt-tx_bytes);
+   *(__be32 *)(data + 40 + 28) = be64_to_be32(cnt-rx_bytes);
+   *(__be32 *)(data + 40 + 32) = be64_to_be32(cnt-tx_frames);
+   *(__be32 *)(data + 40 + 36) = be64_to_be32(cnt-rx_frames);
+}
+
+static void edit_ext_counters(struct mlx4_counters_ext *cnt, void *data)
+{
+   *(__be32 *)(data + 40 + 24) = be64_to_be32(cnt-tx_uni_bytes);
+   *(__be32 *)(data + 40 + 28) = be64_to_be32(cnt-rx_uni_bytes);
+   *(__be32 *)(data + 40 + 32) = be64_to_be32(cnt-tx_uni_frames);
+   *(__be32 *)(data + 40 + 36) = be64_to_be32(cnt-rx_uni_frames);
+   *(__be32 *)(data + 40 + 8) = be64_to_be32(cnt-rx_err_frames);
+}
+
+static int rdmaoe_process_mad(struct ib_device *ibdev, int mad_flags, u8 
port_num,
+  struct ib_wc *in_wc, struct ib_grh *in_grh,
+  struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+   struct mlx4_cmd_mailbox *mailbox;
+   struct mlx4_ib_dev *dev = to_mdev(ibdev);
+   int err;
+   u32 inmod = dev-counters[port_num - 1]  0x;
+   int mode;
+
+if (in_mad-mad_hdr.mgmt_class != IB_MGMT_CLASS_PERF_MGMT)
+   return -EINVAL;
+
+   mailbox = mlx4_alloc_cmd_mailbox(dev-dev);
+   if (IS_ERR(mailbox))
+   return IB_MAD_RESULT_FAILURE;
+
+   err = mlx4_cmd_box(dev-dev, 0, mailbox-dma, inmod, 0,
+  MLX4_CMD_QUERY_IF_STAT, MLX4_CMD_TIME_CLASS_C);
+   if (err)
+   err = IB_MAD_RESULT_FAILURE;
+   else {
+   memset(out_mad-data, 0, sizeof out_mad-data);
+   mode = be32_to_cpu(((struct mlx4_counters 
*)mailbox-buf)-counter_mode)  0xf;
+   switch (mode) {
+   case 0:
+   edit_counters(mailbox-buf, out_mad-data);
+   err = IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY;
+   break;
+   case 1:
+   edit_ext_counters(mailbox-buf, out_mad-data);
+   err = IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY;
+   break;
+   default:
+   err = IB_MAD_RESULT_FAILURE;
+   }
+   }
+
+   mlx4_free_cmd_mailbox(dev-dev, mailbox);
+
+   return err;
+}
+
+int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,u8 
port_num,
+   struct ib_wc *in_wc, struct ib_grh *in_grh,
+   struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+   switch 

Re: Fix IPoIB to conform to ethtool definitions

2010-11-07 Thread Or Gerlitz
Eli Cohen wrote:
 Sure, I was going to. I will send later today.

I saw that you've dropped and implementation of inline/blue-flame sending 
for kernel space, what was the motivation is it sdp, rds or alike or something 
else?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IBoE fixes/enhancements

2010-11-08 Thread Or Gerlitz

Eli Cohen wrote:
I was going to send [...] upstream 
Also you had a fix to the port speed and something related to SL which I 
didn't understand, please send for review


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel inline sending / UARs / etc

2010-11-08 Thread Or Gerlitz

Eli Cohen wrote:

The idea is to let kernel consumers enjoy the improvements to latency that blue 
flame gives. And yes, SDP is motivating us but I am going to push to IPoIB too.
  
From my recollection of numbers, for user space apps, using inline 
accounts for about 1us improvement in the latency, if this is indeed the 
case, I'm sure if there's great value here for kernel consumers, do you 
have any numbers to support this patch?


I want to take the opportunity that you raised the issue to hear others opinion about changing the bitmap allocator maintain an avail variable that will count the number of available UARs. I want to use this to limit the number of UARs that a kernel consumer can allocate so that there will always be some available for userspace 
Is this correct that today all the kernels QPs use the same UAR, any 
problem with that?


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel inline sending / UARs / etc

2010-11-08 Thread Or Gerlitz
Or Gerlitz wrote:
 Eli Cohen wrote:
   
 From my recollection of numbers, for user space apps, using inline
 accounts for about 1us improvement in the latency, if this is indeed the
 case, I'm sure if there's great value here for kernel consumers, do you
 have any numbers to support this patch?

I wanted to say that I'm NOT sure, sorry for the spam

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel inline sending / UARs / etc

2010-11-08 Thread Or Gerlitz
Eli Cohen wrote:
 It indeed improves SDP's latency - I don't have exact numbers.

the SDP number is very interesting (Amir, do you have it?) but irrelevant for 
upstream, any IPoIB numbers?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/13] IB-mgmt: Port madeye to userspace

2010-11-08 Thread Or Gerlitz

Hefty, Sean wrote:

CM mads aren't reliable, however they are retried.  If a CM REQ does not 
receive a response after so many retries (usually 15), the REQ fails (status is 
timeout).  The mad layer reports the timeout to the cm module.  With snooping 
in place, a user will be notified that a mad send has failed and be given a 
copy of the mad.
mmm, got that - I also see that ib_mad_send_wc has both the status and 
the content of the mad, upon which you base the design

3. ibacm returns a path record.  The path record _may_ have come from cached 
data.
4. The librdmacm tries to establish a connection.
5. The kernel ib_cm module issues REQ.
6. The ib_mad module retries the REQ until it times out.
7. The mad timeout is reported to any users wishing to capture errors.
In this example, the ibacm service would be registered and receive a copy of 
the failed REQ.  The ibacm can look at the data in the REQ, see if it if has 
cached path record data which matches, and remove the cached data if so.
8. The librdmacm will see a connection failure.
so the usage of mad snooping would be for cache invalidations, I wonder 
if registering on GID/MGID IN/OUT traps be sufficient for the same purpose?


Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel inline sending / UARs / etc

2010-11-08 Thread Or Gerlitz
Eli Cohen wrote:
 For IPoIB it gives ~1 usec for improvement in latency.

yep, this is what I expected, so over your testbed from what value to what 
value? also it 
would be important to note the change in the cpu utilization (e.g few vmstat 
1 output
lines before/after the change, while running IPoIB traffic)

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/13] IB-mgmt: Port madeye to userspace

2010-11-08 Thread Or Gerlitz
Hefty, Sean wrote:
 That requires registration with the SA.  The intent is to avoid using a 
 centralized service when possible. 

yep, makes sense, look like this design finally went the decentralized way... 
cool

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ib receive completion error

2010-11-10 Thread Or Gerlitz

Usha Srinivasan wrote:

Can someone from Mellanox tell me what the vendor error 0x32 means? I am 
getting this error for wc.opcode 128 (IB_WC_RECV) wc.status 4 
(IB_WC_LOC_PROT_ERR). I am running ofed 1.5.2 and am getting it on both rhel5 
and sles11
  
You can't count on the wc.opcode when the status isn't success (0), and 
yes, we're also saw tons on ib0: failed recv event status=4, wrid=154 
vend_err 32) errors when running iscsi/tcp over IPoIB stress from 
windows client to linux node acting as the iscsi target and using ofed 
1.5.2, it starts to look like DoS attack is possible here.


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] svcrdma: Change DMA mapping logic to avoid the page_address kernel API

2010-11-16 Thread Or Gerlitz
 Tom Tucker t...@ogc.us wrote:

 This patch changes the bus mapping logic to avoid page_address() where 
 necessary

Hi Tom,

Does when necessary comes to say that invocations of page_address
which remained in the code after this patch was applied are safe and
no kmap call is needed?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ibacm: check for special handling of loopback requests

2010-11-17 Thread Or Gerlitz

Ralph Campbell wrote:

I guess what I'm objecting to is hard coding mlx4. I was trying to think of a 
way that would allow other HCAs to support the block loopback option in the 
future. It looks like ipoib sets IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK for 
kernel QPs but this isn't defined in libibverbs yet. It seems reasonable to add 
that feature some time in the future and change ibacm to use it. In the mean 
time, I guess I don't see an alternative to your patch.

Ralph, Sean

Its been couple of years since some folks from Voltaire tried to push 
this flag and the grounds for adding similar flags for QP creation, on 
the bright side, its there for kernel consumers where existing flags are 
LSO, block-multicast-loopback. On the somehow disappointing side, we 
didn't get that merged for user space. Basically, there was a claim on 
dependency with  XRC patch set which also added flags for QPs, at some 
point, Ron Livne managed to introduce patch set which is independent of 
the XCR, see (*) below, but it didn't get in. As such one of our 
application teams pushed to ofed that mlx4 patch which sets this bit by 
default and the acm code is trying to identify and act upon its 
existence (**)



Or.

(*) latest post of the patch set

0/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/4392994
1/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/4393054
2/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/4393004
3/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/4393014
4/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/4393024

(**) ofed patch
http://git.openfabrics.org/git?p=ofed_1_5/linux-2.6.git;a=blob;f=kernel_patches/fixes/mlx4_0290_mcast_loopback.patch;h=786a3926529befac2c2d1fa6d8c36bada79d61a7;hb=HEAD
 
http://git.openfabrics.org/git?p=ofed_1_5/linux-2.6.git;a=blob;f=kernel_patches/fixes/mlx4_0290_mcast_loopback.patch;h=786a3926529befac2c2d1fa6d8c36bada79d61a7;hb=HEAD



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ibacm: check for special handling of loopback requests

2010-11-17 Thread Or Gerlitz

Hefty, Sean wrote:

One could argue that this change is reasonable regardless of the OFED kernel 
patch.  It avoids sending multicast traffic when the destination is local.  The 
main drawback beyond the extra code is that a node can't send a multicast 
message to itself, with the intent that remote listeners will be able to cache 
the address data.

Sean,

To be precise, the bit avoids recieving multicast packets by the QP that 
--sent-- it, not by other QPs subscribed to that group on the same 
node/hca, the patch change-log even states that Inter QP multicast 
packets on the relevant HCA will still be delivered. You can test that 
with two mckey processes running on a node which has the patch active. 
So with acm the functionality you need is for the same QP to receive the 
packets it sent?


Or.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rdma_lat whos

2010-11-17 Thread Or Gerlitz
Hi Ido, 

We came into a situation where running rdma_lat with vs with out the -c flag, 
which means w. or w.o using the rdma-cm introduces a notable ~1us difference in 
latency for 1k messages, that is ~3us w.o using rdma-cm and 3.9us when using 
the rdma-cm.

I have reproduced that now with the latest code from your git tree and also 
with the RHEL provided package of perftest-1.2.3-1.el5, see the results below. 
Also, your tree is not available through the ofa git web service, Vlad, can you 
help set this out.

Now, Jack, using this patch, 


Index: perftest/rdma_lat.c
===
--- perftest.orig/rdma_lat.c
+++ perftest/rdma_lat.c
@@ -666,7 +666,7 @@ static int pp_connect_ctx(struct pingpon
 {
struct ibv_qp_attr attr = {
.qp_state   = IBV_QPS_RTR,
-   .path_mtu   = IBV_MTU_256,
+   .path_mtu   = IBV_MTU_2048,
.dest_qp_num= data-rem_dest-qpn,
.rq_psn = data-rem_dest-psn,
.max_dest_rd_atomic = 1,




I could get rdma_lat which doesn't use the rdma-cm, which means setting all the 
low-level QP params 
by the hand to produce the SAME result of 3.9us as with the rdma-cm, as you 
can see its one liner 
patch which uses higher MTU of 2048 vs the hard coded MTU of 256 used in the 
code. This is quite counter 
intuitive, for packets whose size is  256, correct? is there any known issue 
that can 
explain that?! The SA is convinced that 2048 (0x84) is the best MTU for that 
path, both nodes 
have ConnectX DDR with firmware 2.7.0 

Or.


 # saquery -p --src-to-dst 1:14
 Path record for 1 - 14
 PathRecord dump:
 service_id..0x
 dgidfe80::8:f104:399:3c91
 sgidfe80::2:c903:2:6be3
 dlid0xE
 slid0x1
 hop_flow_raw0x0
 tclass..0x0
 num_path_revers.0x80
 pkey0x
 qos_class...0x0
 sl..0x0
 mtu.0x84
 rate0x86
 pkt_life0x92
 preference..0x0
 resv2...0x0
 resv3...0x0


before the patch

active side, w.o rdma-cm
 # rdma_lat 192.168.20.15 -s 1024 -n 1
 26113:pp_client_connect: Couldn't connect to 192.168.20.15:18515
 [r...@nsg1 ~]# rdma_lat 192.168.20.15 -s 1024 -n 1
local address: LID 0x0e QPN 0x1c004d PSN 0x3a3dca RKey 0x48002600 VAddr 
 0x0008a71400
   remote address: LID 0x04 QPN 0x20004c PSN 0x27973 RKey 0x50042700 VAddr 
 0x001b724400
 Latency typical: 3.01932 usec
 Latency best   : 2.97582 usec
 Latency worst  : 11.3183 usec

passive side w.o rdma-cm
 # rdma_lat -s 1024 -n 1
local address: LID 0x04 QPN 0x20004c PSN 0x27973 RKey 0x50042700 VAddr 
 0x001b724400
   remote address: LID 0x0e QPN 0x1c004d PSN 0x3a3dca RKey 0x48002600 VAddr 
 0x0008a71400
 Latency typical: 3.02386 usec
 Latency best   : 2.97436 usec
 Latency worst  : 6.63569 usec

active side, w.o rdma-cm
 # rdma_lat 192.168.20.15 -s 1024 -n 1 -c
 26133: Local address:  LID , QPN 00, PSN 0xa12538 RKey 0x50002600 
 VAddr 0x0013d27400
 26133: Remote address: LID , QPN 00, PSN 0x5c01e8, RKey 0x58042700 
 VAddr 0x0006dbb400
 
 Latency typical: 3.89977 usec
 Latency best   : 3.83227 usec
 Latency worst  : 13.6462 usec

passive side, w.o rdma-cm

 # rdma_lat -s 1024 -n 1 -c
 21826: Local address:  LID , QPN 00, PSN 0x5c01e8 RKey 0x58042700 
 VAddr 0x0006dbb400
 21826: Remote address: LID , QPN 00, PSN 0xa12538, RKey 0x50002600 
 VAddr 0x0013d27400
 
 Latency typical: 3.89982 usec
 Latency best   : 3.83082 usec
 Latency worst  : 13.6974 usec

after the patch, the result w.o -c and with MTU=2048 becomes 3.9us as well,

  /home/ogerlitz/linux/tools/perftest/rdma_lat 192.168.20.15 -s 1024 -n 1
local address: LID 0x0e QPN 0x3c004d PSN 0x14ff1e RKey 0x68002600 VAddr 
 0x0016c5d400
   remote address: LID 0x04 QPN 0x40004c PSN 0xba137e RKey 0x70042700 VAddr 
 0x001f259400
 Latency typical: 3.88327 usec
 Latency best   : 3.80378 usec
 Latency worst  : 8.27951 usec


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    1   2   3   4   5   6   7   8   9   10   >