Re: root owned writeable files under /sys

2010-06-08 Thread Eli Cohen
I don't understand why mlx4_port1 and mlx4_port2 have world write
permissions on your system. I can't see this from the sources nor from
installing ofed-1.5.1 on my system. I agree though that the
permissions for port_trigger and clear_diag should be changed. We'll
push a fix to OFED 1.5.2.

On Sun, Jun 6, 2010 at 7:08 PM, Sumeet Lahorani
sumeet.lahor...@oracle.com wrote:

 Thanks. I realized that my earlier find command didn't capture all the files
 I was looking for. After your patch, the following still need to be
 addressed (all are mlx4 files)

 # find /sys -type f -perm -222
 /sys/class/infiniband/mlx4_0/diag_counters/clear_diag
 /sys/devices/pci:00/:00:04.0/:13:00.0/port_trigger
 /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port2
 /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port1

 - Sumeet

 Or Gerlitz wrote:

 Sumeet Lahorani wrote:


 I see the following files created under /sys which are world writeable
 /sys/class/net/ib0/delete_child
 /sys/class/net/ib0/create_child
 At least the create_child  delete_child files appear to be dangerous to
 leave as world writeable because they result in resources allocations.


 Yes, this looks bad. The below patch fixes that, I tested it on 2.6.35-rc1

 [PATCH] make ipoib child entries non-world writable

 Sumeet Lahorani sumeet.lahor...@oracle.com reported that the ipoib child
 entries are world writable, fix them to be root only writable

 Signed-off-by: Or Gerlitz ogerl...@voltaire.com

 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
 b/drivers/infiniband/ulp/ipoib/ipoib_main.c
 index df3eb8c..b4b2257 100644
 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
 @@ -1163,7 +1163,7 @@ static ssize_t create_child(struct device *dev,
        return ret ? ret : count;
  }
 -static DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child);
 +static DEVICE_ATTR(create_child, S_IWUSR, NULL, create_child);
  static ssize_t delete_child(struct device *dev,
                            struct device_attribute *attr,
 @@ -1183,7 +1183,7 @@ static ssize_t delete_child(struct device *dev,
        return ret ? ret : count;
  }
 -static DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child);
 +static DEVICE_ATTR(delete_child, S_IWUSR, NULL, delete_child);
  int ipoib_add_pkey_attr(struct net_device *dev)
  {


 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Handling busy responses from the SA

2010-06-08 Thread Hal Rosenstock
Mike,

On Mon, Jun 7, 2010 at 12:00 PM, Mike Heinz michael.he...@qlogic.com wrote:
 Hal said:
 Should a busy be retried at all at the mad layer ? Is a special longer) 
 timeout policy for busy needed ?

 Also, should this be done for all MADs classified by ib_response_mad (e.g. 
 trap represses) ?

 Hal,

 The idea of processing BUSY responses in the MAD layer is to BUSY responses 
 like timeouts - which are currently handled by the MAD layer. Right now there 
 is an issue where various apps and ULPs either treat BUSY as a cause to 
 immediately retry or as a permanent error. This doesn't seem to affect users 
 of the OpenSM so much because (as I understand it) the OpenSM seems to 
 discard requests when it gets too busy - but for other SA/SMs, it can cause a 
 major packet storm or, worse, a simple loss of connectivity where MPI jobs or 
 kernel ULPs simply assume the SA is broken because they got a BUSY reply.

 By treating the BUSY reply as a timeout, we're actually simplifying matters 
 by fitting into existing practice.

Understood. Timing these out makes sense to me but still does not
preclude the client from potentially handling this if the retries
fail.

 As for needing a longer timeout - in our old proprietary stack, QLogic did 
 have a longer timeout for retrying busy replies than for normal timeouts

How much longer ? What are the two timeouts used ?

 - but we should try to get this in now so we can get some relief before we 
 begin the long term discussion of the best way to handle this issue overall.

All I was getting at here was: does retrying when busy work ? If not,
why retry at all at the MAD layer (regardless of retries requested)
and perhaps use a longer timeout for this. If it does work, maybe the
timeout on the subsequent retries should be extended.

I think my two other comments on details are relevant to an updated patch.

-- Hal
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Handling busy responses from the SA

2010-06-08 Thread Hefty, Sean
 As for needing a longer timeout - in our old proprietary stack, QLogic did
 have a longer timeout for retrying busy replies than for normal timeouts -
 but we should try to get this in now so we can get some relief before we
 begin the long term discussion of the best way to handle this issue
 overall.

Because applications may handle BUSY replies differently, we shouldn't simply 
start hiding them from the user.  I would much rather agree on the longer term 
plan, so that the ABI can reflect the proper semantics.  I don't see any issue 
with changing the current behavior for kernel clients, however.

- Sean


RE: Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
Sean said,

 Because applications may handle BUSY replies differently, we shouldn't simply 
 start hiding them from the user.  

Sean - remember that this patch will still return a BUSY status to the caller, 
if retries are exhausted and the last return code was BUSY, then that's what 
the caller will get. Thus, code which sets retries to zero will not be affected 
by this patch at all.

Hal said,

 All I was getting at here was: does retrying when busy work ? If not,
 why retry at all at the MAD layer (regardless of retries requested)
 and perhaps use a longer timeout for this. If it does work, maybe the
 timeout on the subsequent retries should be extended.

Personally, I think it's been extremely helpful - we've been using busy status 
to tell compute nodes to slow down since our old proprietary stack and we've 
seen a significant improvement in overall traffic congestion when we added this 
patch to OFED clusters using our SM. In addition use of the BUSY return code 
simplifies debugging traffic congestion problems (since it allows you to 
immediately differentiate between SA overload and other traffic issues) and it 
paves the way for more sophisticated back-off strategies in the future.

As to that, and your question, our old stack used two different timeout values 
specified by the client. One value was for actual timeouts and one for busy 
responses. In the case of busy responses, we added a randomization factor to 
spread out the traffic.

This issue with adapting that to the Linux-RDMA stack is that it's an API 
change. What I would suggest personally, is something like this:

1. Take either the timeout passed by the caller OR a predefined constant, 
whichever is larger. I would suggest setting the predefined constant to 
something moderate, say 2 seconds.
2. Add a randomization factor - say between -250 and +250 ms?
3. Update the packet timeout with this new value.


N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
Anyone know why my messages are being appended with interesting garbage?

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Mike Heinz
Sent: Tuesday, June 08, 2010 11:49 AM
To: Hal Rosenstock
Cc: linux-rdma@vger.kernel.org
Subject: RE: Handling busy responses from the SA

N�r��y���b�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j
��f���h���z��w���
���j:+v���w�j�m
zZ+�ݢj��!�i


RE: Handling busy responses from the SA

2010-06-08 Thread Hefty, Sean
 Anyone know why my messages are being appended with interesting garbage?

I get that too.  I first noticed it a couple of weeks ago.  It eventually went 
back to the normal 'To unsubscribe from this list' message.


RE: Handling busy responses from the SA

2010-06-08 Thread Hefty, Sean
 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

It looks like it only returns the BUSY response if that matches with the last 
retry, otherwise, the BUSY response is dropped.  It also looks like it applies 
to all MADs, including vendor specific ones, and not just those from the SA.

- Sean


RE: Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
Right. Effectively this is similar to the I/O resolution timeout policy laid 
out in the spec.

-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Tuesday, June 08, 2010 12:27 PM
To: Mike Heinz; Hal Rosenstock
Cc: linux-rdma@vger.kernel.org
Subject: RE: Handling busy responses from the SA

 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

It looks like it only returns the BUSY response if that matches with the last 
retry, otherwise, the BUSY response is dropped.  It also looks like it applies 
to all MADs, including vendor specific ones, and not just those from the SA.

- Sean


RE: [PATCH] Handling busy responses from the SA

2010-06-08 Thread Hefty, Sean
 Also, I guess, it would be a good API choice if the caller could say
 'get me a reply for this mad or error within 60s' rather than specify
 details like retry counts, etc. The timeout values should be globally
 set and derived from the usual SA provided data for network transits...

I agree with this.  Within the framework of the existing umad ABI, this could 
be specified by setting the high bit in the ib_user_mad_hdr:timeout_ms field, 
assuming that no one is using that bit in practice.  The kernel could then 
freely select the retry/timeout policy for these clients, which for starters 
could include dropping BUSY responses and adjusting the timeout using an 
approach similar to what Mike mentioned in a separate email.  Kernel clients 
could be updated to use this new mode.

Any disagreements to this approach?  
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
Sean -

Is there case where we would ever want to treat BUSY responses differently from 
timeouts?



-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Tuesday, June 08, 2010 12:27 PM
To: Mike Heinz; Hal Rosenstock
Cc: linux-rdma@vger.kernel.org
Subject: RE: Handling busy responses from the SA

 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

It looks like it only returns the BUSY response if that matches with the last 
retry, otherwise, the BUSY response is dropped.  It also looks like it applies 
to all MADs, including vendor specific ones, and not just those from the SA.

- Sean


RE: Handling busy responses from the SA

2010-06-08 Thread Hefty, Sean
 Is there case where we would ever want to treat BUSY responses differently
 from timeouts?

I doubt it for a single MAD, but I can't say what people may have implemented.  
The main difference I can think of is that a busy response requires a retry, 
whereas a timeout does not.  This affects the retry policy when multiple MADs 
are outstanding.  E.g. if there are 10 requests outstanding and the first times 
out, we may only resend the first request and increase the timeouts of the 
other 9.  If the 10 requests all receive a busy, then they must all be retried.

To me, it looks like it makes more sense to never send busy, except maybe when 
receive buffer space is full consumed, but implement a more intelligent 
timeout/retry mechanism on the sender side.  The SA almost needs some sort of 
MRA like message.

- Sean


RE: [PATCH v2] allow passthrough of rmpp protocol to user mad clients

2010-06-08 Thread Mike Heinz
On a different subject - have we come to any conclusions about this patch? 

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Mike Heinz
Sent: Friday, June 04, 2010 1:14 PM
To: linux-rdma@vger.kernel.org; Hal Rosenstock; Hefty, Sean; Roland Dreier
Subject: [PATCH v2] allow passthrough of rmpp protocol to user mad clients

This is an update to the previous version of the patch, based on feedback from 
Hal.


Currently, if a user application calls umad_register() or umad_register_oui() 
with an rmpp_version of zero, incoming rmpp messages are discarded and if the 
rmpp_version is 1, incoming rmpp packets are collected by the kernel layer and 
passed as a group to the user application.

This patch changes this behavior so that rmpp_version of 255 causes incoming 
rmpp packets to be passed through without alteration, instead.

There are IB users who have requested the ability to perform RMPP transaction 
handling in user space.  This was an option in old proprietary stacks and this 
is useful to migrate old applications to OFED while containing the scope of 
their application changes.  

Signed-Off-By: Michael Heinz michael.he...@qlogic.com

---

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index ef1304f..efca783 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -207,12 +207,18 @@ struct ib_mad_agent *ib_register_mad_agent(struct 
ib_device *device,
int ret2, qpn;
unsigned long flags;
u8 mgmt_class, vclass;
+   u8 rmpp_passthru = 0;
 
/* Validate parameters */
qpn = get_spl_qp_index(qp_type);
if (qpn == -1)
goto error1;
 
+   if (rmpp_version == IB_MGMT_RMPP_PASSTHRU) {
+   rmpp_passthru = 255;
+   rmpp_version = 0;
+   }
+   
if (rmpp_version  rmpp_version != IB_MGMT_RMPP_VERSION)
goto error1;
 
@@ -244,6 +250,7 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device 
*device,
if (!is_vendor_oui(mad_reg_req-oui))
goto error1;
}
+   
/* Make sure class supplied is consistent with RMPP */
if (!ib_is_mad_class_rmpp(mad_reg_req-mgmt_class)) {
if (rmpp_version)
@@ -302,6 +309,7 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device 
*device,
mad_agent_priv-qp_info = port_priv-qp_info[qpn];
mad_agent_priv-reg_req = reg_req;
mad_agent_priv-agent.rmpp_version = rmpp_version;
+   mad_agent_priv-agent.rmpp_passthru = rmpp_passthru;
mad_agent_priv-agent.device = device;
mad_agent_priv-agent.recv_handler = recv_handler;
mad_agent_priv-agent.send_handler = send_handler;
@@ -1792,7 +1800,7 @@ static void ib_mad_complete_recv(struct 
ib_mad_agent_private *mad_agent_priv,
 
INIT_LIST_HEAD(mad_recv_wc-rmpp_list);
list_add(mad_recv_wc-recv_buf.list, mad_recv_wc-rmpp_list);
-   if (mad_agent_priv-agent.rmpp_version) {
+   if (mad_agent_priv-agent.rmpp_version  
!mad_agent_priv-agent.rmpp_passthru) {
mad_recv_wc = ib_process_rmpp_recv_wc(mad_agent_priv,
  mad_recv_wc);
if (!mad_recv_wc) {
@@ -1801,29 +1809,47 @@ static void ib_mad_complete_recv(struct 
ib_mad_agent_private *mad_agent_priv,
}
}
 
+   /*
+* At this point, the MAD is either not an RMPP or we are passing RMPPs 
thru to
+* the client.
+*/
/* Complete corresponding request */
if (ib_response_mad(mad_recv_wc-recv_buf.mad)) {
spin_lock_irqsave(mad_agent_priv-lock, flags);
mad_send_wr = ib_find_send_mad(mad_agent_priv, mad_recv_wc);
-   if (!mad_send_wr) {
+   if (mad_send_wr) {
+   ib_mark_mad_done(mad_send_wr);
spin_unlock_irqrestore(mad_agent_priv-lock, flags);
-   ib_free_recv_mad(mad_recv_wc);
-   deref_mad_agent(mad_agent_priv);
-   return;
-   }
-   ib_mark_mad_done(mad_send_wr);
-   spin_unlock_irqrestore(mad_agent_priv-lock, flags);
 
-   /* Defined behavior is to complete response before request */
-   mad_recv_wc-wc-wr_id = (unsigned long) mad_send_wr-send_buf;
-   mad_agent_priv-agent.recv_handler(mad_agent_priv-agent,
-  mad_recv_wc);
-   atomic_dec(mad_agent_priv-refcount);
+   /* Defined behavior is to complete response before 
request */
+   mad_recv_wc-wc-wr_id = (unsigned long) 
mad_send_wr-send_buf;
+   
mad_agent_priv-agent.recv_handler(mad_agent_priv-agent,
+ 

RE: [PATCH] Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
It's workable, although I really wish there was a way to handle stupid apps 
that aren't written to handle a busy response.

-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Tuesday, June 08, 2010 12:44 PM
To: Jason Gunthorpe
Cc: Mike Heinz; linux-rdma@vger.kernel.org; e...@openfabrics.org
Subject: RE: [PATCH] Handling busy responses from the SA

 Also, I guess, it would be a good API choice if the caller could say
 'get me a reply for this mad or error within 60s' rather than specify
 details like retry counts, etc. The timeout values should be globally
 set and derived from the usual SA provided data for network transits...

I agree with this.  Within the framework of the existing umad ABI, this could 
be specified by setting the high bit in the ib_user_mad_hdr:timeout_ms field, 
assuming that no one is using that bit in practice.  The kernel could then 
freely select the retry/timeout policy for these clients, which for starters 
could include dropping BUSY responses and adjusting the timeout using an 
approach similar to what Mike mentioned in a separate email.  Kernel clients 
could be updated to use this new mode.

Any disagreements to this approach?  
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


librdmacm 1.0.12 release notes for OFED 1.5.2

2010-06-08 Thread Hefty, Sean
Here is the first cut at release notes -- attached and inline -- for the OFED 
1.5.2 release of the librdmacm.

- Sean

---

librdmacm release notes
---
Several enhancements were added to librdmacm release 1.0.12 that
are intended to simplify using RDMA devices and address scalability issues.
These changes were in response to long standing requests to make
connection establishment 'more like sockets'.  For full details,
users should refer to the appropriate man pages.  Major changes include:


* Support synchronous operation for library calls.  Users can control
  whether an rdma_cm_id operates asynchronously or synchronously based on
  the rdma_event_channel parameter.  Use of synchronous operations
  reduces the amount of application code required to use the librdmacm
  by eliminating the need for event processing code.

  An rdma_cm_id will be marked for synchronous operation if the
  rdma_event_channel parameter is NULL for rdma_create_id or
  rdma_migrate_id.  Users can toggle between synchronous and
  asynchronous operation through the rdma_migrate_id call.

  Calls that operate synchronously include rdma_resolve_addr,
  rdma_resolve_route, rdma_connect, rdma_accept, and rdma_get_request.
  Synchronous event data is returned to the user through the
  rdma_cm_id.


* The addition of a new API: rdma_getaddrinfo.  This call is modeled
  after getaddrinfo, but for RDMA devices and connections.  It has the
  following notable deviations from getaddrinfo:

  A source address is returned as part of the call to allow the
  user to allocate necessary local HW resources for connections.

  Optional routing information may be returned to support
  Infiniband fabrics.  IB routing information includes necessary
  path record data.  rdma_getaddrinfo will obtain this information
  if IB ACM support (see below) is enabled.  The use of IB ACM
  is not required for rdma_getaddrinfo.

  rdma_getaddrinfo provides future extensions to support
  more complex address and route resolution mechanisms, such as
  multiple path support and failover.


* Support for a new APIs: rdma_get_request, rdma_create_ep, and
  rdma_destroy_ep.  rdma_get_request simplifies the passive side
  implementation by adding synchronous support for accepting new
  connections.  rdma_create_ep combines the functionality of
  rdma_create_id, rdma_create_qp, rdma_resolve_addr, and rdma_resolve_route
  in a single API that uses the output of rdma_getaddrinfo as its input.

  
* Support for optional parameters.  To simplify support for casual RDMA
  developers and researchers, the librdmacm can allocate protection
  domains, completion queues, and queue pairs on a user's behalf.
  This simplifies the amount of information that a developer
  must learn in order to use RDMA, plus allows the user to take
  advantage of higher-level completion processing abstractions.

  In addition to optional parameters, a user can also specify that the
  librdmacm should automatically select usable values for RDMA read
  operations.


* Add support for IB ACM.  IB ACM (InfiniBand Assistant for Communication
  Management) defines a socket based protocol to an IB address and route
  resolution service.  One implementation of that service is provided
  separately by the ibacm package, but anyone can implement the service
  provided that they adhere to the IB ACM socket protocol.  IB ACM is an
  experimental service targeted at increasing the scalability of applications
  running on a large cluster.
  
  Use of IB ACM is not required and is controlled through the build option
  '--with-ib_acm'.  If the librdmacm fails to contact the IB ACM service, it
  reverts to using kernel services to resolve address and routing data.


* Add RDMA helper routines.  The librdmacm provide a set of simpler verbs
  calls for posting work requests, registering memory, and checking for
  completions.  These calls are wrappers around libibverbs routines.





rel-notes
Description: rel-notes


Re: Handling busy responses from the SA

2010-06-08 Thread Hal Rosenstock
On Tue, Jun 8, 2010 at 12:27 PM, Hefty, Sean sean.he...@intel.com wrote:
 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

 It looks like it only returns the BUSY response if that matches with the last 
 retry, otherwise, the BUSY response is dropped.  It also looks like it 
 applies to all MADs, including vendor specific ones, and not just those from 
 the SA.

Per the proposed patch, it currently includes trap represses (as
determined by ib_response_mad). Shouldn't busy be ignored for that
case ? I don't think that would be used but it seems safer to me.

-- Hal


 - Sean

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Handling busy responses from the SA

2010-06-08 Thread Mike Heinz
Hal,

I may be confused - but I thought the spec said there was no valid response to 
a trap repress. I interpreted

o14-3.a4: The SMA shall not send any message in response to a valid 
SubnTrapRepress() message

to mean that the SMA isn't allowed to respond with a BUSY status for a trap 
repress.

-Original Message-
From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com] 
Sent: Tuesday, June 08, 2010 3:09 PM
To: Hefty, Sean
Cc: Mike Heinz; linux-rdma@vger.kernel.org
Subject: Re: Handling busy responses from the SA

On Tue, Jun 8, 2010 at 12:27 PM, Hefty, Sean sean.he...@intel.com wrote:
 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

 It looks like it only returns the BUSY response if that matches with the last 
 retry, otherwise, the BUSY response is dropped.  It also looks like it 
 applies to all MADs, including vendor specific ones, and not just those from 
 the SA.

Per the proposed patch, it currently includes trap represses (as
determined by ib_response_mad). Shouldn't busy be ignored for that
case ? I don't think that would be used but it seems safer to me.

-- Hal


 - Sean

N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

ibacm 1.0.0 release notes for OFED 1.5.2

2010-06-08 Thread Hefty, Sean
Here are release notes -- attached and inline -- for IB ACM 1.0.0 for OFED 
1.5.2.

 - Sean
---
Assistant for InfiniBand Communication Management (IB ACM)

Note: The IB ACM should be considered experimental.


Overview

The IB ACM package implements and provides a framework for experimental name,
address, and route resolution services over InfiniBand.  It is intended to
address connection setup scalability issues running MPI applications on
large clusters.  The IB ACM provides information needed to establish a
connection, but does not implement the CM protocol.

The librdmacm can invoke IB ACM services when built using the --with-ib_acm
option.  The IB ACM services tie in under the rdma_resolve_addr,
rdma_resolve_route, and rdma_getaddrinfo routines.  For maximum benefit,
the rdma_getaddrinfo routine should be used, however existing applications
should still see significant connection scaling benefits using the calls
available in librdmacm 1.0.11 and previous releases.

The IB ACM is focused on being scalable and efficient.  The current
implementation limits network traffic, SA interactions, and centralized
services.  ACM supports multiple resolution protocols in order to handle
different fabric topologies.

This release 1.0.0 is limited in its handling of dynamic changes.

The IB ACM package is comprised of two components: the ib_acm service
and a test/configuration utility - ib_acme.  Both are userspace components
and are available for Linux and Windows.  Additional details are given below.


Quick Start Guide
-
1. Prerequisites: libibverbs and libibumad must be installed.
   The IB stack should be running with IPoIB configured.
   These steps assume that the user has administrative privileges.
2. Install the IB ACM package
   This installs ib_acm, and ib_acme.
3. Run ib_acme -A -O
   This will generate IB ACM address and options configuration files.
   (acm_addr.cfg and acm_opts.cfg)
4. Run ib_acm and leave running.
   ib_acm will eventually be converted to a service/daemon, but for now
   is a userspace application.  Because ib_acm uses the libibumad
   interfaces, it should be run with administrative privileges.
5. Optionally, run ib_acme -s source_ip -d dest_ip -v
   This will verify that the ib_acm service is running.
5. Install librdmacm using the build option --with-ib_acm.
   The librdmacm will automatically use the ib_acm service.
   On failures, the librdmacm will fall back to normal resolution.


Details
---
ib_acme:
The ib_acme program serves a dual role.  It acts as a utility to test
ib_acm operation and help verify if the ib_acm service and selected
protocol is usable for a given cluster configuration.   Additionally,
it automatically generates ib_acm configuration files to assist with
or eliminate manual setup.


acm configuration files:
The ib_acm service relies on two configuration files.

The acm_addr.cfg file contains name and address mappings for each IB
device, port, pkey endpoint.  Although the names in the acm_addr.cfg
file can be anything, ib_acme maps the host name and IP addresses to
the IB endpoints.

The acm_opts.cfg file provides a set of configurable options for the
ib_acm service, such as timeout, number of retries, logging level, etc.
ib_acme generates the acm_opts.cfg file using static information.  A
future enhancement would adjust options based on the current system
and cluster size. 


ib_acm:
The ib_acm service is responsible for resolving names and addresses to
InfiniBand path information and caching such data.  It is currently
implemented as an executable application, but is a conceptual service
or daemon that should execute with administrative privileges.

The ib_acm implements a client interface over TCP sockets, which is
abstracted by the librdmacm library.  One or more back-end protocols are
used by the ib_acm service to satisfy user requests.  Although the
ib_acm supports standard SA path record queries on the back-end, it
provides an experimental multicast resolution protocol in hope of
achieving greater scalability.  The latter is not usable on all fabric
topologies, specifically ones that may not have reversible paths.
Users should use the ib_acme utility to verify that multicast protocol
is usable before running other applications.

Conceptually, the ib_acm service implements an ARP like protocol and either
uses IB multicast records to construct path record data or queries the
SA directly, depending on the selected route protocol.  By default, the
ib_acm services uses and caches SA path record queries.

Specifically, all IB endpoints join a number of multicast groups.
Multicast groups differ based on rates, mtu, sl, etc., and are prioritized.
All participating endpoints must be able to communicate on the lowest
priority multicast group.  The ib_acm assigns one or more names/addresses
to each IB endpoint using the acm_addr.cfg file.  Clients provide source
and destination names or addresses as input to the service, and receive
as 

librdma_cm: client example failed

2010-06-08 Thread joel vennin
Hi,

I've downloaded the latest version of librdma_cm 1.0.12.

I got unexpected segfault (hum, usually segfault is not really expected .. ;)

On the same server I execute the rdma_server and rdma_client. The
first is waiting incoming message. The second one segfault. This is
the backtrace of the rdma_client:
#0  0x7f3de011482d in ?? () from /usr/lib/libmlx4-rdmav2.so
#1  0x7f3de011640e in ?? () from /usr/lib/libmlx4-rdmav2.so
#2  0x7f3de0a94244 in __ibv_modify_qp (qp=0x2458d40, attr=0x0,
attr_mask=57) at src/verbs.c:474
#3  0x7f3de0c9b292 in ucma_init_conn_qp (id_priv=0x24590d0,
qp=0x2458d40) at src/cma.c:1060
#4  0x7f3de0c9b3a3 in rdma_create_qp (id=0x24590d0, pd=value
optimized out, qp_init_attr=0x7fff973aa090) at src/cma.c:1203
#5  0x7f3de0c9d6a9 in rdma_create_ep (id=0x601650, res=0x2458650,
pd=0x0, qp_init_attr=0x7fff973aa090) at src/cma.c:2153
#6  0x00400bae in run () at examples/rdma_client.c:67
#7  0x00400fcf in main (argc=1, argv=0x7fff973aa288) at
examples/rdma_client.c:131


The IB chipset is the following one:
04:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX IB QDR,
PCIe 2.0 5GT/s] (rev a0)

Kernel use:
Linux 2.6.32.7.v01 #1 SMP Wed Feb 3 15:45:37 CET 2010 x86_64 GNU/Linux

libibverbs: 1.1.2
libmlx4: 1.0

Any help will be appreciated,

Thank you

Joel
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Handling busy responses from the SA

2010-06-08 Thread Roland Dreier
  Is there case where we would ever want to treat BUSY responses
  differently from timeouts?

If there isn't then it's silly for the SA to ever send a BUSY response.

 - R.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: MPI traffic with service Level

2010-06-08 Thread Hefty, Sean
 I'm interested in the MPI traffic that take into the consideration of the
 SL(service level)-Path mapping (as decided by SA/SM). For e.g LASH routing
 algorithm that uses SL/VL as the deadlock avoidance for routing. Is there
 any way that I can make MPI traffic that uses the SL that as indicated by
 the SA/SM? Any help/hints would be appreciated. Thanks again.

If MPI is set to use the rdma_cm, it will obtain SL information from the SA.  I 
believe most MPIs support this option through some means.  Beyond that, there 
may be other ways to do this based on the MPI that you're using.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: librdma_cm: client example failed

2010-06-08 Thread Hefty, Sean
 On the same server I execute the rdma_server and rdma_client. The
 first is waiting incoming message. The second one segfault. This is
 the backtrace of the rdma_client:
 #0  0x7f3de011482d in ?? () from /usr/lib/libmlx4-rdmav2.so
 #1  0x7f3de011640e in ?? () from /usr/lib/libmlx4-rdmav2.so
 #2  0x7f3de0a94244 in __ibv_modify_qp (qp=0x2458d40, attr=0x0,
 attr_mask=57) at src/verbs.c:474

The attr parameter doesn't look right.  ucma_init_conn_qp calls ibv_modify_qp 
using an attr parameter from the stack.  The attr_mask looks like it could be 
correct.

Can you try updating the libmlx4 library and see if you get the same resuls?

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Handling busy responses from the SA

2010-06-08 Thread Hal Rosenstock
Mike,

On Tue, Jun 8, 2010 at 3:59 PM, Mike Heinz michael.he...@qlogic.com wrote:
 Hal,

 I may be confused - but I thought the spec said there was no valid response 
 to a trap repress. I interpreted

 o14-3.a4: The SMA shall not send any message in response to a valid 
 SubnTrapRepress() message

 to mean that the SMA isn't allowed to respond with a BUSY status for a trap 
 repress.

I'm referring to the receipt of the TrapRepress with busy status.
Wouldn't your patch cause the original Trap to be resent when retries
 0 ? TrapRepress is essentially a response to Trap and classified as
such by ib_response_mad. Your proposed patch treats a busy as a
timeout and can cause retry of the original sent Trap.

-- Hal


 -Original Message-
 From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com]
 Sent: Tuesday, June 08, 2010 3:09 PM
 To: Hefty, Sean
 Cc: Mike Heinz; linux-rdma@vger.kernel.org
 Subject: Re: Handling busy responses from the SA

 On Tue, Jun 8, 2010 at 12:27 PM, Hefty, Sean sean.he...@intel.com wrote:
 Sean - remember that this patch will still return a BUSY status to the
 caller, if retries are exhausted and the last return code was BUSY, then
 that's what the caller will get. Thus, code which sets retries to zero will
 not be affected by this patch at all.

 It looks like it only returns the BUSY response if that matches with the 
 last retry, otherwise, the BUSY response is dropped.  It also looks like it 
 applies to all MADs, including vendor specific ones, and not just those from 
 the SA.

 Per the proposed patch, it currently includes trap represses (as
 determined by ib_response_mad). Shouldn't busy be ignored for that
 case ? I don't think that would be used but it seems safer to me.

 -- Hal


 - Sean


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] opensm/osm_sa.c: In osm_sa_respond, only fill in attr offset if RMPP method

2010-06-08 Thread Sasha Khapyorsky
Hi Hal,

On 09:42 Thu 03 Jun , Hal Rosenstock wrote:
 
 Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com
 ---
  opensm/opensm/osm_sa.c |   12 ++--
  1 files changed, 10 insertions(+), 2 deletions(-)
 
 diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c
 index 0aca81f..8325632 100644
 --- a/opensm/opensm/osm_sa.c
 +++ b/opensm/opensm/osm_sa.c
 @@ -3,6 +3,7 @@
   * Copyright (c) 2002-2010 Mellanox Technologies LTD. All rights reserved.
   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
 + * Copyright (c) 2010 HNR Consulting. All rights reserved.
   *
   * This software is available to you under a choice of one of two
   * licenses.  You may choose to be licensed under the terms of the GNU
 @@ -454,8 +455,15 @@ void osm_sa_respond(osm_sa_t *sa, osm_madw_t *madw, 
 size_t attr_size,
   /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */
   resp_sa_mad-sm_key = 0;
  
 - /* Fill in the offset (paylen will be done by the rmpp SAR) */
 - resp_sa_mad-attr_offset = num_rec ? ib_get_attr_offset(attr_size) : 0;
 +#ifdef DUAL_SIDED_RMPP
 + if (resp_sa_mad-method == IB_MAD_METHOD_GETTABLE_RESP ||
 + resp_sa_mad-method == IB_MAD_METHOD_GETMULTI_RESP) {
 +#else
 + if (resp_sa_mad-method == IB_MAD_METHOD_GETTABLE_RESP) {
 +#endif
 + /* Fill in the offset (paylen will be done by the rmpp SAR) */
 + resp_sa_mad-attr_offset = num_rec ? 
 ib_get_attr_offset(attr_size) : 0;
 + }

What is wrong with current implementation?

Sasha

  
   p = ib_sa_mad_get_payload_ptr(resp_sa_mad);
  
 -- 
 1.5.6.4
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html