Re: root owned writeable files under /sys
I don't understand why mlx4_port1 and mlx4_port2 have world write permissions on your system. I can't see this from the sources nor from installing ofed-1.5.1 on my system. I agree though that the permissions for port_trigger and clear_diag should be changed. We'll push a fix to OFED 1.5.2. On Sun, Jun 6, 2010 at 7:08 PM, Sumeet Lahorani sumeet.lahor...@oracle.com wrote: Thanks. I realized that my earlier find command didn't capture all the files I was looking for. After your patch, the following still need to be addressed (all are mlx4 files) # find /sys -type f -perm -222 /sys/class/infiniband/mlx4_0/diag_counters/clear_diag /sys/devices/pci:00/:00:04.0/:13:00.0/port_trigger /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port2 /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port1 - Sumeet Or Gerlitz wrote: Sumeet Lahorani wrote: I see the following files created under /sys which are world writeable /sys/class/net/ib0/delete_child /sys/class/net/ib0/create_child At least the create_child delete_child files appear to be dangerous to leave as world writeable because they result in resources allocations. Yes, this looks bad. The below patch fixes that, I tested it on 2.6.35-rc1 [PATCH] make ipoib child entries non-world writable Sumeet Lahorani sumeet.lahor...@oracle.com reported that the ipoib child entries are world writable, fix them to be root only writable Signed-off-by: Or Gerlitz ogerl...@voltaire.com diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index df3eb8c..b4b2257 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1163,7 +1163,7 @@ static ssize_t create_child(struct device *dev, return ret ? ret : count; } -static DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); +static DEVICE_ATTR(create_child, S_IWUSR, NULL, create_child); static ssize_t delete_child(struct device *dev, struct device_attribute *attr, @@ -1183,7 +1183,7 @@ static ssize_t delete_child(struct device *dev, return ret ? ret : count; } -static DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); +static DEVICE_ATTR(delete_child, S_IWUSR, NULL, delete_child); int ipoib_add_pkey_attr(struct net_device *dev) { -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Handling busy responses from the SA
Mike, On Mon, Jun 7, 2010 at 12:00 PM, Mike Heinz michael.he...@qlogic.com wrote: Hal said: Should a busy be retried at all at the mad layer ? Is a special longer) timeout policy for busy needed ? Also, should this be done for all MADs classified by ib_response_mad (e.g. trap represses) ? Hal, The idea of processing BUSY responses in the MAD layer is to BUSY responses like timeouts - which are currently handled by the MAD layer. Right now there is an issue where various apps and ULPs either treat BUSY as a cause to immediately retry or as a permanent error. This doesn't seem to affect users of the OpenSM so much because (as I understand it) the OpenSM seems to discard requests when it gets too busy - but for other SA/SMs, it can cause a major packet storm or, worse, a simple loss of connectivity where MPI jobs or kernel ULPs simply assume the SA is broken because they got a BUSY reply. By treating the BUSY reply as a timeout, we're actually simplifying matters by fitting into existing practice. Understood. Timing these out makes sense to me but still does not preclude the client from potentially handling this if the retries fail. As for needing a longer timeout - in our old proprietary stack, QLogic did have a longer timeout for retrying busy replies than for normal timeouts How much longer ? What are the two timeouts used ? - but we should try to get this in now so we can get some relief before we begin the long term discussion of the best way to handle this issue overall. All I was getting at here was: does retrying when busy work ? If not, why retry at all at the MAD layer (regardless of retries requested) and perhaps use a longer timeout for this. If it does work, maybe the timeout on the subsequent retries should be extended. I think my two other comments on details are relevant to an updated patch. -- Hal -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Handling busy responses from the SA
As for needing a longer timeout - in our old proprietary stack, QLogic did have a longer timeout for retrying busy replies than for normal timeouts - but we should try to get this in now so we can get some relief before we begin the long term discussion of the best way to handle this issue overall. Because applications may handle BUSY replies differently, we shouldn't simply start hiding them from the user. I would much rather agree on the longer term plan, so that the ABI can reflect the proper semantics. I don't see any issue with changing the current behavior for kernel clients, however. - Sean
RE: Handling busy responses from the SA
Sean said, Because applications may handle BUSY replies differently, we shouldn't simply start hiding them from the user. Sean - remember that this patch will still return a BUSY status to the caller, if retries are exhausted and the last return code was BUSY, then that's what the caller will get. Thus, code which sets retries to zero will not be affected by this patch at all. Hal said, All I was getting at here was: does retrying when busy work ? If not, why retry at all at the MAD layer (regardless of retries requested) and perhaps use a longer timeout for this. If it does work, maybe the timeout on the subsequent retries should be extended. Personally, I think it's been extremely helpful - we've been using busy status to tell compute nodes to slow down since our old proprietary stack and we've seen a significant improvement in overall traffic congestion when we added this patch to OFED clusters using our SM. In addition use of the BUSY return code simplifies debugging traffic congestion problems (since it allows you to immediately differentiate between SA overload and other traffic issues) and it paves the way for more sophisticated back-off strategies in the future. As to that, and your question, our old stack used two different timeout values specified by the client. One value was for actual timeouts and one for busy responses. In the case of busy responses, we added a randomization factor to spread out the traffic. This issue with adapting that to the Linux-RDMA stack is that it's an API change. What I would suggest personally, is something like this: 1. Take either the timeout passed by the caller OR a predefined constant, whichever is larger. I would suggest setting the predefined constant to something moderate, say 2 seconds. 2. Add a randomization factor - say between -250 and +250 ms? 3. Update the packet timeout with this new value. N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
RE: Handling busy responses from the SA
Anyone know why my messages are being appended with interesting garbage? -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Mike Heinz Sent: Tuesday, June 08, 2010 11:49 AM To: Hal Rosenstock Cc: linux-rdma@vger.kernel.org Subject: RE: Handling busy responses from the SA N�r��y���b�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j ��f���h���z��w��� ���j:+v���w�j�m zZ+�ݢj��!�i
RE: Handling busy responses from the SA
Anyone know why my messages are being appended with interesting garbage? I get that too. I first noticed it a couple of weeks ago. It eventually went back to the normal 'To unsubscribe from this list' message.
RE: Handling busy responses from the SA
Sean - remember that this patch will still return a BUSY status to the caller, if retries are exhausted and the last return code was BUSY, then that's what the caller will get. Thus, code which sets retries to zero will not be affected by this patch at all. It looks like it only returns the BUSY response if that matches with the last retry, otherwise, the BUSY response is dropped. It also looks like it applies to all MADs, including vendor specific ones, and not just those from the SA. - Sean
RE: Handling busy responses from the SA
Right. Effectively this is similar to the I/O resolution timeout policy laid out in the spec. -Original Message- From: Hefty, Sean [mailto:sean.he...@intel.com] Sent: Tuesday, June 08, 2010 12:27 PM To: Mike Heinz; Hal Rosenstock Cc: linux-rdma@vger.kernel.org Subject: RE: Handling busy responses from the SA Sean - remember that this patch will still return a BUSY status to the caller, if retries are exhausted and the last return code was BUSY, then that's what the caller will get. Thus, code which sets retries to zero will not be affected by this patch at all. It looks like it only returns the BUSY response if that matches with the last retry, otherwise, the BUSY response is dropped. It also looks like it applies to all MADs, including vendor specific ones, and not just those from the SA. - Sean
RE: [PATCH] Handling busy responses from the SA
Also, I guess, it would be a good API choice if the caller could say 'get me a reply for this mad or error within 60s' rather than specify details like retry counts, etc. The timeout values should be globally set and derived from the usual SA provided data for network transits... I agree with this. Within the framework of the existing umad ABI, this could be specified by setting the high bit in the ib_user_mad_hdr:timeout_ms field, assuming that no one is using that bit in practice. The kernel could then freely select the retry/timeout policy for these clients, which for starters could include dropping BUSY responses and adjusting the timeout using an approach similar to what Mike mentioned in a separate email. Kernel clients could be updated to use this new mode. Any disagreements to this approach? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Handling busy responses from the SA
Sean - Is there case where we would ever want to treat BUSY responses differently from timeouts? -Original Message- From: Hefty, Sean [mailto:sean.he...@intel.com] Sent: Tuesday, June 08, 2010 12:27 PM To: Mike Heinz; Hal Rosenstock Cc: linux-rdma@vger.kernel.org Subject: RE: Handling busy responses from the SA Sean - remember that this patch will still return a BUSY status to the caller, if retries are exhausted and the last return code was BUSY, then that's what the caller will get. Thus, code which sets retries to zero will not be affected by this patch at all. It looks like it only returns the BUSY response if that matches with the last retry, otherwise, the BUSY response is dropped. It also looks like it applies to all MADs, including vendor specific ones, and not just those from the SA. - Sean
RE: Handling busy responses from the SA
Is there case where we would ever want to treat BUSY responses differently from timeouts? I doubt it for a single MAD, but I can't say what people may have implemented. The main difference I can think of is that a busy response requires a retry, whereas a timeout does not. This affects the retry policy when multiple MADs are outstanding. E.g. if there are 10 requests outstanding and the first times out, we may only resend the first request and increase the timeouts of the other 9. If the 10 requests all receive a busy, then they must all be retried. To me, it looks like it makes more sense to never send busy, except maybe when receive buffer space is full consumed, but implement a more intelligent timeout/retry mechanism on the sender side. The SA almost needs some sort of MRA like message. - Sean
RE: [PATCH v2] allow passthrough of rmpp protocol to user mad clients
On a different subject - have we come to any conclusions about this patch? -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Mike Heinz Sent: Friday, June 04, 2010 1:14 PM To: linux-rdma@vger.kernel.org; Hal Rosenstock; Hefty, Sean; Roland Dreier Subject: [PATCH v2] allow passthrough of rmpp protocol to user mad clients This is an update to the previous version of the patch, based on feedback from Hal. Currently, if a user application calls umad_register() or umad_register_oui() with an rmpp_version of zero, incoming rmpp messages are discarded and if the rmpp_version is 1, incoming rmpp packets are collected by the kernel layer and passed as a group to the user application. This patch changes this behavior so that rmpp_version of 255 causes incoming rmpp packets to be passed through without alteration, instead. There are IB users who have requested the ability to perform RMPP transaction handling in user space. This was an option in old proprietary stacks and this is useful to migrate old applications to OFED while containing the scope of their application changes. Signed-Off-By: Michael Heinz michael.he...@qlogic.com --- diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index ef1304f..efca783 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -207,12 +207,18 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, int ret2, qpn; unsigned long flags; u8 mgmt_class, vclass; + u8 rmpp_passthru = 0; /* Validate parameters */ qpn = get_spl_qp_index(qp_type); if (qpn == -1) goto error1; + if (rmpp_version == IB_MGMT_RMPP_PASSTHRU) { + rmpp_passthru = 255; + rmpp_version = 0; + } + if (rmpp_version rmpp_version != IB_MGMT_RMPP_VERSION) goto error1; @@ -244,6 +250,7 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, if (!is_vendor_oui(mad_reg_req-oui)) goto error1; } + /* Make sure class supplied is consistent with RMPP */ if (!ib_is_mad_class_rmpp(mad_reg_req-mgmt_class)) { if (rmpp_version) @@ -302,6 +309,7 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device, mad_agent_priv-qp_info = port_priv-qp_info[qpn]; mad_agent_priv-reg_req = reg_req; mad_agent_priv-agent.rmpp_version = rmpp_version; + mad_agent_priv-agent.rmpp_passthru = rmpp_passthru; mad_agent_priv-agent.device = device; mad_agent_priv-agent.recv_handler = recv_handler; mad_agent_priv-agent.send_handler = send_handler; @@ -1792,7 +1800,7 @@ static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv, INIT_LIST_HEAD(mad_recv_wc-rmpp_list); list_add(mad_recv_wc-recv_buf.list, mad_recv_wc-rmpp_list); - if (mad_agent_priv-agent.rmpp_version) { + if (mad_agent_priv-agent.rmpp_version !mad_agent_priv-agent.rmpp_passthru) { mad_recv_wc = ib_process_rmpp_recv_wc(mad_agent_priv, mad_recv_wc); if (!mad_recv_wc) { @@ -1801,29 +1809,47 @@ static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv, } } + /* +* At this point, the MAD is either not an RMPP or we are passing RMPPs thru to +* the client. +*/ /* Complete corresponding request */ if (ib_response_mad(mad_recv_wc-recv_buf.mad)) { spin_lock_irqsave(mad_agent_priv-lock, flags); mad_send_wr = ib_find_send_mad(mad_agent_priv, mad_recv_wc); - if (!mad_send_wr) { + if (mad_send_wr) { + ib_mark_mad_done(mad_send_wr); spin_unlock_irqrestore(mad_agent_priv-lock, flags); - ib_free_recv_mad(mad_recv_wc); - deref_mad_agent(mad_agent_priv); - return; - } - ib_mark_mad_done(mad_send_wr); - spin_unlock_irqrestore(mad_agent_priv-lock, flags); - /* Defined behavior is to complete response before request */ - mad_recv_wc-wc-wr_id = (unsigned long) mad_send_wr-send_buf; - mad_agent_priv-agent.recv_handler(mad_agent_priv-agent, - mad_recv_wc); - atomic_dec(mad_agent_priv-refcount); + /* Defined behavior is to complete response before request */ + mad_recv_wc-wc-wr_id = (unsigned long) mad_send_wr-send_buf; + mad_agent_priv-agent.recv_handler(mad_agent_priv-agent, +
RE: [PATCH] Handling busy responses from the SA
It's workable, although I really wish there was a way to handle stupid apps that aren't written to handle a busy response. -Original Message- From: Hefty, Sean [mailto:sean.he...@intel.com] Sent: Tuesday, June 08, 2010 12:44 PM To: Jason Gunthorpe Cc: Mike Heinz; linux-rdma@vger.kernel.org; e...@openfabrics.org Subject: RE: [PATCH] Handling busy responses from the SA Also, I guess, it would be a good API choice if the caller could say 'get me a reply for this mad or error within 60s' rather than specify details like retry counts, etc. The timeout values should be globally set and derived from the usual SA provided data for network transits... I agree with this. Within the framework of the existing umad ABI, this could be specified by setting the high bit in the ib_user_mad_hdr:timeout_ms field, assuming that no one is using that bit in practice. The kernel could then freely select the retry/timeout policy for these clients, which for starters could include dropping BUSY responses and adjusting the timeout using an approach similar to what Mike mentioned in a separate email. Kernel clients could be updated to use this new mode. Any disagreements to this approach? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
librdmacm 1.0.12 release notes for OFED 1.5.2
Here is the first cut at release notes -- attached and inline -- for the OFED 1.5.2 release of the librdmacm. - Sean --- librdmacm release notes --- Several enhancements were added to librdmacm release 1.0.12 that are intended to simplify using RDMA devices and address scalability issues. These changes were in response to long standing requests to make connection establishment 'more like sockets'. For full details, users should refer to the appropriate man pages. Major changes include: * Support synchronous operation for library calls. Users can control whether an rdma_cm_id operates asynchronously or synchronously based on the rdma_event_channel parameter. Use of synchronous operations reduces the amount of application code required to use the librdmacm by eliminating the need for event processing code. An rdma_cm_id will be marked for synchronous operation if the rdma_event_channel parameter is NULL for rdma_create_id or rdma_migrate_id. Users can toggle between synchronous and asynchronous operation through the rdma_migrate_id call. Calls that operate synchronously include rdma_resolve_addr, rdma_resolve_route, rdma_connect, rdma_accept, and rdma_get_request. Synchronous event data is returned to the user through the rdma_cm_id. * The addition of a new API: rdma_getaddrinfo. This call is modeled after getaddrinfo, but for RDMA devices and connections. It has the following notable deviations from getaddrinfo: A source address is returned as part of the call to allow the user to allocate necessary local HW resources for connections. Optional routing information may be returned to support Infiniband fabrics. IB routing information includes necessary path record data. rdma_getaddrinfo will obtain this information if IB ACM support (see below) is enabled. The use of IB ACM is not required for rdma_getaddrinfo. rdma_getaddrinfo provides future extensions to support more complex address and route resolution mechanisms, such as multiple path support and failover. * Support for a new APIs: rdma_get_request, rdma_create_ep, and rdma_destroy_ep. rdma_get_request simplifies the passive side implementation by adding synchronous support for accepting new connections. rdma_create_ep combines the functionality of rdma_create_id, rdma_create_qp, rdma_resolve_addr, and rdma_resolve_route in a single API that uses the output of rdma_getaddrinfo as its input. * Support for optional parameters. To simplify support for casual RDMA developers and researchers, the librdmacm can allocate protection domains, completion queues, and queue pairs on a user's behalf. This simplifies the amount of information that a developer must learn in order to use RDMA, plus allows the user to take advantage of higher-level completion processing abstractions. In addition to optional parameters, a user can also specify that the librdmacm should automatically select usable values for RDMA read operations. * Add support for IB ACM. IB ACM (InfiniBand Assistant for Communication Management) defines a socket based protocol to an IB address and route resolution service. One implementation of that service is provided separately by the ibacm package, but anyone can implement the service provided that they adhere to the IB ACM socket protocol. IB ACM is an experimental service targeted at increasing the scalability of applications running on a large cluster. Use of IB ACM is not required and is controlled through the build option '--with-ib_acm'. If the librdmacm fails to contact the IB ACM service, it reverts to using kernel services to resolve address and routing data. * Add RDMA helper routines. The librdmacm provide a set of simpler verbs calls for posting work requests, registering memory, and checking for completions. These calls are wrappers around libibverbs routines. rel-notes Description: rel-notes
Re: Handling busy responses from the SA
On Tue, Jun 8, 2010 at 12:27 PM, Hefty, Sean sean.he...@intel.com wrote: Sean - remember that this patch will still return a BUSY status to the caller, if retries are exhausted and the last return code was BUSY, then that's what the caller will get. Thus, code which sets retries to zero will not be affected by this patch at all. It looks like it only returns the BUSY response if that matches with the last retry, otherwise, the BUSY response is dropped. It also looks like it applies to all MADs, including vendor specific ones, and not just those from the SA. Per the proposed patch, it currently includes trap represses (as determined by ib_response_mad). Shouldn't busy be ignored for that case ? I don't think that would be used but it seems safer to me. -- Hal - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Handling busy responses from the SA
Hal, I may be confused - but I thought the spec said there was no valid response to a trap repress. I interpreted o14-3.a4: The SMA shall not send any message in response to a valid SubnTrapRepress() message to mean that the SMA isn't allowed to respond with a BUSY status for a trap repress. -Original Message- From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com] Sent: Tuesday, June 08, 2010 3:09 PM To: Hefty, Sean Cc: Mike Heinz; linux-rdma@vger.kernel.org Subject: Re: Handling busy responses from the SA On Tue, Jun 8, 2010 at 12:27 PM, Hefty, Sean sean.he...@intel.com wrote: Sean - remember that this patch will still return a BUSY status to the caller, if retries are exhausted and the last return code was BUSY, then that's what the caller will get. Thus, code which sets retries to zero will not be affected by this patch at all. It looks like it only returns the BUSY response if that matches with the last retry, otherwise, the BUSY response is dropped. It also looks like it applies to all MADs, including vendor specific ones, and not just those from the SA. Per the proposed patch, it currently includes trap represses (as determined by ib_response_mad). Shouldn't busy be ignored for that case ? I don't think that would be used but it seems safer to me. -- Hal - Sean N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
ibacm 1.0.0 release notes for OFED 1.5.2
Here are release notes -- attached and inline -- for IB ACM 1.0.0 for OFED 1.5.2. - Sean --- Assistant for InfiniBand Communication Management (IB ACM) Note: The IB ACM should be considered experimental. Overview The IB ACM package implements and provides a framework for experimental name, address, and route resolution services over InfiniBand. It is intended to address connection setup scalability issues running MPI applications on large clusters. The IB ACM provides information needed to establish a connection, but does not implement the CM protocol. The librdmacm can invoke IB ACM services when built using the --with-ib_acm option. The IB ACM services tie in under the rdma_resolve_addr, rdma_resolve_route, and rdma_getaddrinfo routines. For maximum benefit, the rdma_getaddrinfo routine should be used, however existing applications should still see significant connection scaling benefits using the calls available in librdmacm 1.0.11 and previous releases. The IB ACM is focused on being scalable and efficient. The current implementation limits network traffic, SA interactions, and centralized services. ACM supports multiple resolution protocols in order to handle different fabric topologies. This release 1.0.0 is limited in its handling of dynamic changes. The IB ACM package is comprised of two components: the ib_acm service and a test/configuration utility - ib_acme. Both are userspace components and are available for Linux and Windows. Additional details are given below. Quick Start Guide - 1. Prerequisites: libibverbs and libibumad must be installed. The IB stack should be running with IPoIB configured. These steps assume that the user has administrative privileges. 2. Install the IB ACM package This installs ib_acm, and ib_acme. 3. Run ib_acme -A -O This will generate IB ACM address and options configuration files. (acm_addr.cfg and acm_opts.cfg) 4. Run ib_acm and leave running. ib_acm will eventually be converted to a service/daemon, but for now is a userspace application. Because ib_acm uses the libibumad interfaces, it should be run with administrative privileges. 5. Optionally, run ib_acme -s source_ip -d dest_ip -v This will verify that the ib_acm service is running. 5. Install librdmacm using the build option --with-ib_acm. The librdmacm will automatically use the ib_acm service. On failures, the librdmacm will fall back to normal resolution. Details --- ib_acme: The ib_acme program serves a dual role. It acts as a utility to test ib_acm operation and help verify if the ib_acm service and selected protocol is usable for a given cluster configuration. Additionally, it automatically generates ib_acm configuration files to assist with or eliminate manual setup. acm configuration files: The ib_acm service relies on two configuration files. The acm_addr.cfg file contains name and address mappings for each IB device, port, pkey endpoint. Although the names in the acm_addr.cfg file can be anything, ib_acme maps the host name and IP addresses to the IB endpoints. The acm_opts.cfg file provides a set of configurable options for the ib_acm service, such as timeout, number of retries, logging level, etc. ib_acme generates the acm_opts.cfg file using static information. A future enhancement would adjust options based on the current system and cluster size. ib_acm: The ib_acm service is responsible for resolving names and addresses to InfiniBand path information and caching such data. It is currently implemented as an executable application, but is a conceptual service or daemon that should execute with administrative privileges. The ib_acm implements a client interface over TCP sockets, which is abstracted by the librdmacm library. One or more back-end protocols are used by the ib_acm service to satisfy user requests. Although the ib_acm supports standard SA path record queries on the back-end, it provides an experimental multicast resolution protocol in hope of achieving greater scalability. The latter is not usable on all fabric topologies, specifically ones that may not have reversible paths. Users should use the ib_acme utility to verify that multicast protocol is usable before running other applications. Conceptually, the ib_acm service implements an ARP like protocol and either uses IB multicast records to construct path record data or queries the SA directly, depending on the selected route protocol. By default, the ib_acm services uses and caches SA path record queries. Specifically, all IB endpoints join a number of multicast groups. Multicast groups differ based on rates, mtu, sl, etc., and are prioritized. All participating endpoints must be able to communicate on the lowest priority multicast group. The ib_acm assigns one or more names/addresses to each IB endpoint using the acm_addr.cfg file. Clients provide source and destination names or addresses as input to the service, and receive as
librdma_cm: client example failed
Hi, I've downloaded the latest version of librdma_cm 1.0.12. I got unexpected segfault (hum, usually segfault is not really expected .. ;) On the same server I execute the rdma_server and rdma_client. The first is waiting incoming message. The second one segfault. This is the backtrace of the rdma_client: #0 0x7f3de011482d in ?? () from /usr/lib/libmlx4-rdmav2.so #1 0x7f3de011640e in ?? () from /usr/lib/libmlx4-rdmav2.so #2 0x7f3de0a94244 in __ibv_modify_qp (qp=0x2458d40, attr=0x0, attr_mask=57) at src/verbs.c:474 #3 0x7f3de0c9b292 in ucma_init_conn_qp (id_priv=0x24590d0, qp=0x2458d40) at src/cma.c:1060 #4 0x7f3de0c9b3a3 in rdma_create_qp (id=0x24590d0, pd=value optimized out, qp_init_attr=0x7fff973aa090) at src/cma.c:1203 #5 0x7f3de0c9d6a9 in rdma_create_ep (id=0x601650, res=0x2458650, pd=0x0, qp_init_attr=0x7fff973aa090) at src/cma.c:2153 #6 0x00400bae in run () at examples/rdma_client.c:67 #7 0x00400fcf in main (argc=1, argv=0x7fff973aa288) at examples/rdma_client.c:131 The IB chipset is the following one: 04:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX IB QDR, PCIe 2.0 5GT/s] (rev a0) Kernel use: Linux 2.6.32.7.v01 #1 SMP Wed Feb 3 15:45:37 CET 2010 x86_64 GNU/Linux libibverbs: 1.1.2 libmlx4: 1.0 Any help will be appreciated, Thank you Joel -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Handling busy responses from the SA
Is there case where we would ever want to treat BUSY responses differently from timeouts? If there isn't then it's silly for the SA to ever send a BUSY response. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: MPI traffic with service Level
I'm interested in the MPI traffic that take into the consideration of the SL(service level)-Path mapping (as decided by SA/SM). For e.g LASH routing algorithm that uses SL/VL as the deadlock avoidance for routing. Is there any way that I can make MPI traffic that uses the SL that as indicated by the SA/SM? Any help/hints would be appreciated. Thanks again. If MPI is set to use the rdma_cm, it will obtain SL information from the SA. I believe most MPIs support this option through some means. Beyond that, there may be other ways to do this based on the MPI that you're using. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: librdma_cm: client example failed
On the same server I execute the rdma_server and rdma_client. The first is waiting incoming message. The second one segfault. This is the backtrace of the rdma_client: #0 0x7f3de011482d in ?? () from /usr/lib/libmlx4-rdmav2.so #1 0x7f3de011640e in ?? () from /usr/lib/libmlx4-rdmav2.so #2 0x7f3de0a94244 in __ibv_modify_qp (qp=0x2458d40, attr=0x0, attr_mask=57) at src/verbs.c:474 The attr parameter doesn't look right. ucma_init_conn_qp calls ibv_modify_qp using an attr parameter from the stack. The attr_mask looks like it could be correct. Can you try updating the libmlx4 library and see if you get the same resuls? - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Handling busy responses from the SA
Mike, On Tue, Jun 8, 2010 at 3:59 PM, Mike Heinz michael.he...@qlogic.com wrote: Hal, I may be confused - but I thought the spec said there was no valid response to a trap repress. I interpreted o14-3.a4: The SMA shall not send any message in response to a valid SubnTrapRepress() message to mean that the SMA isn't allowed to respond with a BUSY status for a trap repress. I'm referring to the receipt of the TrapRepress with busy status. Wouldn't your patch cause the original Trap to be resent when retries 0 ? TrapRepress is essentially a response to Trap and classified as such by ib_response_mad. Your proposed patch treats a busy as a timeout and can cause retry of the original sent Trap. -- Hal -Original Message- From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com] Sent: Tuesday, June 08, 2010 3:09 PM To: Hefty, Sean Cc: Mike Heinz; linux-rdma@vger.kernel.org Subject: Re: Handling busy responses from the SA On Tue, Jun 8, 2010 at 12:27 PM, Hefty, Sean sean.he...@intel.com wrote: Sean - remember that this patch will still return a BUSY status to the caller, if retries are exhausted and the last return code was BUSY, then that's what the caller will get. Thus, code which sets retries to zero will not be affected by this patch at all. It looks like it only returns the BUSY response if that matches with the last retry, otherwise, the BUSY response is dropped. It also looks like it applies to all MADs, including vendor specific ones, and not just those from the SA. Per the proposed patch, it currently includes trap represses (as determined by ib_response_mad). Shouldn't busy be ignored for that case ? I don't think that would be used but it seems safer to me. -- Hal - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] opensm/osm_sa.c: In osm_sa_respond, only fill in attr offset if RMPP method
Hi Hal, On 09:42 Thu 03 Jun , Hal Rosenstock wrote: Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com --- opensm/opensm/osm_sa.c | 12 ++-- 1 files changed, 10 insertions(+), 2 deletions(-) diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c index 0aca81f..8325632 100644 --- a/opensm/opensm/osm_sa.c +++ b/opensm/opensm/osm_sa.c @@ -3,6 +3,7 @@ * Copyright (c) 2002-2010 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2010 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -454,8 +455,15 @@ void osm_sa_respond(osm_sa_t *sa, osm_madw_t *madw, size_t attr_size, /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */ resp_sa_mad-sm_key = 0; - /* Fill in the offset (paylen will be done by the rmpp SAR) */ - resp_sa_mad-attr_offset = num_rec ? ib_get_attr_offset(attr_size) : 0; +#ifdef DUAL_SIDED_RMPP + if (resp_sa_mad-method == IB_MAD_METHOD_GETTABLE_RESP || + resp_sa_mad-method == IB_MAD_METHOD_GETMULTI_RESP) { +#else + if (resp_sa_mad-method == IB_MAD_METHOD_GETTABLE_RESP) { +#endif + /* Fill in the offset (paylen will be done by the rmpp SAR) */ + resp_sa_mad-attr_offset = num_rec ? ib_get_attr_offset(attr_size) : 0; + } What is wrong with current implementation? Sasha p = ib_sa_mad_get_payload_ptr(resp_sa_mad); -- 1.5.6.4 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html