Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?
Or, counter should work as regular in upstream kernel patches for IB link layer. For IBoE, they will not work since the SMA does not support them. I have patches that allow to show counters using sysfs but those have not yet been sent upstream. They available in OFED-1.5.2 latest builds. On Thu, Jun 3, 2010 at 4:33 PM, Or Gerlitz ogerl...@voltaire.com wrote: Moni Shoua wrote: Did you try OFED-1.5.1 or even better, OFED-1.5.2? I know patches for counters with RoCEE were submitted since OFED-1.5 and I saw it working Mony, I'm not using ofed, sorry... I am interested in a clarification in the context of the upstream submission, e.g does the problem exist in the latest patch-set, is there a bz case tracking this, etc. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] opensm: Add a rate based mechanism for SMP transactions
In order to better handle non responsive SMAs (when link is physically up but the SMA does not respond), a rate based mechanism for SMPs is added to better enable forward progress in a more timely fashion. So rather than wait for timeouts and outstanding wire SMPs to drop below some configured value, there is also a periodic rate for transaction based SMPs. These rate based SMPs are capped at a configured maximum value. Two new options are added for this: rate_based_smp_usecs indicates the number of microseconds between rate based SMPs. max_rate_based_smps indicates the maximum number of rate based SMPs supported. When this limit is reached, rate based SMPs are no longer sent (until the number of outstanding ones drops below this limit). The rate based SMP mechanism can be disabled by setting rate_based_smp_usecs to 0. This is equivalent to the (current) algorithm prior to this change. By default, this mechanism is disabled. Also, the maximum max_wire_smps is reduced to 0x3FFF from 0x7FFF so the sum of max_wire_smps and max_rate_based_smps does not wrap. Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com --- Changes from v1: Algorithm change is isolated to vl15_poller rather than involving the vendor layer. diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index e0d6c66..16241ec 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2010 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. * @@ -440,6 +440,30 @@ BEGIN_C_DECLS */ #define OSM_DEFAULT_SMP_MAX_ON_WIRE 4 /***/ +/d* OpenSM: Base/OSM_DEFAULT_SMP_RATE_MAX +* NAME +* OSM_DEFAULT_SMP_RATE_MAX +* +* DESCRIPTION +* Specifies the default maximum number of outstanding rate based SMPs. +* +* SYNOPSIS +*/ +#define OSM_DEFAULT_SMP_RATE_MAX 100 +/***/ +/d* OpenSM: Base/OSM_DEFAULT_SMP_RATE +* NAME +* OSM_DEFAULT_SMP_RATE +* +* DESCRIPTION +* Specifies the default rate (in usec) for rate based SMPs. +* The default rate is 0. A value of 0 (or EVENT_NO_TIMEOUT) +* disables the rate based SMP mechanism. +* +* SYNOPSIS +*/ +#define OSM_DEFAULT_SMP_RATE 0 +/***/ /d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE * NAME * OSM_SM_DEFAULT_QP0_RCV_SIZE diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 4e8c862..49acf2d 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2010 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * Copyright (c) 2009 System Fabric Works, Inc. All rights reserved. @@ -149,6 +149,8 @@ typedef struct osm_subn_opt { ib_net16_t m_key_lease_period; uint32_t sweep_interval; uint32_t max_wire_smps; + uint32_t max_rate_based_smps; + uint32_t rate_based_smp_usecs; uint32_t transaction_timeout; uint32_t transaction_retries; uint8_t sm_priority; @@ -264,6 +266,14 @@ typedef struct osm_subn_opt { * max_wire_smps * The maximum number of SMPs sent in parallel. Default is 4. * +* max_rate_based_smps +* The maximum number of rate based SMPs allowed to be outstanding. +* Default is 1000. +* +* rate_based_smp_usecs +* The wait time in usec for rate based SMPs. Default is 0 +* (disabled). +* * transaction_timeout * The maximum time in milliseconds allowed for a transaction * to complete. Default is 200. diff --git a/opensm/include/opensm/osm_vl15intf.h b/opensm/include/opensm/osm_vl15intf.h index 15ed56c..887733f 100644 --- a/opensm/include/opensm/osm_vl15intf.h +++ b/opensm/include/opensm/osm_vl15intf.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2010 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -117,6 +117,8 @@ typedef struct osm_vl15 { osm_thread_state_t thread_state; osm_vl15_state_t state; uint32_t max_wire_smps; + uint32_t max_rate_based_smps; + uint32_t rate_based_smp_usecs; cl_event_t signal;
Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?
Eli Cohen wrote: counter should work as regular in upstream kernel patches for IB link layer. okay good, can you validate that? basically, I can set some time to clone Roland's tree and use the iboe branch as a basis for testing that the IB stack is live and kicking as it used to be before the patches. I just need an updated copy of the rest of the patch set (Roland has three patches so far) for that end. Over the review process there were bunch of comments but no new posting, how are you planning to proceed in the review/merge process? for IBoE, they will not work since the SMA does not support them I have patches that allow to show counters using sysfs I am not with you. The counters are read using a MAD sent to the firmware PMA (QP #1), this applies for both perfquery and sysfs, isn't it? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?
Eli Cohen wrote: Why are you asking me to validate that? Did you actually encounter a problem with this? yes, I did. It didn't work with some ofed drop I was using. Anyway, as I said, I can do some validation that IBoE doesn't break upstream IB, just need the patches for that end, so once they are available, I will give them a try over 2.6.35-rcX The counters patches will divert the code: for iboe it will not issue a MAD to the firmware. It will use another command. Can you be more specific what is the origin for this new design? is it HW limitation or firware limitation or something else? In case its not hardware limitation, I don't think we need to go for non MAD based scheme, at least not for mlx4 Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Handling busy responses from the SA
Roland Dreier said: I don't have a strong opinion on this but it seems a bit odd. If we're just going to drop the response anyway, why did the SA send it in the first place? On the other hand, if the SA told us it's busy, it does seem we could do something more sensible than retrying immediately. The spec provides for the SA to return a BUSY response. When that happens, this patch causes us to wait for the original request to time out before retrying, not trying again immediately. In effect, we are pretending we never got the BUSY response and allowing the request to time out, instead. Roland Dreier said: The indentation of values seems pretty crazy here. Also I'm not sure what most of these defines are for? They seem unused in this patch. The indentation is probably from the conversion of tabs to spaces when the patch was pasted into the email - correcting it is no problem. The value IB_MGMT_MAD_STATUS_BUSY is used in the patch, the others are defined because they are the other possible values for the same status field. We might as well define them all, for completeness. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Handling busy responses from the SA
Sean said: I don't object to the concept of treating a busy response as a timeout, but how does this help prevent overwhelming the SA? It continues to retry the queries, even if the SA says that it's too busy to respond without adjusting the timeout specified by the user. I would think that you'd at least want to adjust the timeout (double it or use some random backoff). Well, the current behavior is to simply return the BUSY to the client or ULP, which is either treated as a permanent error or causes an immediate retry. This can be a big problem with, for example, ipoib which sets retries to 15 and (as I understand it) immediately retries to connect when getting an error response from the SA. Other ulps have similar settings. Without some kind of delay, starting up ipoib on a large fabric (at boot time, for example) can cause a real packet storm. By treating BUSY replies identically to timeouts, this patch at least introduces a delay between attempts. In the case of the ULPs, the delay is typically 4 seconds. Sean said: The general guideline that we've been using for adjusting timeouts has been to report the failures and let the caller make the a necessary adjustments. As far as I know, the only way for user space applications to query the SA are through the librdmacm, which sets retries to 0, or through the libibumad interface directly. I would expect any application using the latter to be intelligent enough to handle a busy response. And this approach encourages applications to adjust their timeouts appropriately by treating BUSY responses as non-events and forcing the applications to wait for their request to time out. Depending on the application developers to take BUSY responses into account seems to be asking for trouble - it allows one rogue app to bring the SA to its knees, for example. By enforcing this timeout model in the kernel, we guarantee that there will be at least some delay between each message when the SA is reporting a busy status. And as I previously mentioned this patch also affects kernel code, much of which does use retries. Sean said: Maybe we should re-think that guideline and allow users to simply indicate that the MAD layer should use reasonable defaults. This would enable the ib_mad module to adjust the timeout values for all consumers based on actual destination response times. It could also back off retrying multiple requests that were initiated around the same time, instead only retrying the first request, while simply increasing the timeout values for the others. This is more complex, but we should be able to start with something fairly simple. It's an interesting idea, but in the meantime this is a problem that affects large clusters today. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] infiniband: check local reserved ports
So this patch looks good for you? :) Yes, will queue it up, thanks. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ANNOUNCE] librdmacm 1.0.12: release notes text
On Jun 01, 2010 05:18 PM, Hefty, Sean sean.he...@intel.com wrote: gcc -DHAVE_CONFIG_H -I. -I. -I. -I./include -g -Wall -D_GNU_SOURCE -O2 -g -pipe -m64 -MT src_librdmacm_la-acm.lo -MD -MP -MF .deps/src_librdmacm_la- acm.Tpo -c src/acm.c -fPIC -DPIC -o .libs/src_librdmacm_la-acm.o In file included from src/acm.c:44: ./include/infiniband/ib.h:49: error: syntax error before __be16 At least the following definitions are missing from types.h in redhat 4.x, which is based on 2.6.9. (These are from RH 5.x.) 158 #ifdef __CHECKER__ 159 #define __bitwise__ __attribute__((bitwise)) 160 #else 161 #define __bitwise__ 162 #endif 163 #ifdef __CHECK_ENDIAN__ 164 #define __bitwise __bitwise__ 165 #else 166 #define __bitwise 167 #endif 168 169 typedef __u16 __bitwise __le16; 170 typedef __u16 __bitwise __be16; 171 typedef __u32 __bitwise __le32; 172 typedef __u32 __bitwise __be32; 173 #if defined(__GNUC__) !defined(__STRICT_ANSI__) 174 typedef __u64 __bitwise __le64; 175 typedef __u64 __bitwise __be64; 176 #endif Has OFED handled a similar problem in the past? If so, how? (I can think of a couple ways to deal with this.) - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hi Sean, I'm sorry but I have not information about the release notes text of librdmacm version 1.0.12. It's possible that I have lost the announce email. Could you please send me the release notes, if they are ready? Thank you very much. Regards, Andrea Andrea Gozzelino INFN - Laboratori Nazionali di Legnaro (LNL) Viale dell'Universita' 2 I-35020 - Legnaro (PD)- ITALIA Tel: +39 049 8068346 Fax: +39 049 641925 Mail: andrea.gozzel...@lnl.infn.it -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Handling busy responses from the SA
Hal said: Should a busy be retried at all at the mad layer ? Is a special longer) timeout policy for busy needed ? Also, should this be done for all MADs classified by ib_response_mad (e.g. trap represses) ? Hal, The idea of processing BUSY responses in the MAD layer is to BUSY responses like timeouts - which are currently handled by the MAD layer. Right now there is an issue where various apps and ULPs either treat BUSY as a cause to immediately retry or as a permanent error. This doesn't seem to affect users of the OpenSM so much because (as I understand it) the OpenSM seems to discard requests when it gets too busy - but for other SA/SMs, it can cause a major packet storm or, worse, a simple loss of connectivity where MPI jobs or kernel ULPs simply assume the SA is broken because they got a BUSY reply. By treating the BUSY reply as a timeout, we're actually simplifying matters by fitting into existing practice. As for needing a longer timeout - in our old proprietary stack, QLogic did have a longer timeout for retrying busy replies than for normal timeouts - but we should try to get this in now so we can get some relief before we begin the long term discussion of the best way to handle this issue overall.
RE: [PATCH] Handling busy responses from the SA
But, I also agree with Roland.. having the SA return busy when it is under load seems insane :) In that case, what is the purpose of the BUSY response? -Original Message- From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] Sent: Friday, June 04, 2010 6:58 PM To: Hefty, Sean Cc: Mike Heinz; linux-rdma@vger.kernel.org; e...@openfabrics.org Subject: Re: [PATCH] Handling busy responses from the SA On Fri, Jun 04, 2010 at 02:05:10PM -0700, Hefty, Sean wrote: Maybe we should re-think that guideline and allow users to simply indicate that the MAD layer should use reasonable defaults. This would enable the ib_mad module to adjust the timeout values for all consumers based on actual destination response times. It could also back off retrying multiple requests that were initiated around the same time, instead only retrying the first request, while simply increasing the timeout values for the others. This is more complex, but we should be able to start with something fairly simple. A common method for handling this sort of thing is to randomize the retry timeout. It would be a good idea to randomize all timeouts, but the BUSY replies should probably randomize over a longer time period. Randomization prevents nodes in the cluster from self-synchronizing and making the load on the SA worse. But, I also agree with Roland.. having the SA return busy when it is under load seems insane :) But if you really want to do this then I think a different, larger, timeout should be used than the standard mad timeout. Also, I guess, it would be a good API choice if the caller could say 'get me a reply for this mad or error within 60s' rather than specify details like retry counts, etc. The timeout values should be globally set and derived from the usual SA provided data for network transits... Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2] allow passthrough of rmpp packets to user mad clients
We also have some management applications which also need these capabilities. For those applications, the use of application RMPP control allows the application to perform some pacing of the RMPP transactions, permits some parts of the RMPP response to be built on the fly and also permits a degree of sharing of the response data between multiple requestors. -Original Message- From: Roland Dreier [mailto:rdre...@cisco.com] Sent: Friday, June 04, 2010 3:54 PM To: Mike Heinz Cc: linux-rdma@vger.kernel.org; Hal Rosenstock; Hefty, Sean Subject: Re: [PATCH v2] allow passthrough of rmpp packets to user mad clients This patch changes this behavior so that rmpp_version of 255 causes incoming rmpp packets to be passed through without alteration, instead. There are IB users who have requested the ability to perform RMPP transaction handling in user space. This was an option in old proprietary stacks and this is useful to migrate old applications to OFED while containing the scope of their application changes. I'm a little dubious about this. We have an RMPP implementation in the kernel, and it seems worthwhile to focus on stability and features there. Allowing alternate RMPP implementations in userspace seems a bit iffy -- we don't have a socket option that lets us do TCP in userspace for a given connection, for example. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: InfiniBand/RDMA merge plans for 2.6.35
On 05/17/2010 09:36 AM, Roland Dreier wrote: Since 2.6.34 is here, it's probably a good time to talk about 2.6.35 merge plans. All the pending things that I'm aware of are listed below. Boilerplate: If something isn't already in my tree and it isn't listed below, I probably missed it or dropped it unintentionally. Please remind me. Hi Roland, Please also pick up the 3-patch set Least attached vector support from Yevgeny on 2010-5-13? RDS changes depend on these. Thanks -- Regards -- Andy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] infiniband: check local reserved ports
On 06/07/10 23:45, Roland Dreier wrote: So this patch looks good for you? :) Yes, will queue it up, thanks. Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html