Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?

2010-06-07 Thread Eli Cohen
Or, counter should work as regular in upstream kernel patches for IB
link layer. For IBoE, they will not work since the SMA does not
support them. I have patches that allow to show counters using sysfs
but those have not yet been sent upstream. They available in
OFED-1.5.2 latest builds.

On Thu, Jun 3, 2010 at 4:33 PM, Or Gerlitz ogerl...@voltaire.com wrote:
 Moni Shoua wrote:

 Did you try OFED-1.5.1 or even better, OFED-1.5.2? I know patches for
 counters with RoCEE were submitted since OFED-1.5 and I saw it working

 Mony, I'm not using ofed, sorry... I am interested in a clarification in the
 context of the upstream submission, e.g does the problem exist in the latest
 patch-set, is there a bz case tracking this, etc.

 Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] opensm: Add a rate based mechanism for SMP transactions

2010-06-07 Thread Hal Rosenstock

In order to better handle non responsive SMAs (when link is physically up
but the SMA does not respond), a rate based mechanism for SMPs is added
to better enable forward progress in a more timely fashion. So rather than
wait for timeouts and outstanding wire SMPs to drop below some configured
value, there is also a periodic rate for transaction based SMPs. These
rate based SMPs are capped at a configured maximum value.

Two new options are added for this:
rate_based_smp_usecs indicates the number of microseconds between rate
based SMPs.
max_rate_based_smps indicates the maximum number of rate based SMPs
supported. When this limit is reached, rate based SMPs are no longer
sent (until the number of outstanding ones drops below this limit).

The rate based SMP mechanism can be disabled by setting rate_based_smp_usecs
to 0. This is equivalent to the (current) algorithm prior to this change.

By default, this mechanism is disabled.

Also, the maximum max_wire_smps is reduced to 0x3FFF from 0x7FFF so
the sum of max_wire_smps and max_rate_based_smps does not wrap.

Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com
---
Changes from v1:
Algorithm change is isolated to vl15_poller rather than involving
the vendor layer.

diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h
index e0d6c66..16241ec 100644
--- a/opensm/include/opensm/osm_base.h
+++ b/opensm/include/opensm/osm_base.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2010 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
  *
@@ -440,6 +440,30 @@ BEGIN_C_DECLS
 */
 #define OSM_DEFAULT_SMP_MAX_ON_WIRE 4
 /***/
+/d* OpenSM: Base/OSM_DEFAULT_SMP_RATE_MAX
+* NAME
+*  OSM_DEFAULT_SMP_RATE_MAX
+*
+* DESCRIPTION
+*  Specifies the default maximum number of outstanding rate based SMPs.
+*
+* SYNOPSIS
+*/
+#define OSM_DEFAULT_SMP_RATE_MAX 100
+/***/
+/d* OpenSM: Base/OSM_DEFAULT_SMP_RATE
+* NAME
+*  OSM_DEFAULT_SMP_RATE
+*
+* DESCRIPTION
+*  Specifies the default rate (in usec) for rate based SMPs.
+*  The default rate is 0. A value of 0 (or EVENT_NO_TIMEOUT)
+*  disables the rate based SMP mechanism.
+*
+* SYNOPSIS
+*/
+#define OSM_DEFAULT_SMP_RATE 0
+/***/
 /d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE
 * NAME
 *  OSM_SM_DEFAULT_QP0_RCV_SIZE
diff --git a/opensm/include/opensm/osm_subnet.h 
b/opensm/include/opensm/osm_subnet.h
index 4e8c862..49acf2d 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2010 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
  * Copyright (c) 2009 System Fabric Works, Inc. All rights reserved.
@@ -149,6 +149,8 @@ typedef struct osm_subn_opt {
ib_net16_t m_key_lease_period;
uint32_t sweep_interval;
uint32_t max_wire_smps;
+   uint32_t max_rate_based_smps;
+   uint32_t rate_based_smp_usecs;
uint32_t transaction_timeout;
uint32_t transaction_retries;
uint8_t sm_priority;
@@ -264,6 +266,14 @@ typedef struct osm_subn_opt {
 *  max_wire_smps
 *  The maximum number of SMPs sent in parallel.  Default is 4.
 *
+*  max_rate_based_smps
+*  The maximum number of rate based SMPs allowed to be outstanding.
+*  Default is 1000.
+*
+*  rate_based_smp_usecs
+*  The wait time in usec for rate based SMPs.  Default is 0
+*  (disabled). 
+*
 *  transaction_timeout
 *  The maximum time in milliseconds allowed for a transaction
 *  to complete.  Default is 200.
diff --git a/opensm/include/opensm/osm_vl15intf.h 
b/opensm/include/opensm/osm_vl15intf.h
index 15ed56c..887733f 100644
--- a/opensm/include/opensm/osm_vl15intf.h
+++ b/opensm/include/opensm/osm_vl15intf.h
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 2002-2010 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -117,6 +117,8 @@ typedef struct osm_vl15 {
osm_thread_state_t thread_state;
osm_vl15_state_t state;
uint32_t max_wire_smps;
+   uint32_t max_rate_based_smps;
+   uint32_t rate_based_smp_usecs;
cl_event_t signal;

Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?

2010-06-07 Thread Or Gerlitz
Eli Cohen wrote:
 counter should work as regular in upstream kernel patches for IB link layer. 

okay good, can you validate that? basically, I can set some time to clone 
Roland's tree
and use the iboe branch as a basis for testing that the IB stack is live and 
kicking as it used to be before the patches. I just need an updated copy of the 
rest of the patch set (Roland has three patches so far) for that end. Over the 
review process there were bunch of comments but no new posting, how are you 
planning to proceed in the review/merge process?

 for IBoE, they will not work since the SMA does not support them
 I have patches that allow to show counters using sysfs 

I am not with you. The counters are read using a MAD sent to the firmware PMA 
(QP #1), 
this applies for both perfquery and sysfs, isn't it? 

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?

2010-06-07 Thread Or Gerlitz

Eli Cohen wrote:

Why are you asking me to validate that? Did you actually encounter a problem 
with this?
yes, I did. It didn't work with some ofed drop I was using. Anyway, as I 
said, I can do some validation that IBoE doesn't break upstream IB, just 
need the patches for that end, so once they are available, I will give 
them a try over 2.6.35-rcX



The counters patches will divert the code: for iboe it will not issue a MAD to 
the firmware. It will use another command.
Can you be more specific what is the origin for this new design? is it 
HW limitation or firware limitation or something else? In case its not 
hardware limitation, I don't think we need to go for non MAD based 
scheme, at least not for mlx4


Or.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Handling busy responses from the SA

2010-06-07 Thread Mike Heinz
Roland Dreier said:

 I don't have a strong opinion on this but it seems a bit odd.  If we're just 
 going to drop the response anyway, why did the SA send it in the first place? 
  On the other hand, if the SA told us it's busy, it does seem we could do 
 something more sensible than retrying immediately.

The spec provides for the SA to return a BUSY response. When that happens, this 
patch causes us to wait for the original request to time out before retrying, 
not trying again immediately. In effect, we are pretending we never got the 
BUSY response and allowing the request to time out, instead.

Roland Dreier said:

 The indentation of values seems pretty crazy here.  Also I'm not sure what 
 most of these defines are for?  They seem unused in this patch.

The indentation is probably from the conversion of tabs to spaces when the 
patch was pasted into the email - correcting it is no problem.  The value 
IB_MGMT_MAD_STATUS_BUSY is used in the patch, the others are defined because 
they are the other possible values for the same status field. We might as well 
define them all, for completeness.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Handling busy responses from the SA

2010-06-07 Thread Mike Heinz
Sean said:
 I don't object to the concept of treating a busy response as a timeout, but 
 how does this help prevent overwhelming the SA?  It continues to retry the 
 queries, even if the SA says that it's too busy to respond without adjusting 
 the timeout specified by the user.  I would think that you'd at least want to 
 adjust the timeout (double it or use some random backoff).


Well, the current behavior is to simply return the BUSY to the client or ULP, 
which  is either treated as a permanent error or causes an immediate retry. 
This can be a big problem with, for example, ipoib which sets retries to 15 and 
(as I understand it) immediately retries to connect when getting an error 
response from the SA. Other ulps have similar settings. Without some kind of 
delay, starting up ipoib on a large fabric (at boot time, for example) can 
cause a real packet storm. 

By treating BUSY replies identically to timeouts, this patch at least 
introduces a delay between attempts. In the case of the ULPs, the delay is 
typically 4 seconds.

Sean said:
 The general guideline that we've been using for adjusting timeouts has been 
 to report the failures and let the caller make the a necessary adjustments.  
 As far as I know, the only way for user space applications to query the SA 
 are through the librdmacm, which sets retries to 0, or through the libibumad 
 interface directly.  I would expect any application using the latter to be 
 intelligent enough to handle a busy response.


And this approach encourages applications to adjust their timeouts 
appropriately by treating BUSY responses as non-events and forcing the 
applications to wait for their request to time out.

Depending on the application developers to take BUSY responses into account 
seems to be asking for trouble - it allows one rogue app to bring the SA to its 
knees, for example. By enforcing this timeout model in the kernel, we guarantee 
that there will be at least some delay between each message when the SA is 
reporting a busy status. And as I previously mentioned this patch also affects 
kernel code, much of which does use retries.

Sean said:
 Maybe we should re-think that guideline and allow users to simply indicate 
 that the MAD layer should use reasonable defaults.  This would enable the 
 ib_mad module to adjust the timeout values for all consumers based on actual 
 destination response times.  It could also back off retrying multiple 
 requests that were initiated around the same time, instead only retrying the 
 first request, while simply increasing the timeout values for the others.  
 This is more complex, but we should be able to start with something fairly 
 simple.

It's an interesting idea, but in the meantime this is a problem that affects 
large clusters today.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch] infiniband: check local reserved ports

2010-06-07 Thread Roland Dreier
  So this patch looks good for you? :)

Yes, will queue it up, thanks.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ANNOUNCE] librdmacm 1.0.12: release notes text

2010-06-07 Thread Andrea Gozzelino
On Jun 01, 2010 05:18 PM, Hefty, Sean sean.he...@intel.com wrote:

  gcc -DHAVE_CONFIG_H -I. -I. -I. -I./include -g -Wall -D_GNU_SOURCE
-O2 -g
  -pipe -m64 -MT src_librdmacm_la-acm.lo -MD -MP -MF
  .deps/src_librdmacm_la-
  acm.Tpo -c src/acm.c  -fPIC -DPIC -o
  .libs/src_librdmacm_la-acm.o
  In file included from src/acm.c:44:
  ./include/infiniband/ib.h:49: error: syntax error before __be16
 
 At least the following definitions are missing from types.h in redhat
 4.x, which is based on 2.6.9. (These are from RH 5.x.)
 
 158 #ifdef __CHECKER__
 159 #define __bitwise__ __attribute__((bitwise))
 160 #else
 161 #define __bitwise__
 162 #endif
 163 #ifdef __CHECK_ENDIAN__
 164 #define __bitwise __bitwise__
 165 #else
 166 #define __bitwise
 167 #endif
 168 
 169 typedef __u16 __bitwise __le16;
 170 typedef __u16 __bitwise __be16;
 171 typedef __u32 __bitwise __le32;
 172 typedef __u32 __bitwise __be32;
 173 #if defined(__GNUC__)  !defined(__STRICT_ANSI__)
 174 typedef __u64 __bitwise __le64;
 175 typedef __u64 __bitwise __be64;
 176 #endif
 
 Has OFED handled a similar problem in the past? If so, how? (I can
 think of a couple ways to deal with this.)
 
 - Sean
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

Hi Sean,

I'm sorry but I have not information about the release notes text of
librdmacm version 1.0.12.
It's possible that I have lost the announce email.
Could you please send me the release notes, if they are ready?

Thank you very much.
Regards,
Andrea

Andrea Gozzelino

INFN - Laboratori Nazionali di Legnaro  (LNL)
Viale dell'Universita' 2
I-35020 - Legnaro (PD)- ITALIA
Tel: +39 049 8068346
Fax: +39 049 641925
Mail: andrea.gozzel...@lnl.infn.it  

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Handling busy responses from the SA

2010-06-07 Thread Mike Heinz
Hal said:
Should a busy be retried at all at the mad layer ? Is a special longer) 
timeout policy for busy needed ?

Also, should this be done for all MADs classified by ib_response_mad (e.g. trap 
represses) ?

Hal, 

The idea of processing BUSY responses in the MAD layer is to BUSY responses 
like timeouts - which are currently handled by the MAD layer. Right now there 
is an issue where various apps and ULPs either treat BUSY as a cause to 
immediately retry or as a permanent error. This doesn't seem to affect users of 
the OpenSM so much because (as I understand it) the OpenSM seems to discard 
requests when it gets too busy - but for other SA/SMs, it can cause a major 
packet storm or, worse, a simple loss of connectivity where MPI jobs or kernel 
ULPs simply assume the SA is broken because they got a BUSY reply.

By treating the BUSY reply as a timeout, we're actually simplifying matters by 
fitting into existing practice.

As for needing a longer timeout - in our old proprietary stack, QLogic did have 
a longer timeout for retrying busy replies than for normal timeouts - but we 
should try to get this in now so we can get some relief before we begin the 
long term discussion of the best way to handle this issue overall.



RE: [PATCH] Handling busy responses from the SA

2010-06-07 Thread Mike Heinz
 But, I also agree with Roland.. having the SA return busy when it is
under load seems insane :) 

In that case, what is the purpose of the BUSY response? 

-Original Message-
From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] 
Sent: Friday, June 04, 2010 6:58 PM
To: Hefty, Sean
Cc: Mike Heinz; linux-rdma@vger.kernel.org; e...@openfabrics.org
Subject: Re: [PATCH] Handling busy responses from the SA

On Fri, Jun 04, 2010 at 02:05:10PM -0700, Hefty, Sean wrote:

 Maybe we should re-think that guideline and allow users to simply
 indicate that the MAD layer should use reasonable defaults.  This
 would enable the ib_mad module to adjust the timeout values for all
 consumers based on actual destination response times.  It could also
 back off retrying multiple requests that were initiated around the
 same time, instead only retrying the first request, while simply
 increasing the timeout values for the others.  This is more complex,
 but we should be able to start with something fairly simple.

A common method for handling this sort of thing is to randomize
the retry timeout. It would be a good idea to randomize all timeouts,
but the BUSY replies should probably randomize over a longer time
period.

Randomization prevents nodes in the cluster from self-synchronizing
and making the load on the SA worse.

But, I also agree with Roland.. having the SA return busy when it is
under load seems insane :) But if you really want to do this then I
think a different, larger, timeout should be used than the standard
mad timeout.

Also, I guess, it would be a good API choice if the caller could say
'get me a reply for this mad or error within 60s' rather than specify
details like retry counts, etc. The timeout values should be globally
set and derived from the usual SA provided data for network transits...

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2] allow passthrough of rmpp packets to user mad clients

2010-06-07 Thread Mike Heinz
We also have some management applications which also need these capabilities.  
For those applications, the use of application RMPP control allows the 
application to perform some pacing of the RMPP transactions, permits some parts 
of the RMPP response to be built on the fly and also permits a degree of 
sharing of the response data between multiple requestors.

-Original Message-
From: Roland Dreier [mailto:rdre...@cisco.com] 
Sent: Friday, June 04, 2010 3:54 PM
To: Mike Heinz
Cc: linux-rdma@vger.kernel.org; Hal Rosenstock; Hefty, Sean
Subject: Re: [PATCH v2] allow passthrough of rmpp packets to user mad clients

  This patch changes this behavior so that rmpp_version of 255 causes incoming 
  rmpp packets to be passed through without alteration, instead.
  
  There are IB users who have requested the ability to perform RMPP 
  transaction handling in user space.  This was an option in old proprietary 
  stacks and this is useful to migrate old applications to OFED while 
  containing the scope of their application changes.  

I'm a little dubious about this.  We have an RMPP implementation in the
kernel, and it seems worthwhile to focus on stability and features
there.  Allowing alternate RMPP implementations in userspace seems a bit
iffy -- we don't have a socket option that lets us do TCP in userspace
for a given connection, for example.
-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: InfiniBand/RDMA merge plans for 2.6.35

2010-06-07 Thread Andy Grover

On 05/17/2010 09:36 AM, Roland Dreier wrote:

Since 2.6.34 is here, it's probably a good time to talk about 2.6.35
merge plans.  All the pending things that I'm aware of are listed below.

Boilerplate:

If something isn't already in my tree and it isn't listed below, I
probably missed it or dropped it unintentionally.  Please remind me.


Hi Roland,

Please also pick up the 3-patch set Least attached vector support from 
Yevgeny on 2010-5-13? RDS changes depend on these.


Thanks -- Regards -- Andy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch] infiniband: check local reserved ports

2010-06-07 Thread Cong Wang

On 06/07/10 23:45, Roland Dreier wrote:

So this patch looks good for you? :)

Yes, will queue it up, thanks.


Thanks!

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html