[PATCH] MAINTAINERS: update NetEffect entry

2010-11-02 Thread Chien Tung

Correct web link as www.neteffect.com is no longer valid.
Remove Chien Tung as maintainer.  I am moving on to other
responsibilities at Intel.  Thanks for all the fish.

Signed-off-by: Chien Tung chien.tin.t...@intel.com
---
 MAINTAINERS |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index debde01..e067aa9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4043,9 +4043,8 @@ F:drivers/scsi/NCR_D700.*
 
 NETEFFECT IWARP RNIC DRIVER (IW_NES)
 M: Faisal Latif faisal.la...@intel.com
-M: Chien Tung chien.tin.t...@intel.com
 L: linux-rdma@vger.kernel.org
-W: http://www.neteffect.com
+W: 
http://www.intel.com/Products/Server/Adapters/Server-Cluster/Server-Cluster-overview.htm
 S: Supported
 F: drivers/infiniband/hw/nes/
 
-- 
1.6.4.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] opensm/osm_qos.c: Make offset of VL in VLarb block element match IBA spec

2010-11-02 Thread Jim Schutt
According to IBA 1.2.1, Table 152, page 845, the VL in a VLArbitration Table
Block Element has length 4 bits, starting at offset 4 in the 16 bit
Block Element.

Currently, the data being sent to the switches has the VL starting at
offset 0 in the 16 bit Block Element.

Fix things up to match the spec.

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 opensm/opensm/osm_qos.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c
index c90073e..cc38151 100644
--- a/opensm/opensm/osm_qos.c
+++ b/opensm/opensm/osm_qos.c
@@ -365,7 +365,7 @@ static int parse_vlarb_entry(char *str, ib_vl_arb_element_t 
* e)
unsigned val;
char *p = str;
p += parse_one_unsigned(p, ':', val);
-   e-vl = val % 15;
+   e-vl = (val % 15)  4;
p += parse_one_unsigned(p, ',', val);
e-weight = (uint8_t) val;
return (int)(p - str);
-- 
1.6.2.2


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Add exponential backoff + random delay to MADs when retrying after timeout.

2010-11-02 Thread Mike Heinz
Hal,

At the bottom of this is a slight rewrite of my previous email (and a tweak to 
the patch) to address your concerns and to make things more clear. Other items 
are answered inline.

What experience/confidence is there in this (specific) randomization 
policy ? On what (how large) IB cluster sizes has this policy been tried 
? Is this specific policy modeled from other policies in use elsewhere ?

To explicitly discuss this: The old Infinicon stack added 1 second on each 
successive retry, but didn't randomize. I modeled this algorithm after the 
Ethernet model but I chose the terms to be on the same order of magnitude as we 
typically use for MAD timeouts. I can't claim to have any special experience 
showing this particular policy is best except to say that the principles are 
sound.

Also, is this randomized timeout used on RMPP packets if this parameter 
 is not 0 ?

If the module parameter is non-zero then yes, it will coerce all timeouts for 
all MAD requests to randomize. Keep in mind that this code doesn't change how 
packets are processed when they timeout, it just changes how the timeout is 
calculated.

 Finally, I've added a module parameter to coerce all mad work requests to 
 use this feature if desired.

On one hand, I don't want to introduce unneeded parameters/complexity 
but I'm wondering whether more granularity is useful on which requests 
(classes ?) this applies to. For example, should SM requests be 
randomized ? This feature is primarily an SA thing although busy can be 
used for other management classes but it's use is mainly GS related.

First, I think we should separate this from the BUSY handling issue - not 
because they aren't connected but because every time I start focusing on these 
things I promptly get yanked onto something else. Hopefully we can focus on 
just the randomization aspect and bring it to a satisfactory agreement first, 
then I'll re-submit the BUSY handling patch based on that. 

That said, there's been some argument over whether the best place for choosing 
the retry policy is in ib_mad or in the individual ulps and apps. The intent of 
the module parameter is to provide relief on larger clusters while waiting for 
the authors of other components to modify their models. I do also think 
randomizing on retry is just as applicable for SM requests as for SA - if 
requests are timing out, then the SA/SM is getting overloaded, regardless of 
the type of request.


-
Design notes:

This patch builds upon a discussion we had earlier this year on adding a
backoff function when retrying MAD sends after a timeout.

The current behavior is to retry MAD requests at a fixed interval, specified by
the caller, and no more than the number of times specified by the caller.

The problem with this approach is that if the same application or ulp is
installed on many hundreds (or thousands) of nodes, all using the same retry
interval, they could all end up retrying at roughly the same time, causing
repeatable packet storms. On a large cluster, these storms can effectively act
as a denial of service attack. To get around this, the retry timer should have
a randomization component of a similar order of magnitude as the retries
themselves. Since retries are usually on the order of one second, the patch
defines the randomization component as between zero and roughly 1/2 second
(511 ms) although the upper limit can tuned by changing a #define.

The other standard method for prevent storms of retries is to implement an
exponential backoff, such as is used in the Ethernet protocol. However, because
the user has also explicitly specified a timeout value, I chose to treat
that value as a minimum delay, then I add an exponential value on top of that,
defined as BASE*2^c, where 'c' is the number of retries already attempted,
minus 1.

Currently, the base value is defined as 511 ms (1/2 second), so that the
retry interval is defined as:

(minimum timeout) + 511c - (random value between 0  511)

This causes the following retry times:

0:  minimum timeout
1:  minimum timeout + (random value between 0  511)
2:  minimum timeout + 1 second - (random value between 0  511)
3:  minimum timeout + 2 seconds - (random value between 0  511)
4:  minimum timeout + 4 seconds - (random value between 0  511)
.
.
.
c:  minimum timeout + (1/2 second)*2^c - (random value between 0  511)

(For comparison, the old Silverstorm/Infinicon stack waited 1 second *
the number of retries.)



Implementation:

This patch does NOT implement the ABI/API changes that would be needed to take
advantage of the new features, but it lays the groundwork for doing so. In
addition, it provides a new module parameter that allow the administrator to
coerce existing code into using the new capability:

parm: randomized_wait: When true, use a randomized backoff algorithm to control
retries for timeouts. (int)

Note that this parameter will not force 

RE: [PATCH v2] Add exponential backoff + random delay to MADs when retrying after timeout.

2010-11-02 Thread Hefty, Sean
 The problem with this approach is that if the same application or ulp is
 installed on many hundreds (or thousands) of nodes, all using the same
 retry
 interval, they could all end up retrying at roughly the same time, causing
 repeatable packet storms. On a large cluster, these storms can effectively
 act
 as a denial of service attack. To get around this, the retry timer should
 have
 a randomization component of a similar order of magnitude as the retries
 themselves. Since retries are usually on the order of one second, the patch
 defines the randomization component as between zero and roughly 1/2 second
 (511 ms) although the upper limit can tuned by changing a #define.
 
 The other standard method for prevent storms of retries is to implement an
 exponential backoff, such as is used in the Ethernet protocol. However,
 because
 the user has also explicitly specified a timeout value, I chose to treat
 that value as a minimum delay, then I add an exponential value on top of
 that,
 defined as BASE*2^c, where 'c' is the number of retries already attempted,
 minus 1.
 
 Currently, the base value is defined as 511 ms (1/2 second), so that the
 retry interval is defined as:
 
 (minimum timeout) + 511c - (random value between 0  511)
 
 This causes the following retry times:
 
 0:  minimum timeout
 1:  minimum timeout + (random value between 0  511)
 2:  minimum timeout + 1 second - (random value between 0  511)
 3:  minimum timeout + 2 seconds - (random value between 0  511)
 4:  minimum timeout + 4 seconds - (random value between 0  511)

When you consider RMPP, the timeout/retry values specified by the user are not 
straightforward in their meaning.  I haven't look at this patch in detail yet, 
but how do the timeout changes work with RMPP MADs?  Is the timeout reset to 
the minimum after an ACK is received?

My personal preference at this time is to push more intelligence into the 
timeout/retry algorithm used by the MAD layer, but restricted to SA clients.  
I'd like to see even more randomization in the retry time, coupled with a 
TCP-like congestion windowing implementation when issuing SA queries.

For example: Never allow more than, say, 8 SA queries outstanding at a time.  
If an SA query times out, reduce the number of outstanding queries to 1 until 
we get a response, then double the number of queries allowed to be outstanding 
until we reach the max.  Have the mad layer calculate the SA query timeout 
based on the actual SA response time, with randomization based on that.  The 
user specified timeout value can basically be ignored.

The only reason I'm suggesting we restrict the algorithm to SA queries is to 
avoid storing per endpoint information.  That may be better handled by the CM 
(since CM responses are sends).

Given all this, then I think it would be okay to accept the patch to drop busy 
responses from the SA until this framework is in place, which wouldn't be until 
2.6.38 or 39.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ibv_cmd_create_cq failed with ret=c

2010-11-02 Thread Guanglei Li
Hi,
  I met an error when trying to run a job on my cluster. I am not sure
if this mailing list is the right place to ask for help.
  My cluster has 248 nodes. Each node is Power7+RedHat6, with 32GB
memory. When I run 13000 tasks across the cluster, I met the following
errors:

PID5b42 ehca0 EHCA_ERR:ehcau_create_cq ibv_cmd_create_cq() failed
ret=c context=0x1001f252ef0 cqe=80
PID5b42 ehca0 EHCA_ERR:ehcau_create_cq An error has occured
context=0x1001f252ef0 cqe=80

  'ret=c' should correspond to ENOMEM. But during the job running, I
found the free memory is around 24GB on each node.

 I found if total task number =11000, i.e.,  ~40 tasks per node , it
couild succeed.
  Could someone give me hint about the possible reason?

  the /etc/security/limits.conf is:
*   softcoreunlimited
*   hardcoreunlimited
*   softmemlockunlimited
*   hardmemlockunlimited
*   hardnofile 65535
*   softnofile 65535
*   hardstack 16000
*   softstack 16000
*   softnproc   65535
*   hardnproc   65535

  Thanks in advance for your help.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html