[PATCH] MAINTAINERS: update NetEffect entry
Correct web link as www.neteffect.com is no longer valid. Remove Chien Tung as maintainer. I am moving on to other responsibilities at Intel. Thanks for all the fish. Signed-off-by: Chien Tung chien.tin.t...@intel.com --- MAINTAINERS |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index debde01..e067aa9 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4043,9 +4043,8 @@ F:drivers/scsi/NCR_D700.* NETEFFECT IWARP RNIC DRIVER (IW_NES) M: Faisal Latif faisal.la...@intel.com -M: Chien Tung chien.tin.t...@intel.com L: linux-rdma@vger.kernel.org -W: http://www.neteffect.com +W: http://www.intel.com/Products/Server/Adapters/Server-Cluster/Server-Cluster-overview.htm S: Supported F: drivers/infiniband/hw/nes/ -- 1.6.4.2 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] opensm/osm_qos.c: Make offset of VL in VLarb block element match IBA spec
According to IBA 1.2.1, Table 152, page 845, the VL in a VLArbitration Table Block Element has length 4 bits, starting at offset 4 in the 16 bit Block Element. Currently, the data being sent to the switches has the VL starting at offset 0 in the 16 bit Block Element. Fix things up to match the spec. Signed-off-by: Jim Schutt jasc...@sandia.gov --- opensm/opensm/osm_qos.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index c90073e..cc38151 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -365,7 +365,7 @@ static int parse_vlarb_entry(char *str, ib_vl_arb_element_t * e) unsigned val; char *p = str; p += parse_one_unsigned(p, ':', val); - e-vl = val % 15; + e-vl = (val % 15) 4; p += parse_one_unsigned(p, ',', val); e-weight = (uint8_t) val; return (int)(p - str); -- 1.6.2.2 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Add exponential backoff + random delay to MADs when retrying after timeout.
Hal, At the bottom of this is a slight rewrite of my previous email (and a tweak to the patch) to address your concerns and to make things more clear. Other items are answered inline. What experience/confidence is there in this (specific) randomization policy ? On what (how large) IB cluster sizes has this policy been tried ? Is this specific policy modeled from other policies in use elsewhere ? To explicitly discuss this: The old Infinicon stack added 1 second on each successive retry, but didn't randomize. I modeled this algorithm after the Ethernet model but I chose the terms to be on the same order of magnitude as we typically use for MAD timeouts. I can't claim to have any special experience showing this particular policy is best except to say that the principles are sound. Also, is this randomized timeout used on RMPP packets if this parameter is not 0 ? If the module parameter is non-zero then yes, it will coerce all timeouts for all MAD requests to randomize. Keep in mind that this code doesn't change how packets are processed when they timeout, it just changes how the timeout is calculated. Finally, I've added a module parameter to coerce all mad work requests to use this feature if desired. On one hand, I don't want to introduce unneeded parameters/complexity but I'm wondering whether more granularity is useful on which requests (classes ?) this applies to. For example, should SM requests be randomized ? This feature is primarily an SA thing although busy can be used for other management classes but it's use is mainly GS related. First, I think we should separate this from the BUSY handling issue - not because they aren't connected but because every time I start focusing on these things I promptly get yanked onto something else. Hopefully we can focus on just the randomization aspect and bring it to a satisfactory agreement first, then I'll re-submit the BUSY handling patch based on that. That said, there's been some argument over whether the best place for choosing the retry policy is in ib_mad or in the individual ulps and apps. The intent of the module parameter is to provide relief on larger clusters while waiting for the authors of other components to modify their models. I do also think randomizing on retry is just as applicable for SM requests as for SA - if requests are timing out, then the SA/SM is getting overloaded, regardless of the type of request. - Design notes: This patch builds upon a discussion we had earlier this year on adding a backoff function when retrying MAD sends after a timeout. The current behavior is to retry MAD requests at a fixed interval, specified by the caller, and no more than the number of times specified by the caller. The problem with this approach is that if the same application or ulp is installed on many hundreds (or thousands) of nodes, all using the same retry interval, they could all end up retrying at roughly the same time, causing repeatable packet storms. On a large cluster, these storms can effectively act as a denial of service attack. To get around this, the retry timer should have a randomization component of a similar order of magnitude as the retries themselves. Since retries are usually on the order of one second, the patch defines the randomization component as between zero and roughly 1/2 second (511 ms) although the upper limit can tuned by changing a #define. The other standard method for prevent storms of retries is to implement an exponential backoff, such as is used in the Ethernet protocol. However, because the user has also explicitly specified a timeout value, I chose to treat that value as a minimum delay, then I add an exponential value on top of that, defined as BASE*2^c, where 'c' is the number of retries already attempted, minus 1. Currently, the base value is defined as 511 ms (1/2 second), so that the retry interval is defined as: (minimum timeout) + 511c - (random value between 0 511) This causes the following retry times: 0: minimum timeout 1: minimum timeout + (random value between 0 511) 2: minimum timeout + 1 second - (random value between 0 511) 3: minimum timeout + 2 seconds - (random value between 0 511) 4: minimum timeout + 4 seconds - (random value between 0 511) . . . c: minimum timeout + (1/2 second)*2^c - (random value between 0 511) (For comparison, the old Silverstorm/Infinicon stack waited 1 second * the number of retries.) Implementation: This patch does NOT implement the ABI/API changes that would be needed to take advantage of the new features, but it lays the groundwork for doing so. In addition, it provides a new module parameter that allow the administrator to coerce existing code into using the new capability: parm: randomized_wait: When true, use a randomized backoff algorithm to control retries for timeouts. (int) Note that this parameter will not force
RE: [PATCH v2] Add exponential backoff + random delay to MADs when retrying after timeout.
The problem with this approach is that if the same application or ulp is installed on many hundreds (or thousands) of nodes, all using the same retry interval, they could all end up retrying at roughly the same time, causing repeatable packet storms. On a large cluster, these storms can effectively act as a denial of service attack. To get around this, the retry timer should have a randomization component of a similar order of magnitude as the retries themselves. Since retries are usually on the order of one second, the patch defines the randomization component as between zero and roughly 1/2 second (511 ms) although the upper limit can tuned by changing a #define. The other standard method for prevent storms of retries is to implement an exponential backoff, such as is used in the Ethernet protocol. However, because the user has also explicitly specified a timeout value, I chose to treat that value as a minimum delay, then I add an exponential value on top of that, defined as BASE*2^c, where 'c' is the number of retries already attempted, minus 1. Currently, the base value is defined as 511 ms (1/2 second), so that the retry interval is defined as: (minimum timeout) + 511c - (random value between 0 511) This causes the following retry times: 0: minimum timeout 1: minimum timeout + (random value between 0 511) 2: minimum timeout + 1 second - (random value between 0 511) 3: minimum timeout + 2 seconds - (random value between 0 511) 4: minimum timeout + 4 seconds - (random value between 0 511) When you consider RMPP, the timeout/retry values specified by the user are not straightforward in their meaning. I haven't look at this patch in detail yet, but how do the timeout changes work with RMPP MADs? Is the timeout reset to the minimum after an ACK is received? My personal preference at this time is to push more intelligence into the timeout/retry algorithm used by the MAD layer, but restricted to SA clients. I'd like to see even more randomization in the retry time, coupled with a TCP-like congestion windowing implementation when issuing SA queries. For example: Never allow more than, say, 8 SA queries outstanding at a time. If an SA query times out, reduce the number of outstanding queries to 1 until we get a response, then double the number of queries allowed to be outstanding until we reach the max. Have the mad layer calculate the SA query timeout based on the actual SA response time, with randomization based on that. The user specified timeout value can basically be ignored. The only reason I'm suggesting we restrict the algorithm to SA queries is to avoid storing per endpoint information. That may be better handled by the CM (since CM responses are sends). Given all this, then I think it would be okay to accept the patch to drop busy responses from the SA until this framework is in place, which wouldn't be until 2.6.38 or 39. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ibv_cmd_create_cq failed with ret=c
Hi, I met an error when trying to run a job on my cluster. I am not sure if this mailing list is the right place to ask for help. My cluster has 248 nodes. Each node is Power7+RedHat6, with 32GB memory. When I run 13000 tasks across the cluster, I met the following errors: PID5b42 ehca0 EHCA_ERR:ehcau_create_cq ibv_cmd_create_cq() failed ret=c context=0x1001f252ef0 cqe=80 PID5b42 ehca0 EHCA_ERR:ehcau_create_cq An error has occured context=0x1001f252ef0 cqe=80 'ret=c' should correspond to ENOMEM. But during the job running, I found the free memory is around 24GB on each node. I found if total task number =11000, i.e., ~40 tasks per node , it couild succeed. Could someone give me hint about the possible reason? the /etc/security/limits.conf is: * softcoreunlimited * hardcoreunlimited * softmemlockunlimited * hardmemlockunlimited * hardnofile 65535 * softnofile 65535 * hardstack 16000 * softstack 16000 * softnproc 65535 * hardnproc 65535 Thanks in advance for your help. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html