[PATCH] opensm: Fix Q_Key, TClass and limited keys parsing warnings in partitions.conf
Parsing these paraters caused 'unrecognized mgroup flag' warning. Moreover fixed man page/doc for Q_Key/TClass paramters. Signed-off-by: Alex Netes ale...@mellanox.com --- doc/partition-config.txt | 4 ++-- man/opensm.8.in | 4 ++-- opensm/osm_prtn_config.c | 4 +++- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/doc/partition-config.txt b/doc/partition-config.txt index f49d473..3581ef6 100644 --- a/doc/partition-config.txt +++ b/doc/partition-config.txt @@ -94,11 +94,11 @@ General file format: mgid. Furthermore specifying multiple scope settings will result in multiple MC groups being created. -qkey=val - specifies the Q_Key for this MC group +Q_Key=val - specifies the Q_Key for this MC group (default: 0x0b1b for IP groups, 0 for other groups) WARNING: changing this for the broadcast group may break IPoIB on client nodes!!! -tclass=val- specifies tclass for this MC group +TClass=val- specifies tclass for this MC group (default is 0) FlowLabel=val - specifies FlowLabel for this MC group (default is 0) diff --git a/man/opensm.8.in b/man/opensm.8.in index 37e2eee..4ab7b30 100644 --- a/man/opensm.8.in +++ b/man/opensm.8.in @@ -619,9 +619,9 @@ General file format: mgid. Furthermore specifying multiple scope settings will result in multiple MC groups being created. -qkey=val - specifies the Q_Key for this MC group +Q_Key=val - specifies the Q_Key for this MC group (default: 0x0b1b for IP groups, 0 for other groups) -tclass=val- specifies tclass for this MC group +TClass=val- specifies tclass for this MC group (default is 0) FlowLabel=val - specifies FlowLabel for this MC group (default is 0) diff --git a/opensm/osm_prtn_config.c b/opensm/osm_prtn_config.c index a3524b1..8f4a673 100644 --- a/opensm/osm_prtn_config.c +++ b/opensm/osm_prtn_config.c @@ -296,12 +296,14 @@ static int parse_group_flag(unsigned lineno, osm_log_t * p_log, else flags-scope_mask |= (1scope); } else if (!strncmp(flag, Q_Key, strlen(flag))) { + rc = 1; if (!val || (flags-Q_Key = strtoul(val, NULL, 0)) == 0) OSM_LOG(p_log, OSM_LOG_VERBOSE, PARSE WARN: line %d: flag \'Q_Key\' requires valid value - using '0'\n, lineno); } else if (!strncmp(flag, TClass, strlen(flag))) { + rc =1; if (!val || (flags-TClass = strtoul(val, NULL, 0)) == 0) OSM_LOG(p_log, OSM_LOG_VERBOSE, PARSE WARN: line %d: @@ -406,7 +408,7 @@ static int partition_add_port(unsigned lineno, struct part_conf *conf, membership = FULL; else if (!strncmp(flag, both, strlen(flag))) membership = BOTH; - else if (!strncmp(flag, limited, strlen(flag))) { + else if (strncmp(flag, limited, strlen(flag))) { OSM_LOG(conf-p_log, OSM_LOG_VERBOSE, PARSE WARN: line %d: unrecognized port flag \'%s\'. -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] iw_cm: Don't touch cmid after dereferencing it.
Function cm_work_handler() cannot touch the cm_id after it derefs it because it might be freed on another concurrent thread. If there are more work items queued for this cm_id, then we know there must be more references because they are added when the work items are queued. So in the while loop inside cm_work_handler(), after derefing, if the queue is empty, then exit the function. Otherwise you know its safe to re-acquire the lock. Signed-off-by: Steve Wise sw...@opengridcomputing.com --- drivers/infiniband/core/iwcm.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c index 0bb99bb..c47c203 100644 --- a/drivers/infiniband/core/iwcm.c +++ b/drivers/infiniband/core/iwcm.c @@ -878,6 +878,8 @@ static void cm_work_handler(struct work_struct *_work) } return; } + if (empty) + return; spin_lock_irqsave(cm_id_priv-lock, flags); } spin_unlock_irqrestore(cm_id_priv-lock, flags); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ipoib: fix hard_header return value
On 04/01/13 17:25, Or Gerlitz wrote: Doug Ledford dledf...@redhat.com wrote: If you have a patched up dhcp server (and dhclient), Could you be more specific, I assume you refer to the ISC dhcp bits, which version and which patches? Any version of dhcp server, and the improved-xid and lpf-ib patches primarily. We've carried those patches forever, but as far as I know, they still have not been taken by ISC. Without them, dhcp server will only work with a cooked socket interface. You can't use raw as the socket type when compiling or else it won't work on IB. With the patches, you can enable the raw socket method, and on IB it will fall back to use PACKET instead. AFAIK they don't give you access to their source repo but rather only to drops plus possibly patches which is a bit more tedious to follow, oh wel... they will use AF_PACKET/SOCK_DGRAM pair to send dhcp packets over IPoIB. This has worked since forever if you use OFED kernels or one of the distribution kernels. However, when testing an upstream kernel, it has been broken for a very long time (I tested 2.6.34, 2.6.38, 3.0, 3.1, 3.8, HEAD). IMO doesn't seem relevant to the upstream commit message I disagree. I don't buy the whole we are upstream, nothing else matters or is relevant philosophy. The truth of the matter is that there is essentially a fork between upstream and OFED. I plan on spending some time bringing some of the relevant fixes present in OFED and not upstream back to upstream. In the context of attempting to manually merge some of that fork back together, I see no reason whatsoever to ignore relevant historical information during that process. It turns out that the hard_header routine in ipoib is not following the API and is returning 0 even when it pushed data onto the skb. This then causes af_packet.c to overwrite the header just pushed with data from user space. This header is immediately referenced in the ipoib_start_xmit routine cool, I assume we want this fix to go for -stable after spending some time upstream, e.g probably by the time 3.9 is released and some more testing is done on the patch (I'll advocate for that @ MLNX, copied some folks now) get that to -stable too. Yes, it can go to -stable. But, given that no one has noticed before now that it doesn't work, I'm guessing not many people are using straight upstream (which is something that needs fixed IMO). Erez, in the code we use internally which is based on upstream 3.7 do we have DHCP/IPoIB working without this patch? so I'm wondering how this ever worked in distro/ofed kernels that also have this bug, but fixing the bug here makes things work in upstream kernels. same for the last three lines Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ ipoib_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 8534afd..31dd2a7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -828,7 +828,7 @@ static int ipoib_hard_header(struct sk_buff *skb, */ memcpy(cb-hwaddr, daddr, INFINIBAND_ALEN); - return 0; + return sizeof *header; } static void ipoib_set_mcast_list(struct net_device *dev) -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote: -Original Message- From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com] Sent: Wednesday, April 17, 2013 21:06 To: Atchley, Scott Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: NFS over RDMA benchmark On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov wrote: On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote: Hi. I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. Yan, Are you trying to optimize single client performance or server performance with multiple clients? I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs Would any profiling help figure out which code it's spending time in? (E.g. something simple as perf top might have useful output.) --b. CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15 HI: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TIMER: 418767 46596 43515 44547 50099 34815 40634 40337 39551 93442 73733 42631 42509 41592 40351 61793 NET_TX: 28719309 1421 1294 1730 1243832937 11 44 41 20 26 19 15 29 NET_RX: 612070 19 22 21 6 235 3 2 9 6 17 16 20 13 16 10 BLOCK: 5941 0 0 0 0 0 0 0519259 1238272 253174215 2618 BLOCK_IOPOLL: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TASKLET: 28 1 1 1 1 1540653 1 1 29 1 1 1 1 1 1 2 SCHED: 364965 26547 16807 18403 22919 8678 14358 14091 16981 64903 47141 18517 19179 18036 17037 38261 HRTIMER: 13 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 RCU: 945823 841546 715281 892762 823564 42663 863063 841622 333577 389013 393501 239103 221524 258159 313426 234030 Remember there are always gaps between wire speed (that ib_send_bw measures) and real world applications. I realize that, but I don't expect the difference to be more than twice. That being said, does your server use default export (sync) option ? Export the share with async option can bring you closer to wire speed. However, the practice (async) is generally not recommended in a real production system - as it can cause data integrity issues, e.g. you have more chances to lose data when the boxes crash. I am running with async export option, but that should not matter too much, since my backing storage is tmpfs mounted with noatime. -- Wendy Wendy, It has a been a few years since I looked at RPCRDMA, but I seem to remember that RPCs were limited to 32KB which means that you have to pipeline them to get linerate. In addition to requiring
Re: [PATCH -next] iser-target: fix error return code in isert_connect_request()
On Fri, 2013-04-19 at 13:13 +0800, Wei Yongjun wrote: From: Wei Yongjun yongjun_...@trendmicro.com.cn Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn --- Merged into the initial iser-target commit in for-next-merge. Thanks Wei! --nab drivers/infiniband/ulp/isert/ib_isert.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c index f6f4f58..803b949 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.c +++ b/drivers/infiniband/ulp/isert/ib_isert.c @@ -329,6 +329,7 @@ static struct isert_device * isert_device_find_by_ib_dev(struct rdma_cm_id *cma_id) { struct isert_device *device; + int ret; mutex_lock(device_list_mutex); list_for_each_entry(device, device_list, dev_node) { @@ -342,16 +343,17 @@ isert_device_find_by_ib_dev(struct rdma_cm_id *cma_id) device = kzalloc(sizeof(struct isert_device), GFP_KERNEL); if (!device) { mutex_unlock(device_list_mutex); - return NULL; + return ERR_PTR(-ENOMEM); } INIT_LIST_HEAD(device-dev_node); device-ib_device = cma_id-device; - if (isert_create_device_ib_res(device)) { + ret = isert_create_device_ib_res(device); + if (ret) { kfree(device); mutex_unlock(device_list_mutex); - return NULL; + return ERR_PTR(ret); } device-refcount++; @@ -434,8 +436,10 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) } device = isert_device_find_by_ib_dev(cma_id); - if (!device) + if (IS_ERR(device)) { + ret = PTR_ERR(device); goto out_rsp_dma_map; + } isert_conn-conn_device = device; isert_conn-conn_pd = device-dev_pd; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Mon, Apr 22, 2013 at 7:46 PM, Or Gerlitz or.gerl...@gmail.com wrote: Sean, Tzahi -- I understand now that there have been few talkings @ the OFA meeting re this patch set. So what's the way to move forward, any concrete feedback that needs to be addressed here? This series is hanging since May 2012 and I'd like to see it gets in for 3.10, now if indeed Sean is OK with the general framework, please suggest. Sean, I understand that following some conversations help at the OFA meetings you kind of took back the concerns you raised regarding the concept of the verbs level QP group which is used by this series to implement RSS and TSS, can you acknoledge that? Roland, this series is been around for about a year now, any feedback or comments from your side that we need to address for it to get accepted? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that
On Mon, Apr 22, 2013 at 7:46 PM, Or Gerlitz or.gerl...@gmail.com wrote: Sean, Tzahi -- I understand now that there have been few talkings @ the OFA meeting re this patch set. So what's the way to move forward, any concrete feedback that needs to be addressed here? This series is hanging since May 2012 and I'd like to see it gets in for 3.10, now if indeed Sean is OK with the general framework, please suggest. Sean, I understand that following some conversations help at the OFA meetings you kind of took back the concerns you raised regarding the concept of the verbs level QP group which is used by this series to implement RSS and TSS, can you acknoledge that? No - I agree with the RSS/TSS concept. That I've never had an issue with. My issue is that the current verbs API appears to be a poor fit. I don't have a good answer for an alternative. Conceptually, RSS/TSS are a set of send/receive work queues all belonging to the same transport level address. There's no parent-child relationship or needed pairing of send and receive queues together in order to form a group. Personally, I'd like to see a way that better captures the notion of a 'set of work queues with the same address'. For example, it makes more sense to me if a user created/destroyed the work queues together, and if the WQs were viewed as being in a single state (INIT, RTR, RTS...). I'm just thinking out loud here, hoping that it spurs ideas, but if we added a call like: struct ib_qp *ib_create_wq_array/set/group(...); then added the ability to specify which WQ a specific send or receive should be posted to, it may do a better job of capturing RSS/TSS concepts, but still make use of the existing calls. (Underneath this, the driver can allocate actual QPs with sequential QPNs or whatever is required, but that's not exposed.) Obviously, I haven't thought through specifics. I'll try to meet up with Diego and Tzahi tonight or tomorrow to discuss this further. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html