[PATCH] opensm: Fix Q_Key, TClass and limited keys parsing warnings in partitions.conf

2013-04-23 Thread Alex Netes
Parsing these paraters caused 'unrecognized mgroup flag' warning.

Moreover fixed man page/doc for Q_Key/TClass paramters.

Signed-off-by: Alex Netes ale...@mellanox.com
---
 doc/partition-config.txt | 4 ++--
 man/opensm.8.in  | 4 ++--
 opensm/osm_prtn_config.c | 4 +++-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/doc/partition-config.txt b/doc/partition-config.txt
index f49d473..3581ef6 100644
--- a/doc/partition-config.txt
+++ b/doc/partition-config.txt
@@ -94,11 +94,11 @@ General file format:
mgid.  Furthermore specifying multiple scope
settings will result in multiple MC groups
being created.
-qkey=val  - specifies the Q_Key for this MC group
+Q_Key=val - specifies the Q_Key for this MC group
   (default: 0x0b1b for IP groups, 0 for other groups)
   WARNING: changing this for the broadcast group may
   break IPoIB on client nodes!!!
-tclass=val- specifies tclass for this MC group
+TClass=val- specifies tclass for this MC group
   (default is 0)
 FlowLabel=val - specifies FlowLabel for this MC group
   (default is 0)
diff --git a/man/opensm.8.in b/man/opensm.8.in
index 37e2eee..4ab7b30 100644
--- a/man/opensm.8.in
+++ b/man/opensm.8.in
@@ -619,9 +619,9 @@ General file format:
 mgid.  Furthermore specifying multiple scope
 settings will result in multiple MC groups
 being created.
-qkey=val  - specifies the Q_Key for this MC group
+Q_Key=val  - specifies the Q_Key for this MC group
   (default: 0x0b1b for IP groups, 0 for other groups)
-tclass=val- specifies tclass for this MC group
+TClass=val- specifies tclass for this MC group
   (default is 0)
 FlowLabel=val - specifies FlowLabel for this MC group
   (default is 0)
diff --git a/opensm/osm_prtn_config.c b/opensm/osm_prtn_config.c
index a3524b1..8f4a673 100644
--- a/opensm/osm_prtn_config.c
+++ b/opensm/osm_prtn_config.c
@@ -296,12 +296,14 @@ static int parse_group_flag(unsigned lineno, osm_log_t * 
p_log,
else
flags-scope_mask |= (1scope);
} else if (!strncmp(flag, Q_Key, strlen(flag))) {
+   rc = 1;
if (!val || (flags-Q_Key = strtoul(val, NULL, 0)) == 0)
OSM_LOG(p_log, OSM_LOG_VERBOSE,
PARSE WARN: line %d: 
flag \'Q_Key\' requires valid value
 - using '0'\n, lineno);
} else if (!strncmp(flag, TClass, strlen(flag))) {
+   rc =1;
if (!val || (flags-TClass = strtoul(val, NULL, 0)) == 0)
OSM_LOG(p_log, OSM_LOG_VERBOSE,
PARSE WARN: line %d: 
@@ -406,7 +408,7 @@ static int partition_add_port(unsigned lineno, struct 
part_conf *conf,
membership = FULL;
else if (!strncmp(flag, both, strlen(flag)))
membership = BOTH;
-   else if (!strncmp(flag, limited, strlen(flag))) {
+   else if (strncmp(flag, limited, strlen(flag))) {
OSM_LOG(conf-p_log, OSM_LOG_VERBOSE,
PARSE WARN: line %d: 
unrecognized port flag \'%s\'.
-- 
1.7.11.7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] iw_cm: Don't touch cmid after dereferencing it.

2013-04-23 Thread Steve Wise
Function cm_work_handler() cannot touch the cm_id after it derefs
it because it might be freed on another concurrent thread.  If there
are more work items queued for this cm_id, then we know there must be
more references because they are added when the work items are queued.
So in the while loop inside cm_work_handler(), after derefing, if the
queue is empty, then exit the function.  Otherwise you know its safe to
re-acquire the lock.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---

 drivers/infiniband/core/iwcm.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c
index 0bb99bb..c47c203 100644
--- a/drivers/infiniband/core/iwcm.c
+++ b/drivers/infiniband/core/iwcm.c
@@ -878,6 +878,8 @@ static void cm_work_handler(struct work_struct *_work)
}
return;
}
+   if (empty)
+   return;
spin_lock_irqsave(cm_id_priv-lock, flags);
}
spin_unlock_irqrestore(cm_id_priv-lock, flags);

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ipoib: fix hard_header return value

2013-04-23 Thread Doug Ledford
On 04/01/13 17:25, Or Gerlitz wrote:
 Doug Ledford dledf...@redhat.com wrote:
 If you have a patched up dhcp server (and dhclient),
 
 Could you be more specific, I assume you refer to the ISC dhcp bits,
 which version and which patches?

Any version of dhcp server, and the improved-xid and lpf-ib patches
primarily.  We've carried those patches forever, but as far as I know,
they still have not been taken by ISC.  Without them, dhcp server will
only work with a cooked socket interface.  You can't use raw as the
socket type when compiling or else it won't work on IB.  With the
patches, you can enable the raw socket method, and on IB it will fall
back to use PACKET instead.

 AFAIK they don't give you access to
 their source repo but rather only to drops plus possibly patches which
 is a bit more tedious to follow, oh wel...
 
 they will use AF_PACKET/SOCK_DGRAM pair to send dhcp packets over IPoIB.
 
 
 This has worked since forever if you use OFED kernels or one of the 
 distribution
 kernels.  However, when testing an upstream kernel, it has been broken
 for a very long time (I tested 2.6.34, 2.6.38, 3.0, 3.1, 3.8, HEAD).
 
 IMO doesn't seem relevant to the upstream commit message

I disagree.  I don't buy the whole we are upstream, nothing else
matters or is relevant philosophy.  The truth of the matter is that
there is essentially a fork between upstream and OFED.  I plan on
spending some time bringing some of the relevant fixes present in OFED
and not upstream back to upstream.  In the context of attempting to
manually merge some of that fork back together, I see no reason
whatsoever to ignore relevant historical information during that process.

 It turns out that the hard_header routine in ipoib is not following
 the API and is returning 0 even when it pushed data onto the skb.  This
 then causes af_packet.c to overwrite the header just pushed with data
 from user space.  This header is immediately referenced in the
 ipoib_start_xmit routine
 
 cool, I assume we want this fix to go for -stable after spending some
 time upstream, e.g probably by the time 3.9 is released and some more
 testing is done on the patch (I'll advocate for that @ MLNX, copied
 some folks now) get that to -stable too.

Yes, it can go to -stable.  But, given that no one has noticed before
now that it doesn't work, I'm guessing not many people are using
straight upstream (which is something that needs fixed IMO).

 Erez, in the code we use internally which is based on upstream 3.7 do
 we have DHCP/IPoIB working without this patch?
 
 so I'm wondering how this ever worked in
 distro/ofed kernels that also have this bug, but fixing the bug here
 makes things work in upstream kernels.
 
 same for the last three lines
 
 
 Signed-off-by: Doug Ledford dledf...@redhat.com
 ---
  drivers/infiniband/ulp/ipoib/
 ipoib_main.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
 b/drivers/infiniband/ulp/ipoib/ipoib_main.c
 index 8534afd..31dd2a7 100644
 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
 @@ -828,7 +828,7 @@ static int ipoib_hard_header(struct sk_buff *skb,
  */
 memcpy(cb-hwaddr, daddr, INFINIBAND_ALEN);
 
 -   return 0;
 +   return sizeof *header;
  }
 
  static void ipoib_set_mcast_list(struct net_device *dev)
 


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-23 Thread J. Bruce Fields
On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:
 
 
  -Original Message-
  From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
  Sent: Wednesday, April 17, 2013 21:06
  To: Atchley, Scott
  Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
  linux-...@vger.kernel.org
  Subject: Re: NFS over RDMA benchmark
  
  On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov
  wrote:
   On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
  wrote:
  
   On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
  wrote:
   Hi.
  
   I've been trying to do some benchmarks for NFS over RDMA and I seem to
  only get about half of the bandwidth that the HW can give me.
   My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
  Mellanox ConnectX3 QDR card over PCI-e gen3.
   These servers are connected to a QDR IB switch. The backing storage on
  the server is tmpfs mounted with noatime.
   I am running kernel 3.5.7.
  
   When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
   When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
  same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
  
   Yan,
  
   Are you trying to optimize single client performance or server performance
  with multiple clients?
  
 
 I am trying to get maximum performance from a single server - I used 2 
 processes in fio test - more than 2 did not show any performance boost.
 I tried running fio from 2 different PCs on 2 different files, but the sum of 
 the two is more or less the same as running from single client PC.
 
 What I did see is that server is sweating a lot more than the clients and 
 more than that, it has 1 core (CPU5) in 100% softirq tasklet:
 cat /proc/softirqs

Would any profiling help figure out which code it's spending time in?
(E.g. something simple as perf top might have useful output.)

--b.

 CPU0   CPU1   CPU2   CPU3   CPU4   
 CPU5   CPU6   CPU7   CPU8   CPU9   CPU10  CPU11  
 CPU12  CPU13  CPU14  CPU15
   HI:  0  0  0  0  0  
 0  0  0  0  0  0  0  
 0  0  0  0
TIMER: 418767  46596  43515  44547  50099  
 34815  40634  40337  39551  93442  73733  42631  
 42509  41592  40351  61793
   NET_TX:  28719309   1421   1294   1730   
 1243832937 11 44 41 20
  26 19 15 29
   NET_RX: 612070 19 22 21  6
 235  3  2  9  6 17 16 
 20 13 16 10
BLOCK:   5941  0  0  0  0  
 0  0  0519259   1238272
 253174215   2618
 BLOCK_IOPOLL:  0  0  0  0  0  
 0  0  0  0  0  0  0  
 0  0  0  0
  TASKLET: 28  1  1  1  1
 1540653  1  1 29  1  1  1 
  1  1  1  2
SCHED: 364965  26547  16807  18403  22919   
 8678  14358  14091  16981  64903  47141  18517  
 19179  18036  17037  38261
  HRTIMER: 13  0  1  1  0  
 0  0  0  0  0  0  0  
 1  1  0  1
  RCU: 945823 841546 715281 892762 823564  
 42663 863063 841622 333577 389013 393501 239103 
 221524 258159 313426 234030
  
   Remember there are always gaps between wire speed (that ib_send_bw
   measures) and real world applications.
 
 I realize that, but I don't expect the difference to be more than twice.
 
  
   That being said, does your server use default export (sync) option ?
   Export the share with async option can bring you closer to wire
   speed. However, the practice (async) is generally not recommended in
   a real production system - as it can cause data integrity issues, e.g.
   you have more chances to lose data when the boxes crash.
 
 I am running with async export option, but that should not matter too much, 
 since my backing storage is tmpfs mounted with noatime.
 
  
   -- Wendy
  
  
   Wendy,
  
   It has a been a few years since I looked at RPCRDMA, but I seem to
  remember that RPCs were limited to 32KB which means that you have to
  pipeline them to get linerate. In addition to requiring 

Re: [PATCH -next] iser-target: fix error return code in isert_connect_request()

2013-04-23 Thread Nicholas A. Bellinger
On Fri, 2013-04-19 at 13:13 +0800, Wei Yongjun wrote:
 From: Wei Yongjun yongjun_...@trendmicro.com.cn
 
 Fix to return a negative error code from the error handling
 case instead of 0, as done elsewhere in this function.
 
 Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn
 ---

Merged into the initial iser-target commit in for-next-merge.

Thanks Wei!

--nab

  drivers/infiniband/ulp/isert/ib_isert.c | 12 
  1 file changed, 8 insertions(+), 4 deletions(-)
 
 diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
 b/drivers/infiniband/ulp/isert/ib_isert.c
 index f6f4f58..803b949 100644
 --- a/drivers/infiniband/ulp/isert/ib_isert.c
 +++ b/drivers/infiniband/ulp/isert/ib_isert.c
 @@ -329,6 +329,7 @@ static struct isert_device *
  isert_device_find_by_ib_dev(struct rdma_cm_id *cma_id)
  {
   struct isert_device *device;
 + int ret;
  
   mutex_lock(device_list_mutex);
   list_for_each_entry(device, device_list, dev_node) {
 @@ -342,16 +343,17 @@ isert_device_find_by_ib_dev(struct rdma_cm_id *cma_id)
   device = kzalloc(sizeof(struct isert_device), GFP_KERNEL);
   if (!device) {
   mutex_unlock(device_list_mutex);
 - return NULL;
 + return ERR_PTR(-ENOMEM);
   }
  
   INIT_LIST_HEAD(device-dev_node);
  
   device-ib_device = cma_id-device;
 - if (isert_create_device_ib_res(device)) {
 + ret = isert_create_device_ib_res(device);
 + if (ret) {
   kfree(device);
   mutex_unlock(device_list_mutex);
 - return NULL;
 + return ERR_PTR(ret);
   }
  
   device-refcount++;
 @@ -434,8 +436,10 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct 
 rdma_cm_event *event)
   }
  
   device = isert_device_find_by_ib_dev(cma_id);
 - if (!device)
 + if (IS_ERR(device)) {
 + ret = PTR_ERR(device);
   goto out_rsp_dma_map;
 + }
  
   isert_conn-conn_device = device;
   isert_conn-conn_pd = device-dev_pd;
 


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-23 Thread Or Gerlitz
On Mon, Apr 22, 2013 at 7:46 PM, Or Gerlitz or.gerl...@gmail.com wrote:
 Sean, Tzahi -- I understand now that there have been few talkings @
 the OFA meeting re this patch set. So what's the way to move forward,
 any concrete feedback that needs to be addressed here?  This series is
 hanging since May 2012 and I'd like to see it gets in for 3.10, now if
 indeed Sean is OK with the general framework, please suggest.

Sean,

I understand that following some conversations help at the OFA
meetings you kind of took back the concerns you raised regarding the
concept of the verbs level QP group which is used by this series to
implement RSS and TSS, can you acknoledge that?

Roland, this series is been around for about a year now, any feedback
or comments from your side that we need to address for it to get
accepted?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups - suggesting BOF during OFA conf to further discuss that

2013-04-23 Thread Hefty, Sean
 On Mon, Apr 22, 2013 at 7:46 PM, Or Gerlitz or.gerl...@gmail.com wrote:
  Sean, Tzahi -- I understand now that there have been few talkings @
  the OFA meeting re this patch set. So what's the way to move forward,
  any concrete feedback that needs to be addressed here?  This series is
  hanging since May 2012 and I'd like to see it gets in for 3.10, now if
  indeed Sean is OK with the general framework, please suggest.
 
 Sean,
 
 I understand that following some conversations help at the OFA
 meetings you kind of took back the concerns you raised regarding the
 concept of the verbs level QP group which is used by this series to
 implement RSS and TSS, can you acknoledge that?

No - I agree with the RSS/TSS concept.  That I've never had an issue with.  My 
issue is that the current verbs API appears to be a poor fit.  I don't have a 
good answer for an alternative.

Conceptually, RSS/TSS are a set of send/receive work queues all belonging to 
the same transport level address.  There's no parent-child relationship or 
needed pairing of send and receive queues together in order to form a group.

Personally, I'd like to see a way that better captures the notion of a 'set of 
work queues with the same address'.  For example, it makes more sense to me if 
a user created/destroyed the work queues together, and if the WQs were viewed 
as being in a single state (INIT, RTR, RTS...).

I'm just thinking out loud here, hoping that it spurs ideas, but if we added a 
call like:

struct ib_qp *ib_create_wq_array/set/group(...);

then added the ability to specify which WQ a specific send or receive should be 
posted to, it may do a better job of capturing RSS/TSS concepts, but still make 
use of the existing calls.  (Underneath this, the driver can allocate actual 
QPs  with sequential QPNs or whatever is required, but that's not exposed.)  
Obviously, I haven't thought through specifics.

I'll try to meet up with Diego and Tzahi tonight or tomorrow to discuss this 
further.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html