RE: NFS over RDMA benchmark

2013-04-24 Thread Yan Burman


 -Original Message-
 From: J. Bruce Fields [mailto:bfie...@fieldses.org]
 Sent: Wednesday, April 24, 2013 00:06
 To: Yan Burman
 Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
 linux-...@vger.kernel.org; Or Gerlitz
 Subject: Re: NFS over RDMA benchmark
 
 On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:
 
 
   -Original Message-
   From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
   Sent: Wednesday, April 17, 2013 21:06
   To: Atchley, Scott
   Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
   linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
   Subject: Re: NFS over RDMA benchmark
  
   On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
   atchle...@ornl.gov
   wrote:
On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
   wrote:
   
On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
   wrote:
Hi.
   
I've been trying to do some benchmarks for NFS over RDMA and I
seem to
   only get about half of the bandwidth that the HW can give me.
My setup consists of 2 servers each with 16 cores, 32Gb of
memory, and
   Mellanox ConnectX3 QDR card over PCI-e gen3.
These servers are connected to a QDR IB switch. The backing
storage on
   the server is tmpfs mounted with noatime.
I am running kernel 3.5.7.
   
When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
the
   same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
   
Yan,
   
Are you trying to optimize single client performance or server
performance
   with multiple clients?
   
 
  I am trying to get maximum performance from a single server - I used 2
 processes in fio test - more than 2 did not show any performance boost.
  I tried running fio from 2 different PCs on 2 different files, but the sum 
  of
 the two is more or less the same as running from single client PC.
 
  What I did see is that server is sweating a lot more than the clients and
 more than that, it has 1 core (CPU5) in 100% softirq tasklet:
  cat /proc/softirqs
 
 Would any profiling help figure out which code it's spending time in?
 (E.g. something simple as perf top might have useful output.)
 


Perf top for the CPU with high tasklet count gives:

 samples  pcnt RIPfunctionDSO
 ___ _  ___ 
___

 2787.00 24.1% 81062a00 mutex_spin_on_owner 
/root/vmlinux
  978.00  8.4% 810297f0 clflush_cache_range 
/root/vmlinux
  445.00  3.8% 812ea440 __domain_mapping
/root/vmlinux
  441.00  3.8% 00018c30 svc_recv
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
  344.00  3.0% 813a1bc0 _raw_spin_lock_bh   
/root/vmlinux
  333.00  2.9% 813a19e0 _raw_spin_lock_irqsave  
/root/vmlinux
  288.00  2.5% 813a07d0 __schedule  
/root/vmlinux
  249.00  2.1% 811a87e0 rb_prev 
/root/vmlinux
  242.00  2.1% 813a19b0 _raw_spin_lock  
/root/vmlinux
  184.00  1.6% 2e90 svc_rdma_sendto 
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
  177.00  1.5% 810ac820 get_page_from_freelist  
/root/vmlinux
  174.00  1.5% 812e6da0 alloc_iova  
/root/vmlinux
  165.00  1.4% 810b1390 put_page
/root/vmlinux
  148.00  1.3% 00014760 sunrpc_cache_lookup 
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
  128.00  1.1% 00017f20 svc_xprt_enqueue
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
  126.00  1.1% 8139f820 __mutex_lock_slowpath   
/root/vmlinux
  108.00  0.9% 811a81d0 rb_insert_color 
/root/vmlinux
  107.00  0.9% 4690 svc_rdma_recvfrom   
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
  102.00  0.9% 2640 send_reply  
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
   99.00  0.9% 810e6490 kmem_cache_alloc
/root/vmlinux
   96.00  0.8% 810e5840 __slab_alloc
/root/vmlinux
   91.00  0.8% 6d30 mlx4_ib_post_send   
/lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
   88.00  0.8% 0dd0 svc_rdma_get_context
/lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
   86.00  0.7% 813a1a10 _raw_spin_lock_irq  

[PATCH for-next 3/9] net/mlx4_core: Change few DMFS fields names to match firmare spec

2013-04-24 Thread Or Gerlitz
From: Hadar Hen Zion had...@mellanox.com

Change struct mlx4_net_trans_rule_hw_eth :: vlan_id name to vlan_tag

Change struct mlx4_net_trans_rule_hw_ib :: r_u_qpn name to l3_qpn

The patch doesn't introduce any functional change or API change
towards the firmware.

Signed-off-by: Hadar Hen Zion had...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/net/ethernet/mellanox/mlx4/mcg.c |6 +++---
 include/linux/mlx4/device.h  |8 
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c 
b/drivers/net/ethernet/mellanox/mlx4/mcg.c
index d1f01dc..3cfd372 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mcg.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c
@@ -714,12 +714,12 @@ static int parse_trans_rule(struct mlx4_dev *dev, struct 
mlx4_spec_list *spec,
rule_hw-eth.ether_type_enable = 1;
rule_hw-eth.ether_type = spec-eth.ether_type;
}
-   rule_hw-eth.vlan_id = spec-eth.vlan_id;
-   rule_hw-eth.vlan_id_msk = spec-eth.vlan_id_msk;
+   rule_hw-eth.vlan_tag = spec-eth.vlan_id;
+   rule_hw-eth.vlan_tag_msk = spec-eth.vlan_id_msk;
break;
 
case MLX4_NET_TRANS_RULE_ID_IB:
-   rule_hw-ib.qpn = spec-ib.r_qpn;
+   rule_hw-ib.l3_qpn = spec-ib.l3_qpn;
rule_hw-ib.qpn_mask = spec-ib.qpn_msk;
memcpy(rule_hw-ib.dst_gid, spec-ib.dst_gid, 16);
memcpy(rule_hw-ib.dst_gid_msk, spec-ib.dst_gid_msk, 16);
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index b2fe59d..a69bda7 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -930,7 +930,7 @@ struct mlx4_spec_ipv4 {
 };
 
 struct mlx4_spec_ib {
-   __be32  r_qpn;
+   __be32  l3_qpn;
__be32  qpn_msk;
u8  dst_gid[16];
u8  dst_gid_msk[16];
@@ -978,7 +978,7 @@ struct mlx4_net_trans_rule_hw_ib {
u8 rsvd1;
__be16 id;
u32 rsvd2;
-   __be32 qpn;
+   __be32 l3_qpn;
__be32 qpn_mask;
u8 dst_gid[16];
u8 dst_gid_msk[16];
@@ -999,8 +999,8 @@ struct mlx4_net_trans_rule_hw_eth {
u8  rsvd5;
u8  ether_type_enable;
__be16  ether_type;
-   __be16  vlan_id_msk;
-   __be16  vlan_id;
+   __be16  vlan_tag_msk;
+   __be16  vlan_tag;
 } __packed;
 
 struct mlx4_net_trans_rule_hw_tcp_udp {
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 4/9] net/mlx4_core: Directly expose fields of DMFS HW rule control segment

2013-04-24 Thread Or Gerlitz
From: Hadar Hen Zion had...@mellanox.com

Some of struct mlx4_net_trans_rule_hw_ctrl fields were packed into u32
and accessed through bit field operations. Expose and access them
directly as u8 or u16.

Signed-off-by: Hadar Hen Zion had...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/net/ethernet/mellanox/mlx4/mcg.c |   14 +++---
 include/linux/mlx4/device.h  |4 +++-
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c 
b/drivers/net/ethernet/mellanox/mlx4/mcg.c
index 3cfd372..07712f9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mcg.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c
@@ -656,15 +656,15 @@ static void trans_rule_ctrl_to_hw(struct 
mlx4_net_trans_rule *ctrl,
[MLX4_FS_MC_SNIFFER]= 0x5,
};
 
-   u32 dw = 0;
+   u8 flags = 0;
 
-   dw = ctrl-queue_mode == MLX4_NET_TRANS_Q_LIFO ? 1 : 0;
-   dw |= ctrl-exclusive ? (1  2) : 0;
-   dw |= ctrl-allow_loopback ? (1  3) : 0;
-   dw |= __promisc_mode[ctrl-promisc_mode]  8;
-   dw |= ctrl-priority  16;
+   flags = ctrl-queue_mode == MLX4_NET_TRANS_Q_LIFO ? 1 : 0;
+   flags |= ctrl-exclusive ? (1  2) : 0;
+   flags |= ctrl-allow_loopback ? (1  3) : 0;
 
-   hw-ctrl = cpu_to_be32(dw);
+   hw-flags = flags;
+   hw-type = __promisc_mode[ctrl-promisc_mode];
+   hw-prio = cpu_to_be16(ctrl-priority);
hw-port = ctrl-port;
hw-qpn = cpu_to_be32(ctrl-qpn);
 }
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index a69bda7..08e5bc1 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -964,7 +964,9 @@ struct mlx4_net_trans_rule {
 };
 
 struct mlx4_net_trans_rule_hw_ctrl {
-   __be32 ctrl;
+   __be16 prio;
+   u8 type;
+   u8 flags;
u8 rsvd1;
u8 funcid;
u8 vep;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 5/9] net/mlx4_core: Expose few helpers to fill DMFS HW strucutures

2013-04-24 Thread Or Gerlitz
From: Hadar Hen Zion had...@mellanox.com

Re-arrange some of code which fills DMFS HW structures so we can use it from
within the core driver and from the IB driver too, e.g when verbs DMFS
structures are transformed into mlx4 Hardware structs.

Also, add struct mlx4_flow_handle struct which will be of use by the DMFS
verbs flow in the IB driver.

Signed-off-by: Hadar Hen Zion had...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/net/ethernet/mellanox/mlx4/mcg.c |   85 +-
 include/linux/mlx4/device.h  |   12 -
 2 files changed, 70 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c 
b/drivers/net/ethernet/mellanox/mlx4/mcg.c
index 07712f9..00b4e7b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mcg.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c
@@ -645,17 +645,28 @@ static int find_entry(struct mlx4_dev *dev, u8 port,
return err;
 }
 
+static const u8 __promisc_mode[] = {
+   [MLX4_FS_REGULAR]   = 0x0,
+   [MLX4_FS_ALL_DEFAULT] = 0x1,
+   [MLX4_FS_MC_DEFAULT] = 0x3,
+   [MLX4_FS_UC_SNIFFER] = 0x4,
+   [MLX4_FS_MC_SNIFFER] = 0x5,
+};
+
+int mlx4_map_sw_to_hw_steering_mode(struct mlx4_dev *dev,
+   enum mlx4_net_trans_promisc_mode flow_type)
+{
+   if (flow_type = MLX4_FS_MODE_NUM || flow_type  0) {
+   mlx4_err(dev, Invalid flow type. type = %d\n, flow_type);
+   return -EINVAL;
+   }
+   return __promisc_mode[flow_type];
+}
+EXPORT_SYMBOL_GPL(mlx4_map_sw_to_hw_steering_mode);
+
 static void trans_rule_ctrl_to_hw(struct mlx4_net_trans_rule *ctrl,
  struct mlx4_net_trans_rule_hw_ctrl *hw)
 {
-   static const u8 __promisc_mode[] = {
-   [MLX4_FS_REGULAR]   = 0x0,
-   [MLX4_FS_ALL_DEFAULT]   = 0x1,
-   [MLX4_FS_MC_DEFAULT]= 0x3,
-   [MLX4_FS_UC_SNIFFER]= 0x4,
-   [MLX4_FS_MC_SNIFFER]= 0x5,
-   };
-
u8 flags = 0;
 
flags = ctrl-queue_mode == MLX4_NET_TRANS_Q_LIFO ? 1 : 0;
@@ -678,29 +689,51 @@ const u16 __sw_id_hw[] = {
[MLX4_NET_TRANS_RULE_ID_UDP] = 0xE006
 };
 
+int mlx4_map_sw_to_hw_steering_id(struct mlx4_dev *dev,
+ enum mlx4_net_trans_rule_id id)
+{
+   if (id = MLX4_NET_TRANS_RULE_NUM || id  0) {
+   mlx4_err(dev, Invalid network rule id. id = %d\n, id);
+   return -EINVAL;
+   }
+   return __sw_id_hw[id];
+}
+EXPORT_SYMBOL_GPL(mlx4_map_sw_to_hw_steering_id);
+
+static const int __rule_hw_sz[] = {
+   [MLX4_NET_TRANS_RULE_ID_ETH] =
+   sizeof(struct mlx4_net_trans_rule_hw_eth),
+   [MLX4_NET_TRANS_RULE_ID_IB] =
+   sizeof(struct mlx4_net_trans_rule_hw_ib),
+   [MLX4_NET_TRANS_RULE_ID_IPV6] = 0,
+   [MLX4_NET_TRANS_RULE_ID_IPV4] =
+   sizeof(struct mlx4_net_trans_rule_hw_ipv4),
+   [MLX4_NET_TRANS_RULE_ID_TCP] =
+   sizeof(struct mlx4_net_trans_rule_hw_tcp_udp),
+   [MLX4_NET_TRANS_RULE_ID_UDP] =
+   sizeof(struct mlx4_net_trans_rule_hw_tcp_udp)
+};
+
+int mlx4_hw_rule_sz(struct mlx4_dev *dev,
+  enum mlx4_net_trans_rule_id id)
+{
+   if (id = MLX4_NET_TRANS_RULE_NUM || id  0) {
+   mlx4_err(dev, Invalid network rule id. id = %d\n, id);
+   return -EINVAL;
+   }
+
+   return __rule_hw_sz[id];
+}
+EXPORT_SYMBOL_GPL(mlx4_hw_rule_sz);
+
 static int parse_trans_rule(struct mlx4_dev *dev, struct mlx4_spec_list *spec,
struct _rule_hw *rule_hw)
 {
-   static const size_t __rule_hw_sz[] = {
-   [MLX4_NET_TRANS_RULE_ID_ETH] =
-   sizeof(struct mlx4_net_trans_rule_hw_eth),
-   [MLX4_NET_TRANS_RULE_ID_IB] =
-   sizeof(struct mlx4_net_trans_rule_hw_ib),
-   [MLX4_NET_TRANS_RULE_ID_IPV6] = 0,
-   [MLX4_NET_TRANS_RULE_ID_IPV4] =
-   sizeof(struct mlx4_net_trans_rule_hw_ipv4),
-   [MLX4_NET_TRANS_RULE_ID_TCP] =
-   sizeof(struct mlx4_net_trans_rule_hw_tcp_udp),
-   [MLX4_NET_TRANS_RULE_ID_UDP] =
-   sizeof(struct mlx4_net_trans_rule_hw_tcp_udp)
-   };
-   if (spec-id = MLX4_NET_TRANS_RULE_NUM) {
-   mlx4_err(dev, Invalid network rule id. id = %d\n, spec-id);
+   if (mlx4_hw_rule_sz(dev, spec-id)  0)
return -EINVAL;
-   }
-   memset(rule_hw, 0, __rule_hw_sz[spec-id]);
+   memset(rule_hw, 0, mlx4_hw_rule_sz(dev, spec-id));
rule_hw-id = cpu_to_be16(__sw_id_hw[spec-id]);
-   rule_hw-size = __rule_hw_sz[spec-id]  2;
+   rule_hw-size = mlx4_hw_rule_sz(dev, spec-id)  2;
 
switch (spec-id) {
case MLX4_NET_TRANS_RULE_ID_ETH:
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 

[PATCH for-next 1/9] net/mlx4_core: Move DMFS HW structs to common header file

2013-04-24 Thread Or Gerlitz
From: Hadar Hen Zion had...@mellanox.com

Move flow steering HW structures to be on the public mlx4 include
directory, as a pre-step for the mlx4 IB driver to use them too.

Signed-off-by: Hadar Hen Zion had...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/net/ethernet/mellanox/mlx4/mlx4.h |   79 -
 include/linux/mlx4/device.h   |   79 +
 2 files changed, 79 insertions(+), 79 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h 
b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
index d738454..d5fdb19 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
@@ -701,85 +701,6 @@ struct mlx4_steer {
struct list_head steer_entries[MLX4_NUM_STEERS];
 };
 
-struct mlx4_net_trans_rule_hw_ctrl {
-   __be32 ctrl;
-   u8 rsvd1;
-   u8 funcid;
-   u8 vep;
-   u8 port;
-   __be32 qpn;
-   __be32 rsvd2;
-};
-
-struct mlx4_net_trans_rule_hw_ib {
-   u8 size;
-   u8 rsvd1;
-   __be16 id;
-   u32 rsvd2;
-   __be32 qpn;
-   __be32 qpn_mask;
-   u8 dst_gid[16];
-   u8 dst_gid_msk[16];
-} __packed;
-
-struct mlx4_net_trans_rule_hw_eth {
-   u8  size;
-   u8  rsvd;
-   __be16  id;
-   u8  rsvd1[6];
-   u8  dst_mac[6];
-   u16 rsvd2;
-   u8  dst_mac_msk[6];
-   u16 rsvd3;
-   u8  src_mac[6];
-   u16 rsvd4;
-   u8  src_mac_msk[6];
-   u8  rsvd5;
-   u8  ether_type_enable;
-   __be16  ether_type;
-   __be16  vlan_id_msk;
-   __be16  vlan_id;
-} __packed;
-
-struct mlx4_net_trans_rule_hw_tcp_udp {
-   u8  size;
-   u8  rsvd;
-   __be16  id;
-   __be16  rsvd1[3];
-   __be16  dst_port;
-   __be16  rsvd2;
-   __be16  dst_port_msk;
-   __be16  rsvd3;
-   __be16  src_port;
-   __be16  rsvd4;
-   __be16  src_port_msk;
-} __packed;
-
-struct mlx4_net_trans_rule_hw_ipv4 {
-   u8  size;
-   u8  rsvd;
-   __be16  id;
-   __be32  rsvd1;
-   __be32  dst_ip;
-   __be32  dst_ip_msk;
-   __be32  src_ip;
-   __be32  src_ip_msk;
-} __packed;
-
-struct _rule_hw {
-   union {
-   struct {
-   u8 size;
-   u8 rsvd;
-   __be16 id;
-   };
-   struct mlx4_net_trans_rule_hw_eth eth;
-   struct mlx4_net_trans_rule_hw_ib ib;
-   struct mlx4_net_trans_rule_hw_ipv4 ipv4;
-   struct mlx4_net_trans_rule_hw_tcp_udp tcp_udp;
-   };
-};
-
 enum {
MLX4_PCI_DEV_IS_VF  = 1  0,
MLX4_PCI_DEV_FORCE_SENSE_PORT   = 1  1,
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 811f91c..9fbf416 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -962,6 +962,85 @@ struct mlx4_net_trans_rule {
u32 qpn;
 };
 
+struct mlx4_net_trans_rule_hw_ctrl {
+   __be32 ctrl;
+   u8 rsvd1;
+   u8 funcid;
+   u8 vep;
+   u8 port;
+   __be32 qpn;
+   __be32 rsvd2;
+};
+
+struct mlx4_net_trans_rule_hw_ib {
+   u8 size;
+   u8 rsvd1;
+   __be16 id;
+   u32 rsvd2;
+   __be32 qpn;
+   __be32 qpn_mask;
+   u8 dst_gid[16];
+   u8 dst_gid_msk[16];
+} __packed;
+
+struct mlx4_net_trans_rule_hw_eth {
+   u8  size;
+   u8  rsvd;
+   __be16  id;
+   u8  rsvd1[6];
+   u8  dst_mac[6];
+   u16 rsvd2;
+   u8  dst_mac_msk[6];
+   u16 rsvd3;
+   u8  src_mac[6];
+   u16 rsvd4;
+   u8  src_mac_msk[6];
+   u8  rsvd5;
+   u8  ether_type_enable;
+   __be16  ether_type;
+   __be16  vlan_id_msk;
+   __be16  vlan_id;
+} __packed;
+
+struct mlx4_net_trans_rule_hw_tcp_udp {
+   u8  size;
+   u8  rsvd;
+   __be16  id;
+   __be16  rsvd1[3];
+   __be16  dst_port;
+   __be16  rsvd2;
+   __be16  dst_port_msk;
+   __be16  rsvd3;
+   __be16  src_port;
+   __be16  rsvd4;
+   __be16  src_port_msk;
+} __packed;
+
+struct mlx4_net_trans_rule_hw_ipv4 {
+   u8  size;
+   u8  rsvd;
+   __be16  id;
+   __be32  rsvd1;
+   __be32  dst_ip;
+   __be32  dst_ip_msk;
+   __be32  src_ip;
+   __be32  src_ip_msk;
+} __packed;
+
+struct _rule_hw {
+   union {
+   struct {
+   u8 size;
+   u8 rsvd;
+   __be16 id;
+   };
+   struct mlx4_net_trans_rule_hw_eth eth;
+   struct mlx4_net_trans_rule_hw_ib ib;
+   struct mlx4_net_trans_rule_hw_ipv4 ipv4;
+   struct mlx4_net_trans_rule_hw_tcp_udp tcp_udp;
+   };
+};
+
 int mlx4_flow_steer_promisc_add(struct mlx4_dev *dev, u8 port, u32 qpn,
   

[PATCH for-next 9/9] IB/mlx4: Add receive Flow Steering support

2013-04-24 Thread Or Gerlitz
From: Hadar Hen Zion had...@mellanox.com

Implement the ib_create_flow and ib_destroy_flow verbs.

Translate the verbs structures provided by the user to HW structures
and call the MLX4_QP_FLOW_STEERING_ATTACH/DETACH firmware commands.

On the ATTACH command completion, the firmware provides 64 bit registration
ID which is returned to the caller within struct ib_flow and used
later for detaching that flow.

Signed-off-by: Hadar Hen Zion had...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/hw/mlx4/main.c |  247 +
 1 files changed, 247 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 23d7343..e72584f 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -54,6 +54,8 @@
 #define DRV_VERSION1.0
 #define DRV_RELDATEApril 4, 2008
 
+#define MLX4_IB_FLOW_MAX_PRIO 0xFFF
+
 MODULE_AUTHOR(Roland Dreier);
 MODULE_DESCRIPTION(Mellanox ConnectX HCA InfiniBand driver);
 MODULE_LICENSE(Dual BSD/GPL);
@@ -88,6 +90,25 @@ static void init_query_mad(struct ib_smp *mad)
 
 static union ib_gid zgid;
 
+static int check_flow_steering_support(struct mlx4_dev *dev)
+{
+   int ib_num_ports = 0;
+   int i;
+
+   mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB)
+   ib_num_ports++;
+
+   if (dev-caps.steering_mode == MLX4_STEERING_MODE_DEVICE_MANAGED) {
+   if (ib_num_ports || mlx4_is_mfunc(dev)) {
+   pr_warn(Device managed flow steering is unavailable 
+   for IB ports or in multifunction env.\n);
+   return 0;
+   }
+   return 1;
+   }
+   return 0;
+}
+
 static int mlx4_ib_query_device(struct ib_device *ibdev,
struct ib_device_attr *props)
 {
@@ -144,6 +165,8 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
props-device_cap_flags |= IB_DEVICE_MEM_WINDOW_TYPE_2B;
else
props-device_cap_flags |= IB_DEVICE_MEM_WINDOW_TYPE_2A;
+   if (check_flow_steering_support(dev-dev))
+   props-device_cap_flags |= IB_DEVICE_MANAGED_FLOW_STEERING;
}
 
props-vendor_id   = be32_to_cpup((__be32 *) (out_mad-data + 
36)) 
@@ -798,6 +821,221 @@ struct mlx4_ib_steering {
union ib_gid gid;
 };
 
+static int parse_flow_attr(struct mlx4_dev *dev,
+  struct _ib_flow_spec *ib_spec,
+  struct _rule_hw *mlx4_spec)
+{
+   enum mlx4_net_trans_rule_id type;
+
+   switch (ib_spec-type) {
+   case IB_FLOW_SPEC_ETH:
+   type = MLX4_NET_TRANS_RULE_ID_ETH;
+   memcpy(mlx4_spec-eth.dst_mac, ib_spec-eth.val.dst_mac,
+  ETH_ALEN);
+   memcpy(mlx4_spec-eth.dst_mac_msk, ib_spec-eth.mask.dst_mac,
+  ETH_ALEN);
+   mlx4_spec-eth.vlan_tag = ib_spec-eth.val.vlan_tag;
+   mlx4_spec-eth.vlan_tag_msk = ib_spec-eth.mask.vlan_tag;
+   break;
+
+   case IB_FLOW_SPEC_IB:
+   type = MLX4_NET_TRANS_RULE_ID_IB;
+   mlx4_spec-ib.l3_qpn = ib_spec-ib.val.l3_type_qpn;
+   mlx4_spec-ib.qpn_mask = ib_spec-ib.mask.l3_type_qpn;
+   memcpy(mlx4_spec-ib.dst_gid, ib_spec-ib.val.dst_gid, 16);
+   memcpy(mlx4_spec-ib.dst_gid_msk,
+  ib_spec-ib.mask.dst_gid, 16);
+   break;
+
+   case IB_FLOW_SPEC_IPV4:
+   type = MLX4_NET_TRANS_RULE_ID_IPV4;
+   mlx4_spec-ipv4.src_ip = ib_spec-ipv4.val.src_ip;
+   mlx4_spec-ipv4.src_ip_msk = ib_spec-ipv4.mask.src_ip;
+   mlx4_spec-ipv4.dst_ip = ib_spec-ipv4.val.dst_ip;
+   mlx4_spec-ipv4.dst_ip_msk = ib_spec-ipv4.mask.dst_ip;
+   break;
+
+   case IB_FLOW_SPEC_TCP:
+   case IB_FLOW_SPEC_UDP:
+   type = ib_spec-type == IB_FLOW_SPEC_TCP ?
+   MLX4_NET_TRANS_RULE_ID_TCP :
+   MLX4_NET_TRANS_RULE_ID_UDP;
+   mlx4_spec-tcp_udp.dst_port = ib_spec-tcp_udp.val.dst_port;
+   mlx4_spec-tcp_udp.dst_port_msk = 
ib_spec-tcp_udp.mask.dst_port;
+   mlx4_spec-tcp_udp.src_port = ib_spec-tcp_udp.val.src_port;
+   mlx4_spec-tcp_udp.src_port_msk = 
ib_spec-tcp_udp.mask.src_port;
+   break;
+
+   default:
+   return -EINVAL;
+   }
+   if (mlx4_map_sw_to_hw_steering_id(dev, type)  0 ||
+   mlx4_hw_rule_sz(dev, type)  0)
+   return -EINVAL;
+   mlx4_spec-id = cpu_to_be16(mlx4_map_sw_to_hw_steering_id(dev, type));
+   mlx4_spec-size = mlx4_hw_rule_sz(dev, type)  2;
+   return mlx4_hw_rule_sz(dev, type);
+}
+
+static int __mlx4_ib_create_flow(struct ib_qp *qp, struct 

[PATCH for-next 6/9] IB/core: Add receive Flow Steering support

2013-04-24 Thread Or Gerlitz
From: Hadar Hen Zion had...@mellanox.com

The RDMA stack allows for applications to create IB_QPT_RAW_PACKET QPs,
for which plain Ethernet packets are used, specifically packets which
don't carry any QPN to be matched by the receiving side.

Applications using these QPs must be provided with a method to
program some steering rule with the HW so packets arriving at
the local port can be routed to them.

This patch adds ib_create_flow which allow to provide a flow specification
for a QP, such that when there's a match between the specification and the
received packet, it can be forwarded to that QP, in a similar manner
one needs to use ib_attach_multicast for IB UD multicast handling.

Flow specifications are provided as instances of struct ib_flow_spec_yyy
which describe L2, L3 and L4 headers, currently specs for Ethernet, IPv4,
TCP, UDP and IB are defined. Flow specs are made of values and masks.

The input to ib_create_flow is instance of struct ib_flow_attr which
contain few mandatory control elements and optional flow specs.

struct ib_flow_attr {
enum ib_flow_attr_type type;
u16  size;
u16  priority;
u8   num_of_specs;
u8   port;
u32  flags;
/* Following are the optional layers according to user request
 * struct ib_flow_spec_yyy
 * struct ib_flow_spec_zzz
 */
};

As these specs are eventually coming from user space, they are defined and
used in a way which allows adding new spec types without kernel/user ABI
change, and with a little API enhancement which defines the newly added spec.

The flow spec structures are defined in a TLV (Type-Length-Value) manner,
which allows to call ib_create_flow with a list of variable length of
optional specs.

For the actual processing of ib_flow_attr the driver uses the number of
specs and the size mandatory fields along with the TLV nature of the specs.

Steering rules processing order is according to rules priority. The user
sets the 12 low-order bits from the priority field and the remaining
4 high-order bits are set by the kernel according to a domain the
application or the layer that created the rule belongs to. Lower
priority numerical value means higher priority.

The returned value from ib_create_flow is instance of struct ib_flow
which contains a database pointer (handle) provided by the HW driver
to be used when calling ib_destroy_flow.

Applications that offload TCP/IP traffic could be written also over IB UD QPs.
As such, the ib_create_flow / ib_destroy_flow API is designed to support UD QPs
too, the HW driver sets IB_DEVICE_MANAGED_FLOW_STEERING to denote support
of flow steering.

The ib_flow_attr enum type relates to usage of flow steering for promiscuous
and sniffer purposes:

IB_FLOW_ATTR_NORMAL - regular rule, steering according to rule specification

IB_FLOW_ATTR_ALL_DEFAULT - default unicast and multicast rule, receive
all Ethernet traffic which isn't steered to any QP

IB_FLOW_ATTR_MC_DEFAULT - same as IB_FLOW_ATTR_ALL_DEFAULT but only for 
multicast

IB_FLOW_ATTR_SNIFFER - sniffer rule, receive all port traffic

ALL_DEFAULT and MC_DEFAULT rules options are valid only for Ethernet link type.

Signed-off-by: Hadar Hen Zion had...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/core/verbs.c |   30 +
 include/rdma/ib_verbs.h |  136 ++-
 2 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 22192de..932f4a7 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1254,3 +1254,33 @@ int ib_dealloc_xrcd(struct ib_xrcd *xrcd)
return xrcd-device-dealloc_xrcd(xrcd);
 }
 EXPORT_SYMBOL(ib_dealloc_xrcd);
+
+struct ib_flow *ib_create_flow(struct ib_qp *qp,
+  struct ib_flow_attr *flow_attr,
+  int domain)
+{
+   struct ib_flow *flow_id;
+   if (!qp-device-create_flow)
+   return ERR_PTR(-ENOSYS);
+
+   flow_id = qp-device-create_flow(qp, flow_attr, domain);
+   if (!IS_ERR(flow_id))
+   atomic_inc(qp-usecnt);
+   return flow_id;
+}
+EXPORT_SYMBOL(ib_create_flow);
+
+int ib_destroy_flow(struct ib_flow *flow_id)
+{
+   int err;
+   struct ib_qp *qp = flow_id-qp;
+
+   if (!flow_id-qp-device-destroy_flow)
+   return -ENOSYS;
+
+   err = qp-device-destroy_flow(flow_id);
+   if (!err)
+   atomic_dec(qp-usecnt);
+   return err;
+}
+EXPORT_SYMBOL(ib_destroy_flow);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 98cc4b2..6f76d62 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -116,7 +116,8 @@ enum ib_device_cap_flags {
IB_DEVICE_MEM_MGT_EXTENSIONS= (121),
IB_DEVICE_BLOCK_MULTICAST_LOOPBACK = (122),
IB_DEVICE_MEM_WINDOW_TYPE_2A= (123),
-   

[PATCH for-next 2/9] net/mlx4: Match DMFS promiscuous field names to firmware spec

2013-04-24 Thread Or Gerlitz
From: Hadar Hen Zion had...@mellanox.com

Align the names used by enum mlx4_net_trans_promisc_mode with the actual
firmware specification. The patch doesn't introduce any functional change
or API change towards the firmware.

Remove MLX4_FS_PROMISC_FUNCTION_PORT which isn't of use. Add new enums
MLX4_FS_{UC/MC}_SNIFFER as a preparation step for sniffer support.

Signed-off-by: Hadar Hen Zion had...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |   16 
 drivers/net/ethernet/mellanox/mlx4/mcg.c|   21 ++---
 include/linux/mlx4/device.h |   11 ++-
 4 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 00f25b5..2047684 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -889,7 +889,7 @@ static int mlx4_en_flow_replace(struct net_device *dev,
.queue_mode = MLX4_NET_TRANS_Q_FIFO,
.exclusive = 0,
.allow_loopback = 1,
-   .promisc_mode = MLX4_FS_PROMISC_NONE,
+   .promisc_mode = MLX4_FS_REGULAR,
};
 
rule.port = priv-port;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 30d78f8..0860130 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -127,7 +127,7 @@ static void mlx4_en_filter_work(struct work_struct *work)
.queue_mode = MLX4_NET_TRANS_Q_LIFO,
.exclusive = 1,
.allow_loopback = 1,
-   .promisc_mode = MLX4_FS_PROMISC_NONE,
+   .promisc_mode = MLX4_FS_REGULAR,
.port = priv-port,
.priority = MLX4_DOMAIN_RFS,
};
@@ -446,7 +446,7 @@ static int mlx4_en_uc_steer_add(struct mlx4_en_priv *priv,
.queue_mode = MLX4_NET_TRANS_Q_FIFO,
.exclusive = 0,
.allow_loopback = 1,
-   .promisc_mode = MLX4_FS_PROMISC_NONE,
+   .promisc_mode = MLX4_FS_REGULAR,
.priority = MLX4_DOMAIN_NIC,
};
 
@@ -793,7 +793,7 @@ static void mlx4_en_set_promisc_mode(struct mlx4_en_priv 
*priv,
err = mlx4_flow_steer_promisc_add(mdev-dev,
  priv-port,
  priv-base_qpn,
- 
MLX4_FS_PROMISC_UPLINK);
+ MLX4_FS_ALL_DEFAULT);
if (err)
en_err(priv, Failed enabling promiscuous 
mode\n);
priv-flags |= MLX4_EN_FLAG_MC_PROMISC;
@@ -856,7 +856,7 @@ static void mlx4_en_clear_promisc_mode(struct mlx4_en_priv 
*priv,
case MLX4_STEERING_MODE_DEVICE_MANAGED:
err = mlx4_flow_steer_promisc_remove(mdev-dev,
 priv-port,
-MLX4_FS_PROMISC_UPLINK);
+MLX4_FS_ALL_DEFAULT);
if (err)
en_err(priv, Failed disabling promiscuous mode\n);
priv-flags = ~MLX4_EN_FLAG_MC_PROMISC;
@@ -917,7 +917,7 @@ static void mlx4_en_do_multicast(struct mlx4_en_priv *priv,
err = mlx4_flow_steer_promisc_add(mdev-dev,
  priv-port,
  
priv-base_qpn,
- 
MLX4_FS_PROMISC_ALL_MULTI);
+ 
MLX4_FS_MC_DEFAULT);
break;
 
case MLX4_STEERING_MODE_B0:
@@ -940,7 +940,7 @@ static void mlx4_en_do_multicast(struct mlx4_en_priv *priv,
case MLX4_STEERING_MODE_DEVICE_MANAGED:
err = mlx4_flow_steer_promisc_remove(mdev-dev,
 priv-port,
-
MLX4_FS_PROMISC_ALL_MULTI);
+
MLX4_FS_MC_DEFAULT);
break;
 
case MLX4_STEERING_MODE_B0:
@@ -1598,10 +1598,10 @@ void mlx4_en_stop_port(struct net_device *dev, int 
detach)
 MLX4_EN_FLAG_MC_PROMISC);

[PATCH for-next 7/9] IB/core: Infra-structure to support verbs extensions through uverbs

2013-04-24 Thread Or Gerlitz
From: Igor Ivanov igor.iva...@itseez.com

Add Infra-structure to support extended uverbs capabilities in a 
forward/backward
manner. Uverbs command opcodes which are based on the verbs extensions approach 
should
be greater or equal to IB_USER_VERBS_CMD_THRESHOLD. They have new header format
and processed a bit differently.

Signed-off-by: Igor Ivanov igor.iva...@itseez.com
Signed-off-by: Hadar Hen Zion had...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/core/uverbs_main.c |   29 -
 include/uapi/rdma/ib_user_verbs.h |   10 ++
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_main.c 
b/drivers/infiniband/core/uverbs_main.c
index 2c6f0f2..e4e7b24 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -583,9 +583,6 @@ static ssize_t ib_uverbs_write(struct file *filp, const 
char __user *buf,
if (copy_from_user(hdr, buf, sizeof hdr))
return -EFAULT;
 
-   if (hdr.in_words * 4 != count)
-   return -EINVAL;
-
if (hdr.command = ARRAY_SIZE(uverbs_cmd_table) ||
!uverbs_cmd_table[hdr.command])
return -EINVAL;
@@ -597,8 +594,30 @@ static ssize_t ib_uverbs_write(struct file *filp, const 
char __user *buf,
if (!(file-device-ib_dev-uverbs_cmd_mask  (1ull  hdr.command)))
return -ENOSYS;
 
-   return uverbs_cmd_table[hdr.command](file, buf + sizeof hdr,
-hdr.in_words * 4, hdr.out_words * 
4);
+   if (hdr.command = IB_USER_VERBS_CMD_THRESHOLD) {
+   struct ib_uverbs_cmd_hdr_ex hdr_ex;
+
+   if (copy_from_user(hdr_ex, buf, sizeof(hdr_ex)))
+   return -EFAULT;
+
+   if (((hdr_ex.in_words + hdr_ex.provider_in_words) * 4) != count)
+   return -EINVAL;
+
+   return uverbs_cmd_table[hdr.command](file,
+buf + sizeof(hdr_ex),
+(hdr_ex.in_words +
+ hdr_ex.provider_in_words) 
* 4,
+(hdr_ex.out_words +
+ 
hdr_ex.provider_out_words) * 4);
+   } else {
+   if (hdr.in_words * 4 != count)
+   return -EINVAL;
+
+   return uverbs_cmd_table[hdr.command](file,
+buf + sizeof(hdr),
+hdr.in_words * 4,
+hdr.out_words * 4);
+   }
 }
 
 static int ib_uverbs_mmap(struct file *filp, struct vm_area_struct *vma)
diff --git a/include/uapi/rdma/ib_user_verbs.h 
b/include/uapi/rdma/ib_user_verbs.h
index 805711e..61535aa 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -43,6 +43,7 @@
  * compatibility are made.
  */
 #define IB_USER_VERBS_ABI_VERSION  6
+#define IB_USER_VERBS_CMD_THRESHOLD50
 
 enum {
IB_USER_VERBS_CMD_GET_CONTEXT,
@@ -123,6 +124,15 @@ struct ib_uverbs_cmd_hdr {
__u16 out_words;
 };
 
+struct ib_uverbs_cmd_hdr_ex {
+   __u32 command;
+   __u16 in_words;
+   __u16 out_words;
+   __u16 provider_in_words;
+   __u16 provider_out_words;
+   __u32 cmd_hdr_reserved;
+};
+
 struct ib_uverbs_get_context {
__u64 response;
__u64 driver_data[0];
-- 
1.7.1

Cc: Tzahi Oved tza...@mellanox.com
Cc: Sean Hefty sean.he...@intel.com
Cc: Yishai Hadas yish...@mellanox.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 8/9] IB/core: Export ib_create/destroy_flow through uverbs

2013-04-24 Thread Or Gerlitz
From: Hadar Hen Zion had...@mellanox.com

Implement ib_uverbs_create_flow and ib_uverbs_destroy_flow to
support flow steering for user space applications.

Signed-off-by: Hadar Hen Zion had...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/core/uverbs.h  |3 +
 drivers/infiniband/core/uverbs_cmd.c  |  209 +
 drivers/infiniband/core/uverbs_main.c |   13 ++-
 include/rdma/ib_verbs.h   |1 +
 include/uapi/rdma/ib_user_verbs.h |  108 +-
 5 files changed, 332 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index 0fcd7aa..ad9d102 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -155,6 +155,7 @@ extern struct idr ib_uverbs_cq_idr;
 extern struct idr ib_uverbs_qp_idr;
 extern struct idr ib_uverbs_srq_idr;
 extern struct idr ib_uverbs_xrcd_idr;
+extern struct idr ib_uverbs_rule_idr;
 
 void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj);
 
@@ -215,5 +216,7 @@ IB_UVERBS_DECLARE_CMD(destroy_srq);
 IB_UVERBS_DECLARE_CMD(create_xsrq);
 IB_UVERBS_DECLARE_CMD(open_xrcd);
 IB_UVERBS_DECLARE_CMD(close_xrcd);
+IB_UVERBS_DECLARE_CMD(create_flow);
+IB_UVERBS_DECLARE_CMD(destroy_flow);
 
 #endif /* UVERBS_H */
diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index a7d00f6..29c340e 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -54,6 +54,7 @@ static struct uverbs_lock_class qp_lock_class = { .name = 
QP-uobj };
 static struct uverbs_lock_class ah_lock_class  = { .name = AH-uobj };
 static struct uverbs_lock_class srq_lock_class = { .name = SRQ-uobj };
 static struct uverbs_lock_class xrcd_lock_class = { .name = XRCD-uobj };
+static struct uverbs_lock_class rule_lock_class = { .name = RULE-uobj };
 
 #define INIT_UDATA(udata, ibuf, obuf, ilen, olen)  \
do {\
@@ -330,6 +331,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
INIT_LIST_HEAD(ucontext-srq_list);
INIT_LIST_HEAD(ucontext-ah_list);
INIT_LIST_HEAD(ucontext-xrcd_list);
+   INIT_LIST_HEAD(ucontext-rule_list);
ucontext-closing = 0;
 
resp.num_comp_vectors = file-device-num_comp_vectors;
@@ -2587,6 +2589,213 @@ out_put:
return ret ? ret : in_len;
 }
 
+static int kern_spec_to_ib_spec(struct ib_kern_spec *kern_spec,
+   struct _ib_flow_spec *ib_spec)
+{
+   ib_spec-type = kern_spec-type;
+
+   switch (ib_spec-type) {
+   case IB_FLOW_SPEC_ETH:
+   ib_spec-eth.size = sizeof(struct ib_flow_spec_eth);
+   memcpy(ib_spec-eth.val, kern_spec-eth.val,
+  sizeof(struct ib_flow_eth_filter));
+   memcpy(ib_spec-eth.mask, kern_spec-eth.mask,
+  sizeof(struct ib_flow_eth_filter));
+   break;
+   case IB_FLOW_SPEC_IB:
+   ib_spec-ib.size = sizeof(struct ib_flow_spec_ib);
+   memcpy(ib_spec-ib.val, kern_spec-ib.val,
+  sizeof(struct ib_flow_ib_filter));
+   memcpy(ib_spec-ib.mask, kern_spec-ib.mask,
+  sizeof(struct ib_flow_ib_filter));
+   break;
+   case IB_FLOW_SPEC_IPV4:
+   ib_spec-ipv4.size = sizeof(struct ib_flow_spec_ipv4);
+   memcpy(ib_spec-ipv4.val, kern_spec-ipv4.val,
+  sizeof(struct ib_flow_ipv4_filter));
+   memcpy(ib_spec-ipv4.mask, kern_spec-ipv4.mask,
+  sizeof(struct ib_flow_ipv4_filter));
+   break;
+   case IB_FLOW_SPEC_TCP:
+   case IB_FLOW_SPEC_UDP:
+   ib_spec-tcp_udp.size = sizeof(struct ib_flow_spec_tcp_udp);
+   memcpy(ib_spec-tcp_udp.val, kern_spec-tcp_udp.val,
+  sizeof(struct ib_flow_tcp_udp_filter));
+   memcpy(ib_spec-tcp_udp.mask, kern_spec-tcp_udp.mask,
+  sizeof(struct ib_flow_tcp_udp_filter));
+   break;
+   default:
+   return -EINVAL;
+   }
+   return 0;
+}
+
+ssize_t ib_uverbs_create_flow(struct ib_uverbs_file *file,
+ const char __user *buf, int in_len,
+ int out_len)
+{
+   struct ib_uverbs_create_flow  cmd;
+   struct ib_uverbs_create_flow_resp resp;
+   struct ib_uobject *uobj;
+   struct ib_flow*flow_id;
+   struct ib_kern_flow_attr  *kern_flow_attr;
+   struct ib_flow_attr   *flow_attr;
+   struct ib_qp  *qp;
+   int err = 0;
+   void *kern_spec;
+   void *ib_spec;
+   int i;
+
+   if (out_len  sizeof(resp))
+   return -ENOSPC;
+
+   if (copy_from_user(cmd, buf, 

Re: NFS over RDMA benchmark

2013-04-24 Thread J. Bruce Fields
On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
 
 
  -Original Message-
  From: J. Bruce Fields [mailto:bfie...@fieldses.org]
  Sent: Wednesday, April 24, 2013 00:06
  To: Yan Burman
  Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
  linux-...@vger.kernel.org; Or Gerlitz
  Subject: Re: NFS over RDMA benchmark
  
  On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:
  
  
-Original Message-
From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
Sent: Wednesday, April 17, 2013 21:06
To: Atchley, Scott
Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: NFS over RDMA benchmark
   
On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
atchle...@ornl.gov
wrote:
 On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
wrote:

 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
wrote:
 Hi.

 I've been trying to do some benchmarks for NFS over RDMA and I
 seem to
only get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores, 32Gb of
 memory, and
Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The backing
 storage on
the server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 
 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
 the
same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

 Yan,

 Are you trying to optimize single client performance or server
 performance
with multiple clients?

  
   I am trying to get maximum performance from a single server - I used 2
  processes in fio test - more than 2 did not show any performance boost.
   I tried running fio from 2 different PCs on 2 different files, but the 
   sum of
  the two is more or less the same as running from single client PC.
  
   What I did see is that server is sweating a lot more than the clients and
  more than that, it has 1 core (CPU5) in 100% softirq tasklet:
   cat /proc/softirqs
  
  Would any profiling help figure out which code it's spending time in?
  (E.g. something simple as perf top might have useful output.)
  
 
 
 Perf top for the CPU with high tasklet count gives:
 
  samples  pcnt RIPfunctionDSO
  ___ _  ___ 
 ___
 
  2787.00 24.1% 81062a00 mutex_spin_on_owner 
 /root/vmlinux

I guess that means lots of contention on some mutex?  If only we knew
which one perf should also be able to collect stack statistics, I
forget how.

--b.

   978.00  8.4% 810297f0 clflush_cache_range 
 /root/vmlinux
   445.00  3.8% 812ea440 __domain_mapping
 /root/vmlinux
   441.00  3.8% 00018c30 svc_recv
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
   344.00  3.0% 813a1bc0 _raw_spin_lock_bh   
 /root/vmlinux
   333.00  2.9% 813a19e0 _raw_spin_lock_irqsave  
 /root/vmlinux
   288.00  2.5% 813a07d0 __schedule  
 /root/vmlinux
   249.00  2.1% 811a87e0 rb_prev 
 /root/vmlinux
   242.00  2.1% 813a19b0 _raw_spin_lock  
 /root/vmlinux
   184.00  1.6% 2e90 svc_rdma_sendto 
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
   177.00  1.5% 810ac820 get_page_from_freelist  
 /root/vmlinux
   174.00  1.5% 812e6da0 alloc_iova  
 /root/vmlinux
   165.00  1.4% 810b1390 put_page
 /root/vmlinux
   148.00  1.3% 00014760 sunrpc_cache_lookup 
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
   128.00  1.1% 00017f20 svc_xprt_enqueue
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
   126.00  1.1% 8139f820 __mutex_lock_slowpath   
 /root/vmlinux
   108.00  0.9% 811a81d0 rb_insert_color 
 /root/vmlinux
   107.00  0.9% 4690 svc_rdma_recvfrom   
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
   102.00  0.9% 2640 send_reply  
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
99.00  0.9% 810e6490 kmem_cache_alloc
 /root/vmlinux
96.00  0.8% 810e5840 __slab_alloc
 /root/vmlinux
  

Re: NFS over RDMA benchmark

2013-04-24 Thread J. Bruce Fields
On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
 On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
  
  
   -Original Message-
   From: J. Bruce Fields [mailto:bfie...@fieldses.org]
   Sent: Wednesday, April 24, 2013 00:06
   To: Yan Burman
   Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
   linux-...@vger.kernel.org; Or Gerlitz
   Subject: Re: NFS over RDMA benchmark
   
   On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:
   
   
 -Original Message-
 From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
 Sent: Wednesday, April 17, 2013 21:06
 To: Atchley, Scott
 Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
 linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
 Subject: Re: NFS over RDMA benchmark

 On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
 atchle...@ornl.gov
 wrote:
  On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
 wrote:
 
  On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
 wrote:
  Hi.
 
  I've been trying to do some benchmarks for NFS over RDMA and I
  seem to
 only get about half of the bandwidth that the HW can give me.
  My setup consists of 2 servers each with 16 cores, 32Gb of
  memory, and
 Mellanox ConnectX3 QDR card over PCI-e gen3.
  These servers are connected to a QDR IB switch. The backing
  storage on
 the server is tmpfs mounted with noatime.
  I am running kernel 3.5.7.
 
  When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 
  4-512K.
  When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
  the
 same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
 
  Yan,
 
  Are you trying to optimize single client performance or server
  performance
 with multiple clients?
 
   
I am trying to get maximum performance from a single server - I used 2
   processes in fio test - more than 2 did not show any performance boost.
I tried running fio from 2 different PCs on 2 different files, but the 
sum of
   the two is more or less the same as running from single client PC.
   
What I did see is that server is sweating a lot more than the clients 
and
   more than that, it has 1 core (CPU5) in 100% softirq tasklet:
cat /proc/softirqs
   
   Would any profiling help figure out which code it's spending time in?
   (E.g. something simple as perf top might have useful output.)
   
  
  
  Perf top for the CPU with high tasklet count gives:
  
   samples  pcnt RIPfunction
  DSO
   ___ _  ___ 
  ___
  
   2787.00 24.1% 81062a00 mutex_spin_on_owner 
  /root/vmlinux
 
 I guess that means lots of contention on some mutex?  If only we knew
 which one perf should also be able to collect stack statistics, I
 forget how.

Googling around  I think we want:

perf record -a --call-graph
(give it a chance to collect some samples, then ^C)
perf report --call-graph --stdio

--b.

 
 --b.
 
978.00  8.4% 810297f0 clflush_cache_range 
  /root/vmlinux
445.00  3.8% 812ea440 __domain_mapping
  /root/vmlinux
441.00  3.8% 00018c30 svc_recv
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
344.00  3.0% 813a1bc0 _raw_spin_lock_bh   
  /root/vmlinux
333.00  2.9% 813a19e0 _raw_spin_lock_irqsave  
  /root/vmlinux
288.00  2.5% 813a07d0 __schedule  
  /root/vmlinux
249.00  2.1% 811a87e0 rb_prev 
  /root/vmlinux
242.00  2.1% 813a19b0 _raw_spin_lock  
  /root/vmlinux
184.00  1.6% 2e90 svc_rdma_sendto 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
177.00  1.5% 810ac820 get_page_from_freelist  
  /root/vmlinux
174.00  1.5% 812e6da0 alloc_iova  
  /root/vmlinux
165.00  1.4% 810b1390 put_page
  /root/vmlinux
148.00  1.3% 00014760 sunrpc_cache_lookup 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
128.00  1.1% 00017f20 svc_xprt_enqueue
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
126.00  1.1% 8139f820 __mutex_lock_slowpath   
  /root/vmlinux
108.00  0.9% 811a81d0 rb_insert_color 
  /root/vmlinux
107.00  0.9% 4690 svc_rdma_recvfrom   
  

Re: Infiniband use of get_user_pages()

2013-04-24 Thread Christoph Lameter
On Wed, 24 Apr 2013, Jan Kara wrote:

   Hello,

   when checking users of get_user_pages() (I'm doing some cleanups in that
 area to fix filesystem's issues with mmap_sem locking) I've noticed that
 infiniband drivers add number of pages obtained from get_user_pages() to
 mm-pinned_vm counter. Although this makes some sence, it doesn't match
 with any other user of get_user_pages() (e.g. direct IO) so has infiniband
 some special reason why it does so?

get_user_pages typically is used to temporarily increase the refcount. The
Infiniband layer needs to permanently pin the pages for memory
registration.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-24 Thread Wendy Cheng
On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote:
 On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
 On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
 
 
 
  Perf top for the CPU with high tasklet count gives:
 
   samples  pcnt RIPfunction
  DSO
   ___ _  ___ 
  ___
 
   2787.00 24.1% 81062a00 mutex_spin_on_owner 
  /root/vmlinux

 I guess that means lots of contention on some mutex?  If only we knew
 which one perf should also be able to collect stack statistics, I
 forget how.

 Googling around  I think we want:

 perf record -a --call-graph
 (give it a chance to collect some samples, then ^C)
 perf report --call-graph --stdio


I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere
in the paths ? Trees like that requires extensive lockings.

-- Wendy

.

978.00  8.4% 810297f0 clflush_cache_range 
  /root/vmlinux
445.00  3.8% 812ea440 __domain_mapping
  /root/vmlinux
441.00  3.8% 00018c30 svc_recv
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
344.00  3.0% 813a1bc0 _raw_spin_lock_bh   
  /root/vmlinux
333.00  2.9% 813a19e0 _raw_spin_lock_irqsave  
  /root/vmlinux
288.00  2.5% 813a07d0 __schedule  
  /root/vmlinux
249.00  2.1% 811a87e0 rb_prev 
  /root/vmlinux
242.00  2.1% 813a19b0 _raw_spin_lock  
  /root/vmlinux
184.00  1.6% 2e90 svc_rdma_sendto 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
177.00  1.5% 810ac820 get_page_from_freelist  
  /root/vmlinux
174.00  1.5% 812e6da0 alloc_iova  
  /root/vmlinux
165.00  1.4% 810b1390 put_page
  /root/vmlinux
148.00  1.3% 00014760 sunrpc_cache_lookup 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
128.00  1.1% 00017f20 svc_xprt_enqueue
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
126.00  1.1% 8139f820 __mutex_lock_slowpath   
  /root/vmlinux
108.00  0.9% 811a81d0 rb_insert_color 
  /root/vmlinux
107.00  0.9% 4690 svc_rdma_recvfrom   
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
102.00  0.9% 2640 send_reply  
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 99.00  0.9% 810e6490 kmem_cache_alloc
  /root/vmlinux
 96.00  0.8% 810e5840 __slab_alloc
  /root/vmlinux
 91.00  0.8% 6d30 mlx4_ib_post_send   
  /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
 88.00  0.8% 0dd0 svc_rdma_get_context
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 86.00  0.7% 813a1a10 _raw_spin_lock_irq  
  /root/vmlinux
 86.00  0.7% 1530 svc_rdma_send   
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 85.00  0.7% 81060a80 prepare_creds   
  /root/vmlinux
 83.00  0.7% 810a5790 find_get_pages_contig   
  /root/vmlinux
 79.00  0.7% 810e4620 __slab_free 
  /root/vmlinux
 79.00  0.7% 813a1a40 _raw_spin_unlock_irqrestore 
  /root/vmlinux
 77.00  0.7% 81065610 finish_task_switch  
  /root/vmlinux
 76.00  0.7% 812e9270 pfn_to_dma_pte  
  /root/vmlinux
 75.00  0.6% 810976d0 __call_rcu  
  /root/vmlinux
 73.00  0.6% 811a2fa0 _atomic_dec_and_lock
  /root/vmlinux
 73.00  0.6% 02e0 svc_rdma_has_wspace 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 67.00  0.6% 813a1a70 _raw_read_lock  
  /root/vmlinux
 65.00  0.6% f590 svcauth_unix_set_client 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
 63.00  0.5% 000180e0 svc_reserve 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
 60.00  0.5% 64d0 stamp_send_wqe  

Re: linux-next: manual merge of the net-next tree with the infiniband tree

2013-04-24 Thread Thadeu Lima de Souza Cascardo
On Thu, Apr 18, 2013 at 01:18:43PM +1000, Stephen Rothwell wrote:
 Hi all,
 
 Today's linux-next merge of the net-next tree got a conflict in
 drivers/infiniband/hw/cxgb4/qp.c between commit 5b0c275926b8
 (RDMA/cxgb4: Fix SQ allocation when on-chip SQ is disabled) from the
 infiniband tree and commit 9919d5bd01b9 (RDMA/cxgb4: Fix onchip queue
 support for T5) from the net-next tree.
 
 I think that they are 2 different fixes for the same problem, so I just
 used the net-next version and can carry the fix as necessary (no action
 is required).
 
 -- 
 Cheers,
 Stephen Rothwells...@canb.auug.org.au


Commit 5b0c275926b8 also keeps the intention of the original patch which
broke it, which was to return an error code, in case the allocation fails.
Commit 9919d5bd01b9 fix will return 0 in case the allocation fails.

We should keep the other fix or fix the code again to return the proper
error code.

Regards.
Cascardo.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-24 Thread Wendy Cheng
On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote:
 On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
 On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
 
 
 
  Perf top for the CPU with high tasklet count gives:
 
   samples  pcnt RIPfunction
  DSO
   ___ _  ___ 
  ___
 
   2787.00 24.1% 81062a00 mutex_spin_on_owner 
  /root/vmlinux

 I guess that means lots of contention on some mutex?  If only we knew
 which one perf should also be able to collect stack statistics, I
 forget how.

 Googling around  I think we want:

 perf record -a --call-graph
 (give it a chance to collect some samples, then ^C)
 perf report --call-graph --stdio


 I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
 that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere
 in the paths ? Trees like that requires extensive lockings.


So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.

In short, if anyone has benchmark setup handy, bumping up the slot
table size as the following might be interesting:

--- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h
2013-03-21 09:19:36.233006570 -0700
+++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h  2013-04-24
10:52:20.934781304 -0700
@@ -59,7 +59,7 @@
  * a single chunk type per message is supported currently.
  */
 #define RPCRDMA_MIN_SLOT_TABLE (2U)
-#define RPCRDMA_DEF_SLOT_TABLE (32U)
+#define RPCRDMA_DEF_SLOT_TABLE (64U)
 #define RPCRDMA_MAX_SLOT_TABLE (256U)

 #define RPCRDMA_DEF_INLINE  (1024) /* default inline max */

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-24 Thread Tom Talpey

On 4/24/2013 2:04 PM, Wendy Cheng wrote:

On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote:

On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote:

On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:

On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:




Perf top for the CPU with high tasklet count gives:

  samples  pcnt RIPfunctionDSO
  ___ _  ___ 
___

  2787.00 24.1% 81062a00 mutex_spin_on_owner 
/root/vmlinux


I guess that means lots of contention on some mutex?  If only we knew
which one perf should also be able to collect stack statistics, I
forget how.


Googling around  I think we want:

 perf record -a --call-graph
 (give it a chance to collect some samples, then ^C)
 perf report --call-graph --stdio



I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere
in the paths ? Trees like that requires extensive lockings.



So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.


1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.

2)

The observation appears to be that the bandwidth is server CPU limited.
Increasing the load offered by the client probably won't move the needle,
until that's addressed.




In short, if anyone has benchmark setup handy, bumping up the slot
table size as the following might be interesting:

--- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h
2013-03-21 09:19:36.233006570 -0700
+++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h  2013-04-24
10:52:20.934781304 -0700
@@ -59,7 +59,7 @@
   * a single chunk type per message is supported currently.
   */
  #define RPCRDMA_MIN_SLOT_TABLE (2U)
-#define RPCRDMA_DEF_SLOT_TABLE (32U)
+#define RPCRDMA_DEF_SLOT_TABLE (64U)
  #define RPCRDMA_MAX_SLOT_TABLE (256U)

  #define RPCRDMA_DEF_INLINE  (1024) /* default inline max */

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Infiniband use of get_user_pages()

2013-04-24 Thread Roland Dreier
On Wed, Apr 24, 2013 at 8:38 AM, Jan Kara j...@suse.cz wrote:
   when checking users of get_user_pages() (I'm doing some cleanups in that
 area to fix filesystem's issues with mmap_sem locking) I've noticed that
 infiniband drivers add number of pages obtained from get_user_pages() to
 mm-pinned_vm counter. Although this makes some sence, it doesn't match
 with any other user of get_user_pages() (e.g. direct IO) so has infiniband
 some special reason why it does so?

Direct IO mappings are in some sense ephemeral -- they only need to
last while the IO is in flight.  In contrast the IB memory pinning is
controlled by (possibly unprivileged) userspace and might last the
whole lifetime of a long-lived application.  So we want some
accounting and resource control.

   Also that seems to be the only real reason why mmap_sem has to be grabbed
 in exclusive mode, am I right?

Most likely that is true.

   Another suspicious thing (at least in drivers/infiniband/core/umem.c:
 ib_umem_get()) is that arguments of get_user_pages() are like:
 ret = get_user_pages(current, current-mm, cur_base,
  min_t(unsigned long, npages,
PAGE_SIZE / sizeof (struct page 
 *)),
  1, !umem-writable, page_list, vma_list);
 So we always have write argument set to 1 and force argument is set to
 !umem-writable. Is that really intentional? My naive guess would be that
 arguments should be switched... Although even in that case I fail to see
 why 'force' argument should be set. Can someone please explain?

This confused even me recently.  We had a long discussion (read the
whole thread starting here: https://lkml.org/lkml/2012/1/26/7) but in
short the current parameters seem to be needed to trigger COW even
when the kernel/hardware want to read the memory, to avoid problems
where we get stale data if userspace triggers COW.

I think I better add a comment explaining this.

   Finally (and here I may show my ignorance ;), I'd like to ask whether
 there's any reason why ib_umem_get() checks for is_vm_hugetlb_page() and
 not just whether a page is a huge page?

I'm not sure of the history here.  How would one check directly if a
page is a huge page?  get_user_pages() actually goes to some trouble
to return all small pages, even when it has to split a single huge
page into many entries in the page array.  (Which is actually a bit
unfortunate for our use here)

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html