RE: NFS over RDMA benchmark
-Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Wednesday, April 24, 2013 00:06 To: Yan Burman Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote: -Original Message- From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com] Sent: Wednesday, April 17, 2013 21:06 To: Atchley, Scott Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: NFS over RDMA benchmark On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov wrote: On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote: Hi. I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. Yan, Are you trying to optimize single client performance or server performance with multiple clients? I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs Would any profiling help figure out which code it's spending time in? (E.g. something simple as perf top might have useful output.) Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunctionDSO ___ _ ___ ___ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux 978.00 8.4% 810297f0 clflush_cache_range /root/vmlinux 445.00 3.8% 812ea440 __domain_mapping /root/vmlinux 441.00 3.8% 00018c30 svc_recv /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 344.00 3.0% 813a1bc0 _raw_spin_lock_bh /root/vmlinux 333.00 2.9% 813a19e0 _raw_spin_lock_irqsave /root/vmlinux 288.00 2.5% 813a07d0 __schedule /root/vmlinux 249.00 2.1% 811a87e0 rb_prev /root/vmlinux 242.00 2.1% 813a19b0 _raw_spin_lock /root/vmlinux 184.00 1.6% 2e90 svc_rdma_sendto /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 177.00 1.5% 810ac820 get_page_from_freelist /root/vmlinux 174.00 1.5% 812e6da0 alloc_iova /root/vmlinux 165.00 1.4% 810b1390 put_page /root/vmlinux 148.00 1.3% 00014760 sunrpc_cache_lookup /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 128.00 1.1% 00017f20 svc_xprt_enqueue /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 126.00 1.1% 8139f820 __mutex_lock_slowpath /root/vmlinux 108.00 0.9% 811a81d0 rb_insert_color /root/vmlinux 107.00 0.9% 4690 svc_rdma_recvfrom /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 102.00 0.9% 2640 send_reply /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 99.00 0.9% 810e6490 kmem_cache_alloc /root/vmlinux 96.00 0.8% 810e5840 __slab_alloc /root/vmlinux 91.00 0.8% 6d30 mlx4_ib_post_send /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko 88.00 0.8% 0dd0 svc_rdma_get_context /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 86.00 0.7% 813a1a10 _raw_spin_lock_irq
[PATCH for-next 3/9] net/mlx4_core: Change few DMFS fields names to match firmare spec
From: Hadar Hen Zion had...@mellanox.com Change struct mlx4_net_trans_rule_hw_eth :: vlan_id name to vlan_tag Change struct mlx4_net_trans_rule_hw_ib :: r_u_qpn name to l3_qpn The patch doesn't introduce any functional change or API change towards the firmware. Signed-off-by: Hadar Hen Zion had...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/net/ethernet/mellanox/mlx4/mcg.c |6 +++--- include/linux/mlx4/device.h |8 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c b/drivers/net/ethernet/mellanox/mlx4/mcg.c index d1f01dc..3cfd372 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mcg.c +++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c @@ -714,12 +714,12 @@ static int parse_trans_rule(struct mlx4_dev *dev, struct mlx4_spec_list *spec, rule_hw-eth.ether_type_enable = 1; rule_hw-eth.ether_type = spec-eth.ether_type; } - rule_hw-eth.vlan_id = spec-eth.vlan_id; - rule_hw-eth.vlan_id_msk = spec-eth.vlan_id_msk; + rule_hw-eth.vlan_tag = spec-eth.vlan_id; + rule_hw-eth.vlan_tag_msk = spec-eth.vlan_id_msk; break; case MLX4_NET_TRANS_RULE_ID_IB: - rule_hw-ib.qpn = spec-ib.r_qpn; + rule_hw-ib.l3_qpn = spec-ib.l3_qpn; rule_hw-ib.qpn_mask = spec-ib.qpn_msk; memcpy(rule_hw-ib.dst_gid, spec-ib.dst_gid, 16); memcpy(rule_hw-ib.dst_gid_msk, spec-ib.dst_gid_msk, 16); diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index b2fe59d..a69bda7 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -930,7 +930,7 @@ struct mlx4_spec_ipv4 { }; struct mlx4_spec_ib { - __be32 r_qpn; + __be32 l3_qpn; __be32 qpn_msk; u8 dst_gid[16]; u8 dst_gid_msk[16]; @@ -978,7 +978,7 @@ struct mlx4_net_trans_rule_hw_ib { u8 rsvd1; __be16 id; u32 rsvd2; - __be32 qpn; + __be32 l3_qpn; __be32 qpn_mask; u8 dst_gid[16]; u8 dst_gid_msk[16]; @@ -999,8 +999,8 @@ struct mlx4_net_trans_rule_hw_eth { u8 rsvd5; u8 ether_type_enable; __be16 ether_type; - __be16 vlan_id_msk; - __be16 vlan_id; + __be16 vlan_tag_msk; + __be16 vlan_tag; } __packed; struct mlx4_net_trans_rule_hw_tcp_udp { -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 4/9] net/mlx4_core: Directly expose fields of DMFS HW rule control segment
From: Hadar Hen Zion had...@mellanox.com Some of struct mlx4_net_trans_rule_hw_ctrl fields were packed into u32 and accessed through bit field operations. Expose and access them directly as u8 or u16. Signed-off-by: Hadar Hen Zion had...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/net/ethernet/mellanox/mlx4/mcg.c | 14 +++--- include/linux/mlx4/device.h |4 +++- 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c b/drivers/net/ethernet/mellanox/mlx4/mcg.c index 3cfd372..07712f9 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mcg.c +++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c @@ -656,15 +656,15 @@ static void trans_rule_ctrl_to_hw(struct mlx4_net_trans_rule *ctrl, [MLX4_FS_MC_SNIFFER]= 0x5, }; - u32 dw = 0; + u8 flags = 0; - dw = ctrl-queue_mode == MLX4_NET_TRANS_Q_LIFO ? 1 : 0; - dw |= ctrl-exclusive ? (1 2) : 0; - dw |= ctrl-allow_loopback ? (1 3) : 0; - dw |= __promisc_mode[ctrl-promisc_mode] 8; - dw |= ctrl-priority 16; + flags = ctrl-queue_mode == MLX4_NET_TRANS_Q_LIFO ? 1 : 0; + flags |= ctrl-exclusive ? (1 2) : 0; + flags |= ctrl-allow_loopback ? (1 3) : 0; - hw-ctrl = cpu_to_be32(dw); + hw-flags = flags; + hw-type = __promisc_mode[ctrl-promisc_mode]; + hw-prio = cpu_to_be16(ctrl-priority); hw-port = ctrl-port; hw-qpn = cpu_to_be32(ctrl-qpn); } diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index a69bda7..08e5bc1 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -964,7 +964,9 @@ struct mlx4_net_trans_rule { }; struct mlx4_net_trans_rule_hw_ctrl { - __be32 ctrl; + __be16 prio; + u8 type; + u8 flags; u8 rsvd1; u8 funcid; u8 vep; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 5/9] net/mlx4_core: Expose few helpers to fill DMFS HW strucutures
From: Hadar Hen Zion had...@mellanox.com Re-arrange some of code which fills DMFS HW structures so we can use it from within the core driver and from the IB driver too, e.g when verbs DMFS structures are transformed into mlx4 Hardware structs. Also, add struct mlx4_flow_handle struct which will be of use by the DMFS verbs flow in the IB driver. Signed-off-by: Hadar Hen Zion had...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/net/ethernet/mellanox/mlx4/mcg.c | 85 +- include/linux/mlx4/device.h | 12 - 2 files changed, 70 insertions(+), 27 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c b/drivers/net/ethernet/mellanox/mlx4/mcg.c index 07712f9..00b4e7b 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mcg.c +++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c @@ -645,17 +645,28 @@ static int find_entry(struct mlx4_dev *dev, u8 port, return err; } +static const u8 __promisc_mode[] = { + [MLX4_FS_REGULAR] = 0x0, + [MLX4_FS_ALL_DEFAULT] = 0x1, + [MLX4_FS_MC_DEFAULT] = 0x3, + [MLX4_FS_UC_SNIFFER] = 0x4, + [MLX4_FS_MC_SNIFFER] = 0x5, +}; + +int mlx4_map_sw_to_hw_steering_mode(struct mlx4_dev *dev, + enum mlx4_net_trans_promisc_mode flow_type) +{ + if (flow_type = MLX4_FS_MODE_NUM || flow_type 0) { + mlx4_err(dev, Invalid flow type. type = %d\n, flow_type); + return -EINVAL; + } + return __promisc_mode[flow_type]; +} +EXPORT_SYMBOL_GPL(mlx4_map_sw_to_hw_steering_mode); + static void trans_rule_ctrl_to_hw(struct mlx4_net_trans_rule *ctrl, struct mlx4_net_trans_rule_hw_ctrl *hw) { - static const u8 __promisc_mode[] = { - [MLX4_FS_REGULAR] = 0x0, - [MLX4_FS_ALL_DEFAULT] = 0x1, - [MLX4_FS_MC_DEFAULT]= 0x3, - [MLX4_FS_UC_SNIFFER]= 0x4, - [MLX4_FS_MC_SNIFFER]= 0x5, - }; - u8 flags = 0; flags = ctrl-queue_mode == MLX4_NET_TRANS_Q_LIFO ? 1 : 0; @@ -678,29 +689,51 @@ const u16 __sw_id_hw[] = { [MLX4_NET_TRANS_RULE_ID_UDP] = 0xE006 }; +int mlx4_map_sw_to_hw_steering_id(struct mlx4_dev *dev, + enum mlx4_net_trans_rule_id id) +{ + if (id = MLX4_NET_TRANS_RULE_NUM || id 0) { + mlx4_err(dev, Invalid network rule id. id = %d\n, id); + return -EINVAL; + } + return __sw_id_hw[id]; +} +EXPORT_SYMBOL_GPL(mlx4_map_sw_to_hw_steering_id); + +static const int __rule_hw_sz[] = { + [MLX4_NET_TRANS_RULE_ID_ETH] = + sizeof(struct mlx4_net_trans_rule_hw_eth), + [MLX4_NET_TRANS_RULE_ID_IB] = + sizeof(struct mlx4_net_trans_rule_hw_ib), + [MLX4_NET_TRANS_RULE_ID_IPV6] = 0, + [MLX4_NET_TRANS_RULE_ID_IPV4] = + sizeof(struct mlx4_net_trans_rule_hw_ipv4), + [MLX4_NET_TRANS_RULE_ID_TCP] = + sizeof(struct mlx4_net_trans_rule_hw_tcp_udp), + [MLX4_NET_TRANS_RULE_ID_UDP] = + sizeof(struct mlx4_net_trans_rule_hw_tcp_udp) +}; + +int mlx4_hw_rule_sz(struct mlx4_dev *dev, + enum mlx4_net_trans_rule_id id) +{ + if (id = MLX4_NET_TRANS_RULE_NUM || id 0) { + mlx4_err(dev, Invalid network rule id. id = %d\n, id); + return -EINVAL; + } + + return __rule_hw_sz[id]; +} +EXPORT_SYMBOL_GPL(mlx4_hw_rule_sz); + static int parse_trans_rule(struct mlx4_dev *dev, struct mlx4_spec_list *spec, struct _rule_hw *rule_hw) { - static const size_t __rule_hw_sz[] = { - [MLX4_NET_TRANS_RULE_ID_ETH] = - sizeof(struct mlx4_net_trans_rule_hw_eth), - [MLX4_NET_TRANS_RULE_ID_IB] = - sizeof(struct mlx4_net_trans_rule_hw_ib), - [MLX4_NET_TRANS_RULE_ID_IPV6] = 0, - [MLX4_NET_TRANS_RULE_ID_IPV4] = - sizeof(struct mlx4_net_trans_rule_hw_ipv4), - [MLX4_NET_TRANS_RULE_ID_TCP] = - sizeof(struct mlx4_net_trans_rule_hw_tcp_udp), - [MLX4_NET_TRANS_RULE_ID_UDP] = - sizeof(struct mlx4_net_trans_rule_hw_tcp_udp) - }; - if (spec-id = MLX4_NET_TRANS_RULE_NUM) { - mlx4_err(dev, Invalid network rule id. id = %d\n, spec-id); + if (mlx4_hw_rule_sz(dev, spec-id) 0) return -EINVAL; - } - memset(rule_hw, 0, __rule_hw_sz[spec-id]); + memset(rule_hw, 0, mlx4_hw_rule_sz(dev, spec-id)); rule_hw-id = cpu_to_be16(__sw_id_hw[spec-id]); - rule_hw-size = __rule_hw_sz[spec-id] 2; + rule_hw-size = mlx4_hw_rule_sz(dev, spec-id) 2; switch (spec-id) { case MLX4_NET_TRANS_RULE_ID_ETH: diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index
[PATCH for-next 1/9] net/mlx4_core: Move DMFS HW structs to common header file
From: Hadar Hen Zion had...@mellanox.com Move flow steering HW structures to be on the public mlx4 include directory, as a pre-step for the mlx4 IB driver to use them too. Signed-off-by: Hadar Hen Zion had...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/net/ethernet/mellanox/mlx4/mlx4.h | 79 - include/linux/mlx4/device.h | 79 + 2 files changed, 79 insertions(+), 79 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h b/drivers/net/ethernet/mellanox/mlx4/mlx4.h index d738454..d5fdb19 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h @@ -701,85 +701,6 @@ struct mlx4_steer { struct list_head steer_entries[MLX4_NUM_STEERS]; }; -struct mlx4_net_trans_rule_hw_ctrl { - __be32 ctrl; - u8 rsvd1; - u8 funcid; - u8 vep; - u8 port; - __be32 qpn; - __be32 rsvd2; -}; - -struct mlx4_net_trans_rule_hw_ib { - u8 size; - u8 rsvd1; - __be16 id; - u32 rsvd2; - __be32 qpn; - __be32 qpn_mask; - u8 dst_gid[16]; - u8 dst_gid_msk[16]; -} __packed; - -struct mlx4_net_trans_rule_hw_eth { - u8 size; - u8 rsvd; - __be16 id; - u8 rsvd1[6]; - u8 dst_mac[6]; - u16 rsvd2; - u8 dst_mac_msk[6]; - u16 rsvd3; - u8 src_mac[6]; - u16 rsvd4; - u8 src_mac_msk[6]; - u8 rsvd5; - u8 ether_type_enable; - __be16 ether_type; - __be16 vlan_id_msk; - __be16 vlan_id; -} __packed; - -struct mlx4_net_trans_rule_hw_tcp_udp { - u8 size; - u8 rsvd; - __be16 id; - __be16 rsvd1[3]; - __be16 dst_port; - __be16 rsvd2; - __be16 dst_port_msk; - __be16 rsvd3; - __be16 src_port; - __be16 rsvd4; - __be16 src_port_msk; -} __packed; - -struct mlx4_net_trans_rule_hw_ipv4 { - u8 size; - u8 rsvd; - __be16 id; - __be32 rsvd1; - __be32 dst_ip; - __be32 dst_ip_msk; - __be32 src_ip; - __be32 src_ip_msk; -} __packed; - -struct _rule_hw { - union { - struct { - u8 size; - u8 rsvd; - __be16 id; - }; - struct mlx4_net_trans_rule_hw_eth eth; - struct mlx4_net_trans_rule_hw_ib ib; - struct mlx4_net_trans_rule_hw_ipv4 ipv4; - struct mlx4_net_trans_rule_hw_tcp_udp tcp_udp; - }; -}; - enum { MLX4_PCI_DEV_IS_VF = 1 0, MLX4_PCI_DEV_FORCE_SENSE_PORT = 1 1, diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index 811f91c..9fbf416 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -962,6 +962,85 @@ struct mlx4_net_trans_rule { u32 qpn; }; +struct mlx4_net_trans_rule_hw_ctrl { + __be32 ctrl; + u8 rsvd1; + u8 funcid; + u8 vep; + u8 port; + __be32 qpn; + __be32 rsvd2; +}; + +struct mlx4_net_trans_rule_hw_ib { + u8 size; + u8 rsvd1; + __be16 id; + u32 rsvd2; + __be32 qpn; + __be32 qpn_mask; + u8 dst_gid[16]; + u8 dst_gid_msk[16]; +} __packed; + +struct mlx4_net_trans_rule_hw_eth { + u8 size; + u8 rsvd; + __be16 id; + u8 rsvd1[6]; + u8 dst_mac[6]; + u16 rsvd2; + u8 dst_mac_msk[6]; + u16 rsvd3; + u8 src_mac[6]; + u16 rsvd4; + u8 src_mac_msk[6]; + u8 rsvd5; + u8 ether_type_enable; + __be16 ether_type; + __be16 vlan_id_msk; + __be16 vlan_id; +} __packed; + +struct mlx4_net_trans_rule_hw_tcp_udp { + u8 size; + u8 rsvd; + __be16 id; + __be16 rsvd1[3]; + __be16 dst_port; + __be16 rsvd2; + __be16 dst_port_msk; + __be16 rsvd3; + __be16 src_port; + __be16 rsvd4; + __be16 src_port_msk; +} __packed; + +struct mlx4_net_trans_rule_hw_ipv4 { + u8 size; + u8 rsvd; + __be16 id; + __be32 rsvd1; + __be32 dst_ip; + __be32 dst_ip_msk; + __be32 src_ip; + __be32 src_ip_msk; +} __packed; + +struct _rule_hw { + union { + struct { + u8 size; + u8 rsvd; + __be16 id; + }; + struct mlx4_net_trans_rule_hw_eth eth; + struct mlx4_net_trans_rule_hw_ib ib; + struct mlx4_net_trans_rule_hw_ipv4 ipv4; + struct mlx4_net_trans_rule_hw_tcp_udp tcp_udp; + }; +}; + int mlx4_flow_steer_promisc_add(struct mlx4_dev *dev, u8 port, u32 qpn,
[PATCH for-next 9/9] IB/mlx4: Add receive Flow Steering support
From: Hadar Hen Zion had...@mellanox.com Implement the ib_create_flow and ib_destroy_flow verbs. Translate the verbs structures provided by the user to HW structures and call the MLX4_QP_FLOW_STEERING_ATTACH/DETACH firmware commands. On the ATTACH command completion, the firmware provides 64 bit registration ID which is returned to the caller within struct ib_flow and used later for detaching that flow. Signed-off-by: Hadar Hen Zion had...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/hw/mlx4/main.c | 247 + 1 files changed, 247 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 23d7343..e72584f 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -54,6 +54,8 @@ #define DRV_VERSION1.0 #define DRV_RELDATEApril 4, 2008 +#define MLX4_IB_FLOW_MAX_PRIO 0xFFF + MODULE_AUTHOR(Roland Dreier); MODULE_DESCRIPTION(Mellanox ConnectX HCA InfiniBand driver); MODULE_LICENSE(Dual BSD/GPL); @@ -88,6 +90,25 @@ static void init_query_mad(struct ib_smp *mad) static union ib_gid zgid; +static int check_flow_steering_support(struct mlx4_dev *dev) +{ + int ib_num_ports = 0; + int i; + + mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) + ib_num_ports++; + + if (dev-caps.steering_mode == MLX4_STEERING_MODE_DEVICE_MANAGED) { + if (ib_num_ports || mlx4_is_mfunc(dev)) { + pr_warn(Device managed flow steering is unavailable + for IB ports or in multifunction env.\n); + return 0; + } + return 1; + } + return 0; +} + static int mlx4_ib_query_device(struct ib_device *ibdev, struct ib_device_attr *props) { @@ -144,6 +165,8 @@ static int mlx4_ib_query_device(struct ib_device *ibdev, props-device_cap_flags |= IB_DEVICE_MEM_WINDOW_TYPE_2B; else props-device_cap_flags |= IB_DEVICE_MEM_WINDOW_TYPE_2A; + if (check_flow_steering_support(dev-dev)) + props-device_cap_flags |= IB_DEVICE_MANAGED_FLOW_STEERING; } props-vendor_id = be32_to_cpup((__be32 *) (out_mad-data + 36)) @@ -798,6 +821,221 @@ struct mlx4_ib_steering { union ib_gid gid; }; +static int parse_flow_attr(struct mlx4_dev *dev, + struct _ib_flow_spec *ib_spec, + struct _rule_hw *mlx4_spec) +{ + enum mlx4_net_trans_rule_id type; + + switch (ib_spec-type) { + case IB_FLOW_SPEC_ETH: + type = MLX4_NET_TRANS_RULE_ID_ETH; + memcpy(mlx4_spec-eth.dst_mac, ib_spec-eth.val.dst_mac, + ETH_ALEN); + memcpy(mlx4_spec-eth.dst_mac_msk, ib_spec-eth.mask.dst_mac, + ETH_ALEN); + mlx4_spec-eth.vlan_tag = ib_spec-eth.val.vlan_tag; + mlx4_spec-eth.vlan_tag_msk = ib_spec-eth.mask.vlan_tag; + break; + + case IB_FLOW_SPEC_IB: + type = MLX4_NET_TRANS_RULE_ID_IB; + mlx4_spec-ib.l3_qpn = ib_spec-ib.val.l3_type_qpn; + mlx4_spec-ib.qpn_mask = ib_spec-ib.mask.l3_type_qpn; + memcpy(mlx4_spec-ib.dst_gid, ib_spec-ib.val.dst_gid, 16); + memcpy(mlx4_spec-ib.dst_gid_msk, + ib_spec-ib.mask.dst_gid, 16); + break; + + case IB_FLOW_SPEC_IPV4: + type = MLX4_NET_TRANS_RULE_ID_IPV4; + mlx4_spec-ipv4.src_ip = ib_spec-ipv4.val.src_ip; + mlx4_spec-ipv4.src_ip_msk = ib_spec-ipv4.mask.src_ip; + mlx4_spec-ipv4.dst_ip = ib_spec-ipv4.val.dst_ip; + mlx4_spec-ipv4.dst_ip_msk = ib_spec-ipv4.mask.dst_ip; + break; + + case IB_FLOW_SPEC_TCP: + case IB_FLOW_SPEC_UDP: + type = ib_spec-type == IB_FLOW_SPEC_TCP ? + MLX4_NET_TRANS_RULE_ID_TCP : + MLX4_NET_TRANS_RULE_ID_UDP; + mlx4_spec-tcp_udp.dst_port = ib_spec-tcp_udp.val.dst_port; + mlx4_spec-tcp_udp.dst_port_msk = ib_spec-tcp_udp.mask.dst_port; + mlx4_spec-tcp_udp.src_port = ib_spec-tcp_udp.val.src_port; + mlx4_spec-tcp_udp.src_port_msk = ib_spec-tcp_udp.mask.src_port; + break; + + default: + return -EINVAL; + } + if (mlx4_map_sw_to_hw_steering_id(dev, type) 0 || + mlx4_hw_rule_sz(dev, type) 0) + return -EINVAL; + mlx4_spec-id = cpu_to_be16(mlx4_map_sw_to_hw_steering_id(dev, type)); + mlx4_spec-size = mlx4_hw_rule_sz(dev, type) 2; + return mlx4_hw_rule_sz(dev, type); +} + +static int __mlx4_ib_create_flow(struct ib_qp *qp, struct
[PATCH for-next 6/9] IB/core: Add receive Flow Steering support
From: Hadar Hen Zion had...@mellanox.com The RDMA stack allows for applications to create IB_QPT_RAW_PACKET QPs, for which plain Ethernet packets are used, specifically packets which don't carry any QPN to be matched by the receiving side. Applications using these QPs must be provided with a method to program some steering rule with the HW so packets arriving at the local port can be routed to them. This patch adds ib_create_flow which allow to provide a flow specification for a QP, such that when there's a match between the specification and the received packet, it can be forwarded to that QP, in a similar manner one needs to use ib_attach_multicast for IB UD multicast handling. Flow specifications are provided as instances of struct ib_flow_spec_yyy which describe L2, L3 and L4 headers, currently specs for Ethernet, IPv4, TCP, UDP and IB are defined. Flow specs are made of values and masks. The input to ib_create_flow is instance of struct ib_flow_attr which contain few mandatory control elements and optional flow specs. struct ib_flow_attr { enum ib_flow_attr_type type; u16 size; u16 priority; u8 num_of_specs; u8 port; u32 flags; /* Following are the optional layers according to user request * struct ib_flow_spec_yyy * struct ib_flow_spec_zzz */ }; As these specs are eventually coming from user space, they are defined and used in a way which allows adding new spec types without kernel/user ABI change, and with a little API enhancement which defines the newly added spec. The flow spec structures are defined in a TLV (Type-Length-Value) manner, which allows to call ib_create_flow with a list of variable length of optional specs. For the actual processing of ib_flow_attr the driver uses the number of specs and the size mandatory fields along with the TLV nature of the specs. Steering rules processing order is according to rules priority. The user sets the 12 low-order bits from the priority field and the remaining 4 high-order bits are set by the kernel according to a domain the application or the layer that created the rule belongs to. Lower priority numerical value means higher priority. The returned value from ib_create_flow is instance of struct ib_flow which contains a database pointer (handle) provided by the HW driver to be used when calling ib_destroy_flow. Applications that offload TCP/IP traffic could be written also over IB UD QPs. As such, the ib_create_flow / ib_destroy_flow API is designed to support UD QPs too, the HW driver sets IB_DEVICE_MANAGED_FLOW_STEERING to denote support of flow steering. The ib_flow_attr enum type relates to usage of flow steering for promiscuous and sniffer purposes: IB_FLOW_ATTR_NORMAL - regular rule, steering according to rule specification IB_FLOW_ATTR_ALL_DEFAULT - default unicast and multicast rule, receive all Ethernet traffic which isn't steered to any QP IB_FLOW_ATTR_MC_DEFAULT - same as IB_FLOW_ATTR_ALL_DEFAULT but only for multicast IB_FLOW_ATTR_SNIFFER - sniffer rule, receive all port traffic ALL_DEFAULT and MC_DEFAULT rules options are valid only for Ethernet link type. Signed-off-by: Hadar Hen Zion had...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/core/verbs.c | 30 + include/rdma/ib_verbs.h | 136 ++- 2 files changed, 164 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 22192de..932f4a7 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -1254,3 +1254,33 @@ int ib_dealloc_xrcd(struct ib_xrcd *xrcd) return xrcd-device-dealloc_xrcd(xrcd); } EXPORT_SYMBOL(ib_dealloc_xrcd); + +struct ib_flow *ib_create_flow(struct ib_qp *qp, + struct ib_flow_attr *flow_attr, + int domain) +{ + struct ib_flow *flow_id; + if (!qp-device-create_flow) + return ERR_PTR(-ENOSYS); + + flow_id = qp-device-create_flow(qp, flow_attr, domain); + if (!IS_ERR(flow_id)) + atomic_inc(qp-usecnt); + return flow_id; +} +EXPORT_SYMBOL(ib_create_flow); + +int ib_destroy_flow(struct ib_flow *flow_id) +{ + int err; + struct ib_qp *qp = flow_id-qp; + + if (!flow_id-qp-device-destroy_flow) + return -ENOSYS; + + err = qp-device-destroy_flow(flow_id); + if (!err) + atomic_dec(qp-usecnt); + return err; +} +EXPORT_SYMBOL(ib_destroy_flow); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 98cc4b2..6f76d62 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -116,7 +116,8 @@ enum ib_device_cap_flags { IB_DEVICE_MEM_MGT_EXTENSIONS= (121), IB_DEVICE_BLOCK_MULTICAST_LOOPBACK = (122), IB_DEVICE_MEM_WINDOW_TYPE_2A= (123), -
[PATCH for-next 2/9] net/mlx4: Match DMFS promiscuous field names to firmware spec
From: Hadar Hen Zion had...@mellanox.com Align the names used by enum mlx4_net_trans_promisc_mode with the actual firmware specification. The patch doesn't introduce any functional change or API change towards the firmware. Remove MLX4_FS_PROMISC_FUNCTION_PORT which isn't of use. Add new enums MLX4_FS_{UC/MC}_SNIFFER as a preparation step for sniffer support. Signed-off-by: Hadar Hen Zion had...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |2 +- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 16 drivers/net/ethernet/mellanox/mlx4/mcg.c| 21 ++--- include/linux/mlx4/device.h | 11 ++- 4 files changed, 25 insertions(+), 25 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c index 00f25b5..2047684 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c @@ -889,7 +889,7 @@ static int mlx4_en_flow_replace(struct net_device *dev, .queue_mode = MLX4_NET_TRANS_Q_FIFO, .exclusive = 0, .allow_loopback = 1, - .promisc_mode = MLX4_FS_PROMISC_NONE, + .promisc_mode = MLX4_FS_REGULAR, }; rule.port = priv-port; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 30d78f8..0860130 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -127,7 +127,7 @@ static void mlx4_en_filter_work(struct work_struct *work) .queue_mode = MLX4_NET_TRANS_Q_LIFO, .exclusive = 1, .allow_loopback = 1, - .promisc_mode = MLX4_FS_PROMISC_NONE, + .promisc_mode = MLX4_FS_REGULAR, .port = priv-port, .priority = MLX4_DOMAIN_RFS, }; @@ -446,7 +446,7 @@ static int mlx4_en_uc_steer_add(struct mlx4_en_priv *priv, .queue_mode = MLX4_NET_TRANS_Q_FIFO, .exclusive = 0, .allow_loopback = 1, - .promisc_mode = MLX4_FS_PROMISC_NONE, + .promisc_mode = MLX4_FS_REGULAR, .priority = MLX4_DOMAIN_NIC, }; @@ -793,7 +793,7 @@ static void mlx4_en_set_promisc_mode(struct mlx4_en_priv *priv, err = mlx4_flow_steer_promisc_add(mdev-dev, priv-port, priv-base_qpn, - MLX4_FS_PROMISC_UPLINK); + MLX4_FS_ALL_DEFAULT); if (err) en_err(priv, Failed enabling promiscuous mode\n); priv-flags |= MLX4_EN_FLAG_MC_PROMISC; @@ -856,7 +856,7 @@ static void mlx4_en_clear_promisc_mode(struct mlx4_en_priv *priv, case MLX4_STEERING_MODE_DEVICE_MANAGED: err = mlx4_flow_steer_promisc_remove(mdev-dev, priv-port, -MLX4_FS_PROMISC_UPLINK); +MLX4_FS_ALL_DEFAULT); if (err) en_err(priv, Failed disabling promiscuous mode\n); priv-flags = ~MLX4_EN_FLAG_MC_PROMISC; @@ -917,7 +917,7 @@ static void mlx4_en_do_multicast(struct mlx4_en_priv *priv, err = mlx4_flow_steer_promisc_add(mdev-dev, priv-port, priv-base_qpn, - MLX4_FS_PROMISC_ALL_MULTI); + MLX4_FS_MC_DEFAULT); break; case MLX4_STEERING_MODE_B0: @@ -940,7 +940,7 @@ static void mlx4_en_do_multicast(struct mlx4_en_priv *priv, case MLX4_STEERING_MODE_DEVICE_MANAGED: err = mlx4_flow_steer_promisc_remove(mdev-dev, priv-port, - MLX4_FS_PROMISC_ALL_MULTI); + MLX4_FS_MC_DEFAULT); break; case MLX4_STEERING_MODE_B0: @@ -1598,10 +1598,10 @@ void mlx4_en_stop_port(struct net_device *dev, int detach) MLX4_EN_FLAG_MC_PROMISC);
[PATCH for-next 7/9] IB/core: Infra-structure to support verbs extensions through uverbs
From: Igor Ivanov igor.iva...@itseez.com Add Infra-structure to support extended uverbs capabilities in a forward/backward manner. Uverbs command opcodes which are based on the verbs extensions approach should be greater or equal to IB_USER_VERBS_CMD_THRESHOLD. They have new header format and processed a bit differently. Signed-off-by: Igor Ivanov igor.iva...@itseez.com Signed-off-by: Hadar Hen Zion had...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/core/uverbs_main.c | 29 - include/uapi/rdma/ib_user_verbs.h | 10 ++ 2 files changed, 34 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c index 2c6f0f2..e4e7b24 100644 --- a/drivers/infiniband/core/uverbs_main.c +++ b/drivers/infiniband/core/uverbs_main.c @@ -583,9 +583,6 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf, if (copy_from_user(hdr, buf, sizeof hdr)) return -EFAULT; - if (hdr.in_words * 4 != count) - return -EINVAL; - if (hdr.command = ARRAY_SIZE(uverbs_cmd_table) || !uverbs_cmd_table[hdr.command]) return -EINVAL; @@ -597,8 +594,30 @@ static ssize_t ib_uverbs_write(struct file *filp, const char __user *buf, if (!(file-device-ib_dev-uverbs_cmd_mask (1ull hdr.command))) return -ENOSYS; - return uverbs_cmd_table[hdr.command](file, buf + sizeof hdr, -hdr.in_words * 4, hdr.out_words * 4); + if (hdr.command = IB_USER_VERBS_CMD_THRESHOLD) { + struct ib_uverbs_cmd_hdr_ex hdr_ex; + + if (copy_from_user(hdr_ex, buf, sizeof(hdr_ex))) + return -EFAULT; + + if (((hdr_ex.in_words + hdr_ex.provider_in_words) * 4) != count) + return -EINVAL; + + return uverbs_cmd_table[hdr.command](file, +buf + sizeof(hdr_ex), +(hdr_ex.in_words + + hdr_ex.provider_in_words) * 4, +(hdr_ex.out_words + + hdr_ex.provider_out_words) * 4); + } else { + if (hdr.in_words * 4 != count) + return -EINVAL; + + return uverbs_cmd_table[hdr.command](file, +buf + sizeof(hdr), +hdr.in_words * 4, +hdr.out_words * 4); + } } static int ib_uverbs_mmap(struct file *filp, struct vm_area_struct *vma) diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h index 805711e..61535aa 100644 --- a/include/uapi/rdma/ib_user_verbs.h +++ b/include/uapi/rdma/ib_user_verbs.h @@ -43,6 +43,7 @@ * compatibility are made. */ #define IB_USER_VERBS_ABI_VERSION 6 +#define IB_USER_VERBS_CMD_THRESHOLD50 enum { IB_USER_VERBS_CMD_GET_CONTEXT, @@ -123,6 +124,15 @@ struct ib_uverbs_cmd_hdr { __u16 out_words; }; +struct ib_uverbs_cmd_hdr_ex { + __u32 command; + __u16 in_words; + __u16 out_words; + __u16 provider_in_words; + __u16 provider_out_words; + __u32 cmd_hdr_reserved; +}; + struct ib_uverbs_get_context { __u64 response; __u64 driver_data[0]; -- 1.7.1 Cc: Tzahi Oved tza...@mellanox.com Cc: Sean Hefty sean.he...@intel.com Cc: Yishai Hadas yish...@mellanox.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next 8/9] IB/core: Export ib_create/destroy_flow through uverbs
From: Hadar Hen Zion had...@mellanox.com Implement ib_uverbs_create_flow and ib_uverbs_destroy_flow to support flow steering for user space applications. Signed-off-by: Hadar Hen Zion had...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/core/uverbs.h |3 + drivers/infiniband/core/uverbs_cmd.c | 209 + drivers/infiniband/core/uverbs_main.c | 13 ++- include/rdma/ib_verbs.h |1 + include/uapi/rdma/ib_user_verbs.h | 108 +- 5 files changed, 332 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 0fcd7aa..ad9d102 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -155,6 +155,7 @@ extern struct idr ib_uverbs_cq_idr; extern struct idr ib_uverbs_qp_idr; extern struct idr ib_uverbs_srq_idr; extern struct idr ib_uverbs_xrcd_idr; +extern struct idr ib_uverbs_rule_idr; void idr_remove_uobj(struct idr *idp, struct ib_uobject *uobj); @@ -215,5 +216,7 @@ IB_UVERBS_DECLARE_CMD(destroy_srq); IB_UVERBS_DECLARE_CMD(create_xsrq); IB_UVERBS_DECLARE_CMD(open_xrcd); IB_UVERBS_DECLARE_CMD(close_xrcd); +IB_UVERBS_DECLARE_CMD(create_flow); +IB_UVERBS_DECLARE_CMD(destroy_flow); #endif /* UVERBS_H */ diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index a7d00f6..29c340e 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -54,6 +54,7 @@ static struct uverbs_lock_class qp_lock_class = { .name = QP-uobj }; static struct uverbs_lock_class ah_lock_class = { .name = AH-uobj }; static struct uverbs_lock_class srq_lock_class = { .name = SRQ-uobj }; static struct uverbs_lock_class xrcd_lock_class = { .name = XRCD-uobj }; +static struct uverbs_lock_class rule_lock_class = { .name = RULE-uobj }; #define INIT_UDATA(udata, ibuf, obuf, ilen, olen) \ do {\ @@ -330,6 +331,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file, INIT_LIST_HEAD(ucontext-srq_list); INIT_LIST_HEAD(ucontext-ah_list); INIT_LIST_HEAD(ucontext-xrcd_list); + INIT_LIST_HEAD(ucontext-rule_list); ucontext-closing = 0; resp.num_comp_vectors = file-device-num_comp_vectors; @@ -2587,6 +2589,213 @@ out_put: return ret ? ret : in_len; } +static int kern_spec_to_ib_spec(struct ib_kern_spec *kern_spec, + struct _ib_flow_spec *ib_spec) +{ + ib_spec-type = kern_spec-type; + + switch (ib_spec-type) { + case IB_FLOW_SPEC_ETH: + ib_spec-eth.size = sizeof(struct ib_flow_spec_eth); + memcpy(ib_spec-eth.val, kern_spec-eth.val, + sizeof(struct ib_flow_eth_filter)); + memcpy(ib_spec-eth.mask, kern_spec-eth.mask, + sizeof(struct ib_flow_eth_filter)); + break; + case IB_FLOW_SPEC_IB: + ib_spec-ib.size = sizeof(struct ib_flow_spec_ib); + memcpy(ib_spec-ib.val, kern_spec-ib.val, + sizeof(struct ib_flow_ib_filter)); + memcpy(ib_spec-ib.mask, kern_spec-ib.mask, + sizeof(struct ib_flow_ib_filter)); + break; + case IB_FLOW_SPEC_IPV4: + ib_spec-ipv4.size = sizeof(struct ib_flow_spec_ipv4); + memcpy(ib_spec-ipv4.val, kern_spec-ipv4.val, + sizeof(struct ib_flow_ipv4_filter)); + memcpy(ib_spec-ipv4.mask, kern_spec-ipv4.mask, + sizeof(struct ib_flow_ipv4_filter)); + break; + case IB_FLOW_SPEC_TCP: + case IB_FLOW_SPEC_UDP: + ib_spec-tcp_udp.size = sizeof(struct ib_flow_spec_tcp_udp); + memcpy(ib_spec-tcp_udp.val, kern_spec-tcp_udp.val, + sizeof(struct ib_flow_tcp_udp_filter)); + memcpy(ib_spec-tcp_udp.mask, kern_spec-tcp_udp.mask, + sizeof(struct ib_flow_tcp_udp_filter)); + break; + default: + return -EINVAL; + } + return 0; +} + +ssize_t ib_uverbs_create_flow(struct ib_uverbs_file *file, + const char __user *buf, int in_len, + int out_len) +{ + struct ib_uverbs_create_flow cmd; + struct ib_uverbs_create_flow_resp resp; + struct ib_uobject *uobj; + struct ib_flow*flow_id; + struct ib_kern_flow_attr *kern_flow_attr; + struct ib_flow_attr *flow_attr; + struct ib_qp *qp; + int err = 0; + void *kern_spec; + void *ib_spec; + int i; + + if (out_len sizeof(resp)) + return -ENOSPC; + + if (copy_from_user(cmd, buf,
Re: NFS over RDMA benchmark
On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote: -Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Wednesday, April 24, 2013 00:06 To: Yan Burman Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote: -Original Message- From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com] Sent: Wednesday, April 17, 2013 21:06 To: Atchley, Scott Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: NFS over RDMA benchmark On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov wrote: On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote: Hi. I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. Yan, Are you trying to optimize single client performance or server performance with multiple clients? I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs Would any profiling help figure out which code it's spending time in? (E.g. something simple as perf top might have useful output.) Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunctionDSO ___ _ ___ ___ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux I guess that means lots of contention on some mutex? If only we knew which one perf should also be able to collect stack statistics, I forget how. --b. 978.00 8.4% 810297f0 clflush_cache_range /root/vmlinux 445.00 3.8% 812ea440 __domain_mapping /root/vmlinux 441.00 3.8% 00018c30 svc_recv /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 344.00 3.0% 813a1bc0 _raw_spin_lock_bh /root/vmlinux 333.00 2.9% 813a19e0 _raw_spin_lock_irqsave /root/vmlinux 288.00 2.5% 813a07d0 __schedule /root/vmlinux 249.00 2.1% 811a87e0 rb_prev /root/vmlinux 242.00 2.1% 813a19b0 _raw_spin_lock /root/vmlinux 184.00 1.6% 2e90 svc_rdma_sendto /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 177.00 1.5% 810ac820 get_page_from_freelist /root/vmlinux 174.00 1.5% 812e6da0 alloc_iova /root/vmlinux 165.00 1.4% 810b1390 put_page /root/vmlinux 148.00 1.3% 00014760 sunrpc_cache_lookup /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 128.00 1.1% 00017f20 svc_xprt_enqueue /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 126.00 1.1% 8139f820 __mutex_lock_slowpath /root/vmlinux 108.00 0.9% 811a81d0 rb_insert_color /root/vmlinux 107.00 0.9% 4690 svc_rdma_recvfrom /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 102.00 0.9% 2640 send_reply /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 99.00 0.9% 810e6490 kmem_cache_alloc /root/vmlinux 96.00 0.8% 810e5840 __slab_alloc /root/vmlinux
Re: NFS over RDMA benchmark
On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote: On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote: -Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Wednesday, April 24, 2013 00:06 To: Yan Burman Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote: -Original Message- From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com] Sent: Wednesday, April 17, 2013 21:06 To: Atchley, Scott Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: NFS over RDMA benchmark On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov wrote: On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote: Hi. I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. Yan, Are you trying to optimize single client performance or server performance with multiple clients? I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs Would any profiling help figure out which code it's spending time in? (E.g. something simple as perf top might have useful output.) Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunction DSO ___ _ ___ ___ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux I guess that means lots of contention on some mutex? If only we knew which one perf should also be able to collect stack statistics, I forget how. Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio --b. --b. 978.00 8.4% 810297f0 clflush_cache_range /root/vmlinux 445.00 3.8% 812ea440 __domain_mapping /root/vmlinux 441.00 3.8% 00018c30 svc_recv /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 344.00 3.0% 813a1bc0 _raw_spin_lock_bh /root/vmlinux 333.00 2.9% 813a19e0 _raw_spin_lock_irqsave /root/vmlinux 288.00 2.5% 813a07d0 __schedule /root/vmlinux 249.00 2.1% 811a87e0 rb_prev /root/vmlinux 242.00 2.1% 813a19b0 _raw_spin_lock /root/vmlinux 184.00 1.6% 2e90 svc_rdma_sendto /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 177.00 1.5% 810ac820 get_page_from_freelist /root/vmlinux 174.00 1.5% 812e6da0 alloc_iova /root/vmlinux 165.00 1.4% 810b1390 put_page /root/vmlinux 148.00 1.3% 00014760 sunrpc_cache_lookup /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 128.00 1.1% 00017f20 svc_xprt_enqueue /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 126.00 1.1% 8139f820 __mutex_lock_slowpath /root/vmlinux 108.00 0.9% 811a81d0 rb_insert_color /root/vmlinux 107.00 0.9% 4690 svc_rdma_recvfrom
Re: Infiniband use of get_user_pages()
On Wed, 24 Apr 2013, Jan Kara wrote: Hello, when checking users of get_user_pages() (I'm doing some cleanups in that area to fix filesystem's issues with mmap_sem locking) I've noticed that infiniband drivers add number of pages obtained from get_user_pages() to mm-pinned_vm counter. Although this makes some sence, it doesn't match with any other user of get_user_pages() (e.g. direct IO) so has infiniband some special reason why it does so? get_user_pages typically is used to temporarily increase the refcount. The Infiniband layer needs to permanently pin the pages for memory registration. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote: On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote: On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote: Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunction DSO ___ _ ___ ___ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux I guess that means lots of contention on some mutex? If only we knew which one perf should also be able to collect stack statistics, I forget how. Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio I have not looked at NFS RDMA (and 3.x kernel) source yet. But see that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere in the paths ? Trees like that requires extensive lockings. -- Wendy . 978.00 8.4% 810297f0 clflush_cache_range /root/vmlinux 445.00 3.8% 812ea440 __domain_mapping /root/vmlinux 441.00 3.8% 00018c30 svc_recv /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 344.00 3.0% 813a1bc0 _raw_spin_lock_bh /root/vmlinux 333.00 2.9% 813a19e0 _raw_spin_lock_irqsave /root/vmlinux 288.00 2.5% 813a07d0 __schedule /root/vmlinux 249.00 2.1% 811a87e0 rb_prev /root/vmlinux 242.00 2.1% 813a19b0 _raw_spin_lock /root/vmlinux 184.00 1.6% 2e90 svc_rdma_sendto /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 177.00 1.5% 810ac820 get_page_from_freelist /root/vmlinux 174.00 1.5% 812e6da0 alloc_iova /root/vmlinux 165.00 1.4% 810b1390 put_page /root/vmlinux 148.00 1.3% 00014760 sunrpc_cache_lookup /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 128.00 1.1% 00017f20 svc_xprt_enqueue /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 126.00 1.1% 8139f820 __mutex_lock_slowpath /root/vmlinux 108.00 0.9% 811a81d0 rb_insert_color /root/vmlinux 107.00 0.9% 4690 svc_rdma_recvfrom /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 102.00 0.9% 2640 send_reply /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 99.00 0.9% 810e6490 kmem_cache_alloc /root/vmlinux 96.00 0.8% 810e5840 __slab_alloc /root/vmlinux 91.00 0.8% 6d30 mlx4_ib_post_send /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko 88.00 0.8% 0dd0 svc_rdma_get_context /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 86.00 0.7% 813a1a10 _raw_spin_lock_irq /root/vmlinux 86.00 0.7% 1530 svc_rdma_send /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 85.00 0.7% 81060a80 prepare_creds /root/vmlinux 83.00 0.7% 810a5790 find_get_pages_contig /root/vmlinux 79.00 0.7% 810e4620 __slab_free /root/vmlinux 79.00 0.7% 813a1a40 _raw_spin_unlock_irqrestore /root/vmlinux 77.00 0.7% 81065610 finish_task_switch /root/vmlinux 76.00 0.7% 812e9270 pfn_to_dma_pte /root/vmlinux 75.00 0.6% 810976d0 __call_rcu /root/vmlinux 73.00 0.6% 811a2fa0 _atomic_dec_and_lock /root/vmlinux 73.00 0.6% 02e0 svc_rdma_has_wspace /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 67.00 0.6% 813a1a70 _raw_read_lock /root/vmlinux 65.00 0.6% f590 svcauth_unix_set_client /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 63.00 0.5% 000180e0 svc_reserve /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 60.00 0.5% 64d0 stamp_send_wqe
Re: linux-next: manual merge of the net-next tree with the infiniband tree
On Thu, Apr 18, 2013 at 01:18:43PM +1000, Stephen Rothwell wrote: Hi all, Today's linux-next merge of the net-next tree got a conflict in drivers/infiniband/hw/cxgb4/qp.c between commit 5b0c275926b8 (RDMA/cxgb4: Fix SQ allocation when on-chip SQ is disabled) from the infiniband tree and commit 9919d5bd01b9 (RDMA/cxgb4: Fix onchip queue support for T5) from the net-next tree. I think that they are 2 different fixes for the same problem, so I just used the net-next version and can carry the fix as necessary (no action is required). -- Cheers, Stephen Rothwells...@canb.auug.org.au Commit 5b0c275926b8 also keeps the intention of the original patch which broke it, which was to return an error code, in case the allocation fails. Commit 9919d5bd01b9 fix will return 0 in case the allocation fails. We should keep the other fix or fix the code again to return the proper error code. Regards. Cascardo. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote: On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote: On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote: Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunction DSO ___ _ ___ ___ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux I guess that means lots of contention on some mutex? If only we knew which one perf should also be able to collect stack statistics, I forget how. Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio I have not looked at NFS RDMA (and 3.x kernel) source yet. But see that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere in the paths ? Trees like that requires extensive lockings. So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1 tar ball) ... Here is a random thought (not related to the rb tree comment). The inflight packet count seems to be controlled by xprt_rdma_slot_table_entries that is currently hard-coded as RPCRDMA_DEF_SLOT_TABLE (32) (?). I'm wondering whether it could help with the bandwidth number if we pump it up, say 64 instead ? Not sure whether FMR pool size needs to get adjusted accordingly though. In short, if anyone has benchmark setup handy, bumping up the slot table size as the following might be interesting: --- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h 2013-03-21 09:19:36.233006570 -0700 +++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h 2013-04-24 10:52:20.934781304 -0700 @@ -59,7 +59,7 @@ * a single chunk type per message is supported currently. */ #define RPCRDMA_MIN_SLOT_TABLE (2U) -#define RPCRDMA_DEF_SLOT_TABLE (32U) +#define RPCRDMA_DEF_SLOT_TABLE (64U) #define RPCRDMA_MAX_SLOT_TABLE (256U) #define RPCRDMA_DEF_INLINE (1024) /* default inline max */ -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/24/2013 2:04 PM, Wendy Cheng wrote: On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote: On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote: On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote: Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunctionDSO ___ _ ___ ___ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux I guess that means lots of contention on some mutex? If only we knew which one perf should also be able to collect stack statistics, I forget how. Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio I have not looked at NFS RDMA (and 3.x kernel) source yet. But see that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere in the paths ? Trees like that requires extensive lockings. So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1 tar ball) ... Here is a random thought (not related to the rb tree comment). The inflight packet count seems to be controlled by xprt_rdma_slot_table_entries that is currently hard-coded as RPCRDMA_DEF_SLOT_TABLE (32) (?). I'm wondering whether it could help with the bandwidth number if we pump it up, say 64 instead ? Not sure whether FMR pool size needs to get adjusted accordingly though. 1) The client slot count is not hard-coded, it can easily be changed by writing a value to /proc and initiating a new mount. But I doubt that increasing the slot table will improve performance much, unless this is a small-random-read, and spindle-limited workload. 2) The observation appears to be that the bandwidth is server CPU limited. Increasing the load offered by the client probably won't move the needle, until that's addressed. In short, if anyone has benchmark setup handy, bumping up the slot table size as the following might be interesting: --- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h 2013-03-21 09:19:36.233006570 -0700 +++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h 2013-04-24 10:52:20.934781304 -0700 @@ -59,7 +59,7 @@ * a single chunk type per message is supported currently. */ #define RPCRDMA_MIN_SLOT_TABLE (2U) -#define RPCRDMA_DEF_SLOT_TABLE (32U) +#define RPCRDMA_DEF_SLOT_TABLE (64U) #define RPCRDMA_MAX_SLOT_TABLE (256U) #define RPCRDMA_DEF_INLINE (1024) /* default inline max */ -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Infiniband use of get_user_pages()
On Wed, Apr 24, 2013 at 8:38 AM, Jan Kara j...@suse.cz wrote: when checking users of get_user_pages() (I'm doing some cleanups in that area to fix filesystem's issues with mmap_sem locking) I've noticed that infiniband drivers add number of pages obtained from get_user_pages() to mm-pinned_vm counter. Although this makes some sence, it doesn't match with any other user of get_user_pages() (e.g. direct IO) so has infiniband some special reason why it does so? Direct IO mappings are in some sense ephemeral -- they only need to last while the IO is in flight. In contrast the IB memory pinning is controlled by (possibly unprivileged) userspace and might last the whole lifetime of a long-lived application. So we want some accounting and resource control. Also that seems to be the only real reason why mmap_sem has to be grabbed in exclusive mode, am I right? Most likely that is true. Another suspicious thing (at least in drivers/infiniband/core/umem.c: ib_umem_get()) is that arguments of get_user_pages() are like: ret = get_user_pages(current, current-mm, cur_base, min_t(unsigned long, npages, PAGE_SIZE / sizeof (struct page *)), 1, !umem-writable, page_list, vma_list); So we always have write argument set to 1 and force argument is set to !umem-writable. Is that really intentional? My naive guess would be that arguments should be switched... Although even in that case I fail to see why 'force' argument should be set. Can someone please explain? This confused even me recently. We had a long discussion (read the whole thread starting here: https://lkml.org/lkml/2012/1/26/7) but in short the current parameters seem to be needed to trigger COW even when the kernel/hardware want to read the memory, to avoid problems where we get stale data if userspace triggers COW. I think I better add a comment explaining this. Finally (and here I may show my ignorance ;), I'd like to ask whether there's any reason why ib_umem_get() checks for is_vm_hugetlb_page() and not just whether a page is a huge page? I'm not sure of the history here. How would one check directly if a page is a huge page? get_user_pages() actually goes to some trouble to return all small pages, even when it has to split a single huge page into many entries in the page array. (Which is actually a bit unfortunate for our use here) - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html