[ovs-dev] 答复: [PATCH 0/3] userspace-tso: Improve L4 csum offload support.
Ethtool can separately turn off tso $ ethtool -K vethXXX tso off $ ethtool -K vethXXX tx off will turn off tx checksum, tso, sg. TSO depends on tx checksum and sg, so if you just want to turn off tso and keep tx chechsum on, you can do it in the below way. $ ethtool -K vethXXX tx on $ ethtool -K vethXXX tso off -邮件原件- 发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 William Tu 发送时间: 2020年4月16日 7:49 收件人: Ilya Maximets 抄送: ; Flavio Leitner 主题: Re: [ovs-dev] [PATCH 0/3] userspace-tso: Improve L4 csum offload support. On Fri, Feb 28, 2020 at 7:34 AM Ilya Maximets wrote: > > On 2/14/20 2:03 PM, Flavio Leitner wrote: > > This patchset disables unsupported offload features for vhost device > > such as UFO and ECN. > > > > Then it includes UDP checksum offload as a must have to enable > > userspace TSO, but leave SCTP as optional. Only a few drivers > > support SCTP checksum offload and the protocol is not widely used. Hi Flavio and Ilya, I have a question about this. If we do "other_config:userspace-tso-enable=true", it enables both the TSO and CSUM offload. Can we enable only the CSUM offload, but not TSO? So making it a separate configurations? Because currently, all the "make check-system-userspace" has to add $ ethtool -K $1 tx off due to no checksum support. If we make CSUM offload enabled by default, so we don't need to turn off tx offload? Regards, William ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] 答复: Re: Re: Re: [PATCH v4 0/3] Add support for TSO with DPDK
Hi, Flavio Are you Redhat folks working on VxLAN TSO support for OVS DPDK? This is a great feature, many NICs can do VxLAN TSO, even very old Intel 82599ES Controller also can support VxLAN TSO, per our test, for two VMs across two compute nodes, iperf3 tcp and udp performance can reach line speed (about 10Gbps). Per our understanding, DPDK and hardware are ready for VxLAN TSO, I don't know how much extra effort we need to make to enable VxLAN TSO support? what's the road block for this? -邮件原件- 发件人: Flavio Leitner [mailto:f...@sysclose.org] 发送时间: 2020年3月10日 21:10 收件人: txfh2007 抄送: William Tu ; Yi Yang (杨�D)-云服务集团 ; d...@openvswitch.org; i.maxim...@ovn.org 主题: Re: Re:[ovs-dev] Re: Re: [PATCH v4 0/3] Add support for TSO with DPDK On Tue, Mar 10, 2020 at 04:08:43PM +0800, txfh2007 wrote: > Hi Flavio and all: > > Is there a way to support software TSO for DPDK tunnel network ? I have tried userspace TSO function, and running on tunnel network, I have got the following error: > "Tunneling packets with HW offload flags is not supported: packet dropped" > So is there a way to work around if we would support both vlan and tunnel network on the same compute node ? No, there is no support for tunneling at this point. fbl > > Thanks > Timo > > > > -- > > > > > On Fri, Feb 28, 2020 at 9:56 AM Flavio Leitner wrote: > > > > > > Hi Yi Yang, > > > > This is the bug fix required to make veth TSO work in OvS: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i d=9d2f67e43b73e8af7438be219b66a5de0cfa8bd9 > > > > commit 9d2f67e43b73e8af7438be219b66a5de0cfa8bd9 > > Author: Jianfeng Tan > > Date: Sat Sep 29 15:41:27 2018 + > > > > net/packet: fix packet drop as of virtio gso > > > > When we use raw socket as the vhost backend, a packet from virito with > > gso offloading information, cannot be sent out in later validaton at > > xmit path, as we did not set correct skb->protocol which is further used > > for looking up the gso function. > > > > To fix this, we set this field according to virito hdr information. > > > > Fixes: e858fae2b0b8f4 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion") > > Signed-off-by: Jianfeng Tan > > Signed-off-by: David S. Miller > > > > > > So, the minimum kernel version is 4.19. > > > Thanks, > I sent a patch to update the documentation. Please take a look. > William > -- fbl ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] 答复: [PATCH v7] Use TPACKET_V3 to accelerate veth for userspace datapath
Ilya, raw socket for the interface type of which is "system" has been set to non-block mode, can you explain which syscall will lead to sleep? Yes, pmd thread will consume CPU resource even if it has nothing to do, but all the type=dpdk ports are handled by pmd thread, here we just let system interfaces look like a DPDK interface. I didn't see any problem in my test, it will be better if you can tell me what will result in a problem and how I can reproduce it. By the way, type=tap/internal interfaces are still be handled by ovs-vswitchd thread. In addition, only one line change is there, ".is_pmd = true,", ".is_pmd = false," will keep it in ovs-vswitchd if there is any other concern. We can change non-thread-safe parts to support pmd. -邮件原件- 发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 Ilya Maximets 发送时间: 2020年3月18日 19:45 收件人: yang_y...@163.com; ovs-dev@openvswitch.org 抄送: i.maxim...@ovn.org 主题: Re: [ovs-dev] [PATCH v7] Use TPACKET_V3 to accelerate veth for userspace datapath On 3/18/20 10:02 AM, yang_y...@163.com wrote: > From: Yi Yang > > We can avoid high system call overhead by using TPACKET_V3 and using > DPDK-like poll to receive and send packets (Note: send still needs to > call sendto to trigger final packet transmission). > > From Linux kernel 3.10 on, TPACKET_V3 has been supported, so all the > Linux kernels current OVS supports can run > TPACKET_V3 without any problem. > > I can see about 50% performance improvement for veth compared to last > recvmmsg optimization if I use TPACKET_V3, it is about 2.21 Gbps, but > it was 1.47 Gbps before. > > After is_pmd is set to true, performance can be improved much more, it > is about 180% performance improvement. > > TPACKET_V3 can support TSO, but its performance isn't good because of > TPACKET_V3 kernel implementation issue, so it falls back to recvmmsg > in case userspace-tso-enable is set to true, but its performance is > better than recvmmsg in case userspace-tso-enable is set to false, so > just use TPACKET_V3 in that case. > > Note: how much performance improvement is up to your platform, some > platforms can see huge improvement, some ones aren't so noticeable, > but if is_pmd is set to true, you can see big performance improvement, > the prerequisite is your tested veth interfaces should be attached to > different pmd threads. > > Signed-off-by: Yi Yang > Co-authored-by: William Tu > Signed-off-by: William Tu > --- > acinclude.m4 | 12 ++ > configure.ac | 1 + > include/sparse/linux/if_packet.h | 111 +++ > lib/dp-packet.c | 18 ++ > lib/dp-packet.h | 9 + > lib/netdev-linux-private.h | 26 +++ > lib/netdev-linux.c | 419 +-- > 7 files changed, 579 insertions(+), 17 deletions(-) > > Changelog: > - v6->v7 > * is_pmd is set to true for system interfaces This can not be done that simple and should not be done unconditionally anyways. netdev-linux is not thread safe in many ways. At least, stats accounting will be messed up. Second thing is that this change will harm all the usual DPDK-based setups since PMD threads will start make a lot of syscalls and sleep inside the kernel missing packets from the fast DPDK interfaces. Third thing is that this change will fire up at least one PMD thread consuming 100% CPU constantly even for setups where it's not needed. So, this version is definitely not acceptable. Best regards, Ilya Maximets. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] 答复: [PATCH v5] Use TPACKET_V3 to accelerate veth for userspace datapath
In the same environment, but I used tap but not veth, retr number is 0 for the case without this patch (of course, I applied Flavio's tap enable patch) vagrant@ubuntu1804:~$ sudo ./run-iperf3.sh Connecting to host 10.15.1.3, port 5201 [ 4] local 10.15.1.2 port 54572 connected to 10.15.1.3 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.00 sec 12.6 GBytes 10.9 Gbits/sec0 3.14 MBytes [ 4] 10.00-20.00 sec 12.8 GBytes 11.0 Gbits/sec0 3.14 MBytes [ 4] 20.00-30.00 sec 10.2 GBytes 8.76 Gbits/sec0 3.14 MBytes [ 4] 30.00-40.00 sec 10.0 GBytes 8.63 Gbits/sec0 3.14 MBytes [ 4] 40.00-50.00 sec 10.4 GBytes 8.94 Gbits/sec0 3.14 MBytes [ 4] 50.00-60.00 sec 10.8 GBytes 9.31 Gbits/sec0 3.14 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-60.00 sec 67.0 GBytes 9.59 Gbits/sec0 sender [ 4] 0.00-60.00 sec 67.0 GBytes 9.59 Gbits/sec receiver Server output: Accepted connection from 10.15.1.2, port 54570 [ 5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 54572 [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.00 sec 12.6 GBytes 10.9 Gbits/sec [ 5] 10.00-20.00 sec 12.8 GBytes 11.0 Gbits/sec [ 5] 20.00-30.00 sec 10.2 GBytes 8.76 Gbits/sec [ 5] 30.00-40.00 sec 10.0 GBytes 8.63 Gbits/sec [ 5] 40.00-50.00 sec 10.4 GBytes 8.94 Gbits/sec [ 5] 50.00-60.00 sec 10.8 GBytes 9.31 Gbits/sec [ 5] 60.00-60.00 sec 1.75 MBytes 9.25 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth [ 5] 0.00-60.00 sec 0.00 Bytes 0.00 bits/sec sender [ 5] 0.00-60.00 sec 67.0 GBytes 9.59 Gbits/sec receiver iperf Done. vagrant@ubuntu1804:~$ -邮件原件- 发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 William Tu 发送时间: 2020年2月26日 6:32 收件人: yang_y...@126.com 抄送: yang_y_yi ; ovs-dev 主题: Re: [ovs-dev] [PATCH v5] Use TPACKET_V3 to accelerate veth for userspace datapath On Mon, Feb 24, 2020 at 5:01 AM wrote: > > From: Yi Yang > > We can avoid high system call overhead by using TPACKET_V3 and using > DPDK-like poll to receive and send packets (Note: send still needs to > call sendto to trigger final packet transmission). > > From Linux kernel 3.10 on, TPACKET_V3 has been supported, so all the > Linux kernels current OVS supports can run > TPACKET_V3 without any problem. > > I can see about 30% performance improvement for veth compared to last > recvmmsg optimization if I use TPACKET_V3, it is about 1.98 Gbps, but > it was 1.47 Gbps before. > > TPACKET_V3 can support TSO, it can work only if your kernel can > support, this has been verified on Ubuntu 18.04 5.3.0-40-generic , if > you find the performance is very poor, please turn off tso for veth > interfces in case userspace-tso-enable is set to true. Do you test the performance of enabling TSO? Using veth (like your run-iperf3.sh) and with kernel 5.3. Without your patch, with TSO enabled, I can get around 6Gbps But with this patch, with TSO enabled, the performance drops to 1.9Gbps. Regards, William ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] 答复: [PATCHv2 1/2] userspace: Enable TSO support for non-DPDK.
William, which kernel version did you use to test for this patch? I don't want to build a kernel if Ubuntu 16.04 kernel can work. -邮件原件- 发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 William Tu 发送时间: 2020年2月21日 3:00 收件人: d...@openvswitch.org 抄送: f...@sysclose.org; i.maxim...@ovn.org 主题: [ovs-dev] [PATCHv2 1/2] userspace: Enable TSO support for non-DPDK. This patch enables TSO support for non-DPDK use cases, and also add check-system-tso testsuite. Before TSO, we have to disable checksum offload, allowing the kernel to calculate the TCP/UDP packet checsum. With TSO, we can skip the checksum validation by enabling checksum offload, and with large packet size, we see better performance. Consider container to container use cases: iperf3 -c (ns0) -> veth peer -> OVS -> veth peer -> iperf3 -s (ns1) And I got around 6Gbps, similar to TSO with DPDK-enabled. Tested-at: https://travis-ci.org/williamtu/ovs-travis/builds/653109097 Signed-off-by: William Tu --- v2: - add make check-system-tso test - combine logging for dpdk and non-dpdk - I'm surprised that most of the test cases passed. This is due to few tests using tcp/udp, so it does not trigger TSO. I saw only geneve/vxlan fails randomly, maybe we can check it later. --- lib/dp-packet.h | 95 ++- lib/userspace-tso.c | 5 --- tests/.gitignore | 3 ++ tests/automake.mk | 15 +++ tests/system-tso-macros.at| 42 +++ tests/system-tso-testsuite.at | 26 6 files changed, 143 insertions(+), 43 deletions(-) create mode 100644 tests/system-tso-macros.at create mode 100644 tests/system-tso-testsuite.at diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 9f8991faad52..6b90cec2afb4 100644 --- a/lib/dp-packet.h +++ b/lib/dp-packet.h @@ -53,7 +53,25 @@ enum OVS_PACKED_ENUM dp_packet_source { enum dp_packet_offload_mask { DP_PACKET_OL_RSS_HASH_MASK = 0x1, /* Is the 'rss_hash' valid? */ DP_PACKET_OL_FLOW_MARK_MASK = 0x2, /* Is the 'flow_mark' valid? */ +DP_PACKET_OL_RX_L4_CKSUM_BAD = 1 << 3, +DP_PACKET_OL_RX_IP_CKSUM_BAD = 1 << 4, +DP_PACKET_OL_RX_L4_CKSUM_GOOD = 1 << 5, +DP_PACKET_OL_RX_IP_CKSUM_GOOD = 1 << 6, +DP_PACKET_OL_TX_TCP_SEG = 1 << 7, +DP_PACKET_OL_TX_IPV4 = 1 << 8, +DP_PACKET_OL_TX_IPV6 = 1 << 9, +DP_PACKET_OL_TX_TCP_CKSUM = 1 << 10, +DP_PACKET_OL_TX_UDP_CKSUM = 1 << 11, +DP_PACKET_OL_TX_SCTP_CKSUM = 1 << 12, }; + +#define DP_PACKET_OL_TX_L4_MASK (DP_PACKET_OL_TX_TCP_CKSUM | \ + DP_PACKET_OL_TX_UDP_CKSUM | \ + DP_PACKET_OL_TX_SCTP_CKSUM) #define +DP_PACKET_OL_RX_IP_CKSUM_MASK (DP_PACKET_OL_RX_IP_CKSUM_GOOD | \ + DP_PACKET_OL_RX_IP_CKSUM_BAD) +#define DP_PACKET_OL_RX_L4_CKSUM_MASK (DP_PACKET_OL_RX_L4_CKSUM_GOOD | \ + DP_PACKET_OL_RX_L4_CKSUM_BAD) #else /* DPDK mbuf ol_flags that are not really an offload flags. These are mostly * related to mbuf memory layout and OVS should not touch/clear them. */ @@ -739,82 +757,79 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s) b->allocated_ = s; } -/* There are no implementation when not DPDK enabled datapath. */ static inline bool -dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED) +dp_packet_hwol_is_tso(const struct dp_packet *b) { -return false; +return !!(b->ol_flags & DP_PACKET_OL_TX_TCP_SEG); } -/* There are no implementation when not DPDK enabled datapath. */ static inline bool -dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED) +dp_packet_hwol_is_ipv4(const struct dp_packet *b) { -return false; +return !!(b->ol_flags & DP_PACKET_OL_TX_IPV4); } -/* There are no implementation when not DPDK enabled datapath. */ static inline uint64_t -dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED) +dp_packet_hwol_l4_mask(const struct dp_packet *b) { -return 0; +return b->ol_flags & DP_PACKET_OL_TX_L4_MASK; } -/* There are no implementation when not DPDK enabled datapath. */ static inline bool -dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED) +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b) { -return false; +return (b->ol_flags & DP_PACKET_OL_TX_L4_MASK) == +DP_PACKET_OL_TX_TCP_CKSUM; } -/* There are no implementation when not DPDK enabled datapath. */ static inline bool -dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED) +dp_packet_hwol_l4_is_udp(const struct dp_packet *b) { -return false; +return (b->ol_flags & DP_PACKET_OL_TX_L4_MASK) == +DP_PACKET_OL_TX_UDP_CKSUM; } -/* There are no implementation when not DPDK enabled datapath. */ static inline bool -dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED) +dp_packet_hwol_l4_is_sctp(const struct dp_packet *b) { -return false; +return
[ovs-dev] 答复: [PATCH v4 0/3] Add support for TSO with DPDK
Hi, Flavio I find this tso feature doesn't work normally on my Ubuntu 16.04, here is my result. My kernel version is $ uname -a Linux cmp008 4.15.0-55-generic #60~16.04.2-Ubuntu SMP Thu Jul 4 09:03:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux $ $ ./run-iperf3.sh Connecting to host 10.15.1.3, port 5201 [ 4] local 10.15.1.2 port 56466 connected to 10.15.1.3 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.00 sec 7.05 MBytes 5.91 Mbits/sec 2212 5.66 KBytes [ 4] 10.00-20.00 sec 7.67 MBytes 6.44 Mbits/sec 2484 5.66 KBytes [ 4] 20.00-30.00 sec 7.77 MBytes 6.52 Mbits/sec 2500 5.66 KBytes [ 4] 30.00-40.00 sec 7.77 MBytes 6.52 Mbits/sec 2490 5.66 KBytes [ 4] 40.00-50.00 sec 7.76 MBytes 6.51 Mbits/sec 2500 5.66 KBytes [ 4] 50.00-60.00 sec 7.79 MBytes 6.54 Mbits/sec 2504 5.66 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-60.00 sec 45.8 MBytes 6.40 Mbits/sec 14690 sender [ 4] 0.00-60.00 sec 45.7 MBytes 6.40 Mbits/sec receiver Server output: Accepted connection from 10.15.1.2, port 56464 [ 5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 56466 [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.00 sec 6.90 MBytes 5.79 Mbits/sec [ 5] 10.00-20.00 sec 7.71 MBytes 6.47 Mbits/sec [ 5] 20.00-30.00 sec 7.73 MBytes 6.48 Mbits/sec [ 5] 30.00-40.00 sec 7.79 MBytes 6.53 Mbits/sec [ 5] 40.00-50.00 sec 7.79 MBytes 6.53 Mbits/sec [ 5] 50.00-60.00 sec 7.79 MBytes 6.54 Mbits/sec iperf Done. $ But it does work for tap, I'm not sure if it is a kernel issue, which kernel version are you using? I didn't use tpacket_v3 patch. Here is my local ovs info. $ git log commit 1223cf123ed141c0a0110ebed17572bdb2e3d0f4 Author: Ilya Maximets Date: Thu Feb 6 14:24:23 2020 +0100 netdev-dpdk: Don't enable offloading on HW device if not requested. DPDK drivers has different implementations of transmit functions. Enabled offloading may cause driver to choose slower variant significantly affecting performance if userspace TSO wasn't requested. Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") Reported-by: David Marchand Acked-by: David Marchand Acked-by: Flavio Leitner Acked-by: Kevin Traynor Signed-off-by: Ilya Maximets commit 73858f9dbe83daf8cc8d4b604acc23eb62cc3f52 Author: Flavio Leitner Date: Mon Feb 3 18:45:50 2020 -0300 netdev-linux: Prepend the std packet in the TSO packet Usually TSO packets are close to 50k, 60k bytes long, so to to copy less bytes when receiving a packet from the kernel change the approach. Instead of extending the MTU sized packet received and append with remaining TSO data from the TSO buffer, allocate a TSO packet with enough headroom to prepend the std packet data. Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") Suggested-by: Ben Pfaff Signed-off-by: Flavio Leitner Signed-off-by: Ben Pfaff commit 2297cbe6cc25b6b1862c499ce8f16f52f75d9e5f Author: Flavio Leitner Date: Mon Feb 3 11:22:22 2020 -0300 netdev-linux-private: fix max length to be 16 bits The dp_packet length is limited to 16 bits, so document that and fix the length value accordingly. Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") Signed-off-by: Flavio Leitner Signed-off-by: Ben Pfaff commit 3d6a6f450af5b7eaf4b532983cb14458ae792b72 Author: David Marchand Date: Tue Feb 4 22:28:26 2020 +0100 netdev-dpdk: Fix port init when lacking Tx offloads for TSO. The check on TSO capability did not ensure ip checksum, tcp checksum and TSO tx offloads were available which resulted in a port init failure (example below with a ena device): *2020-02-04T17:42:52.976Z|00084|dpdk|ERR|Ethdev port_id=0 requested Tx offloads 0x2a doesn't match Tx offloads capabilities 0xe in rte_eth_dev_configure()* Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") Reported-by: Ravi Kerur Signed-off-by: David Marchand Acked-by: Kevin Traynor Acked-by: Flavio Leitner Signed-off-by: Ilya Maximets commit 8e371aa497aa95e3562d53f566c2d634b4b0f589 Author: Kirill A. Kornilov Date: Mon Jan 13 12:29:10 2020 +0300 vswitchd: Add serial number configuration. Signed-off-by: Kirill A. Kornilov Signed-off-by: Ben Pfaff I applied your tap patch. $ git diff diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index c6f3d27..74a5728 100644 --- a/lib/netdev-linux.c +++ b/lib/netdev-linux.c @@ -1010,6 +1010,23 @@ netdev_linux_construct_tap(struct netdev *netdev_) goto error_close; } +if (userspace_tso_enabled()) { +/* Old kernels don't support TUNSETOFFLOAD. If TUNSETOFFLOAD is + * available, it will return EINVAL when a flag is unknown. +
[ovs-dev] 答复: [PATCH v4] Use TPACKET_V3 to accelerate veth for userspace datapath
Ilya, thank you so much for your comments, I'll fix them in next version. For TSO support, this patch can work from functionality, but I checked tpacket_v3 kernel code, I don't think tpacket_v3 kernel part can support it, my test result also showed very bad performance if userspace-tso-enable is set to true, for the case you're saying, current tpacket_v3 can't reach very good performance. I have a question, in openstack and vxlan scenario, can userspace-tso-enable be set to true? Per TSO patch series comments, it can't support such case, so I'm thinking how we can trade off TSO and tpacket_v3, from my perspective, tpacket_v3 is the best choice for the cases TSO can't work. I hope openstack can get good performance when it uses new OVS DPDK version and don't need to do any change in neutron and ovs agent side. It will be better if we can work out a good way to cover two used cases. -邮件原件- 发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 Ilya Maximets 发送时间: 2020年2月17日 21:12 收件人: ovs-dev@openvswitch.org 抄送: yang_y...@163.com; yang_y...@126.com; i.maxim...@ovn.org 主题: Re: [ovs-dev] [PATCH v4] Use TPACKET_V3 to accelerate veth for userspace datapath On 2/16/20 2:10 AM, yang_y...@126.com wrote: > From: Yi Yang > > We can avoid high system call overhead by using TPACKET_V3 and using > DPDK-like poll to receive and send packets (Note: send still needs to > call sendto to trigger final packet transmission). > > From Linux kernel 3.10 on, TPACKET_V3 has been supported, so all the > Linux kernels current OVS supports can run > TPACKET_V3 without any problem. > > I can see about 30% performance improvement for veth compared to last > recvmmsg optimization if I use TPACKET_V3, it is about 1.98 Gbps, but > it was 1.47 Gbps before. > > Note: Linux kernel TPACKET_V3 can't support TSO, so the performance is > very poor, please turn off tso for veth interfces in case > userspace-tso-enable is set to true. So, does this patch supports TSO or not? What if I want to have TSO support AND a good performance? I didn't review the code, but have some patch-wide style comments: 1. Comments in code should be complete sentences, i.e. start with a capital letter and end with a period. 2. Don't parenthesize arguments of sizeof if possible. Best regards, Ilya Maximets. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] 答复: [PATCH v3] Use TPACKET_V3 to accelerate veth for userspace datapath
Hi, William I checked sparse check errors in my local machine, new v4 version should fix these errors, please use v4, thanks a lot. https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367883.html diff --git a/include/sparse/linux/if_packet.h b/include/sparse/linux/if_packet.h index 503bade..d6a9fb0 100644 --- a/include/sparse/linux/if_packet.h +++ b/include/sparse/linux/if_packet.h @@ -5,6 +5,7 @@ #error "Use this header only with sparse. It is not a correct implementation." #endif +#include #include_next /* Fix endianness of 'spkt_protocol' and 'sll_protocol' members. */ diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index 49b6aa4..c275a64 100644 --- a/lib/netdev-linux.c +++ b/lib/netdev-linux.c @@ -1139,7 +1139,7 @@ tpacket_mmap_rx_tx_ring(int sock, struct tpacket_ring *rx_ring, { int i; -rx_ring->mm_space = mmap(0, rx_ring->mm_len + tx_ring->mm_len, +rx_ring->mm_space = mmap(NULL, rx_ring->mm_len + tx_ring->mm_len, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sock, 0); if (rx_ring->mm_space == MAP_FAILED) { @@ -1194,7 +1194,7 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_) }; /* Create file descriptor. */ -rx->fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); +rx->fd = socket(PF_PACKET, SOCK_RAW, (OVS_FORCE int) htons(ETH_P_ALL)); if (rx->fd < 0) { error = errno; VLOG_ERR("failed to create raw socket (%s)", ovs_strerror(error)); @@ -1282,7 +1282,7 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_) sll.sll_halen = 0; #endif sll.sll_ifindex = ifindex; -sll.sll_protocol = htons(ETH_P_ALL); +sll.sll_protocol = (OVS_FORCE ovs_be16) htons(ETH_P_ALL); if (bind(rx->fd, (struct sockaddr *) , sizeof sll) < 0) { error = errno; VLOG_ERR("%s: failed to bind raw socket (%s)", -邮件原件- 发件人: Yi Yang (杨�D)-云服务集团 发送时间: 2020年2月15日 12:09 收件人: 'yang_y...@126.com' ; 'ovs-dev@openvswitch.org' 抄送: 'b...@ovn.org' ; 'ian.sto...@intel.com' ; 'yang_y...@163.com' 主题: 答复: [PATCH v3] Use TPACKET_V3 to accelerate veth for userspace datapath 重要性: 高 William, I don't know why I can't receive your comments in my outlook, https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367860.html I don't know how to check travis build issue, can you help provide a quick guide in order that I can fix it? -邮件原件- 发件人: yang_y...@126.com [mailto:yang_y...@126.com] 发送时间: 2020年2月11日 18:22 收件人: ovs-dev@openvswitch.org 抄送: b...@ovn.org; ian.sto...@intel.com; Yi Yang (杨�D)-云服务集团 ; yang_y...@163.com 主题: [PATCH v3] Use TPACKET_V3 to accelerate veth for userspace datapath From: Yi Yang We can avoid high system call overhead by using TPACKET_V3 and using DPDK-like poll to receive and send packets (Note: send still needs to call sendto to trigger final packet transmission). >From Linux kernel 3.10 on, TPACKET_V3 has been supported, so all the Linux kernels current OVS supports can run TPACKET_V3 without any problem. I can see about 30% performance improvement for veth compared to last recvmmsg optimization if I use TPACKET_V3, it is about 1.98 Gbps, but it was 1.47 Gbps before. Note: Linux kernel TPACKET_V3 can't support TSO, so the performance is very poor, please turn off tso for veth interfces in case userspace-tso-enable is set to true. Signed-off-by: Yi Yang Co-authored-by: William Tu Signed-off-by: William Tu --- acinclude.m4 | 12 ++ configure.ac | 1 + include/linux/automake.mk| 1 + include/linux/if_packet.h| 126 + include/sparse/linux/if_packet.h | 108 +++ lib/netdev-linux-private.h | 22 +++ lib/netdev-linux.c | 375 ++- 7 files changed, 640 insertions(+), 5 deletions(-) create mode 100644 include/linux/if_packet.h Changelog: - v2->v3 * Fix build issues in case HAVE_TPACKET_V3 is not defined * Add tso-related support code * make sure it can work normally in case userspace-tso-enable is true - v1->v2 * Remove TPACKET_V1 and TPACKET_V2 which is obsolete * Add include/linux/if_packet.h * Change include/sparse/linux/if_packet.h diff --git a/acinclude.m4 b/acinclude.m4 index 1212a46..b39bbb9 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -1093,6 +1093,18 @@ AC_DEFUN([OVS_CHECK_IF_DL], AC_SEARCH_LIBS([pcap_open_live], [pcap]) fi]) +dnl OVS_CHECK_LINUX_TPACKET +dnl +dnl Configure Linux TPACKET. +AC_DEFUN([OVS_CHECK_LINUX_TPACKET], [ + AC_COMPILE_IFELSE([ +AC_LANG_PROGRAM([#include ], [ +struct tpacket3_hdr x = { 0 }; +])], +[AC_DEFINE([HAVE_TPACKET_V3], [1], +[Define to 1 if struct tpacket3_hdr is available.])]) +]) + dnl Checks for buggy strtok_r. dnl dnl Some versions of glibc 2.7
[ovs-dev] 答复: [PATCH v3] Use TPACKET_V3 to accelerate veth for userspace datapath
William, I don't know why I can't receive your comments in my outlook, https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367860.html I don't know how to check travis build issue, can you help provide a quick guide in order that I can fix it? -邮件原件- 发件人: yang_y...@126.com [mailto:yang_y...@126.com] 发送时间: 2020年2月11日 18:22 收件人: ovs-dev@openvswitch.org 抄送: b...@ovn.org; ian.sto...@intel.com; Yi Yang (杨�D)-云服务集团 ; yang_y...@163.com 主题: [PATCH v3] Use TPACKET_V3 to accelerate veth for userspace datapath From: Yi Yang We can avoid high system call overhead by using TPACKET_V3 and using DPDK-like poll to receive and send packets (Note: send still needs to call sendto to trigger final packet transmission). >From Linux kernel 3.10 on, TPACKET_V3 has been supported, so all the Linux kernels current OVS supports can run TPACKET_V3 without any problem. I can see about 30% performance improvement for veth compared to last recvmmsg optimization if I use TPACKET_V3, it is about 1.98 Gbps, but it was 1.47 Gbps before. Note: Linux kernel TPACKET_V3 can't support TSO, so the performance is very poor, please turn off tso for veth interfces in case userspace-tso-enable is set to true. Signed-off-by: Yi Yang Co-authored-by: William Tu Signed-off-by: William Tu --- acinclude.m4 | 12 ++ configure.ac | 1 + include/linux/automake.mk| 1 + include/linux/if_packet.h| 126 + include/sparse/linux/if_packet.h | 108 +++ lib/netdev-linux-private.h | 22 +++ lib/netdev-linux.c | 375 ++- 7 files changed, 640 insertions(+), 5 deletions(-) create mode 100644 include/linux/if_packet.h Changelog: - v2->v3 * Fix build issues in case HAVE_TPACKET_V3 is not defined * Add tso-related support code * make sure it can work normally in case userspace-tso-enable is true - v1->v2 * Remove TPACKET_V1 and TPACKET_V2 which is obsolete * Add include/linux/if_packet.h * Change include/sparse/linux/if_packet.h diff --git a/acinclude.m4 b/acinclude.m4 index 1212a46..b39bbb9 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -1093,6 +1093,18 @@ AC_DEFUN([OVS_CHECK_IF_DL], AC_SEARCH_LIBS([pcap_open_live], [pcap]) fi]) +dnl OVS_CHECK_LINUX_TPACKET +dnl +dnl Configure Linux TPACKET. +AC_DEFUN([OVS_CHECK_LINUX_TPACKET], [ + AC_COMPILE_IFELSE([ +AC_LANG_PROGRAM([#include ], [ +struct tpacket3_hdr x = { 0 }; +])], +[AC_DEFINE([HAVE_TPACKET_V3], [1], +[Define to 1 if struct tpacket3_hdr is available.])]) +]) + dnl Checks for buggy strtok_r. dnl dnl Some versions of glibc 2.7 has a bug in strtok_r when compiling diff --git a/configure.ac b/configure.ac index 1877aae..b61a1f4 100644 --- a/configure.ac +++ b/configure.ac @@ -89,6 +89,7 @@ OVS_CHECK_VISUAL_STUDIO_DDK OVS_CHECK_COVERAGE OVS_CHECK_NDEBUG OVS_CHECK_NETLINK +OVS_CHECK_LINUX_TPACKET OVS_CHECK_OPENSSL OVS_CHECK_LIBCAPNG OVS_CHECK_LOGDIR diff --git a/include/linux/automake.mk b/include/linux/automake.mk index 8f063f4..a659e65 100644 --- a/include/linux/automake.mk +++ b/include/linux/automake.mk @@ -1,4 +1,5 @@ noinst_HEADERS += \ + include/linux/if_packet.h \ include/linux/netlink.h \ include/linux/netfilter/nf_conntrack_sctp.h \ include/linux/pkt_cls.h \ diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h new file mode 100644 index 000..34c5747 --- /dev/null +++ b/include/linux/if_packet.h @@ -0,0 +1,126 @@ +#ifndef __LINUX_IF_PACKET_WRAPPER_H +#define __LINUX_IF_PACKET_WRAPPER_H 1 + +#ifdef HAVE_TPACKET_V3 +#include_next +#else +#define HAVE_TPACKET_V3 1 + +struct sockaddr_pkt { +unsigned short spkt_family; +unsigned char spkt_device[14]; +uint16_tspkt_protocol; +}; + +struct sockaddr_ll { +unsigned short sll_family; +uint16_tsll_protocol; +int sll_ifindex; +unsigned short sll_hatype; +unsigned char sll_pkttype; +unsigned char sll_halen; +unsigned char sll_addr[8]; +}; + +/* Packet types */ +#define PACKET_HOST 0 /* To us*/ + +/* Packet socket options */ +#define PACKET_RX_RING 5 +#define PACKET_VERSION 10 +#define PACKET_TX_RING 13 +#define PACKET_VNET_HDR15 + +/* Rx ring - header status */ +#define TP_STATUS_KERNEL0 +#define TP_STATUS_USER(1 << 0) +#define TP_STATUS_VLAN_VALID (1 << 4) /* auxdata has valid tp_vlan_tci */ +#define TP_STATUS_VLAN_TPID_VALID (1 << 6) /* auxdata has valid +tp_vlan_tpid */ + +/* Tx ring - header status */ +#define TP_STATUS_SEND_REQUEST(1 << 0) +#define TP_STATUS_SENDING (1 << 1) + +struct tpacket_hdr { +unsigned long tp_status; +unsigned int tp_len; +unsigned int tp_snaplen; +unsigned s
[ovs-dev] 答复: [PATCH v2] netdev-linux: Prepend the std packet in the TSO packet
Hi, Flavio With this one patch and previous several merged TSO-related patches, can veth work with "ethtool -K vethX tx on"? I always can't figure out why veth can work in dpdk data path when tx offload features are on, it looks like you're fixing this big issue, right? For tap interface, it can't support TSO, do you Redhat guys have plan to enable it on kernel side. -邮件原件- 发件人: Flavio Leitner [mailto:f...@sysclose.org] 发送时间: 2020年2月4日 5:46 收件人: d...@openvswitch.org 抄送: Stokes Ian ; Loftus Ciara ; Ilya Maximets ; Yi Yang (杨 �D)-云服务集团 ; txfh2007 ; Ben Pfaff ; Flavio Leitner 主题: [PATCH v2] netdev-linux: Prepend the std packet in the TSO packet Usually TSO packets are close to 50k, 60k bytes long, so to to copy less bytes when receiving a packet from the kernel change the approach. Instead of extending the MTU sized packet received and append with remaining TSO data from the TSO buffer, allocate a TSO packet with enough headroom to prepend the std packet data. Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") Suggested-by: Ben Pfaff Signed-off-by: Flavio Leitner --- lib/dp-packet.c| 8 +-- lib/dp-packet.h| 2 + lib/netdev-linux-private.h | 3 +- lib/netdev-linux.c | 117 ++--- 4 files changed, 78 insertions(+), 52 deletions(-) V2: - tso packets tailroom depends on headroom in netdev_linux_rxq_recv() - iov_len uses packet's tailroom. This patch depends on a previous posted patch to work: Subject: netdev-linux-private: fix max length to be 16 bits https://mail.openvswitch.org/pipermail/ovs-dev/2020-February/367469.html With both patches applied, I can run iperf3 and scp on both directions with good performance and no issues. diff --git a/lib/dp-packet.c b/lib/dp-packet.c index 8dfedcb7c..cd2623500 100644 --- a/lib/dp-packet.c +++ b/lib/dp-packet.c @@ -243,8 +243,8 @@ dp_packet_copy__(struct dp_packet *b, uint8_t *new_base, /* Reallocates 'b' so that it has exactly 'new_headroom' and 'new_tailroom' * bytes of headroom and tailroom, respectively. */ -static void -dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom) +void +dp_packet_resize(struct dp_packet *b, size_t new_headroom, size_t +new_tailroom) { void *new_base, *new_data; size_t new_allocated; @@ -297,7 +297,7 @@ void dp_packet_prealloc_tailroom(struct dp_packet *b, size_t size) { if (size > dp_packet_tailroom(b)) { -dp_packet_resize__(b, dp_packet_headroom(b), MAX(size, 64)); +dp_packet_resize(b, dp_packet_headroom(b), MAX(size, 64)); } } @@ -308,7 +308,7 @@ void dp_packet_prealloc_headroom(struct dp_packet *b, size_t size) { if (size > dp_packet_headroom(b)) { -dp_packet_resize__(b, MAX(size, 64), dp_packet_tailroom(b)); +dp_packet_resize(b, MAX(size, 64), dp_packet_tailroom(b)); } } diff --git a/lib/dp-packet.h b/lib/dp-packet.h index 69ae5dfac..9a9d35183 100644 --- a/lib/dp-packet.h +++ b/lib/dp-packet.h @@ -152,6 +152,8 @@ struct dp_packet *dp_packet_clone_with_headroom(const struct dp_packet *, struct dp_packet *dp_packet_clone_data(const void *, size_t); struct dp_packet *dp_packet_clone_data_with_headroom(const void *, size_t, size_t headroom); +void dp_packet_resize(struct dp_packet *b, size_t new_headroom, + size_t new_tailroom); static inline void dp_packet_delete(struct dp_packet *); static inline void *dp_packet_at(const struct dp_packet *, size_t offset, diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h index be2d7b10b..c7c515f70 100644 --- a/lib/netdev-linux-private.h +++ b/lib/netdev-linux-private.h @@ -45,7 +45,8 @@ struct netdev_rxq_linux { struct netdev_rxq up; bool is_tap; int fd; -char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */ +struct dp_packet *aux_bufs[NETDEV_MAX_BURST]; /* Preallocated TSO + packets. */ }; int netdev_linux_construct(struct netdev *); diff --git a/lib/netdev-linux. c b/lib/netdev-linux.c index 6add3e2fc..c6f3d2740 100644 --- a/lib/netdev-linux.c +++ b/lib/netdev-linux.c @@ -1052,15 +1052,6 @@ static struct netdev_rxq * netdev_linux_rxq_alloc(void) { struct netdev_rxq_linux *rx = xzalloc(sizeof *rx); -if (userspace_tso_enabled()) { -int i; - -/* Allocate auxiliay buffers to receive TSO packets. */ -for (i = 0; i < NETDEV_MAX_BURST; i++) { -rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN); -} -} - return >up; } @@ -1172,7 +1163,7 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_) } for (i = 0; i < NETDEV_MAX_BURST; i++) { -free(rx->aux_bufs[i]); +dp_packet_delete(rx->aux_bufs[i]); } } @@ -1238,13 +1229,18 @@ netdev_linux_batch_rxq_re
[ovs-dev] 答复: [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath
William, sorry for late reply. About your question in https://mail.openvswitch.org/pipermail/ovs-dev/2020-January/367133.html, af_packet I was saying is DPDK af_packet, its interface type is dpdk, so its performance can be up to 4.00Gbps, for non-DPDK interface, handling thread is ovs-vswitchd but not pmd_thread, so TPACKET_V3 only can reach 1.98 Gbps, 1.47Gbps is for my last patch Ben merged (recvmmsg for batch receiving). >Hi Yiyang, > >I don't understand these three numbers. >Don't you also use af_packet for veth for 1.47 Gbps and 1.98 Gbps? >What's the difference between your 4.00 Gbps and 1.98Gbps? > >William -邮件原件----- 发件人: Yi Yang (杨�D)-云服务集团 发送时间: 2020年2月3日 12:06 收件人: 'u9012...@gmail.com' ; 'b...@ovn.org' ; 'yang_y...@163.com' 抄送: 'ovs-dev@openvswitch.org' ; 'ian.sto...@intel.com' 主题: 答复: [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath 重要性: 高 Hi, William Sorry for last reply, I don't know why I always can't get your comments email from my outlook, Ben's comments are ok, I also can't see your comments in outlook junk box. About your comments in https://mail.openvswitch.org/pipermail/ovs-dev/2020-January/367146.html, I checked it in my CentOS 7 which has 3.10.0 kernel, TPACKET_V3 sample code can work, so I'm ok to remove V1 code. >Hi Yiyang, > >Can we just implement TPACKET v3, and drop v2 and v1? >V3 is supported since kernel 3.10, >commit f6fb8f100b807378fda19e83e5ac6828b638603a >Author: chetan loke >Date: Fri Aug 19 10:18:16 2011 + > >af-packet: TPACKET_V3 flexible buffer implementation. > >and based on OVS release >http://docs.openvswitch.org/en/latest/faq/releases/ >after OVS 2.12, the minimum kernel requirement is 3.10. > >Regards, >William -邮件原件- 发件人: Yi Yang (杨�D)-云服务集团 发送时间: 2020年2月3日 10:36 收件人: 'b...@ovn.org' ; 'yang_y...@163.com' 抄送: 'ovs-dev@openvswitch.org' ; 'ian.sto...@intel.com' 主题: 答复: [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath 重要性: 高 Hi, all Current tap, internal and system interfaces aren't handled by pmd_thread, so the performance can't be boosted too high, I have a very simple test just by setting is_pmd to true for them, the below is my data for veth (using TPACKET_V3), you can see pmd_thread is much better than ovs_vswitchd obviously, compared with my previous data 1.98Gbps, my question is if we can set is_pmd to true by default, I'll set is_pmd to true in next version if no objection. $ sudo ip netns exec ns01 iperf3 -t 60 -i 10 -c 10.15.1.3 --get-server-output Connecting to host 10.15.1.3, port 5201 [ 4] local 10.15.1.2 port 59590 connected to 10.15.1.3 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.00 sec 3.59 GBytes 3.09 Gbits/sec0 3.04 MBytes [ 4] 10.00-20.00 sec 3.57 GBytes 3.06 Gbits/sec0 3.04 MBytes [ 4] 20.00-30.00 sec 3.60 GBytes 3.09 Gbits/sec0 3.04 MBytes [ 4] 30.00-40.00 sec 3.56 GBytes 3.06 Gbits/sec0 3.04 MBytes [ 4] 40.00-50.00 sec 3.64 GBytes 3.12 Gbits/sec0 3.04 MBytes [ 4] 50.00-60.00 sec 3.62 GBytes 3.11 Gbits/sec0 3.04 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-60.00 sec 21.6 GBytes 3.09 Gbits/sec0 sender [ 4] 0.00-60.00 sec 21.6 GBytes 3.09 Gbits/sec receiver Server output: --- Accepted connection from 10.15.1.2, port 59588 [ 5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 59590 [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.00 sec 3.57 GBytes 3.07 Gbits/sec [ 5] 10.00-20.00 sec 3.57 GBytes 3.06 Gbits/sec [ 5] 20.00-30.00 sec 3.60 GBytes 3.09 Gbits/sec [ 5] 30.00-40.00 sec 3.56 GBytes 3.06 Gbits/sec [ 5] 40.00-50.00 sec 3.64 GBytes 3.12 Gbits/sec [ 5] 50.00-60.00 sec 3.62 GBytes 3.11 Gbits/sec iperf Done. eipadmin@cmp008:~$ -邮件原件- 发件人: Ben Pfaff [mailto:b...@ovn.org] 发送时间: 2020年1月22日 3:26 收件人: yang_y...@163.com 抄送: ovs-dev@openvswitch.org; ian.sto...@intel.com; Yi Yang (杨�D)-云服务集 团 主题: Re: [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath On Tue, Jan 21, 2020 at 02:49:47AM -0500, yang_y...@163.com wrote: > From: Yi Yang > > We can avoid high system call overhead by using TPACKET_V1/V2/V3 and > use DPDK-like poll to receive and send packets (Note: send still needs > to call sendto to trigger final packet transmission). > > I can see about 30% improvement compared to last recvmmsg optimization > if I use TPACKET_V3. TPACKET_V1/V2 is worse than TPACKET_V3, but it > still can improve about 20%. > > For veth, it is 1.47 Gbps before this patch, it is about 1.98 Gbps > after applied this patch. But it is about 4.00 Gbps if we use > af_packet for veth, the bottle neck lies in ovs-vswitchd thread, it
[ovs-dev] 答复: [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath
Hi, William Sorry for last reply, I don't know why I always can't get your comments email from my outlook, Ben's comments are ok, I also can't see your comments in outlook junk box. About your comments in https://mail.openvswitch.org/pipermail/ovs-dev/2020-January/367146.html, I checked it in my CentOS 7 which has 3.10.0 kernel, TPACKET_V3 sample code can work, so I'm ok to remove V1 code. >Hi Yiyang, > >Can we just implement TPACKET v3, and drop v2 and v1? >V3 is supported since kernel 3.10, >commit f6fb8f100b807378fda19e83e5ac6828b638603a >Author: chetan loke >Date: Fri Aug 19 10:18:16 2011 + > >af-packet: TPACKET_V3 flexible buffer implementation. > >and based on OVS release >http://docs.openvswitch.org/en/latest/faq/releases/ >after OVS 2.12, the minimum kernel requirement is 3.10. > >Regards, >William -邮件原件- 发件人: Yi Yang (杨�D)-云服务集团 发送时间: 2020年2月3日 10:36 收件人: 'b...@ovn.org' ; 'yang_y...@163.com' 抄送: 'ovs-dev@openvswitch.org' ; 'ian.sto...@intel.com' 主题: 答复: [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath 重要性: 高 Hi, all Current tap, internal and system interfaces aren't handled by pmd_thread, so the performance can't be boosted too high, I have a very simple test just by setting is_pmd to true for them, the below is my data for veth (using TPACKET_V3), you can see pmd_thread is much better than ovs_vswitchd obviously, compared with my previous data 1.98Gbps, my question is if we can set is_pmd to true by default, I'll set is_pmd to true in next version if no objection. $ sudo ip netns exec ns01 iperf3 -t 60 -i 10 -c 10.15.1.3 --get-server-output Connecting to host 10.15.1.3, port 5201 [ 4] local 10.15.1.2 port 59590 connected to 10.15.1.3 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.00 sec 3.59 GBytes 3.09 Gbits/sec0 3.04 MBytes [ 4] 10.00-20.00 sec 3.57 GBytes 3.06 Gbits/sec0 3.04 MBytes [ 4] 20.00-30.00 sec 3.60 GBytes 3.09 Gbits/sec0 3.04 MBytes [ 4] 30.00-40.00 sec 3.56 GBytes 3.06 Gbits/sec0 3.04 MBytes [ 4] 40.00-50.00 sec 3.64 GBytes 3.12 Gbits/sec0 3.04 MBytes [ 4] 50.00-60.00 sec 3.62 GBytes 3.11 Gbits/sec0 3.04 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-60.00 sec 21.6 GBytes 3.09 Gbits/sec0 sender [ 4] 0.00-60.00 sec 21.6 GBytes 3.09 Gbits/sec receiver Server output: --- Accepted connection from 10.15.1.2, port 59588 [ 5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 59590 [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.00 sec 3.57 GBytes 3.07 Gbits/sec [ 5] 10.00-20.00 sec 3.57 GBytes 3.06 Gbits/sec [ 5] 20.00-30.00 sec 3.60 GBytes 3.09 Gbits/sec [ 5] 30.00-40.00 sec 3.56 GBytes 3.06 Gbits/sec [ 5] 40.00-50.00 sec 3.64 GBytes 3.12 Gbits/sec [ 5] 50.00-60.00 sec 3.62 GBytes 3.11 Gbits/sec iperf Done. eipadmin@cmp008:~$ -邮件原件- 发件人: Ben Pfaff [mailto:b...@ovn.org] 发送时间: 2020年1月22日 3:26 收件人: yang_y...@163.com 抄送: ovs-dev@openvswitch.org; ian.sto...@intel.com; Yi Yang (杨�D)-云服务集 团 主题: Re: [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath On Tue, Jan 21, 2020 at 02:49:47AM -0500, yang_y...@163.com wrote: > From: Yi Yang > > We can avoid high system call overhead by using TPACKET_V1/V2/V3 and > use DPDK-like poll to receive and send packets (Note: send still needs > to call sendto to trigger final packet transmission). > > I can see about 30% improvement compared to last recvmmsg optimization > if I use TPACKET_V3. TPACKET_V1/V2 is worse than TPACKET_V3, but it > still can improve about 20%. > > For veth, it is 1.47 Gbps before this patch, it is about 1.98 Gbps > after applied this patch. But it is about 4.00 Gbps if we use > af_packet for veth, the bottle neck lies in ovs-vswitchd thread, it > will handle too many things for every loop (as below) , so it can't > work very efficintly as pmd_thread. > > memory_run(); > bridge_run(); > unixctl_server_run(unixctl); > netdev_run(); > > memory_wait(); > bridge_wait(); > unixctl_server_wait(unixctl); > netdev_wait(); > poll_block(); > > In the next step, it will be better if let pmd_thread to handle tap > and veth interface. > > Signed-off-by: Yi Yang > Co-authored-by: William Tu > Signed-off-by: William Tu Thanks for the patch! I am a bit concerned about version compatibility issues here. There are two relevant kinds of versions. The first is the version of the kernel/library headers. This patch works pretty hard to adapt to the headers that are available at compile time, only dealing with the versions of the
[ovs-dev] 答复: [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath
Hi, all Current tap, internal and system interfaces aren't handled by pmd_thread, so the performance can't be boosted too high, I have a very simple test just by setting is_pmd to true for them, the below is my data for veth (using TPACKET_V3), you can see pmd_thread is much better than ovs_vswitchd obviously, compared with my previous data 1.98Gbps, my question is if we can set is_pmd to true by default, I'll set is_pmd to true in next version if no objection. $ sudo ip netns exec ns01 iperf3 -t 60 -i 10 -c 10.15.1.3 --get-server-output Connecting to host 10.15.1.3, port 5201 [ 4] local 10.15.1.2 port 59590 connected to 10.15.1.3 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.00 sec 3.59 GBytes 3.09 Gbits/sec0 3.04 MBytes [ 4] 10.00-20.00 sec 3.57 GBytes 3.06 Gbits/sec0 3.04 MBytes [ 4] 20.00-30.00 sec 3.60 GBytes 3.09 Gbits/sec0 3.04 MBytes [ 4] 30.00-40.00 sec 3.56 GBytes 3.06 Gbits/sec0 3.04 MBytes [ 4] 40.00-50.00 sec 3.64 GBytes 3.12 Gbits/sec0 3.04 MBytes [ 4] 50.00-60.00 sec 3.62 GBytes 3.11 Gbits/sec0 3.04 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-60.00 sec 21.6 GBytes 3.09 Gbits/sec0 sender [ 4] 0.00-60.00 sec 21.6 GBytes 3.09 Gbits/sec receiver Server output: --- Accepted connection from 10.15.1.2, port 59588 [ 5] local 10.15.1.3 port 5201 connected to 10.15.1.2 port 59590 [ ID] Interval Transfer Bandwidth [ 5] 0.00-10.00 sec 3.57 GBytes 3.07 Gbits/sec [ 5] 10.00-20.00 sec 3.57 GBytes 3.06 Gbits/sec [ 5] 20.00-30.00 sec 3.60 GBytes 3.09 Gbits/sec [ 5] 30.00-40.00 sec 3.56 GBytes 3.06 Gbits/sec [ 5] 40.00-50.00 sec 3.64 GBytes 3.12 Gbits/sec [ 5] 50.00-60.00 sec 3.62 GBytes 3.11 Gbits/sec iperf Done. eipadmin@cmp008:~$ -邮件原件- 发件人: Ben Pfaff [mailto:b...@ovn.org] 发送时间: 2020年1月22日 3:26 收件人: yang_y...@163.com 抄送: ovs-dev@openvswitch.org; ian.sto...@intel.com; Yi Yang (杨�D)-云服务集 团 主题: Re: [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath On Tue, Jan 21, 2020 at 02:49:47AM -0500, yang_y...@163.com wrote: > From: Yi Yang > > We can avoid high system call overhead by using TPACKET_V1/V2/V3 and > use DPDK-like poll to receive and send packets (Note: send still needs > to call sendto to trigger final packet transmission). > > I can see about 30% improvement compared to last recvmmsg optimization > if I use TPACKET_V3. TPACKET_V1/V2 is worse than TPACKET_V3, but it > still can improve about 20%. > > For veth, it is 1.47 Gbps before this patch, it is about 1.98 Gbps > after applied this patch. But it is about 4.00 Gbps if we use > af_packet for veth, the bottle neck lies in ovs-vswitchd thread, it > will handle too many things for every loop (as below) , so it can't > work very efficintly as pmd_thread. > > memory_run(); > bridge_run(); > unixctl_server_run(unixctl); > netdev_run(); > > memory_wait(); > bridge_wait(); > unixctl_server_wait(unixctl); > netdev_wait(); > poll_block(); > > In the next step, it will be better if let pmd_thread to handle tap > and veth interface. > > Signed-off-by: Yi Yang > Co-authored-by: William Tu > Signed-off-by: William Tu Thanks for the patch! I am a bit concerned about version compatibility issues here. There are two relevant kinds of versions. The first is the version of the kernel/library headers. This patch works pretty hard to adapt to the headers that are available at compile time, only dealing with the versions of the protocols that are available from the headers. This approach is sometimes fine, but an approach can be better is to simply declare the structures or constants that the headers lack. This is often pretty easy for Linux data structures. OVS does this for some structures that it cares about with the headers in ovs/include/linux. This approach has two advantages: the OVS code (outside these special declarations) doesn't have to care whether particular structures are declared, because they are always declared, and the OVS build always supports a particular feature regardless of the headers of the system on which it was built. The second kind of version is the version of the system that OVS runs on. Unless a given feature is one that is supported by every version that OVS cares about, OVS needs to test at runtime whether the feature is supported and, if not, fall back to the older feature. I don't see that in this code. Instead, it looks to me like it assumes that if the feature was available at build time, then it is available at runtime. This is not a good way to do things, since we want people to be ab
[ovs-dev] 答复: [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath
Ben, thank you so much for your quick comments, yes, using some code to check TPACKET features will be better, but I'm not familiar with AC_CHECK* stuff, it will be better if you can show me a good example for reference, I'll fix the issues you mentioned in next version. BTW, I'm taking Chinese New Year long holiday, so next version post will be sent out after one week at least. Welcome more comments from other folks. -邮件原件- 发件人: Ben Pfaff [mailto:b...@ovn.org] 发送时间: 2020年1月22日 3:26 收件人: yang_y...@163.com 抄送: ovs-dev@openvswitch.org; ian.sto...@intel.com; Yi Yang (杨�D)-云服务集 团 主题: Re: [PATCH] Use TPACKET_V1/V2/V3 to accelerate veth for DPDK datapath On Tue, Jan 21, 2020 at 02:49:47AM -0500, yang_y...@163.com wrote: > From: Yi Yang > > We can avoid high system call overhead by using TPACKET_V1/V2/V3 and > use DPDK-like poll to receive and send packets (Note: send still needs > to call sendto to trigger final packet transmission). > > I can see about 30% improvement compared to last recvmmsg optimization > if I use TPACKET_V3. TPACKET_V1/V2 is worse than TPACKET_V3, but it > still can improve about 20%. > > For veth, it is 1.47 Gbps before this patch, it is about 1.98 Gbps > after applied this patch. But it is about 4.00 Gbps if we use > af_packet for veth, the bottle neck lies in ovs-vswitchd thread, it > will handle too many things for every loop (as below) , so it can't > work very efficintly as pmd_thread. > > memory_run(); > bridge_run(); > unixctl_server_run(unixctl); > netdev_run(); > > memory_wait(); > bridge_wait(); > unixctl_server_wait(unixctl); > netdev_wait(); > poll_block(); > > In the next step, it will be better if let pmd_thread to handle tap > and veth interface. > > Signed-off-by: Yi Yang > Co-authored-by: William Tu > Signed-off-by: William Tu Thanks for the patch! I am a bit concerned about version compatibility issues here. There are two relevant kinds of versions. The first is the version of the kernel/library headers. This patch works pretty hard to adapt to the headers that are available at compile time, only dealing with the versions of the protocols that are available from the headers. This approach is sometimes fine, but an approach can be better is to simply declare the structures or constants that the headers lack. This is often pretty easy for Linux data structures. OVS does this for some structures that it cares about with the headers in ovs/include/linux. This approach has two advantages: the OVS code (outside these special declarations) doesn't have to care whether particular structures are declared, because they are always declared, and the OVS build always supports a particular feature regardless of the headers of the system on which it was built. The second kind of version is the version of the system that OVS runs on. Unless a given feature is one that is supported by every version that OVS cares about, OVS needs to test at runtime whether the feature is supported and, if not, fall back to the older feature. I don't see that in this code. Instead, it looks to me like it assumes that if the feature was available at build time, then it is available at runtime. This is not a good way to do things, since we want people to be able to get builds from distributors such as Red Hat or Debian and then run those builds on a diverse collection of kernels. One specific comment I have here is that, in acinclude.m4, it would be better to use AC_CHECK_TYPE or AC_CHECK_TYPES thatn OVS_GREP_IFELSE. The latter is for testing for kernel builds only; we can't use the normal AC_* tests for those because we often can't successfully build kernel headers using the compiler and flags that Autoconf sets up for building OVS. Thanks, Ben. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] 答复: [PATCH RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support.
Thanks William, af_packet only can open tap interface, it can't create tap interface. Tap interface onlu can be created by the below way ovs-vsctl add-port tapX -- set interface tapX type=internal this tap is very special, it is like a mystery to me so far. "ip tuntap add tapX mode tap" can't work for such tap interface. Anybody can tell me how I can create such a tap interface without using " ovs-vsctl add-port tapX" By the way, I tried af_packet for veth, the performance is very good, it is about 4Gbps on my machine, but it used TPACKET_V2. -邮件原件- 发件人: William Tu [mailto:u9012...@gmail.com] 发送时间: 2019年12月21日 1:50 收件人: Ben Pfaff 抄送: d...@openvswitch.org; i.maxim...@ovn.org; Yi Yang (杨�D)-云服务集团 ; echau...@redhat.com 主题: Re: [PATCH RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support. On Thu, Dec 19, 2019 at 08:44:30PM -0800, Ben Pfaff wrote: > On Thu, Dec 19, 2019 at 04:41:25PM -0800, William Tu wrote: > > Currently the performance of sending packets from userspace ovs to > > kernel veth device is pretty bad as reported from YiYang[1]. > > The patch adds AF_PACKET v3, tpacket v3, as another way to tx/rx > > packet to linux device, hopefully showing better performance. > > > > AF_PACKET v3 should get closed to 1Mpps, as shown[2]. However, my > > current patch using iperf tcp shows only 1.4Gbps, maybe I'm doing > > something wrong. Also DPDK has similar implementation using > > AF_PACKET v2[3]. This is still work-in-progress but any feedbacks > > are welcome. > > Is there a good reason that this is implemented as a new kind of > netdev rather than just a new way for the existing netdev > implementation to do packet i/o? The AF_PACKET v3 is more like PMD mode driver (the netdev-afxdp and other dpdk netdev), which has its own memory mgmt, ring structure, and polling the descriptors. So I implemented it as a new kind. I feel its pretty different than tap or existing af_packet netdev. But integrate it to the existing netdev (lib/netdev-linux.c) is also OK. William ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] 答复: [PATCH RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support.
Hi, William What kernel version can support AF_PACKET v3? I can try it with your patch. -邮件原件- 发件人: William Tu [mailto:u9012...@gmail.com] 发送时间: 2019年12月20日 8:41 收件人: d...@openvswitch.org 抄送: i.maxim...@ovn.org; Yi Yang (杨�D)-云服务集团 ; b...@ovn.org; echau...@redhat.com 主题: [PATCH RFC] WIP: netdev-tpacket: Add AF_PACKET v3 support. Currently the performance of sending packets from userspace ovs to kernel veth device is pretty bad as reported from YiYang[1]. The patch adds AF_PACKET v3, tpacket v3, as another way to tx/rx packet to linux device, hopefully showing better performance. AF_PACKET v3 should get closed to 1Mpps, as shown[2]. However, my current patch using iperf tcp shows only 1.4Gbps, maybe I'm doing something wrong. Also DPDK has similar implementation using AF_PACKET v2[3]. This is still work-in-progress but any feedbacks are welcome. [1] https://patchwork.ozlabs.org/patch/1204939/ [2] slide 18, https://www.netdevconf.info/2.2/slides/karlsson-afpacket-talk. pdf [3] dpdk/drivers/net/af_packet/rte_eth_af_packet.c --- lib/automake.mk| 2 + lib/netdev-linux-private.h | 23 +++ lib/netdev-linux.c | 24 ++- lib/netdev-provider.h | 1 + lib/netdev-tpacket.c | 487 + lib/netdev-tpacket.h | 43 lib/netdev.c | 1 + 7 files changed, 580 insertions(+), 1 deletion(-) create mode 100644 lib/netdev-tpacket.c create mode 100644 lib/netdev-tpacket.h diff --git a/lib/automake.mk b/lib/automake.mk index 17b36b43d9d7..0c635404cb43 100644 --- a/lib/automake.mk +++ b/lib/automake.mk @@ -398,6 +398,8 @@ lib_libopenvswitch_la_SOURCES += \ lib/netdev-linux.c \ lib/netdev-linux.h \ lib/netdev-linux-private.h \ + lib/netdev-tpacket.c \ + lib/netdev-tpacket.h \ lib/netdev-offload-tc.c \ lib/netlink-conntrack.c \ lib/netlink-conntrack.h \ diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h index f08159aa7b53..99a2c03bb2a6 100644 --- a/lib/netdev-linux-private.h +++ b/lib/netdev-linux-private.h @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -37,6 +38,24 @@ struct netdev; +/* tpacket rx and tx ring structure. */ struct tp_ring { +struct iovec *rd; /* rd[n] points to mmap area. */ +int rd_len; +int rd_num; +char *mm; /* mmap address. */ +size_t mm_len; +unsigned int next_avail_block; +int frame_len; +}; + +struct tpacket_info { +int fd; +struct tpacket_req3 req; +struct tp_ring rxring; +struct tp_ring txring; +}; + struct netdev_rxq_linux { struct netdev_rxq up; bool is_tap; @@ -110,6 +129,10 @@ struct netdev_linux { struct netdev_afxdp_tx_lock *tx_locks; /* Array of locks for TX queues. */ #endif + +/* tpacket v3 information. */ +struct tpacket_info **tps; +int n_tps; }; static bool diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index f8e59bacfb13..edfc389ee6f2 100644 --- a/lib/netdev-linux.c +++ b/lib/netdev-linux.c @@ -36,9 +36,10 @@ #include #include #include +#include #include #include -#include +//#include #include #include #include @@ -57,6 +58,7 @@ #include "openvswitch/hmap.h" #include "netdev-afxdp.h" #include "netdev-provider.h" +#include "netdev-tpacket.h" #include "netdev-vport.h" #include "netlink-notifier.h" #include "netlink-socket.h" @@ -3315,6 +3317,26 @@ const struct netdev_class netdev_afxdp_class = { .rxq_recv = netdev_afxdp_rxq_recv, }; #endif + +const struct netdev_class netdev_tpacket_class = { +NETDEV_LINUX_CLASS_COMMON, +.type = "tpacket", +.is_pmd = true, +.construct = netdev_linux_construct, +.destruct = netdev_linux_destruct, +.get_stats = netdev_linux_get_stats, +.get_features = netdev_linux_get_features, +.get_status = netdev_linux_get_status, +.set_config = netdev_tpacket_set_config, +.get_config = netdev_tpacket_get_config, +.reconfigure = netdev_tpacket_reconfigure, +.get_block_id = netdev_linux_get_block_id, +.get_numa_id = netdev_afxdp_get_numa_id, +.send = netdev_tpacket_batch_send, +.rxq_construct = netdev_linux_rxq_construct, +.rxq_destruct = netdev_linux_rxq_destruct, +.rxq_recv = netdev_tpacket_rxq_recv, }; #define CODEL_N_QUEUES 0x diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h index f109c4e66f0d..518d1dc6e02c 100644 --- a/lib/netdev-provider.h +++ b/lib/netdev-provider.h @@ -833,6 +833,7 @@ extern const struct netdev_class netdev_bsd_class; extern const struct netdev_class netdev_windows_class; #else extern const struct netdev_class netdev_linux_class; +extern const struct netdev_class netdev_tpacket_class; #endif extern const struct netdev_class netdev_internal_class; extern const struct netdev_class netdev_tap_class; dif
[ovs-dev] 答复: [PATCH] socket-util: Introduce emulation and wrapper for recvmmsg().
Current ovs matser has included sendmmsg declaration in include/sparse/sys/socket.h int sendmmsg(int, struct mmsghdr *, unsigned int, unsigned int); I saw "+^L" in your patch. --- a/lib/socket-util.c +++ b/lib/socket-util.c @@ -1283,3 +1283,59 @@ wrap_sendmmsg(int fd, struct mmsghdr *msgs, unsigned int n, unsigned int flags) } #endif #endif +^L +#ifndef _WIN32 /* Avoid using recvmsg on Windows entirely. */ +#undef recvmmsg +int +wrap_recvmmsg(int fd, struct mmsghdr *msgs, unsigned int n, + int flags, struct timespec *timeout) +{ +ovs_assert(!timeout); /* XXX not emulated */ + +static bool recvmmsg_broken = false; +if (!recvmmsg_broken) { +int save_errno = errno; +int retval = recvmmsg(fd, msgs, n, flags, timeout); +if (retval >= 0 || errno != ENOSYS) { +return retval; +} +recvmmsg_broken = true; +errno = save_errno; +} +return emulate_recvmmsg(fd, msgs, n, flags, timeout); +} +#endif I don't understand why call recvmmsg here although we have known recvmmsg isn't defined, I don't think "static bool recvmmsg_broken" is thread-safe. I think we can completely remove the below part if we do know recvmmsg isn't defined (I think autoconf can detect it very precisely, we needn't to do runtime check for this) +static bool recvmmsg_broken = false; +if (!recvmmsg_broken) { +int save_errno = errno; +int retval = recvmmsg(fd, msgs, n, flags, timeout); +if (retval >= 0 || errno != ENOSYS) { +return retval; +} +recvmmsg_broken = true; +errno = save_errno; +} -邮件原件- 发件人: Ben Pfaff [mailto:b...@ovn.org] 发送时间: 2019年12月18日 4:39 收件人: d...@openvswitch.org 抄送: Ben Pfaff ; Yi Yang (杨�D)-云服务集团 主题: [PATCH] socket-util: Introduce emulation and wrapper for recvmmsg(). Not every system will have recvmmsg(), so introduce compatibility code that will allow it to be used blindly from the rest of the tree. This assumes that recvmmsg() and sendmmsg() are either both present or both absent in system libraries and headers. CC: Yi Yang Signed-off-by: Ben Pfaff --- I haven't actually tested this! include/sparse/sys/socket.h | 7 - lib/socket-util.c | 56 + lib/socket-util.h | 24 +--- 3 files changed, 76 insertions(+), 11 deletions(-) diff --git a/include/sparse/sys/socket.h b/include/sparse/sys/socket.h index 4178f57e2bda..6ff245ae939b 100644 --- a/include/sparse/sys/socket.h +++ b/include/sparse/sys/socket.h @@ -27,6 +27,7 @@ typedef unsigned short int sa_family_t; typedef __socklen_t socklen_t; +struct timespec; struct sockaddr { sa_family_t sa_family; @@ -126,7 +127,8 @@ enum { MSG_PEEK, MSG_TRUNC, MSG_WAITALL, -MSG_DONTWAIT +MSG_DONTWAIT, +MSG_WAITFORONE }; enum { @@ -171,4 +173,7 @@ int sockatmark(int); int socket(int, int, int); int socketpair(int, int, int, int[2]); +int sendmmsg(int, struct mmsghdr *, unsigned int, int); int +recvmmsg(int, struct mmsghdr *, unsigned int, int, struct timespec *); + #endif /* for sparse */ diff --git a/lib/socket-util.c b/lib/socket-util.c index 6b7378de934b..f6f6f3b0a33f 100644 --- a/lib/socket-util.c +++ b/lib/socket-util.c @@ -1283,3 +1283,59 @@ wrap_sendmmsg(int fd, struct mmsghdr *msgs, unsigned int n, unsigned int flags) } #endif #endif + +#ifndef _WIN32 /* Avoid using recvmsg on Windows entirely. */ static +int emulate_recvmmsg(int fd, struct mmsghdr *msgs, unsigned int n, + int flags, struct timespec *timeout OVS_UNUSED) { +ovs_assert(!timeout); /* XXX not emulated */ + +bool waitforone = flags & MSG_WAITFORONE; +flags &= ~MSG_WAITFORONE; + +for (unsigned int i = 0; i < n; i++) { +ssize_t retval = recvmsg(fd, [i].msg_hdr, flags); +if (retval < 0) { +return i ? i : retval; +} +msgs[i].msg_len = retval; + +if (waitforone) { +flags |= MSG_DONTWAIT; +} +} +return n; +} + +#ifndef HAVE_SENDMMSG +int +recvmmsg(int fd, struct mmsghdr *msgs, unsigned int n, + int flags, struct timespec *timeout) { +return emulate_recvmmsg(fd, msgs, n, flags, timeout); } #else +/* recvmmsg was redefined in lib/socket-util.c, should undef recvmmsg +here + * to avoid recursion */ +#undef recvmmsg +int +wrap_recvmmsg(int fd, struct mmsghdr *msgs, unsigned int n, + int flags, struct timespec *timeout) { +ovs_assert(!timeout); /* XXX not emulated */ + +static bool recvmmsg_broken = false; +if (!recvmmsg_broken) { +int save_errno = errno; +int retval = recvmmsg(fd, msgs, n, flags, timeout); +if (retval >= 0 || errno != ENOSYS) { +return retval; +} +recvmmsg_broken = true; +errno = save_errno; +} +return emulate_recvmmsg(fd, msgs, n, fl
[ovs-dev] 答复: [PATCH] Use batch process recv for tap and raw socket in netdev datapath
Ben, thank for your review, for recvmmsg, we have to prepare some buffers for it, but we have no way to know how many packets are there for socket, so these mallocs are must-have overhead, maybe self-adaptive malloc mechanism is better, for example, the first receive just mallocs 4 buffers, if it receives 4 buffers successfully, we can increase it to 8, till it is up to 32, if it can't receive all the buffers, we can decrease it by one half, but this will make code complicated a bit. Your fix is right, I should be set to 0 when retval < 0, thank for your review again, I'll update it with your fix patch and send another version. -邮件原件- 发件人: Ben Pfaff [mailto:b...@ovn.org] 发送时间: 2019年12月18日 4:14 收件人: yang_y...@163.com 抄送: ovs-dev@openvswitch.org; ian.sto...@intel.com; Yi Yang (杨�D)-云服务集 团 主题: Re: [PATCH] Use batch process recv for tap and raw socket in netdev datapath On Fri, Dec 06, 2019 at 02:09:24AM -0500, yang_y...@163.com wrote: > From: Yi Yang > > Current netdev_linux_rxq_recv_tap and netdev_linux_rxq_recv_sock just > receive single packet, that is very inefficient, per my test case > which adds two tap ports or veth ports into OVS bridge > (datapath_type=netdev) and use iperf3 to do performance test between > two ports (they are set into different network name space). Thanks for the patch! This is an impressive performance improvement! Each call to netdev_linux_batch_rxq_recv_sock() now calls malloc() 32 times. This is expensive if only a few packets (or none) are received. Maybe it doesn't matter, but I wonder whether it affects performance. I think that no packets are freed on error. Fix: diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index 9cb45d5c7d29..3414a6495ced 100644 --- a/lib/netdev-linux.c +++ b/lib/netdev-linux.c @@ -1198,6 +1198,7 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu, if (retval < 0) { /* Save -errno to retval temporarily */ retval = -errno; +i = 0; goto free_buffers; } To get sparse to work, one must fold in the following: diff --git a/include/sparse/sys/socket.h b/include/sparse/sys/socket.h index 4178f57e2bda..e954ade714b5 100644 --- a/include/sparse/sys/socket.h +++ b/include/sparse/sys/socket.h @@ -27,6 +27,7 @@ typedef unsigned short int sa_family_t; typedef __socklen_t socklen_t; +struct timespec; struct sockaddr { sa_family_t sa_family; @@ -171,4 +172,7 @@ int sockatmark(int); int socket(int, int, int); int socketpair(int, int, int, int[2]); +int sendmmsg(int, struct mmsghdr *, unsigned int, int); int +recvmmsg(int, struct mmsghdr *, unsigned int, int, struct timespec *); + #endif /* for sparse */ ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] 答复: [openvswitch.org代发]Re: [PATCH] Use batch process recv for tap and raw socket in netdev datapath
William, thank you for your test, it is one of solutions to OVS DPDK issues in OVS Conference :-), this is a kind of very cheap improving way, the performance isn't that bad, it is basically acceptable for common use cases which don't expect high network performance. -邮件原件- 发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 William Tu 发送时间: 2019年12月7日 12:19 收件人: yang_y...@163.com 抄送: ovs-dev 主题: [openvswitch.org代发]Re: [ovs-dev] [PATCH] Use batch process recv for tap and raw socket in netdev datapath On Thu, Dec 5, 2019 at 11:09 PM wrote: > > From: Yi Yang > > Current netdev_linux_rxq_recv_tap and netdev_linux_rxq_recv_sock just > receive single packet, that is very inefficient, per my test case > which adds two tap ports or veth ports into OVS bridge > (datapath_type=netdev) and use iperf3 to do performance test between > two ports (they are set into different network name space). > > The result is as below: > > tap: 295 Mbits/sec > veth: 207 Mbits/sec > > After I change netdev_linux_rxq_recv_tap and > netdev_linux_rxq_recv_sock to use batch process, the performance is > boosted by about 7 times, here is the result: > > tap: 1.96 Gbits/sec > veth: 1.47 Gbits/sec > > Undoubtedly this is a huge improvement although it can't match OVS > kernel datapath yet. > > FYI: here is thr result for OVS kernel datapath: > > tap: 37.2 Gbits/sec > veth: 36.3 Gbits/sec > > Note: performance result is highly related with your test machine , > you shouldn't expect the same results on your test machine. Hi Yi Yang, Thanks for the patch, it's amazing with so much performance improvement. I haven't reviewed the code but Yifeng and I applied and tested this patch. Using netdev-afxdp + tap port, we do see performance improves from 300Mbps to 2Gbps in our testbed! Will add more feedback next week. William ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] 答复: [openvswitch.org代发] [PATCH v2] netdev-afxdp: Best-effort configuration of XDP mode.
Hi, Ilya Can you explain what kernel limitations are for TCP for veth? I can't understand why veth has such limitations only for TCP. I saw a veth bug (https://tech.vijayp.ca/linux-kernel-bug-delivers-corrupt-tcp-ip-data-to-mes os-kubernetes-docker-containers-4986f88f7a19) but it has been fixed in 2016. -邮件原件- 发件人: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-bounces@openvswitch. org] 代表 Ilya Maximets 发送时间: 2019年11月7日 19:37 收件人: ovs-dev@openvswitch.org 抄送: Ilya Maximets 主题: [openvswitch.org代发][ovs-dev] [PATCH v2] netdev-afxdp: Best-effort configuration of XDP mode. Until now there was only two options for XDP mode in OVS: SKB or DRV. i.e. 'generic XDP' or 'native XDP with zero-copy enabled'. Devices like 'veth' interfaces in Linux supports native XDP, but doesn't support zero-copy mode. This case can not be covered by existing API and we have to use slower generic XDP for such devices. There are few more issues, e.g. TCP is not supported in generic XDP mode for veth interfaces due to kernel limitations, however it is supported in native mode. This change introduces ability to use native XDP without zero-copy along with best-effort configuration option that enabled by default. In best-effort case OVS will sequentially try different modes starting from the fastest one and will choose the first acceptable for current interface. This will guarantee the best possible performance. If user will want to choose specific mode, it's still possible by setting the 'options:xdp-mode'. This change additionally changes the API by renaming the configuration knob from 'xdpmode' to 'xdp-mode' and also renaming the modes themselves to be more user-friendly. The full list of currently supported modes: * native-with-zerocopy - former DRV * native - new one, DRV without zero-copy * generic - former SKB * best-effort - new one, chooses the best available from 3 above modes Since 'best-effort' is a default mode, users will not need to explicitely set 'xdp-mode' in most cases. TCP related tests enabled back in system afxdp testsuite, because 'best-effort' will choose 'native' mode for veth interfaces and this mode has no issues with TCP. Signed-off-by: Ilya Maximets --- With this patch I modified the user-visible API, but I think it's OK since it's still an experimental netdev. Comments are welcome. Version 2: * Rebased on current master. Documentation/intro/install/afxdp.rst | 54 --- NEWS | 12 +- lib/netdev-afxdp.c| 223 -- lib/netdev-afxdp.h| 9 ++ lib/netdev-linux-private.h| 8 +- tests/system-afxdp-macros.at | 7 - vswitchd/vswitch.xml | 38 +++-- 7 files changed, 227 insertions(+), 124 deletions(-) diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst index a136db0c9..937770ad0 100644 --- a/Documentation/intro/install/afxdp.rst +++ b/Documentation/intro/install/afxdp.rst @@ -153,9 +153,8 @@ To kick start end-to-end autotesting:: make check-afxdp TESTSUITEFLAGS='1' .. note:: - Not all test cases pass at this time. Currenly all TCP related - tests, ex: using wget or http, are skipped due to XDP limitations - on veth. cvlan test is also skipped. + Not all test cases pass at this time. Currenly all cvlan tests are skipped + due to kernel issues. If a test case fails, check the log at:: @@ -177,33 +176,35 @@ in :doc:`general`:: ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev Make sure your device driver support AF_XDP, netdev-afxdp supports -the following additional options (see man ovs-vswitchd.conf.db for +the following additional options (see ``man ovs-vswitchd.conf.db`` for more details): - * **xdpmode**: use "drv" for driver mode, or "skb" for skb mode. + * ``xdp-mode``: ``best-effort``, ``native-with-zerocopy``, + ``native`` or ``generic``. Defaults to ``best-effort``, i.e. best of + supported modes, so in most cases you don't need to change it. - * **use-need-wakeup**: default "true" if libbpf supports it, otherwise false. + * ``use-need-wakeup``: default ``true`` if libbpf supports it, + otherwise ``false``. For example, to use 1 PMD (on core 4) on 1 queue (queue 0) device, -configure these options: **pmd-cpu-mask, pmd-rxq-affinity, and n_rxq**. -The **xdpmode** can be "drv" or "skb":: +configure these options: ``pmd-cpu-mask``, ``pmd-rxq-affinity``, and +``n_rxq``:: ethtool -L enp2s0 combined 1 ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10 ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ -options:n_rxq=1 options:xdpmode=drv \ -other_config:pmd-rxq-affinity="0:4" + other_config:pmd-rxq-affinity="0:4" Or, use 4 pmds/cores and 4 queues by doing:: ethtool -L enp2s0 combined 4 ovs-vsctl
[ovs-dev] can OVS conntrack support IP list like this: actions=ct(commit, table=0, zone=1, nat(dst=220.0.0.3, 220.0.0.7, 220.0.0.123))?
Hi, folks We need to do SNAT for many internal IPs by just using several public IPs, we also need to do DNAT by some other public IPs for exposing webservice, openflow rules look like the below: table=0,ip,nw_src=172.17.0.0/16,…,actions=ct(commit,table=0,zone=1,nat(src= 220.0.0.3,220.0.0.7,220.0.0.123)) table=0,ip,nw_src=172.18.0.67,…,actions=ct(commit,table=0,zone=1,nat(src=22 0.0.0.3,220.0.0.7,220.0.0.123)) table=0,ip,tcp,nw_dst=220.0.0.11,tp_dst=80,…,actions=ct(commit,table=0,zone =2,nat(dst=172.16.0.100:80)) table=0,ip,tcp,nw_dst=220.0.0.11, tp_dst=443,…,actions=ct(commit,table=0,zone=2,nat(dst=172.16.0.100:443)) >From ct document, it seems it can’t support IP list for nat, anybody knows how we can handle such cases in some kind feasible way? In addition, is it ok if multiple openflow rules use the same NAT IP:PORT combination? I’m not sure if it will result in some conflicts for SNAT, because all of them need to do dynamic source port mapping, per my test, it seems this isn’t a problem. Thank you all in advance and appreciate your help sincerely. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] 答复: Why are iperf3 udp packets out of order in OVS DPDK case?
Ian, here is my configuration, sorry I can't show flow details because it is confidential. By the way, iperf3 tcp is ok and performance is good enough, I'm really confused, udp was ok but tcp were not ok in my VM environment before, it broke my sense :-), I can avoid out of order issue if I control udp bandwidth to 1G by -b 1G. The traffic doesn't reach vlan ports, this ovs node acts as a NAT gateway, it steers the traffic back and forth between iperf3 client and server, iperf3 client and server are other physical machines which are IP reachable for this ovs node. $ sudo ovs-vsctl show 4135a1ed-2bcb-449a-bb07-ed907d6c265f Bridge br-int Port br-int Interface br-int type: internal Port "vlan151" tag: 151 Interface "vlan151" type: internal Port "vlan12" tag: 12 Interface "vlan12" type: internal Port "dpdk0" Interface "dpdk0" type: dpdk options: {dpdk-devargs=":07:00.1", n_rxq="7"} Port "vlan11" tag: 11 Interface "vlan11" type: internal Port "vlan153" tag: 153 Interface "vlan153" type: internal ovs_version: "2.11.1" $ sudo ovs-vsctl list Open_vSwitch _uuid : 4135a1ed-2bcb-449a-bb07-ed907d6c265f bridges : [778ea619-496c-417c-ac08-92d7784f1660] cur_cfg : 46 datapath_types : [netdev, system] db_version : "7.16.1" dpdk_initialized: true dpdk_version: "DPDK 18.11.1" external_ids: {hostname="eip01", rundir="/var/run/openvswitch", system-id="f331dcc0-8ae7-4f2b-aa30-10ae4c8a7b11"} iface_types : [dpdk, dpdkr, dpdkvhostuser, dpdkvhostuserclient, erspan, geneve, gre, internal, "ip6erspan", "ip6gre", lisp, patch, stt, system, tap, vxlan] manager_options : [] next_cfg: 46 other_config: {dpdk-init="true", dpdk-socket-mem="4096", pmd-cpu-mask="0xfe"} ovs_version : "2.11.1" ssl : [] statistics : {} system_type : ubuntu system_version : "16.04" inspur@eip01:~$ sudo ovs-vsctl -- get Interface dpdk0 mtu_request 9000 -邮件原件- 发件人: Stokes, Ian [mailto:ian.sto...@intel.com] 发送时间: 2019年8月27日 18:02 收件人: Yi Yang (杨�D)-云服务集团 ; ovs-disc...@openvswitch.org 抄送: ovs-dev@openvswitch.org 主题: Re: [ovs-dev] Why are iperf3 udp packets out of order in OVS DPDK case? On 8/27/2019 9:35 AM, Yi Yang (杨�D)-云服务集团 wrote: > Hi, all > > > > I’m doing experiments with OVS and OVS DPDK, only one bridge is there, > ports and flows are same for OVS and OVS DPDK, in OVS case, everything > works well, but in OVS DPDK case, iperf udp performance data are very > poor, udp packets are out of order, I have limited MTU and send buffer > by �Cl1410 �C M1410, anybody knows why and how to fix it? Thank you in advance. > Hi, can you provide more detail of you deployment? OVS version, DPDK version, configuration commands for ports/flows etc. Thanks Ian > > ___ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] Why are iperf3 udp packets out of order in OVS DPDK case?
Hi, all I’m doing experiments with OVS and OVS DPDK, only one bridge is there, ports and flows are same for OVS and OVS DPDK, in OVS case, everything works well, but in OVS DPDK case, iperf udp performance data are very poor, udp packets are out of order, I have limited MTU and send buffer by �Cl1410 �C M1410, anybody knows why and how to fix it? Thank you in advance. iperf3: OUT OF ORDER - incoming packet = 65 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 66 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 67 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 68 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 69 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 70 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 71 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 72 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 73 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 74 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 75 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 76 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 77 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 78 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 79 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 80 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 81 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 82 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 83 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 84 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 85 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 86 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 87 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 88 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 89 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 90 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 91 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 92 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 93 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 94 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 95 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 96 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 97 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 98 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 99 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 100 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 101 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 102 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 103 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 104 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 105 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 106 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 107 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 108 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 109 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 110 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 111 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 112 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 113 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 114 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 115 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 116 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 117 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 118 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 119 and received packet = 352 AND SP = 5 iperf3: OUT OF ORDER - incoming packet = 120 and received packet = 352 AND SP = 5
[ovs-dev] why action "meter" only can be specified once?
Hi, all I was told meter only can be specified once, but actually there is such case existing, i.e. multiple flows share a total bandwidth, but every flow also has its own bandwidth limit, by two meters, we can not only get every flow stats but also get total stats, I think this is very reasonable user scenario. ovs-ofctl: instruction meter may be specified only once ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] How can we improve veth and tap performance in OVS DPDK?
Hi, all We’re trying OVS DPDK in openstack cloud, but a big warn makes us hesitate. Floating IP and qrouter use tap interfaces which are attached into br-int, SNAT also should use similar way, so OVS DPDK will impact on VM network performance significantly, I believe many cloud providers have deployed OVS DPDK, my questions are: 1. Do we have some known ways to improve this? 2. Is there any existing effort for this? Veth in kubernetes should have the same performance issue in OVS DPDK case. I also found a very weird issue. I added two veth pairs into ovs bridge and ovs DPDK bridge, for ovs case, iperf3 can work well, but it can’t for OVS DPDK case, what’s wrong. $ sudo ./my-ovs-vsctl show 2a67c1d9-51dc-4728-bb3e-405f2f49e2b1 Bridge br-int Port "veth3-br" Interface "veth3-br" Port "dpdk0" Interface "dpdk0" type: dpdk options: {dpdk-devargs=":00:08.0"} Port br-int Interface br-int type: internal Port "veth2-br" Interface "veth2-br" Port "dpdk1" Interface "dpdk1" type: dpdk options: {dpdk-devargs=":00:09.0"} Port "veth4-br" Interface "veth4-br" Port "veth1-br" Interface "veth1-br" $ sudo ip netns exec ns1 ifconfig veth1 veth1 Link encap:Ethernet HWaddr 26:32:e8:f3:1e:2a inet addr:20.1.1.1 Bcast:20.1.1.255 Mask:255.255.255.0 inet6 addr: fe80::2432:e8ff:fef3:1e2a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:809 errors:0 dropped:0 overruns:0 frame:0 TX packets:20 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:66050 (66.0 KB) TX bytes:1580 (1.5 KB) $ sudo ip netns exec ns2 ifconfig veth2 veth2 Link encap:Ethernet HWaddr 82:71:3b:41:d1:ec inet addr:20.1.1.2 Bcast:20.1.1.255 Mask:255.255.255.0 inet6 addr: fe80::8071:3bff:fe41:d1ec/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:862 errors:0 dropped:0 overruns:0 frame:0 TX packets:26 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:70436 (70.4 KB) TX bytes:2024 (2.0 KB) $ sudo ip netns exec ns2 ping 20.1.1.1 PING 20.1.1.1 (20.1.1.1) 56(84) bytes of data. 64 bytes from 20.1.1.1: icmp_seq=1 ttl=64 time=0.353 ms 64 bytes from 20.1.1.1: icmp_seq=2 ttl=64 time=0.322 ms 64 bytes from 20.1.1.1: icmp_seq=3 ttl=64 time=0.333 ms 64 bytes from 20.1.1.1: icmp_seq=4 ttl=64 time=0.329 ms 64 bytes from 20.1.1.1: icmp_seq=5 ttl=64 time=0.340 ms ^C --- 20.1.1.1 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4099ms rtt min/avg/max/mdev = 0.322/0.335/0.353/0.019 ms $ sudo ip netns exec ns1 iperf3 -s -i 10 & [2] 2851 [1] Exit 1 sudo ip netns exec ns1 iperf3 -s -i 10 $ --- Server listening on 5201 --- $ sudo ip netns exec ns2 iperf3 -t 60 -i 10 -c 20.1.1.1 iperf3: error - unable to connect to server: Connection timed out $ iperf3 has always been hanging there, then exit because of timeout, what's wrong here? $ sudo ./my-ovs-ofctl -Oopenflow13 dump-flows br-int cookie=0x0, duration=1076.396s, table=0, n_packets=1522, n_bytes=124264, priority=0 actions=NORMAL $ The below is Redhat OSP document for your reference. https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/ html/network_functions_virtualization_planning_and_configuration_guide/part- dpdk-configure 8.8. Known limitations There are certain limitations when configuring OVS-DPDK with Red Hat OpenStack Platform for the NFV use case: * Use Linux bonds for control plane networks. Ensure both PCI devices used in the bond are on the same NUMA node for optimum performance. Neutron Linux bridge configuration is not supported by Red Hat. * Huge pages are required for every instance running on the hosts with OVS-DPDK. If huge pages are not present in the guest, the interface appears but does not function. * There is a performance degradation of services that use tap devices, because these devices do not support DPDK. For example, services such as DVR, FWaaS, and LBaaS use tap devices. * With OVS-DPDK, you can enable DVR with netdev datapath, but this has poor performance and is not suitable for a production environment. DVR uses kernel namespace and tap devices to perform the routing. * To ensure the DVR routing performs well with OVS-DPDK, you need to use a controller such as ODL which implements routing as OpenFlow rules. With OVS-DPDK, OpenFlow routing removes the bottleneck introduced by the Linux kernel
[ovs-dev] How can I delete flows which match a given cookie value?
Hi, all I need to add and delete flows according to user operations, I know openflowplugin in Opendaylight can do this, but it seems “ovs-ofctl del-flows” can’t do this way, why can’t cookie value be used to do this for “ovs-ofctl del-flows”? sudo ovs-ofctl -Oopenflow13 --strict del-flows br-int "table=2,cookie=12345" ovs-ofctl: cannot set cookie mod-flows can use cookie to modify flows, anybody can tell me one way to do this for del-flows? I have a unqiue cooki value for every user’s flows, I want to delete those flows by ID/cookie when the user is deleted. ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev
[ovs-dev] Why is ovs DPDK much worse than ovs in my test case?
Hi, all I just use ovs as a static router in my test case, ovs is ran in vagrant VM, ethernet interfaces uses virtio driver, I create two ovs bridges, each one adds one ethernet interface, two bridges are connected by patch port, only default openflow rule is there. table=0, priority=0 actions=NORMAL Bridge br-int Port patch-br-ex Interface patch-br-ex type: patch options: {peer=patch-br-int} Port br-int Interface br-int type: internal Port "dpdk0" Interface "dpdk0" type: dpdk options: {dpdk-devargs=":00:08.0"} Bridge br-ex Port "dpdk1" Interface "dpdk1" type: dpdk options: {dpdk-devargs=":00:09.0"} Port patch-br-int Interface patch-br-int type: patch options: {peer=patch-br-ex} Port br-ex Interface br-ex type: internal But when I run iperf to do performance benchmark, the result shocked me. For ovs nondpdk, the result is vagrant@client1:~$ iperf -t 60 -i 10 -c 192.168.230.101 Client connecting to 192.168.230.101, TCP port 5001 TCP window size: 85.0 KByte (default) [ 3] local 192.168.200.101 port 53900 connected with 192.168.230.101 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.05 GBytes 905 Mbits/sec [ 3] 10.0-20.0 sec 1.02 GBytes 877 Mbits/sec [ 3] 20.0-30.0 sec 1.07 GBytes 922 Mbits/sec [ 3] 30.0-40.0 sec 1.08 GBytes 927 Mbits/sec [ 3] 40.0-50.0 sec 1.06 GBytes 914 Mbits/sec [ 3] 50.0-60.0 sec 1.07 GBytes 922 Mbits/sec [ 3] 0.0-60.0 sec 6.37 GBytes 911 Mbits/sec vagrant@client1:~$ For ovs dpdk, the bandwidth is just about 45Mbits/sec, why? I really don’t understand what happened. vagrant@client1:~$ iperf -t 60 -i 10 -c 192.168.230.101 Client connecting to 192.168.230.101, TCP port 5001 TCP window size: 85.0 KByte (default) [ 3] local 192.168.200.101 port 53908 connected with 192.168.230.101 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 54.6 MBytes 45.8 Mbits/sec [ 3] 10.0-20.0 sec 55.5 MBytes 46.6 Mbits/sec [ 3] 20.0-30.0 sec 52.5 MBytes 44.0 Mbits/sec [ 3] 30.0-40.0 sec 53.6 MBytes 45.0 Mbits/sec [ 3] 40.0-50.0 sec 54.0 MBytes 45.3 Mbits/sec [ 3] 50.0-60.0 sec 53.9 MBytes 45.2 Mbits/sec [ 3] 0.0-60.0 sec 324 MBytes 45.3 Mbits/sec vagrant@client1:~$ By the way, I tried to pin physical cores to qemu processes which correspond to ovs pmd threads, but it hardly affects on performance. PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND P 16303 yangyi 20 0 9207120 209700 107500 R 99.9 0.1 63:02.37 EMT-1 1 16304 yangyi 20 0 9207120 209700 107500 R 99.9 0.1 69:16.16 EMT-2 2 ___ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev