On Fri, Jan 24, 2020 at 03:06:47PM -0800, William Tu wrote:
> On Fri, Jan 24, 2020 at 1:38 PM Flavio Leitner <f...@sysclose.org> wrote:
> >
> > On Fri, Jan 24, 2020 at 10:17:10AM -0800, William Tu wrote:
> > > On Fri, Jan 24, 2020 at 6:40 AM Flavio Leitner <f...@sysclose.org> wrote:
> > > >
> > > > On Wed, Jan 22, 2020 at 10:33:59AM -0800, William Tu wrote:
> > > > > On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner <f...@sysclose.org> 
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > Hi Ben,
> > > > > >
> > > > > > Thanks for reviewing it!
> > > > > >
> > > > > > On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > > > > > > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > > > > > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > > > > > > Thanks all for review/testing, pushed to master.
> > > > > > > >
> > > > > > > > OK, thanks Ian.
> > > > > > > >
> > > > > > > > @Ben, even though this patch already merged, I'd ask you to 
> > > > > > > > take a look
> > > > > > > > at the code in case you'll spot some issues especially in 
> > > > > > > > non-DPDK related
> > > > > > > > parts.
> > > > > > >
> > > > > > > I found the name dp_packet_hwol_is_ipv4(), and similar, 
> > > > > > > confusing.  The
> > > > > > > name suggested to me "test whether the packet is IPv4" not "test 
> > > > > > > whether
> > > > > > > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > > > > > > offload related but...  I like the name 
> > > > > > > dp_packet_hwol_tx_l4_checksum()
> > > > > > > much more, it makes it obvious at a glance that it's a
> > > > > > > checksum-offloading check.
> > > > > >
> > > > > > hwol = hardware offloading. I hear that all the time, but maybe 
> > > > > > there is a
> > > > > > better name. I will improve that if no one gets on it first.
> > > > > >
> > > > > > > In the case where we actually receive a 64 kB packet, I think 
> > > > > > > that this
> > > > > > > code is going to be relatively inefficient.  If I'm reading the 
> > > > > > > code
> > > > > > > correctly (I did it quickly), then this is what happens:
> > > > > > >
> > > > > > >         - The first 1500 bytes of the packet land in the first
> > > > > > >           dp_packet.
> > > > > > >
> > > > > > >         - The remaining 64000ish bytes land in the second 
> > > > > > > dp_packet.
> > > > >
> > > > > It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).
> > > > >
> > > > > struct netdev_rxq_linux {
> > > > >     struct netdev_rxq up;
> > > > >     bool is_tap;
> > > > >     int fd;
> > > > >     char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO 
> > > > > buffers. */
> > > > > };
> > > > >
> > > > > > >
> > > > > > >         - Then we expand the first dp_packet to the needed size 
> > > > > > > and copy
> > > > > > >           the remaining 64000 bytes into it.
> > > > > >
> > > > > > That's correct.
> > > > > >
> > > > > > > An alternative would be:
> > > > > > >
> > > > > > >         - Set up the first dp_packet as currently.
> > > > > > >
> > > > > > >         - Set up the second dp_packet so that the bytes are 
> > > > > > > received
> > > > > > >           into it starting at offset (mtu + headroom).
> > > > > > >
> > > > > > >         - If more than mtu bytes are received, then copy those 
> > > > > > > bytes
> > > > > > >           into the headroom of the second dp_packet and return it 
> > > > > > > to the
> > > > > > >           caller instead of the first dp_packet.
> > > > > >
> > > > > > I wanted to avoid doing more extensive processing if it's not a TSO 
> > > > > > packet
> > > > > > to avoid performance regressions since it' very sensitive. Right 
> > > > > > now the 64k
> > > > > > buffer is preallocated and is static for each queue to avoid the 
> > > > > > malloc
> > > > > > performance issue. Now for TSO case, we have more time per packet 
> > > > > > for
> > > > > > processing.
> > > > >
> > > > > Can we implement Ben's idea by
> > > > > 1) set size of aux_buf to 64k + mtu
> > > > > 2) create 2nd dp_packet using this aux_buf and copy first packet to
> > > > > first mtu bytes of aux_buf
> > > > > 3) since we steal this aux_bufs, allocate a new aux_buf by
> > > > > rxq->aux_bufs[i] = xmalloc(64k + mtu)
> > > > > 4) free the first dp_packet, and use the second dp_packet
> > > >
> > > > I did a quick experiment while at the conference and Ben's idea is
> > > > indeed a bit faster (2.7%) when the packet is not resized due to #1.
> > > >
> > > > If the buffer gets resized to what's actually used, then it becomes
> > > > a bit slower (1.8%).
> > >
> > > Do we have to resize it?
> >
> > Well, if there is congestion the packets will get proportional to
> > the TCP window size which can be like ~4k, ~9k, while it is allocating
> > 64k. Assuming many ports and many packets in parallel, that might be
> > a waste of memory.
> >
> > > > Anyways, feel free to have a look at the code[1]. Perhaps it could
> > > > be changed to be more efficient. Just send me a patch and I will be
> > > > happy to test again.
> > > >
> > > > [1] https://github.com/fleitner/ovs/tree/tso-cycles-ben
> > >
> > > Thanks!
> > >
> > > I tested it by applying
> > > https://github.com/fleitner/ovs/commit/f0f5f630645134bf3c46201de8ce3f44e4fd2c03
> > > Implemented Ben suggestion.
> > > Signed-off-by: Flavio Leitner <f...@sysclose.org>
> > >
> > > Using
> > >     iperf3 -c (ns0) -> veth peer -> OVS -> veth peer -> iperf3 -s (ns1)
> > >
> > > Test 100 second TCP
> > >
> > > without the patch
> > > [  3]  0.0-100.0 sec  78.8 GBytes  6.77 Gbits/sec
> > >
> > > with the patch
> > > [  3]  0.0-100.0 sec  94.5 GBytes  8.11 Gbits/sec
> > >
> > > I think it's pretty good improvement!
> >
> OK, I applied "Resize the large packet to the exact size"
> 
> The performance is
> [  3]  0.0-100.0 sec  93.4 GBytes  8.02 Gbits/sec
> 
> Still pretty good. If you want to give it a try:
> 
> #!/bin/bash
> ovs-vswitchd --no-chdir --pidfile
> --log-file=/root/ovs/ovs-vswitchd.log --disable-system --detach
> ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
> ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3
> ovs-vsctl set Open_vSwitch . other_config:userspace-tso-enable=true
> 
> ip netns add at_ns0
> ip link add p0 type veth peer name afxdp-p0
> ip link set p0 netns at_ns0
> ip link set dev afxdp-p0 up
> ovs-vsctl add-port br0 afxdp-p0
> 
> ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> ip addr add "10.1.1.1/24" dev p0
> ip link set dev p0 up
> NS_EXEC_HEREDOC
> 
> ip netns add at_ns1
> ip link add p1 type veth peer name afxdp-p1
> ip link set p1 netns at_ns1
> ip link set dev afxdp-p1 up
> ovs-vsctl add-port br0 afxdp-p1
> 
> ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> ip addr add "10.1.1.2/24" dev p1
> ip link set dev p1 up
> NS_EXEC_HEREDOC
> 
> Then iperf....

I was testing VM-netns and moving to netns-netns helped a bit more.
It moved from 5.6 Gbits/sec to 6.3 Gbits/sec (~ +12%) and using the
resizing patch didn't change much like it happened in your test.

Ok, I will finish up the patch to move to Ben's idea plus resizing.

Thanks!
-- 
fbl
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to