Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-28 Thread Flavio Leitner
On Fri, Jan 24, 2020 at 03:06:47PM -0800, William Tu wrote:
> On Fri, Jan 24, 2020 at 1:38 PM Flavio Leitner  wrote:
> >
> > On Fri, Jan 24, 2020 at 10:17:10AM -0800, William Tu wrote:
> > > On Fri, Jan 24, 2020 at 6:40 AM Flavio Leitner  wrote:
> > > >
> > > > On Wed, Jan 22, 2020 at 10:33:59AM -0800, William Tu wrote:
> > > > > On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner  
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > Hi Ben,
> > > > > >
> > > > > > Thanks for reviewing it!
> > > > > >
> > > > > > On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > > > > > > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > > > > > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > > > > > > Thanks all for review/testing, pushed to master.
> > > > > > > >
> > > > > > > > OK, thanks Ian.
> > > > > > > >
> > > > > > > > @Ben, even though this patch already merged, I'd ask you to 
> > > > > > > > take a look
> > > > > > > > at the code in case you'll spot some issues especially in 
> > > > > > > > non-DPDK related
> > > > > > > > parts.
> > > > > > >
> > > > > > > I found the name dp_packet_hwol_is_ipv4(), and similar, 
> > > > > > > confusing.  The
> > > > > > > name suggested to me "test whether the packet is IPv4" not "test 
> > > > > > > whether
> > > > > > > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > > > > > > offload related but...  I like the name 
> > > > > > > dp_packet_hwol_tx_l4_checksum()
> > > > > > > much more, it makes it obvious at a glance that it's a
> > > > > > > checksum-offloading check.
> > > > > >
> > > > > > hwol = hardware offloading. I hear that all the time, but maybe 
> > > > > > there is a
> > > > > > better name. I will improve that if no one gets on it first.
> > > > > >
> > > > > > > In the case where we actually receive a 64 kB packet, I think 
> > > > > > > that this
> > > > > > > code is going to be relatively inefficient.  If I'm reading the 
> > > > > > > code
> > > > > > > correctly (I did it quickly), then this is what happens:
> > > > > > >
> > > > > > > - The first 1500 bytes of the packet land in the first
> > > > > > >   dp_packet.
> > > > > > >
> > > > > > > - The remaining 64000ish bytes land in the second 
> > > > > > > dp_packet.
> > > > >
> > > > > It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).
> > > > >
> > > > > struct netdev_rxq_linux {
> > > > > struct netdev_rxq up;
> > > > > bool is_tap;
> > > > > int fd;
> > > > > char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO 
> > > > > buffers. */
> > > > > };
> > > > >
> > > > > > >
> > > > > > > - Then we expand the first dp_packet to the needed size 
> > > > > > > and copy
> > > > > > >   the remaining 64000 bytes into it.
> > > > > >
> > > > > > That's correct.
> > > > > >
> > > > > > > An alternative would be:
> > > > > > >
> > > > > > > - Set up the first dp_packet as currently.
> > > > > > >
> > > > > > > - Set up the second dp_packet so that the bytes are 
> > > > > > > received
> > > > > > >   into it starting at offset (mtu + headroom).
> > > > > > >
> > > > > > > - If more than mtu bytes are received, then copy those 
> > > > > > > bytes
> > > > > > >   into the headroom of the second dp_packet and return it 
> > > > > > > to the
> > > > > > >   caller instead of the first dp_packet.
> > > > > >
> > > > > > I wanted to avoid doing more extensive processing if it's not a TSO 
> > > > > > packet
> > > > > > to avoid performance regressions since it' very sensitive. Right 
> > > > > > now the 64k
> > > > > > buffer is preallocated and is static for each queue to avoid the 
> > > > > > malloc
> > > > > > performance issue. Now for TSO case, we have more time per packet 
> > > > > > for
> > > > > > processing.
> > > > >
> > > > > Can we implement Ben's idea by
> > > > > 1) set size of aux_buf to 64k + mtu
> > > > > 2) create 2nd dp_packet using this aux_buf and copy first packet to
> > > > > first mtu bytes of aux_buf
> > > > > 3) since we steal this aux_bufs, allocate a new aux_buf by
> > > > > rxq->aux_bufs[i] = xmalloc(64k + mtu)
> > > > > 4) free the first dp_packet, and use the second dp_packet
> > > >
> > > > I did a quick experiment while at the conference and Ben's idea is
> > > > indeed a bit faster (2.7%) when the packet is not resized due to #1.
> > > >
> > > > If the buffer gets resized to what's actually used, then it becomes
> > > > a bit slower (1.8%).
> > >
> > > Do we have to resize it?
> >
> > Well, if there is congestion the packets will get proportional to
> > the TCP window size which can be like ~4k, ~9k, while it is allocating
> > 64k. Assuming many ports and many packets in parallel, that might be
> > a waste of memory.
> >
> > > > Anyways, feel free to have a look at the code[1]. Perhaps it could
> > > > be changed to be more efficient. Just send me a patch and I will be
> > > > happy to test again

Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-24 Thread William Tu
On Fri, Jan 24, 2020 at 1:38 PM Flavio Leitner  wrote:
>
> On Fri, Jan 24, 2020 at 10:17:10AM -0800, William Tu wrote:
> > On Fri, Jan 24, 2020 at 6:40 AM Flavio Leitner  wrote:
> > >
> > > On Wed, Jan 22, 2020 at 10:33:59AM -0800, William Tu wrote:
> > > > On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner  
> > > > wrote:
> > > > >
> > > > >
> > > > > Hi Ben,
> > > > >
> > > > > Thanks for reviewing it!
> > > > >
> > > > > On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > > > > > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > > > > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > > > > > Thanks all for review/testing, pushed to master.
> > > > > > >
> > > > > > > OK, thanks Ian.
> > > > > > >
> > > > > > > @Ben, even though this patch already merged, I'd ask you to take 
> > > > > > > a look
> > > > > > > at the code in case you'll spot some issues especially in 
> > > > > > > non-DPDK related
> > > > > > > parts.
> > > > > >
> > > > > > I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  
> > > > > > The
> > > > > > name suggested to me "test whether the packet is IPv4" not "test 
> > > > > > whether
> > > > > > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > > > > > offload related but...  I like the name 
> > > > > > dp_packet_hwol_tx_l4_checksum()
> > > > > > much more, it makes it obvious at a glance that it's a
> > > > > > checksum-offloading check.
> > > > >
> > > > > hwol = hardware offloading. I hear that all the time, but maybe there 
> > > > > is a
> > > > > better name. I will improve that if no one gets on it first.
> > > > >
> > > > > > In the case where we actually receive a 64 kB packet, I think that 
> > > > > > this
> > > > > > code is going to be relatively inefficient.  If I'm reading the code
> > > > > > correctly (I did it quickly), then this is what happens:
> > > > > >
> > > > > > - The first 1500 bytes of the packet land in the first
> > > > > >   dp_packet.
> > > > > >
> > > > > > - The remaining 64000ish bytes land in the second dp_packet.
> > > >
> > > > It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).
> > > >
> > > > struct netdev_rxq_linux {
> > > > struct netdev_rxq up;
> > > > bool is_tap;
> > > > int fd;
> > > > char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO 
> > > > buffers. */
> > > > };
> > > >
> > > > > >
> > > > > > - Then we expand the first dp_packet to the needed size and 
> > > > > > copy
> > > > > >   the remaining 64000 bytes into it.
> > > > >
> > > > > That's correct.
> > > > >
> > > > > > An alternative would be:
> > > > > >
> > > > > > - Set up the first dp_packet as currently.
> > > > > >
> > > > > > - Set up the second dp_packet so that the bytes are received
> > > > > >   into it starting at offset (mtu + headroom).
> > > > > >
> > > > > > - If more than mtu bytes are received, then copy those bytes
> > > > > >   into the headroom of the second dp_packet and return it 
> > > > > > to the
> > > > > >   caller instead of the first dp_packet.
> > > > >
> > > > > I wanted to avoid doing more extensive processing if it's not a TSO 
> > > > > packet
> > > > > to avoid performance regressions since it' very sensitive. Right now 
> > > > > the 64k
> > > > > buffer is preallocated and is static for each queue to avoid the 
> > > > > malloc
> > > > > performance issue. Now for TSO case, we have more time per packet for
> > > > > processing.
> > > >
> > > > Can we implement Ben's idea by
> > > > 1) set size of aux_buf to 64k + mtu
> > > > 2) create 2nd dp_packet using this aux_buf and copy first packet to
> > > > first mtu bytes of aux_buf
> > > > 3) since we steal this aux_bufs, allocate a new aux_buf by
> > > > rxq->aux_bufs[i] = xmalloc(64k + mtu)
> > > > 4) free the first dp_packet, and use the second dp_packet
> > >
> > > I did a quick experiment while at the conference and Ben's idea is
> > > indeed a bit faster (2.7%) when the packet is not resized due to #1.
> > >
> > > If the buffer gets resized to what's actually used, then it becomes
> > > a bit slower (1.8%).
> >
> > Do we have to resize it?
>
> Well, if there is congestion the packets will get proportional to
> the TCP window size which can be like ~4k, ~9k, while it is allocating
> 64k. Assuming many ports and many packets in parallel, that might be
> a waste of memory.
>
> > > Anyways, feel free to have a look at the code[1]. Perhaps it could
> > > be changed to be more efficient. Just send me a patch and I will be
> > > happy to test again.
> > >
> > > [1] https://github.com/fleitner/ovs/tree/tso-cycles-ben
> >
> > Thanks!
> >
> > I tested it by applying
> > https://github.com/fleitner/ovs/commit/f0f5f630645134bf3c46201de8ce3f44e4fd2c03
> > Implemented Ben suggestion.
> > Signed-off-by: Flavio Leitner 
> >
> > Using
> > iperf3 -c (ns0) -> veth peer -> OVS -> veth peer -> iperf3 -s (ns1)

Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-24 Thread Flavio Leitner
On Fri, Jan 24, 2020 at 10:17:10AM -0800, William Tu wrote:
> On Fri, Jan 24, 2020 at 6:40 AM Flavio Leitner  wrote:
> >
> > On Wed, Jan 22, 2020 at 10:33:59AM -0800, William Tu wrote:
> > > On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner  wrote:
> > > >
> > > >
> > > > Hi Ben,
> > > >
> > > > Thanks for reviewing it!
> > > >
> > > > On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > > > > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > > > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > > > > Thanks all for review/testing, pushed to master.
> > > > > >
> > > > > > OK, thanks Ian.
> > > > > >
> > > > > > @Ben, even though this patch already merged, I'd ask you to take a 
> > > > > > look
> > > > > > at the code in case you'll spot some issues especially in non-DPDK 
> > > > > > related
> > > > > > parts.
> > > > >
> > > > > I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  
> > > > > The
> > > > > name suggested to me "test whether the packet is IPv4" not "test 
> > > > > whether
> > > > > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > > > > offload related but...  I like the name 
> > > > > dp_packet_hwol_tx_l4_checksum()
> > > > > much more, it makes it obvious at a glance that it's a
> > > > > checksum-offloading check.
> > > >
> > > > hwol = hardware offloading. I hear that all the time, but maybe there 
> > > > is a
> > > > better name. I will improve that if no one gets on it first.
> > > >
> > > > > In the case where we actually receive a 64 kB packet, I think that 
> > > > > this
> > > > > code is going to be relatively inefficient.  If I'm reading the code
> > > > > correctly (I did it quickly), then this is what happens:
> > > > >
> > > > > - The first 1500 bytes of the packet land in the first
> > > > >   dp_packet.
> > > > >
> > > > > - The remaining 64000ish bytes land in the second dp_packet.
> > >
> > > It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).
> > >
> > > struct netdev_rxq_linux {
> > > struct netdev_rxq up;
> > > bool is_tap;
> > > int fd;
> > > char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO 
> > > buffers. */
> > > };
> > >
> > > > >
> > > > > - Then we expand the first dp_packet to the needed size and 
> > > > > copy
> > > > >   the remaining 64000 bytes into it.
> > > >
> > > > That's correct.
> > > >
> > > > > An alternative would be:
> > > > >
> > > > > - Set up the first dp_packet as currently.
> > > > >
> > > > > - Set up the second dp_packet so that the bytes are received
> > > > >   into it starting at offset (mtu + headroom).
> > > > >
> > > > > - If more than mtu bytes are received, then copy those bytes
> > > > >   into the headroom of the second dp_packet and return it to 
> > > > > the
> > > > >   caller instead of the first dp_packet.
> > > >
> > > > I wanted to avoid doing more extensive processing if it's not a TSO 
> > > > packet
> > > > to avoid performance regressions since it' very sensitive. Right now 
> > > > the 64k
> > > > buffer is preallocated and is static for each queue to avoid the malloc
> > > > performance issue. Now for TSO case, we have more time per packet for
> > > > processing.
> > >
> > > Can we implement Ben's idea by
> > > 1) set size of aux_buf to 64k + mtu
> > > 2) create 2nd dp_packet using this aux_buf and copy first packet to
> > > first mtu bytes of aux_buf
> > > 3) since we steal this aux_bufs, allocate a new aux_buf by
> > > rxq->aux_bufs[i] = xmalloc(64k + mtu)
> > > 4) free the first dp_packet, and use the second dp_packet
> >
> > I did a quick experiment while at the conference and Ben's idea is
> > indeed a bit faster (2.7%) when the packet is not resized due to #1.
> >
> > If the buffer gets resized to what's actually used, then it becomes
> > a bit slower (1.8%).
> 
> Do we have to resize it?

Well, if there is congestion the packets will get proportional to
the TCP window size which can be like ~4k, ~9k, while it is allocating
64k. Assuming many ports and many packets in parallel, that might be
a waste of memory.

> > Anyways, feel free to have a look at the code[1]. Perhaps it could
> > be changed to be more efficient. Just send me a patch and I will be
> > happy to test again.
> >
> > [1] https://github.com/fleitner/ovs/tree/tso-cycles-ben
> 
> Thanks!
> 
> I tested it by applying
> https://github.com/fleitner/ovs/commit/f0f5f630645134bf3c46201de8ce3f44e4fd2c03
> Implemented Ben suggestion.
> Signed-off-by: Flavio Leitner 
> 
> Using
> iperf3 -c (ns0) -> veth peer -> OVS -> veth peer -> iperf3 -s (ns1)
> 
> Test 100 second TCP
> 
> without the patch
> [  3]  0.0-100.0 sec  78.8 GBytes  6.77 Gbits/sec
> 
> with the patch
> [  3]  0.0-100.0 sec  94.5 GBytes  8.11 Gbits/sec
> 
> I think it's pretty good improvement!

I agree. Could you test with the resize also (next patch in that
branch) and see how the numbers 

Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-24 Thread William Tu
On Fri, Jan 24, 2020 at 6:40 AM Flavio Leitner  wrote:
>
> On Wed, Jan 22, 2020 at 10:33:59AM -0800, William Tu wrote:
> > On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner  wrote:
> > >
> > >
> > > Hi Ben,
> > >
> > > Thanks for reviewing it!
> > >
> > > On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > > > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > > > Thanks all for review/testing, pushed to master.
> > > > >
> > > > > OK, thanks Ian.
> > > > >
> > > > > @Ben, even though this patch already merged, I'd ask you to take a 
> > > > > look
> > > > > at the code in case you'll spot some issues especially in non-DPDK 
> > > > > related
> > > > > parts.
> > > >
> > > > I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
> > > > name suggested to me "test whether the packet is IPv4" not "test whether
> > > > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > > > offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
> > > > much more, it makes it obvious at a glance that it's a
> > > > checksum-offloading check.
> > >
> > > hwol = hardware offloading. I hear that all the time, but maybe there is a
> > > better name. I will improve that if no one gets on it first.
> > >
> > > > In the case where we actually receive a 64 kB packet, I think that this
> > > > code is going to be relatively inefficient.  If I'm reading the code
> > > > correctly (I did it quickly), then this is what happens:
> > > >
> > > > - The first 1500 bytes of the packet land in the first
> > > >   dp_packet.
> > > >
> > > > - The remaining 64000ish bytes land in the second dp_packet.
> >
> > It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).
> >
> > struct netdev_rxq_linux {
> > struct netdev_rxq up;
> > bool is_tap;
> > int fd;
> > char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. 
> > */
> > };
> >
> > > >
> > > > - Then we expand the first dp_packet to the needed size and copy
> > > >   the remaining 64000 bytes into it.
> > >
> > > That's correct.
> > >
> > > > An alternative would be:
> > > >
> > > > - Set up the first dp_packet as currently.
> > > >
> > > > - Set up the second dp_packet so that the bytes are received
> > > >   into it starting at offset (mtu + headroom).
> > > >
> > > > - If more than mtu bytes are received, then copy those bytes
> > > >   into the headroom of the second dp_packet and return it to the
> > > >   caller instead of the first dp_packet.
> > >
> > > I wanted to avoid doing more extensive processing if it's not a TSO packet
> > > to avoid performance regressions since it' very sensitive. Right now the 
> > > 64k
> > > buffer is preallocated and is static for each queue to avoid the malloc
> > > performance issue. Now for TSO case, we have more time per packet for
> > > processing.
> >
> > Can we implement Ben's idea by
> > 1) set size of aux_buf to 64k + mtu
> > 2) create 2nd dp_packet using this aux_buf and copy first packet to
> > first mtu bytes of aux_buf
> > 3) since we steal this aux_bufs, allocate a new aux_buf by
> > rxq->aux_bufs[i] = xmalloc(64k + mtu)
> > 4) free the first dp_packet, and use the second dp_packet
>
> I did a quick experiment while at the conference and Ben's idea is
> indeed a bit faster (2.7%) when the packet is not resized due to #1.
>
> If the buffer gets resized to what's actually used, then it becomes
> a bit slower (1.8%).

Do we have to resize it?

>
> Anyways, feel free to have a look at the code[1]. Perhaps it could
> be changed to be more efficient. Just send me a patch and I will be
> happy to test again.
>
> [1] https://github.com/fleitner/ovs/tree/tso-cycles-ben

Thanks!

I tested it by applying
https://github.com/fleitner/ovs/commit/f0f5f630645134bf3c46201de8ce3f44e4fd2c03
Implemented Ben suggestion.
Signed-off-by: Flavio Leitner 

Using
iperf3 -c (ns0) -> veth peer -> OVS -> veth peer -> iperf3 -s (ns1)

Test 100 second TCP

without the patch
[  3]  0.0-100.0 sec  78.8 GBytes  6.77 Gbits/sec

with the patch
[  3]  0.0-100.0 sec  94.5 GBytes  8.11 Gbits/sec

I think it's pretty good improvement!
Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-24 Thread Flavio Leitner
On Wed, Jan 22, 2020 at 10:33:59AM -0800, William Tu wrote:
> On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner  wrote:
> >
> >
> > Hi Ben,
> >
> > Thanks for reviewing it!
> >
> > On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > > Thanks all for review/testing, pushed to master.
> > > >
> > > > OK, thanks Ian.
> > > >
> > > > @Ben, even though this patch already merged, I'd ask you to take a look
> > > > at the code in case you'll spot some issues especially in non-DPDK 
> > > > related
> > > > parts.
> > >
> > > I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
> > > name suggested to me "test whether the packet is IPv4" not "test whether
> > > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > > offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
> > > much more, it makes it obvious at a glance that it's a
> > > checksum-offloading check.
> >
> > hwol = hardware offloading. I hear that all the time, but maybe there is a
> > better name. I will improve that if no one gets on it first.
> >
> > > In the case where we actually receive a 64 kB packet, I think that this
> > > code is going to be relatively inefficient.  If I'm reading the code
> > > correctly (I did it quickly), then this is what happens:
> > >
> > > - The first 1500 bytes of the packet land in the first
> > >   dp_packet.
> > >
> > > - The remaining 64000ish bytes land in the second dp_packet.
> 
> It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).
> 
> struct netdev_rxq_linux {
> struct netdev_rxq up;
> bool is_tap;
> int fd;
> char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
> };
> 
> > >
> > > - Then we expand the first dp_packet to the needed size and copy
> > >   the remaining 64000 bytes into it.
> >
> > That's correct.
> >
> > > An alternative would be:
> > >
> > > - Set up the first dp_packet as currently.
> > >
> > > - Set up the second dp_packet so that the bytes are received
> > >   into it starting at offset (mtu + headroom).
> > >
> > > - If more than mtu bytes are received, then copy those bytes
> > >   into the headroom of the second dp_packet and return it to the
> > >   caller instead of the first dp_packet.
> >
> > I wanted to avoid doing more extensive processing if it's not a TSO packet
> > to avoid performance regressions since it' very sensitive. Right now the 64k
> > buffer is preallocated and is static for each queue to avoid the malloc
> > performance issue. Now for TSO case, we have more time per packet for
> > processing.
> 
> Can we implement Ben's idea by
> 1) set size of aux_buf to 64k + mtu
> 2) create 2nd dp_packet using this aux_buf and copy first packet to
> first mtu bytes of aux_buf
> 3) since we steal this aux_bufs, allocate a new aux_buf by
> rxq->aux_bufs[i] = xmalloc(64k + mtu)
> 4) free the first dp_packet, and use the second dp_packet

I did a quick experiment while at the conference and Ben's idea is
indeed a bit faster (2.7%) when the packet is not resized due to #1.

If the buffer gets resized to what's actually used, then it becomes
a bit slower (1.8%).

Anyways, feel free to have a look at the code[1]. Perhaps it could
be changed to be more efficient. Just send me a patch and I will be
happy to test again.

[1] https://github.com/fleitner/ovs/tree/tso-cycles-ben
Thanks,
fbl
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-22 Thread William Tu
On Wed, Jan 22, 2020 at 12:54 AM Flavio Leitner  wrote:
>
>
> Hi Ben,
>
> Thanks for reviewing it!
>
> On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> > On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > > Thanks all for review/testing, pushed to master.
> > >
> > > OK, thanks Ian.
> > >
> > > @Ben, even though this patch already merged, I'd ask you to take a look
> > > at the code in case you'll spot some issues especially in non-DPDK related
> > > parts.
> >
> > I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
> > name suggested to me "test whether the packet is IPv4" not "test whether
> > the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> > offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
> > much more, it makes it obvious at a glance that it's a
> > checksum-offloading check.
>
> hwol = hardware offloading. I hear that all the time, but maybe there is a
> better name. I will improve that if no one gets on it first.
>
> > In the case where we actually receive a 64 kB packet, I think that this
> > code is going to be relatively inefficient.  If I'm reading the code
> > correctly (I did it quickly), then this is what happens:
> >
> > - The first 1500 bytes of the packet land in the first
> >   dp_packet.
> >
> > - The remaining 64000ish bytes land in the second dp_packet.

It's not a dp_packet, it's a preallocated buffer per rxq (aux_bufs).

struct netdev_rxq_linux {
struct netdev_rxq up;
bool is_tap;
int fd;
char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
};

> >
> > - Then we expand the first dp_packet to the needed size and copy
> >   the remaining 64000 bytes into it.
>
> That's correct.
>
> > An alternative would be:
> >
> > - Set up the first dp_packet as currently.
> >
> > - Set up the second dp_packet so that the bytes are received
> >   into it starting at offset (mtu + headroom).
> >
> > - If more than mtu bytes are received, then copy those bytes
> >   into the headroom of the second dp_packet and return it to the
> >   caller instead of the first dp_packet.
>
> I wanted to avoid doing more extensive processing if it's not a TSO packet
> to avoid performance regressions since it' very sensitive. Right now the 64k
> buffer is preallocated and is static for each queue to avoid the malloc
> performance issue. Now for TSO case, we have more time per packet for
> processing.

Can we implement Ben's idea by
1) set size of aux_buf to 64k + mtu
2) create 2nd dp_packet using this aux_buf and copy first packet to
first mtu bytes of aux_buf
3) since we steal this aux_bufs, allocate a new aux_buf by
rxq->aux_bufs[i] = xmalloc(64k + mtu)
4) free the first dp_packet, and use the second dp_packet

Regards,
William
>
> > The advantage is that we do a 1500-byte copy instead of a 64000-byte
> > copy.  The disadvantage is that we waste any memory leftover in the
> > second dp_packet, e.g. 32 kB if it's only a 32 kB packet instead of 64
> > kB.  Also we need slightly more sophisticated dp_packet allocation (we
> > will need to replenish the supply of aux_bufs).
>
> I also tried to avoid waste of memory, which becomes a problem with multiple
> ports and queues working in parallel.
>
> --
> fbl
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-22 Thread Flavio Leitner


Hi Ben,

Thanks for reviewing it!

On Tue, Jan 21, 2020 at 01:35:39PM -0800, Ben Pfaff wrote:
> On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> > On 18.01.2020 00:03, Stokes, Ian wrote:
> > > Thanks all for review/testing, pushed to master.
> > 
> > OK, thanks Ian.
> > 
> > @Ben, even though this patch already merged, I'd ask you to take a look
> > at the code in case you'll spot some issues especially in non-DPDK related
> > parts.
> 
> I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
> name suggested to me "test whether the packet is IPv4" not "test whether
> the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
> offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
> much more, it makes it obvious at a glance that it's a
> checksum-offloading check.

hwol = hardware offloading. I hear that all the time, but maybe there is a
better name. I will improve that if no one gets on it first.

> In the case where we actually receive a 64 kB packet, I think that this
> code is going to be relatively inefficient.  If I'm reading the code
> correctly (I did it quickly), then this is what happens:
> 
> - The first 1500 bytes of the packet land in the first
>   dp_packet.
> 
> - The remaining 64000ish bytes land in the second dp_packet.
> 
> - Then we expand the first dp_packet to the needed size and copy
>   the remaining 64000 bytes into it.

That's correct.

> An alternative would be:
> 
> - Set up the first dp_packet as currently.
> 
> - Set up the second dp_packet so that the bytes are received
>   into it starting at offset (mtu + headroom).
> 
> - If more than mtu bytes are received, then copy those bytes
>   into the headroom of the second dp_packet and return it to the
>   caller instead of the first dp_packet.

I wanted to avoid doing more extensive processing if it's not a TSO packet
to avoid performance regressions since it' very sensitive. Right now the 64k
buffer is preallocated and is static for each queue to avoid the malloc
performance issue. Now for TSO case, we have more time per packet for
processing.

> The advantage is that we do a 1500-byte copy instead of a 64000-byte
> copy.  The disadvantage is that we waste any memory leftover in the
> second dp_packet, e.g. 32 kB if it's only a 32 kB packet instead of 64
> kB.  Also we need slightly more sophisticated dp_packet allocation (we
> will need to replenish the supply of aux_bufs).

I also tried to avoid waste of memory, which becomes a problem with multiple
ports and queues working in parallel.

-- 
fbl
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-21 Thread Ben Pfaff
On Sat, Jan 18, 2020 at 12:08:06AM +0100, Ilya Maximets wrote:
> On 18.01.2020 00:03, Stokes, Ian wrote:
> > Thanks all for review/testing, pushed to master.
> 
> OK, thanks Ian.
> 
> @Ben, even though this patch already merged, I'd ask you to take a look
> at the code in case you'll spot some issues especially in non-DPDK related
> parts.

I found the name dp_packet_hwol_is_ipv4(), and similar, confusing.  The
name suggested to me "test whether the packet is IPv4" not "test whether
the packet has an offloaded IPv4 checksum".  I guess the "hwol" is
offload related but...  I like the name dp_packet_hwol_tx_l4_checksum()
much more, it makes it obvious at a glance that it's a
checksum-offloading check.

In the case where we actually receive a 64 kB packet, I think that this
code is going to be relatively inefficient.  If I'm reading the code
correctly (I did it quickly), then this is what happens:

- The first 1500 bytes of the packet land in the first
  dp_packet.

- The remaining 64000ish bytes land in the second dp_packet.

- Then we expand the first dp_packet to the needed size and copy
  the remaining 64000 bytes into it.

An alternative would be:

- Set up the first dp_packet as currently.

- Set up the second dp_packet so that the bytes are received
  into it starting at offset (mtu + headroom).

- If more than mtu bytes are received, then copy those bytes
  into the headroom of the second dp_packet and return it to the
  caller instead of the first dp_packet.

The advantage is that we do a 1500-byte copy instead of a 64000-byte
copy.  The disadvantage is that we waste any memory leftover in the
second dp_packet, e.g. 32 kB if it's only a 32 kB packet instead of 64
kB.  Also we need slightly more sophisticated dp_packet allocation (we
will need to replenish the supply of aux_bufs).

Thanks,

Ben.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-17 Thread Ilya Maximets
On 18.01.2020 00:03, Stokes, Ian wrote:
> Thanks all for review/testing, pushed to master.

OK, thanks Ian.

@Ben, even though this patch already merged, I'd ask you to take a look
at the code in case you'll spot some issues especially in non-DPDK related
parts.

Thanks.

Best regards, Ilya Maximets.


> 
> Regards
> Ian
> 
> -Original Message-
> From: dev  On Behalf Of Stokes, Ian
> Sent: Friday, January 17, 2020 10:56 PM
> To: Flavio Leitner ; d...@openvswitch.org
> Cc: Ilya Maximets ; txfh2007 
> Subject: Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload 
> support
> 
> 
> 
> On 1/17/2020 9:54 PM, Stokes, Ian wrote:
>>
>>
>> On 1/17/2020 9:47 PM, Flavio Leitner wrote:
>>> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
>>> the network stack to delegate the TCP segmentation to the NIC reducing
>>> the per packet CPU overhead.
>>>
>>> A guest using vhostuser interface with TSO enabled can send TCP packets
>>> much bigger than the MTU, which saves CPU cycles normally used to break
>>> the packets down to MTU size and to calculate checksums.
>>>
>>> It also saves CPU cycles used to parse multiple packets/headers during
>>> the packet processing inside virtual switch.
>>>
>>> If the destination of the packet is another guest in the same host, then
>>> the same big packet can be sent through a vhostuser interface skipping
>>> the segmentation completely. However, if the destination is not local,
>>> the NIC hardware is instructed to do the TCP segmentation and checksum
>>> calculation.
>>>
>>> It is recommended to check if NIC hardware supports TSO before enabling
>>> the feature, which is off by default. For additional information please
>>> check the tso.rst document.
>>>
>>> Signed-off-by: Flavio Leitner 
>>
>> Fantastic work here Flavio, quick turn arouround when needed.
>>
>> Acked
> 
> Are the any objectionions to merging this?
> 
> Theres been nothhing so far.
> 
> If no further objections I will merge this at the end of the hour?
> 
> BR
> Ian
>>
>> BR
>> Ian
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-17 Thread Stokes, Ian
Thanks all for review/testing, pushed to master.

Regards
Ian

-Original Message-
From: dev  On Behalf Of Stokes, Ian
Sent: Friday, January 17, 2020 10:56 PM
To: Flavio Leitner ; d...@openvswitch.org
Cc: Ilya Maximets ; txfh2007 
Subject: Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload 
support



On 1/17/2020 9:54 PM, Stokes, Ian wrote:
> 
> 
> On 1/17/2020 9:47 PM, Flavio Leitner wrote:
>> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
>> the network stack to delegate the TCP segmentation to the NIC reducing
>> the per packet CPU overhead.
>>
>> A guest using vhostuser interface with TSO enabled can send TCP packets
>> much bigger than the MTU, which saves CPU cycles normally used to break
>> the packets down to MTU size and to calculate checksums.
>>
>> It also saves CPU cycles used to parse multiple packets/headers during
>> the packet processing inside virtual switch.
>>
>> If the destination of the packet is another guest in the same host, then
>> the same big packet can be sent through a vhostuser interface skipping
>> the segmentation completely. However, if the destination is not local,
>> the NIC hardware is instructed to do the TCP segmentation and checksum
>> calculation.
>>
>> It is recommended to check if NIC hardware supports TSO before enabling
>> the feature, which is off by default. For additional information please
>> check the tso.rst document.
>>
>> Signed-off-by: Flavio Leitner 
> 
> Fantastic work here Flavio, quick turn arouround when needed.
> 
> Acked

Are the any objectionions to merging this?

Theres been nothhing so far.

If no further objections I will merge this at the end of the hour?

BR
Ian
> 
> BR
> Ian
>> ---
>>   Documentation/automake.mk  |   1 +
>>   Documentation/topics/index.rst |   1 +
>>   Documentation/topics/userspace-tso.rst |  98 +++
>>   NEWS   |   1 +
>>   lib/automake.mk    |   2 +
>>   lib/conntrack.c    |  29 +-
>>   lib/dp-packet.h    | 176 ++-
>>   lib/ipf.c  |  32 +-
>>   lib/netdev-dpdk.c  | 348 +++---
>>   lib/netdev-linux-private.h |   5 +
>>   lib/netdev-linux.c | 386 ++---
>>   lib/netdev-provider.h  |   9 +
>>   lib/netdev.c   |  78 -
>>   lib/userspace-tso.c    |  53 
>>   lib/userspace-tso.h    |  23 ++
>>   vswitchd/bridge.c  |   2 +
>>   vswitchd/vswitch.xml   |  20 ++
>>   17 files changed, 1140 insertions(+), 124 deletions(-)
>>   create mode 100644 Documentation/topics/userspace-tso.rst
>>   create mode 100644 lib/userspace-tso.c
>>   create mode 100644 lib/userspace-tso.h
>>
>> Testing:
>>   - Travis, Cirrus, AppVeyor, testsuite passed OK.
>>   - notice no changes since v4 with regards to performance.
>>
>> Changelog:
>> - v5
>>   * rebased on top of master (NEWS conflict)
>>   * added missing periods at the end of comments
>>   * mention DPDK requirement at vswitch.xml
>>   * restricted tso feature to OvS built with dpdk
>>   * headers in alphabetical order
>>   * removed unneeded call to initialize pkt
>>   * used OVS_UNLIKELY instead of unlikely
>>   * removed parenthesis from sizeof()
>>   * removed blank line at dp_packet_hwol_tx_l4_checksum()
>>   * removed redundant dp_packet_hwol_tx_ipv4_checksum()
>>   * updated function comments as suggested
>>
>> - v4
>>   * rebased on top of master (recvmmsg)
>>   * fixed URL in doc to point to 19.11
>>   * renamed tso to userspace-tso
>>   * renamed the option to userspace-tso-enable
>>   * removed prototype that left over from v2
>>   * fixed function style declaration
>>   * renamed dp_packet_hwol_tx_ip_checksum to 
>> dp_packet_hwol_tx_ipv4_checksum
>>   * dp_packet_hwol_tx_ipv4_checksum now checks for PKT_TX_IPV4.
>>   * account for drops while preping the batch for TX.
>>   * don't prep the batch for TX if TSO is disabled.
>>   * simplified setsockopt error checking
>>   * fixed af_packet_sock error checking to not call setsockopt on
>>    closed sockets.
>>   * fixed ol_flags comment.
>>   * used VLOG_ERR_BUF() to pass error messages.
>>   * fixed packet leak at netdev_send_prepare_batch()
>>   * added a coverage counter to account drops while preparin

Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-17 Thread Stokes, Ian



On 1/17/2020 9:54 PM, Stokes, Ian wrote:



On 1/17/2020 9:47 PM, Flavio Leitner wrote:

Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
the network stack to delegate the TCP segmentation to the NIC reducing
the per packet CPU overhead.

A guest using vhostuser interface with TSO enabled can send TCP packets
much bigger than the MTU, which saves CPU cycles normally used to break
the packets down to MTU size and to calculate checksums.

It also saves CPU cycles used to parse multiple packets/headers during
the packet processing inside virtual switch.

If the destination of the packet is another guest in the same host, then
the same big packet can be sent through a vhostuser interface skipping
the segmentation completely. However, if the destination is not local,
the NIC hardware is instructed to do the TCP segmentation and checksum
calculation.

It is recommended to check if NIC hardware supports TSO before enabling
the feature, which is off by default. For additional information please
check the tso.rst document.

Signed-off-by: Flavio Leitner 


Fantastic work here Flavio, quick turn arouround when needed.

Acked


Are the any objectionions to merging this?

Theres been nothhing so far.

If no further objections I will merge this at the end of the hour?

BR
Ian


BR
Ian

---
  Documentation/automake.mk  |   1 +
  Documentation/topics/index.rst |   1 +
  Documentation/topics/userspace-tso.rst |  98 +++
  NEWS   |   1 +
  lib/automake.mk    |   2 +
  lib/conntrack.c    |  29 +-
  lib/dp-packet.h    | 176 ++-
  lib/ipf.c  |  32 +-
  lib/netdev-dpdk.c  | 348 +++---
  lib/netdev-linux-private.h |   5 +
  lib/netdev-linux.c | 386 ++---
  lib/netdev-provider.h  |   9 +
  lib/netdev.c   |  78 -
  lib/userspace-tso.c    |  53 
  lib/userspace-tso.h    |  23 ++
  vswitchd/bridge.c  |   2 +
  vswitchd/vswitch.xml   |  20 ++
  17 files changed, 1140 insertions(+), 124 deletions(-)
  create mode 100644 Documentation/topics/userspace-tso.rst
  create mode 100644 lib/userspace-tso.c
  create mode 100644 lib/userspace-tso.h

Testing:
  - Travis, Cirrus, AppVeyor, testsuite passed OK.
  - notice no changes since v4 with regards to performance.

Changelog:
- v5
  * rebased on top of master (NEWS conflict)
  * added missing periods at the end of comments
  * mention DPDK requirement at vswitch.xml
  * restricted tso feature to OvS built with dpdk
  * headers in alphabetical order
  * removed unneeded call to initialize pkt
  * used OVS_UNLIKELY instead of unlikely
  * removed parenthesis from sizeof()
  * removed blank line at dp_packet_hwol_tx_l4_checksum()
  * removed redundant dp_packet_hwol_tx_ipv4_checksum()
  * updated function comments as suggested

- v4
  * rebased on top of master (recvmmsg)
  * fixed URL in doc to point to 19.11
  * renamed tso to userspace-tso
  * renamed the option to userspace-tso-enable
  * removed prototype that left over from v2
  * fixed function style declaration
  * renamed dp_packet_hwol_tx_ip_checksum to 
dp_packet_hwol_tx_ipv4_checksum

  * dp_packet_hwol_tx_ipv4_checksum now checks for PKT_TX_IPV4.
  * account for drops while preping the batch for TX.
  * don't prep the batch for TX if TSO is disabled.
  * simplified setsockopt error checking
  * fixed af_packet_sock error checking to not call setsockopt on
   closed sockets.
  * fixed ol_flags comment.
  * used VLOG_ERR_BUF() to pass error messages.
  * fixed packet leak at netdev_send_prepare_batch()
  * added a coverage counter to account drops while preparing a batch
    at netdev.c
  * fixed netdev_send() to not call ->send() if the batch is empty.
  * fixed packet leak at netdev_push_header and account for the drops.
  * removed DPDK requirement to enable userspace TSO support.
  * fixed parameter documentation in vswitch.xml.
  * renamed tso.rst to userspace-tso.rst and moved to topics/
  * added comments documeting the functions in dp-packet.h
  * fixed dp_packet_hwol_is_tso to check only PKT_TX_TCP_SEG

- v3
  * Improved the documentation.
  * Updated copyright year to 2020.
  * TSO offloaded msg now includes the netdev's name.
  * Added period at the end of all code comments.
  * Warn and drop encapsulation of TSO packets.
  * Fixed travis issue with restricted virtio types.
  * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
    which caused packet corruption.
  * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
    PKT_TX_IP_CKSUM only for IPv4 packets.

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index f2ca17bad..22976a3cd 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.m

Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-17 Thread 0-day Robot
Bleep bloop.  Greetings Flavio Leitner, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 80 characters long (recommended limit is 79)
#1901 FILE: lib/userspace-tso.c:41:
VLOG_WARN("Userspace TCP Segmentation Offloading can not be enabled"

Lines checked: 1996, Warnings: 1, Errors: 0


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-17 Thread Stokes, Ian




On 1/17/2020 9:47 PM, Flavio Leitner wrote:

Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
the network stack to delegate the TCP segmentation to the NIC reducing
the per packet CPU overhead.

A guest using vhostuser interface with TSO enabled can send TCP packets
much bigger than the MTU, which saves CPU cycles normally used to break
the packets down to MTU size and to calculate checksums.

It also saves CPU cycles used to parse multiple packets/headers during
the packet processing inside virtual switch.

If the destination of the packet is another guest in the same host, then
the same big packet can be sent through a vhostuser interface skipping
the segmentation completely. However, if the destination is not local,
the NIC hardware is instructed to do the TCP segmentation and checksum
calculation.

It is recommended to check if NIC hardware supports TSO before enabling
the feature, which is off by default. For additional information please
check the tso.rst document.

Signed-off-by: Flavio Leitner 


Fantastic work here Flavio, quick turn arouround when needed.

Acked

BR
Ian

---
  Documentation/automake.mk  |   1 +
  Documentation/topics/index.rst |   1 +
  Documentation/topics/userspace-tso.rst |  98 +++
  NEWS   |   1 +
  lib/automake.mk|   2 +
  lib/conntrack.c|  29 +-
  lib/dp-packet.h| 176 ++-
  lib/ipf.c  |  32 +-
  lib/netdev-dpdk.c  | 348 +++---
  lib/netdev-linux-private.h |   5 +
  lib/netdev-linux.c | 386 ++---
  lib/netdev-provider.h  |   9 +
  lib/netdev.c   |  78 -
  lib/userspace-tso.c|  53 
  lib/userspace-tso.h|  23 ++
  vswitchd/bridge.c  |   2 +
  vswitchd/vswitch.xml   |  20 ++
  17 files changed, 1140 insertions(+), 124 deletions(-)
  create mode 100644 Documentation/topics/userspace-tso.rst
  create mode 100644 lib/userspace-tso.c
  create mode 100644 lib/userspace-tso.h

Testing:
  - Travis, Cirrus, AppVeyor, testsuite passed OK.
  - notice no changes since v4 with regards to performance.

Changelog:
- v5
  * rebased on top of master (NEWS conflict)
  * added missing periods at the end of comments
  * mention DPDK requirement at vswitch.xml
  * restricted tso feature to OvS built with dpdk
  * headers in alphabetical order
  * removed unneeded call to initialize pkt
  * used OVS_UNLIKELY instead of unlikely
  * removed parenthesis from sizeof()
  * removed blank line at dp_packet_hwol_tx_l4_checksum()
  * removed redundant dp_packet_hwol_tx_ipv4_checksum()
  * updated function comments as suggested

- v4
  * rebased on top of master (recvmmsg)
  * fixed URL in doc to point to 19.11
  * renamed tso to userspace-tso
  * renamed the option to userspace-tso-enable
  * removed prototype that left over from v2
  * fixed function style declaration
  * renamed dp_packet_hwol_tx_ip_checksum to dp_packet_hwol_tx_ipv4_checksum
  * dp_packet_hwol_tx_ipv4_checksum now checks for PKT_TX_IPV4.
  * account for drops while preping the batch for TX.
  * don't prep the batch for TX if TSO is disabled.
  * simplified setsockopt error checking
  * fixed af_packet_sock error checking to not call setsockopt on
   closed sockets.
  * fixed ol_flags comment.
  * used VLOG_ERR_BUF() to pass error messages.
  * fixed packet leak at netdev_send_prepare_batch()
  * added a coverage counter to account drops while preparing a batch
at netdev.c
  * fixed netdev_send() to not call ->send() if the batch is empty.
  * fixed packet leak at netdev_push_header and account for the drops.
  * removed DPDK requirement to enable userspace TSO support.
  * fixed parameter documentation in vswitch.xml.
  * renamed tso.rst to userspace-tso.rst and moved to topics/
  * added comments documeting the functions in dp-packet.h
  * fixed dp_packet_hwol_is_tso to check only PKT_TX_TCP_SEG

- v3
  * Improved the documentation.
  * Updated copyright year to 2020.
  * TSO offloaded msg now includes the netdev's name.
  * Added period at the end of all code comments.
  * Warn and drop encapsulation of TSO packets.
  * Fixed travis issue with restricted virtio types.
  * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
which caused packet corruption.
  * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
PKT_TX_IP_CKSUM only for IPv4 packets.

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index f2ca17bad..22976a3cd 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -57,6 +57,7 @@ DOC_SOURCE = \
Documentation/topics/ovsdb-replication.rst \
Documentation/topics/porting.rst \
Documentation/topics/tracing.rst \
+   Documentation/t

[ovs-dev] [PATCH v5] userspace: Add TCP Segmentation Offload support

2020-01-17 Thread Flavio Leitner
Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
the network stack to delegate the TCP segmentation to the NIC reducing
the per packet CPU overhead.

A guest using vhostuser interface with TSO enabled can send TCP packets
much bigger than the MTU, which saves CPU cycles normally used to break
the packets down to MTU size and to calculate checksums.

It also saves CPU cycles used to parse multiple packets/headers during
the packet processing inside virtual switch.

If the destination of the packet is another guest in the same host, then
the same big packet can be sent through a vhostuser interface skipping
the segmentation completely. However, if the destination is not local,
the NIC hardware is instructed to do the TCP segmentation and checksum
calculation.

It is recommended to check if NIC hardware supports TSO before enabling
the feature, which is off by default. For additional information please
check the tso.rst document.

Signed-off-by: Flavio Leitner 
---
 Documentation/automake.mk  |   1 +
 Documentation/topics/index.rst |   1 +
 Documentation/topics/userspace-tso.rst |  98 +++
 NEWS   |   1 +
 lib/automake.mk|   2 +
 lib/conntrack.c|  29 +-
 lib/dp-packet.h| 176 ++-
 lib/ipf.c  |  32 +-
 lib/netdev-dpdk.c  | 348 +++---
 lib/netdev-linux-private.h |   5 +
 lib/netdev-linux.c | 386 ++---
 lib/netdev-provider.h  |   9 +
 lib/netdev.c   |  78 -
 lib/userspace-tso.c|  53 
 lib/userspace-tso.h|  23 ++
 vswitchd/bridge.c  |   2 +
 vswitchd/vswitch.xml   |  20 ++
 17 files changed, 1140 insertions(+), 124 deletions(-)
 create mode 100644 Documentation/topics/userspace-tso.rst
 create mode 100644 lib/userspace-tso.c
 create mode 100644 lib/userspace-tso.h

Testing:
 - Travis, Cirrus, AppVeyor, testsuite passed OK.
 - notice no changes since v4 with regards to performance.

Changelog:
- v5
 * rebased on top of master (NEWS conflict)
 * added missing periods at the end of comments
 * mention DPDK requirement at vswitch.xml
 * restricted tso feature to OvS built with dpdk
 * headers in alphabetical order
 * removed unneeded call to initialize pkt
 * used OVS_UNLIKELY instead of unlikely
 * removed parenthesis from sizeof()
 * removed blank line at dp_packet_hwol_tx_l4_checksum()
 * removed redundant dp_packet_hwol_tx_ipv4_checksum()
 * updated function comments as suggested

- v4
 * rebased on top of master (recvmmsg)
 * fixed URL in doc to point to 19.11
 * renamed tso to userspace-tso
 * renamed the option to userspace-tso-enable
 * removed prototype that left over from v2
 * fixed function style declaration
 * renamed dp_packet_hwol_tx_ip_checksum to dp_packet_hwol_tx_ipv4_checksum
 * dp_packet_hwol_tx_ipv4_checksum now checks for PKT_TX_IPV4.
 * account for drops while preping the batch for TX.
 * don't prep the batch for TX if TSO is disabled.
 * simplified setsockopt error checking
 * fixed af_packet_sock error checking to not call setsockopt on
  closed sockets.
 * fixed ol_flags comment.
 * used VLOG_ERR_BUF() to pass error messages.
 * fixed packet leak at netdev_send_prepare_batch()
 * added a coverage counter to account drops while preparing a batch
   at netdev.c
 * fixed netdev_send() to not call ->send() if the batch is empty.
 * fixed packet leak at netdev_push_header and account for the drops.
 * removed DPDK requirement to enable userspace TSO support.
 * fixed parameter documentation in vswitch.xml.
 * renamed tso.rst to userspace-tso.rst and moved to topics/
 * added comments documeting the functions in dp-packet.h
 * fixed dp_packet_hwol_is_tso to check only PKT_TX_TCP_SEG

- v3
 * Improved the documentation.
 * Updated copyright year to 2020.
 * TSO offloaded msg now includes the netdev's name.
 * Added period at the end of all code comments.
 * Warn and drop encapsulation of TSO packets.
 * Fixed travis issue with restricted virtio types.
 * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
   which caused packet corruption.
 * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
   PKT_TX_IP_CKSUM only for IPv4 packets.

diff --git a/Documentation/automake.mk b/Documentation/automake.mk
index f2ca17bad..22976a3cd 100644
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -57,6 +57,7 @@ DOC_SOURCE = \
Documentation/topics/ovsdb-replication.rst \
Documentation/topics/porting.rst \
Documentation/topics/tracing.rst \
+   Documentation/topics/userspace-tso.rst \
Documentation/topics/windows.rst \
Documentation/howto/index.rst \
Documentation/howto/dpdk.rst \
diff --git a/Documentation/topics/index.rst b/Docu