Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)

2016-03-11 Thread Alexander Duyck
On Fri, Mar 11, 2016 at 2:55 PM, Tom Herbert  wrote:
> On Fri, Mar 11, 2016 at 2:31 PM, Alexander Duyck
>  wrote:
>> On Fri, Mar 11, 2016 at 1:29 PM, Edward Cree  wrote:
>>> On 11/03/16 21:09, Alexander Duyck wrote:
 The only real issue with the "generic" TSO is that it isn't going to
 be so generic.  We have different devices that will support different
 levels of stuff.  For example the ixgbe drivers will need to treat the
 outer tunnel header as one giant L2 header.  As a result we will need
 to populate all the fields in the outer header including the outer IP
 ID, checksum, udp->len, and UDP or GRE checksum if requested.  For
 i40e I think this gets a bit simpler as they already handle the outer
 IPv4 ID and checksum.  I think there we may need to only populate the
 checksum for it to work out correctly.  As such I may look at coming
 up with a number of functions so that we can mix and match based on
 what is needed in order to assemble a partially segmented frame.
>>> AIUI, the point of the design is that we _can_ populate everything,
>>> because we're keeping lengths and outer IP ID fixed, so outer checksums
>>> stay the same and the outer tunnel header _is_ just one giant L2 header
>>> which is bit-for-bit identical for each generated segment.  So every
>>> devicegets to just be dumb and treat it as opaque.
>>
>> This works so long as the device isn't trying to do anything like
>> insert VLAN tags.  Then I think we might have an issue since we don't
>> want to confuse the device and have it trying to insert the tag on the
>> inner frame's Ethernet header.
>>
> In Edward's giant L2 header mode, couldn't VLAN tags just be part of that?

The problem is things like VFs which aren't allowed to insert their
own tags.  Having them try to lie about where the network header
actually starts may trigger things like anti-spoof events.

>> I suspect we may have differing levels of "dumb" that we have to deal
>> with.  That is all I am saying.  By default we could just populate all
>> of the length and checksum fields in the outer header, we would just
>> have to be consistent about what is provided then.  In addition there
>> will be the matter of sorting out the IP ID bits.  For example some of
>> the i40e parts support tunnel offloads, but not tunnel offloads with
>> checksums enabled.  I suspect those parts will end up wanting to
>> handle the outer IP header and UDP length values.  As a result there
>> trying to do a "dumb" send may result in us really messing up the IP
>> ID values if we don't take steps to make it a bit smarter.
>>
 The other issue I am working on at the moment to enable all this is to
 fix the differents between csum_tcpudp_magic and csum_ipv6_magic in
 terms of handling packet lengths greater than 65535.  Currently we are
 messing up the checksum in relation to IPv6 since we are using the
 truncated uh->len value.  I'll be submitting some patches later today
 that will hopefully get that fixed and that in turn should make the
 rest of the segmentation work easier.
>>> Again, in the superpacket we want to calculate the checksum based on the
>>> subsegment length, rather than the length of the superpacket.  The idea
>>> is to create the header for an MSS-sized segment, then follow it with an
>>> inner IP & TCP header, and n*MSS bytes of payload.  (This of course
>>> produces a superpacket that's not what you'd send over a link with a 64k
>>> MTU, unlike how we do it today.)
>>
>> The question is at what point do we do the chopping.  Should we be
>> doing this in the drivers or somewhere higher in the stack like we do
>> for standard GSO segmentation.  I would think we would need to add
>> another bit that says we can do GSO with custom outer headers since I
>> can see VLANs being a possible issue otherwise.
>>
>>> Then hw just chops up the payload, fixes up the inner headers, and glues
>>> the "L2" header on each packet.
>>
>> Yea, it sounds really straight forward and easy.  It isn't till you
>> start digging into the actual code that it gets a bit hairy.
>>
>> What this effectively is is another form of TSO where each driver will
>> want to do things a little differently.  Alot of it has to do with the
>> fact that this is kind of a nasty hack that we are trying to add since
>> many devices won't like the fact that we are lying about the size of
>> our actual L2 header so things like VLAN tag insertion are going to
>> end up blowing back on us.
>>
> Right, the point is that we're trying to get out of the model where
> every driver/device implements TSO differently, supports ad hoc
> protocols, etc. Do you see any other common invasive technique that we
> need to deal with other than VLAN insertion and IP ID?

Well that is the thing.  Before we can actually start tinkering with
the outer header we probably need to make sure we set the DF bit and

Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)

2016-03-11 Thread Tom Herbert
On Fri, Mar 11, 2016 at 2:31 PM, Alexander Duyck
 wrote:
> On Fri, Mar 11, 2016 at 1:29 PM, Edward Cree  wrote:
>> On 11/03/16 21:09, Alexander Duyck wrote:
>>> The only real issue with the "generic" TSO is that it isn't going to
>>> be so generic.  We have different devices that will support different
>>> levels of stuff.  For example the ixgbe drivers will need to treat the
>>> outer tunnel header as one giant L2 header.  As a result we will need
>>> to populate all the fields in the outer header including the outer IP
>>> ID, checksum, udp->len, and UDP or GRE checksum if requested.  For
>>> i40e I think this gets a bit simpler as they already handle the outer
>>> IPv4 ID and checksum.  I think there we may need to only populate the
>>> checksum for it to work out correctly.  As such I may look at coming
>>> up with a number of functions so that we can mix and match based on
>>> what is needed in order to assemble a partially segmented frame.
>> AIUI, the point of the design is that we _can_ populate everything,
>> because we're keeping lengths and outer IP ID fixed, so outer checksums
>> stay the same and the outer tunnel header _is_ just one giant L2 header
>> which is bit-for-bit identical for each generated segment.  So every
>> devicegets to just be dumb and treat it as opaque.
>
> This works so long as the device isn't trying to do anything like
> insert VLAN tags.  Then I think we might have an issue since we don't
> want to confuse the device and have it trying to insert the tag on the
> inner frame's Ethernet header.
>
In Edward's giant L2 header mode, couldn't VLAN tags just be part of that?

> I suspect we may have differing levels of "dumb" that we have to deal
> with.  That is all I am saying.  By default we could just populate all
> of the length and checksum fields in the outer header, we would just
> have to be consistent about what is provided then.  In addition there
> will be the matter of sorting out the IP ID bits.  For example some of
> the i40e parts support tunnel offloads, but not tunnel offloads with
> checksums enabled.  I suspect those parts will end up wanting to
> handle the outer IP header and UDP length values.  As a result there
> trying to do a "dumb" send may result in us really messing up the IP
> ID values if we don't take steps to make it a bit smarter.
>
>>> The other issue I am working on at the moment to enable all this is to
>>> fix the differents between csum_tcpudp_magic and csum_ipv6_magic in
>>> terms of handling packet lengths greater than 65535.  Currently we are
>>> messing up the checksum in relation to IPv6 since we are using the
>>> truncated uh->len value.  I'll be submitting some patches later today
>>> that will hopefully get that fixed and that in turn should make the
>>> rest of the segmentation work easier.
>> Again, in the superpacket we want to calculate the checksum based on the
>> subsegment length, rather than the length of the superpacket.  The idea
>> is to create the header for an MSS-sized segment, then follow it with an
>> inner IP & TCP header, and n*MSS bytes of payload.  (This of course
>> produces a superpacket that's not what you'd send over a link with a 64k
>> MTU, unlike how we do it today.)
>
> The question is at what point do we do the chopping.  Should we be
> doing this in the drivers or somewhere higher in the stack like we do
> for standard GSO segmentation.  I would think we would need to add
> another bit that says we can do GSO with custom outer headers since I
> can see VLANs being a possible issue otherwise.
>
>> Then hw just chops up the payload, fixes up the inner headers, and glues
>> the "L2" header on each packet.
>
> Yea, it sounds really straight forward and easy.  It isn't till you
> start digging into the actual code that it gets a bit hairy.
>
> What this effectively is is another form of TSO where each driver will
> want to do things a little differently.  Alot of it has to do with the
> fact that this is kind of a nasty hack that we are trying to add since
> many devices won't like the fact that we are lying about the size of
> our actual L2 header so things like VLAN tag insertion are going to
> end up blowing back on us.
>
Right, the point is that we're trying to get out of the model where
every driver/device implements TSO differently, supports ad hoc
protocols, etc. Do you see any other common invasive technique that we
need to deal with other than VLAN insertion and IP ID?

> Really my preference in the case of ixgbe would have been to let the
> hardware update the outer IP header and the inner TCP header and then
> do the UDP and inner IP header as the static headers.  That way we
> could still theoretically support fragmentation on the outer headers
> which last I knew is a very real possibility since the DF bit is not
> set on the outer headers for VXLAN I believe.
>
> - Alex


Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)

2016-03-11 Thread Edward Cree
On 11/03/16 21:09, Alexander Duyck wrote:
> The only real issue with the "generic" TSO is that it isn't going to
> be so generic.  We have different devices that will support different
> levels of stuff.  For example the ixgbe drivers will need to treat the
> outer tunnel header as one giant L2 header.  As a result we will need
> to populate all the fields in the outer header including the outer IP
> ID, checksum, udp->len, and UDP or GRE checksum if requested.  For
> i40e I think this gets a bit simpler as they already handle the outer
> IPv4 ID and checksum.  I think there we may need to only populate the
> checksum for it to work out correctly.  As such I may look at coming
> up with a number of functions so that we can mix and match based on
> what is needed in order to assemble a partially segmented frame.
AIUI, the point of the design is that we _can_ populate everything,
because we're keeping lengths and outer IP ID fixed, so outer checksums
stay the same and the outer tunnel header _is_ just one giant L2 header
which is bit-for-bit identical for each generated segment.  So every
devicegets to just be dumb and treat it as opaque.
> The other issue I am working on at the moment to enable all this is to
> fix the differents between csum_tcpudp_magic and csum_ipv6_magic in
> terms of handling packet lengths greater than 65535.  Currently we are
> messing up the checksum in relation to IPv6 since we are using the
> truncated uh->len value.  I'll be submitting some patches later today
> that will hopefully get that fixed and that in turn should make the
> rest of the segmentation work easier.
Again, in the superpacket we want to calculate the checksum based on the
subsegment length, rather than the length of the superpacket.  The idea
is to create the header for an MSS-sized segment, then follow it with an
inner IP & TCP header, and n*MSS bytes of payload.  (This of course
produces a superpacket that's not what you'd send over a link with a 64k
MTU, unlike how we do it today.)
Then hw just chops up the payload, fixes up the inner headers, and glues
the "L2" header on each packet.

-Ed


Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)

2016-03-11 Thread Alexander Duyck
On Fri, Mar 11, 2016 at 12:24 PM, Edward Cree  wrote:
> On 11/03/16 20:16, Tom Herbert wrote:
>> On Fri, Mar 11, 2016 at 11:59 AM, Edward Cree  wrote:
>>> On 11/03/16 19:57, Tom Herbert wrote:
 On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree  wrote:
> Tom,
>
> Are you planning to / working on implementing this?  If not, I might have 
> a
> crack at it; I've talked to our firmware guys and (provisionally) we think
> we can support it in current sfc hardware.
> Or were there any blocking problems raised in the thread?  My 
> understanding
> of the IP ID issue was that it only matters for the inner frame, because
> the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I 
> may
> have missed something.
>
 Right, then the interface would need to just include the offset of the
 IP ID. But doesn't this break using LCO with GSO though-- i.e. the
 outer checksum and inner checksum still need to be updated per packet
 so we need to tell device where outer checksum(s) is.
>>> No, outer checksum shouldn't change: IP ID is protected by inner IP header
>>> checksum, which device will edit.  No?
>> Right, the interface would probably still need offset to the IPv4 hdr?
> Yes; I'm assuming the interface could just be "offset to inner IP header",
> and the hardware knows well enough what IP and TCP headers look like that
> it can figure out the rest (including skipping over options if e.g. ihl>5).
> So, do you want to try and implement it or shall I?

I've already started looking into this and was waiting for feedback
from Dave about the IPv4 ID issue which is looks like he is okay with
a static ID value as long as the DF bit is set.

The only real issue with the "generic" TSO is that it isn't going to
be so generic.  We have different devices that will support different
levels of stuff.  For example the ixgbe drivers will need to treat the
outer tunnel header as one giant L2 header.  As a result we will need
to populate all the fields in the outer header including the outer IP
ID, checksum, udp->len, and UDP or GRE checksum if requested.  For
i40e I think this gets a bit simpler as they already handle the outer
IPv4 ID and checksum.  I think there we may need to only populate the
checksum for it to work out correctly.  As such I may look at coming
up with a number of functions so that we can mix and match based on
what is needed in order to assemble a partially segmented frame.

The other issue I am working on at the moment to enable all this is to
fix the differents between csum_tcpudp_magic and csum_ipv6_magic in
terms of handling packet lengths greater than 65535.  Currently we are
messing up the checksum in relation to IPv6 since we are using the
truncated uh->len value.  I'll be submitting some patches later today
that will hopefully get that fixed and that in turn should make the
rest of the segmentation work easier.

- Alex


Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)

2016-03-11 Thread Edward Cree
On 11/03/16 20:16, Tom Herbert wrote:
> On Fri, Mar 11, 2016 at 11:59 AM, Edward Cree  wrote:
>> On 11/03/16 19:57, Tom Herbert wrote:
>>> On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree  wrote:
 Tom,

 Are you planning to / working on implementing this?  If not, I might have a
 crack at it; I've talked to our firmware guys and (provisionally) we think
 we can support it in current sfc hardware.
 Or were there any blocking problems raised in the thread?  My understanding
 of the IP ID issue was that it only matters for the inner frame, because
 the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
 have missed something.

>>> Right, then the interface would need to just include the offset of the
>>> IP ID. But doesn't this break using LCO with GSO though-- i.e. the
>>> outer checksum and inner checksum still need to be updated per packet
>>> so we need to tell device where outer checksum(s) is.
>> No, outer checksum shouldn't change: IP ID is protected by inner IP header
>> checksum, which device will edit.  No?
> Right, the interface would probably still need offset to the IPv4 hdr?
Yes; I'm assuming the interface could just be "offset to inner IP header",
and the hardware knows well enough what IP and TCP headers look like that
it can figure out the rest (including skipping over options if e.g. ihl>5).
So, do you want to try and implement it or shall I?

-Ed


Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)

2016-03-11 Thread Tom Herbert
On Fri, Mar 11, 2016 at 11:59 AM, Edward Cree  wrote:
> On 11/03/16 19:57, Tom Herbert wrote:
>> On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree  wrote:
>>> Tom,
>>>
>>> Are you planning to / working on implementing this?  If not, I might have a
>>> crack at it; I've talked to our firmware guys and (provisionally) we think
>>> we can support it in current sfc hardware.
>>> Or were there any blocking problems raised in the thread?  My understanding
>>> of the IP ID issue was that it only matters for the inner frame, because
>>> the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
>>> have missed something.
>>>
>> Right, then the interface would need to just include the offset of the
>> IP ID. But doesn't this break using LCO with GSO though-- i.e. the
>> outer checksum and inner checksum still need to be updated per packet
>> so we need to tell device where outer checksum(s) is.
> No, outer checksum shouldn't change: IP ID is protected by inner IP header
> checksum, which device will edit.  No?

Right, the interface would probably still need offset to the IPv4 hdr?

>
> -Ed


Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)

2016-03-11 Thread Edward Cree
On 11/03/16 19:57, Tom Herbert wrote:
> On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree  wrote:
>> Tom,
>>
>> Are you planning to / working on implementing this?  If not, I might have a
>> crack at it; I've talked to our firmware guys and (provisionally) we think
>> we can support it in current sfc hardware.
>> Or were there any blocking problems raised in the thread?  My understanding
>> of the IP ID issue was that it only matters for the inner frame, because
>> the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
>> have missed something.
>>
> Right, then the interface would need to just include the offset of the
> IP ID. But doesn't this break using LCO with GSO though-- i.e. the
> outer checksum and inner checksum still need to be updated per packet
> so we need to tell device where outer checksum(s) is.
No, outer checksum shouldn't change: IP ID is protected by inner IP header
checksum, which device will edit.  No?

-Ed


Re: Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)

2016-03-11 Thread Tom Herbert
On Fri, Mar 11, 2016 at 11:20 AM, Edward Cree  wrote:
> On 20/02/16 19:51, Tom Herbert wrote:
>> Right. To use LCO with TSO we would need to ensure that all packets
>> are the same size so that the UDP length field and thus checksum are
>> constant for all created segments. But this property this would also
>> make any payload lengths in headers constant for all packets so that
>> the only fields that need be set per generated packet would be the TCP
>> sequence number and checksum. This simplifying assumption could be
>> used to make a very protocol-generic GSO/TSO (up to the transport
>> header)!
>>
>> Conceptually, a device would just need to know the start of the
>> packet, the offset of the transport header, and the size of each
>> segment. Any bits from the start of the packet to the beginning of the
>> transport header are just copied to each segment, so any combination
>> of encapsulation/network protocols is  supported as long as they are
>> constant for each segment (e.g. MPLS, NSH, etc. are on the horizon for
>> needing TSO support).
> Tom,
>
> Are you planning to / working on implementing this?  If not, I might have a
> crack at it; I've talked to our firmware guys and (provisionally) we think
> we can support it in current sfc hardware.
> Or were there any blocking problems raised in the thread?  My understanding
> of the IP ID issue was that it only matters for the inner frame, because
> the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
> have missed something.
>
Right, then the interface would need to just include the offset of the
IP ID. But doesn't this break using LCO with GSO though-- i.e. the
outer checksum and inner checksum still need to be updated per packet
so we need to tell device where outer checksum(s) is.

Thanks,
Tom

> -Ed


Generic TSO (was Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default)

2016-03-11 Thread Edward Cree
On 20/02/16 19:51, Tom Herbert wrote:
> Right. To use LCO with TSO we would need to ensure that all packets
> are the same size so that the UDP length field and thus checksum are
> constant for all created segments. But this property this would also
> make any payload lengths in headers constant for all packets so that
> the only fields that need be set per generated packet would be the TCP
> sequence number and checksum. This simplifying assumption could be
> used to make a very protocol-generic GSO/TSO (up to the transport
> header)!
>
> Conceptually, a device would just need to know the start of the
> packet, the offset of the transport header, and the size of each
> segment. Any bits from the start of the packet to the beginning of the
> transport header are just copied to each segment, so any combination
> of encapsulation/network protocols is  supported as long as they are
> constant for each segment (e.g. MPLS, NSH, etc. are on the horizon for
> needing TSO support).
Tom,

Are you planning to / working on implementing this?  If not, I might have a
crack at it; I've talked to our firmware guys and (provisionally) we think
we can support it in current sfc hardware.
Or were there any blocking problems raised in the thread?  My understanding
of the IP ID issue was that it only matters for the inner frame, because
the rest aren't TCP (so hopefully no-one is doing SLHC on them).  But I may
have missed something.

-Ed


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-25 Thread David Miller
From: Tom Herbert 
Date: Tue, 23 Feb 2016 08:47:30 -0800

> Right, GRO should probably not coalesce packets with non-zero IP
> identifiers due to the loss of information.

If they are monotonically increasing, which is the only case worth
caring about, it absolutely should.

Because that can be done without any loss of information.


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-24 Thread Edward Cree
On 24/02/16 00:53, Tom Herbert wrote:
> That's an interesting idea. This should work in IPv6 now and nearly
> all encapsulation protocols (GRE w/ csum doesn't work this way for
> instance)
You mean GRE with sequence numbers?  csum should be fine, it'sjust the
usual LCO setup (i.e. only depends on headers).

-Ed


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-24 Thread David Miller
From: David Laight 
Date: Wed, 24 Feb 2016 09:58:03 +

> From: David Miller
>> Sent: 23 February 2016 18:25
>> 
>> From: Jesse Gross 
>> Date: Tue, 23 Feb 2016 09:31:09 -0800
>> 
>> > Most OSs (including Linux with connected TCP sockets) use non-zero IP
>> > IDs so requiring this would effectively disable GRO.
>> 
>> +1
>> 
>> Any OS that wants to work with SLHC, as I mentioned, has to emit
>> monotonically increasing IP ID values in all packets, even those with
>> DF set.
> 
> Doesn't that leak a lot of info about the sending system?
> ISTR one OS deliberately randomising the IP ID values in order
> to avoid giving out information about the number of packets being sent.

The ID generater is per-flow, therefore I don't think this is an
issue.

And if it is an issue, then it exists for fragmented traffic on
every machine on the planet.


RE: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-24 Thread David Laight
From: David Miller
> Sent: 23 February 2016 18:25
> 
> From: Jesse Gross 
> Date: Tue, 23 Feb 2016 09:31:09 -0800
> 
> > Most OSs (including Linux with connected TCP sockets) use non-zero IP
> > IDs so requiring this would effectively disable GRO.
> 
> +1
> 
> Any OS that wants to work with SLHC, as I mentioned, has to emit
> monotonically increasing IP ID values in all packets, even those with
> DF set.

Doesn't that leak a lot of info about the sending system?
ISTR one OS deliberately randomising the IP ID values in order
to avoid giving out information about the number of packets being sent.

David



Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Tom Herbert
> [ I really hope we can figure out a way not to change IP IDs, because I think
> an inverted version of Tom's generic TSO could also work as a generic h/w GRO
> accelerator.  In its simplest form, the hardware just remembers the previous
> packet, then gives us the length of the common prefix.  If that ends four 
> bytes
> into a TCP header, then the packet is a candidate for GRO; otherwise not.  I
> haven't worked out all the details yet, but it's clear that non-constant IP 
> IDs
> would cause problems.  (If it's only the innermost IP ID that's changing, we
> can probably still cope, but the performance gain will be less and the
> implementation could start to get ugly.) ]

That's an interesting idea. This should work in IPv6 now and nearly
all encapsulation protocols (GRE w/ csum doesn't work this way for
instance), the logic to just match to a GRO state would be as simple
as comparing packet hashes and then do a memcmp over the headers. For
IPv4 maybe do a masked compare-- we still need to validate the header
checksum but that is easily checked without checksumming over the
whole header by just adding in the diff in the dynamic fields (i.e.
something like C_new == C_Old + (IP_ID_new - IP_ID_old)

Tom

>
> --
> -Ed
>
> [1] http://lxr.free-electrons.com/source/drivers/net/slip/slhc.c#L425


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread David Miller
From: Edward Cree 
Date: Tue, 23 Feb 2016 20:20:50 +

> So VJ implementations that can't handle it really are buggy, not
> just exhibiting a difference of opinion.

Buggy or not, they exist, and we have to cope with them.


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Edward Cree
On 23/02/16 18:08, David Miller wrote:
> From: Edward Cree 
> Date: Tue, 23 Feb 2016 17:38:28 +
>
>> "The IPv4 ID field MUST NOT be used for purposes other than fragmentation
>>  and reassembly."(§4.1)
>> "Originating sources MAY set the IPv4 ID field of atomic datagrams to any
>>  value."(§4.1)
>> "All devices that examine IPv4 headers MUST ignore the IPv4 ID field of
>>  atomic datagrams."(§4.1)
>> Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4).
>>
>> So it's OK to coalesce packets with different identifiers, as long as they
>> have DFset (and aren't fragmented already).  Also, the RFC takes pains to
>> point out that it "does not reserve any IPv4 ID values, including 0, as
>> distinguished" (§4.1), so one cannot rely on the ID always being zero.
> Just a reminder that a very long time ago we tried setting the IP ID
> field to zero for DF packets, and this broke SLHC because that expects
> a monotonically increasing IP ID field.
If I'm reading it right, Linux's SLHC implementation can cope [1] with
unchanging IP IDs (or IDs that do something other than increment), which it
encodes with the NEW_I bit.  RFC1144 supports this, remarking "note that it
may be zero or negative" (§3.2.3) of the ID delta.  So VJ implementations that
can't handle it really are buggy, not just exhibiting a difference of opinion.

So it seems like the problem case is where some sort of SLIP gateway lies
between a Linux system (say, a webserver sending with TSO) and a buggy client;
in that case, even if the client can't be fixed it seems like the gateway
could be (either to send NEW_I or just lie about the incoming IP IDs and claim
they're incrementing - the latter is entirely safe for a DF packet).

[ I really hope we can figure out a way not to change IP IDs, because I think
an inverted version of Tom's generic TSO could also work as a generic h/w GRO
accelerator.  In its simplest form, the hardware just remembers the previous
packet, then gives us the length of the common prefix.  If that ends four bytes
into a TCP header, then the packet is a candidate for GRO; otherwise not.  I
haven't worked out all the details yet, but it's clear that non-constant IP IDs
would cause problems.  (If it's only the innermost IP ID that's changing, we
can probably still cope, but the performance gain will be less and the
implementation could start to get ugly.) ]

--
-Ed

[1] http://lxr.free-electrons.com/source/drivers/net/slip/slhc.c#L425


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Tom Herbert
On Tue, Feb 23, 2016 at 10:26 AM, David Miller  wrote:
> From: Tom Herbert 
> Date: Tue, 23 Feb 2016 09:42:00 -0800
>
>> Why not just fix the stack to conform to RFC6864? As Edward pointed
>> out we lose the actual IP ID's in GRO anyway, so attempts to set them
>> in GSO may be wildly incorrect from the source point of view-- even in
>> that case were probably better off changing the IP identifier to zero
>> (okay since we're already breaking the E2E model anyway :-) ).
>
> Tom, you can't, you'll break TCP header compression schemes which
> expect a monotonically increasing IP ID value.
>
> We tried setting the IP ID to zero for frames with DF set, it broke
> things.

All the more reason to move to IPv6 ;-)


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread David Miller
From: Tom Herbert 
Date: Tue, 23 Feb 2016 09:42:00 -0800

> Why not just fix the stack to conform to RFC6864? As Edward pointed
> out we lose the actual IP ID's in GRO anyway, so attempts to set them
> in GSO may be wildly incorrect from the source point of view-- even in
> that case were probably better off changing the IP identifier to zero
> (okay since we're already breaking the E2E model anyway :-) ).

Tom, you can't, you'll break TCP header compression schemes which
expect a monotonically increasing IP ID value.

We tried setting the IP ID to zero for frames with DF set, it broke
things.


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread David Miller
From: Jesse Gross 
Date: Tue, 23 Feb 2016 09:31:09 -0800

> Most OSs (including Linux with connected TCP sockets) use non-zero IP
> IDs so requiring this would effectively disable GRO.

+1

Any OS that wants to work with SLHC, as I mentioned, has to emit
monotonically increasing IP ID values in all packets, even those with
DF set.


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Alexander Duyck
On Tue, Feb 23, 2016 at 9:42 AM, Tom Herbert  wrote:
> On Tue, Feb 23, 2016 at 9:31 AM, Jesse Gross  wrote:
>> On Tue, Feb 23, 2016 at 8:47 AM, Tom Herbert  wrote:
>>> On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree  wrote:
 On 23/02/16 03:31, Jesse Gross wrote:
> The only issue that I see is that making TSO completely unaware of
> outer headers will likely cause performance regressions in some cases.
> Imagine if we have an incoming TCP stream with incrementing IP IDs
> that we aggregate through GRO and forward. Today's TSO would be able
> to recreate the stream by incrementing the ID as new segments are
> created. However, if the outgoing NIC is truly only dealing with the
> L4 header then it wouldn't be able to do this.
 Perhaps TSO should force setting the DF bit, so that the IP ID can be
 ignored.  After all, if your network is going to cause fragmentation and
 reassembly, your performance will probably be bad enough that TSO won't
 help you much.  (And TCP usually wants DF anyway so it can do PMTUD.)
 Arguably, as soon as we perform GRO on traffic to be forwarded, we've
 already violated the end-to-end principle (there are always imaginable
 situations in which a different packet stream comes out than went in),
 so it doesn't really matter if we go on to change the network layer
 parameters in this way - it's not really the same IP datagram any more
 so it's OK for its identification to change.
 And of course this problem doesn't present itself for IPv6 :)
>>>
>>> Right, GRO should probably not coalesce packets with non-zero IP
>>> identifiers due to the loss of information. Besides that, RFC6848 says
>>> the IP identifier should only be set for fragmentation anyway so there
>>> shouldn't be any issue and really no need for HW TSO (or LRO) to
>>> support that.
>>
>> Most OSs (including Linux with connected TCP sockets) use non-zero IP
>> IDs so requiring this would effectively disable GRO.
>>
>> I think the practical way to go about this is to introduce a new GSO
>> type for L4-only offload. There are some existing types that we could
>> immediately convert and kill off with no impact (such as GRE) and some
>> new protocols that would come for free (such as MPLS) so it would be a
>> net win. Once the infrastructure is in, it will be easier to evaluate
>> what else can be simplified on a case by case basis. (i.e. Even
>> UDP_TUNNEL will have some potential adverse impact from this compared
>> to explicit support since we'd need to break off the last segment from
>> a TSO burst where the size isn't an even multiple of the MSS. I guess
>> the impact is probably small but it would be good to know.)
>
> Why not just fix the stack to conform to RFC6864? As Edward pointed
> out we lose the actual IP ID's in GRO anyway, so attempts to set them
> in GSO may be wildly incorrect from the source point of view-- even in
> that case were probably better off changing the IP identifier to zero
> (okay since we're already breaking the E2E model anyway :-) ).

The wording of RFC6864 seems to imply that we can ignore the IP ID
field in the case of DF being set and the MF and fragment offset bits
being cleared.  It states we can use an arbitrary value for the IP ID
on "atomic" frames so we can probably just leave it at whatever the
initial value is for the frame to be segmented, not need to force it
to 0.

The problem as I see it is that we will need to update GRO so that it
is willing to accept frames with an inner IP ID that is not
incrementing for atomic frames before we can really get into the GSO
side of the equation.  From what I can tell it looks like currently it
doesn't honor that and requires IP ID to increment in order to
coalesce frames.

I will look into trying to setup TSO for these devices like I did
NETIF_F_HW_CSUM for the Intel parts.  Basically we will leave the
outer IP header and the inner transport header to be handled by the
hardware, and then we can compute the length and checksum for the UDP
header and inner IP header.  That way we can deal with things like
VLAN tags that need to be inserted before the outer network header
while maintaining the IP ID for the outer IP header as well since most
devices seem to handle that correctly.

- Alex


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Tom Herbert
On Tue, Feb 23, 2016 at 9:38 AM, Edward Cree  wrote:
> On 23/02/16 17:20, Rick Jones wrote:
>> On 02/23/2016 08:47 AM, Tom Herbert wrote:
>>> Right, GRO should probably not coalesce packets with non-zero IP
>>> identifiers due to the loss of information. Besides that, RFC6848 says
>>> the IP identifier should only be set for fragmentation anyway so there
>>> shouldn't be any issue and really no need for HW TSO (or LRO) to
>>> support that.
>>
>> You sure that is RFC 6848 "Specifying Civic Address Extensions in the 
>> Presence Information Data Format Location Object (PIDF-LO)" ?
> PossiblyRFC 6864 "Updated Specification of the IPv4 ID Field".

Yes RFC6864.

>> In whichever RFC that may be, is it a SHOULD or a MUST, and just how many 
>> "other" stacks might be setting a non-zero IP ID on fragments with DF set?
> "The IPv4 ID field MUST NOT be used for purposes other than fragmentation
>  and reassembly."(§4.1)
> "Originating sources MAY set the IPv4 ID field of atomic datagrams to any
>  value."(§4.1)
> "All devices that examine IPv4 headers MUST ignore the IPv4 ID field of
>  atomic datagrams."(§4.1)
> Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4).
>
> So it's OK to coalesce packets with different identifiers, as long as they
> have DFset (and aren't fragmented already).  Also, the RFC takes pains to
> point out that it "does not reserve any IPv4 ID values, including 0, as
> distinguished" (§4.1), so one cannot rely on the ID always being zero.

Right, receive side is straightforward, just ignore IP IDs. Transmit
is more interesting. Operative code is in ip_select_ident_segs. The
comment as to why non-zero IDs are sent is:

  /* This is only to work around buggy Windows95/2000
   * VJ compression implementations.  If the ID field
   * does not change, they drop every other packet in
   * a TCP stream using header compression.
   */

> --
> -Ed


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread David Miller
From: Edward Cree 
Date: Tue, 23 Feb 2016 17:38:28 +

> On 23/02/16 17:20, Rick Jones wrote:
>> On 02/23/2016 08:47 AM, Tom Herbert wrote:
>>> Right, GRO should probably not coalesce packets with non-zero IP
>>> identifiers due to the loss of information. Besides that, RFC6848 says
>>> the IP identifier should only be set for fragmentation anyway so there
>>> shouldn't be any issue and really no need for HW TSO (or LRO) to
>>> support that.
>>
>> You sure that is RFC 6848 "Specifying Civic Address Extensions in the 
>> Presence Information Data Format Location Object (PIDF-LO)" ?
> PossiblyRFC 6864 "Updated Specification of the IPv4 ID Field".
>> In whichever RFC that may be, is it a SHOULD or a MUST, and just how many 
>> "other" stacks might be setting a non-zero IP ID on fragments with DF set?
> "The IPv4 ID field MUST NOT be used for purposes other than fragmentation
>  and reassembly."(§4.1)
> "Originating sources MAY set the IPv4 ID field of atomic datagrams to any
>  value."(§4.1)
> "All devices that examine IPv4 headers MUST ignore the IPv4 ID field of
>  atomic datagrams."(§4.1)
> Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4).
> 
> So it's OK to coalesce packets with different identifiers, as long as they
> have DFset (and aren't fragmented already).  Also, the RFC takes pains to
> point out that it "does not reserve any IPv4 ID values, including 0, as
> distinguished" (§4.1), so one cannot rely on the ID always being zero.

Just a reminder that a very long time ago we tried setting the IP ID
field to zero for DF packets, and this broke SLHC because that expects
a monotonically increasing IP ID field.


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Edward Cree
On 23/02/16 17:20, Rick Jones wrote:
> On 02/23/2016 08:47 AM, Tom Herbert wrote:
>> Right, GRO should probably not coalesce packets with non-zero IP
>> identifiers due to the loss of information. Besides that, RFC6848 says
>> the IP identifier should only be set for fragmentation anyway so there
>> shouldn't be any issue and really no need for HW TSO (or LRO) to
>> support that.
>
> You sure that is RFC 6848 "Specifying Civic Address Extensions in the 
> Presence Information Data Format Location Object (PIDF-LO)" ?
PossiblyRFC 6864 "Updated Specification of the IPv4 ID Field".
> In whichever RFC that may be, is it a SHOULD or a MUST, and just how many 
> "other" stacks might be setting a non-zero IP ID on fragments with DF set?
"The IPv4 ID field MUST NOT be used for purposes other than fragmentation
 and reassembly."(§4.1)
"Originating sources MAY set the IPv4 ID field of atomic datagrams to any
 value."(§4.1)
"All devices that examine IPv4 headers MUST ignore the IPv4 ID field of
 atomic datagrams."(§4.1)
Atomic datagrams are defined by "(DF==1)&&(MF==0)&&(frag_offset==0)" (§4).

So it's OK to coalesce packets with different identifiers, as long as they
have DFset (and aren't fragmented already).  Also, the RFC takes pains to
point out that it "does not reserve any IPv4 ID values, including 0, as
distinguished" (§4.1), so one cannot rely on the ID always being zero.
--
-Ed


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Tom Herbert
On Tue, Feb 23, 2016 at 9:31 AM, Jesse Gross  wrote:
> On Tue, Feb 23, 2016 at 8:47 AM, Tom Herbert  wrote:
>> On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree  wrote:
>>> On 23/02/16 03:31, Jesse Gross wrote:
 The only issue that I see is that making TSO completely unaware of
 outer headers will likely cause performance regressions in some cases.
 Imagine if we have an incoming TCP stream with incrementing IP IDs
 that we aggregate through GRO and forward. Today's TSO would be able
 to recreate the stream by incrementing the ID as new segments are
 created. However, if the outgoing NIC is truly only dealing with the
 L4 header then it wouldn't be able to do this.
>>> Perhaps TSO should force setting the DF bit, so that the IP ID can be
>>> ignored.  After all, if your network is going to cause fragmentation and
>>> reassembly, your performance will probably be bad enough that TSO won't
>>> help you much.  (And TCP usually wants DF anyway so it can do PMTUD.)
>>> Arguably, as soon as we perform GRO on traffic to be forwarded, we've
>>> already violated the end-to-end principle (there are always imaginable
>>> situations in which a different packet stream comes out than went in),
>>> so it doesn't really matter if we go on to change the network layer
>>> parameters in this way - it's not really the same IP datagram any more
>>> so it's OK for its identification to change.
>>> And of course this problem doesn't present itself for IPv6 :)
>>
>> Right, GRO should probably not coalesce packets with non-zero IP
>> identifiers due to the loss of information. Besides that, RFC6848 says
>> the IP identifier should only be set for fragmentation anyway so there
>> shouldn't be any issue and really no need for HW TSO (or LRO) to
>> support that.
>
> Most OSs (including Linux with connected TCP sockets) use non-zero IP
> IDs so requiring this would effectively disable GRO.
>
> I think the practical way to go about this is to introduce a new GSO
> type for L4-only offload. There are some existing types that we could
> immediately convert and kill off with no impact (such as GRE) and some
> new protocols that would come for free (such as MPLS) so it would be a
> net win. Once the infrastructure is in, it will be easier to evaluate
> what else can be simplified on a case by case basis. (i.e. Even
> UDP_TUNNEL will have some potential adverse impact from this compared
> to explicit support since we'd need to break off the last segment from
> a TSO burst where the size isn't an even multiple of the MSS. I guess
> the impact is probably small but it would be good to know.)

Why not just fix the stack to conform to RFC6864? As Edward pointed
out we lose the actual IP ID's in GRO anyway, so attempts to set them
in GSO may be wildly incorrect from the source point of view-- even in
that case were probably better off changing the IP identifier to zero
(okay since we're already breaking the E2E model anyway :-) ).


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Jesse Gross
On Tue, Feb 23, 2016 at 8:47 AM, Tom Herbert  wrote:
> On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree  wrote:
>> On 23/02/16 03:31, Jesse Gross wrote:
>>> The only issue that I see is that making TSO completely unaware of
>>> outer headers will likely cause performance regressions in some cases.
>>> Imagine if we have an incoming TCP stream with incrementing IP IDs
>>> that we aggregate through GRO and forward. Today's TSO would be able
>>> to recreate the stream by incrementing the ID as new segments are
>>> created. However, if the outgoing NIC is truly only dealing with the
>>> L4 header then it wouldn't be able to do this.
>> Perhaps TSO should force setting the DF bit, so that the IP ID can be
>> ignored.  After all, if your network is going to cause fragmentation and
>> reassembly, your performance will probably be bad enough that TSO won't
>> help you much.  (And TCP usually wants DF anyway so it can do PMTUD.)
>> Arguably, as soon as we perform GRO on traffic to be forwarded, we've
>> already violated the end-to-end principle (there are always imaginable
>> situations in which a different packet stream comes out than went in),
>> so it doesn't really matter if we go on to change the network layer
>> parameters in this way - it's not really the same IP datagram any more
>> so it's OK for its identification to change.
>> And of course this problem doesn't present itself for IPv6 :)
>
> Right, GRO should probably not coalesce packets with non-zero IP
> identifiers due to the loss of information. Besides that, RFC6848 says
> the IP identifier should only be set for fragmentation anyway so there
> shouldn't be any issue and really no need for HW TSO (or LRO) to
> support that.

Most OSs (including Linux with connected TCP sockets) use non-zero IP
IDs so requiring this would effectively disable GRO.

I think the practical way to go about this is to introduce a new GSO
type for L4-only offload. There are some existing types that we could
immediately convert and kill off with no impact (such as GRE) and some
new protocols that would come for free (such as MPLS) so it would be a
net win. Once the infrastructure is in, it will be easier to evaluate
what else can be simplified on a case by case basis. (i.e. Even
UDP_TUNNEL will have some potential adverse impact from this compared
to explicit support since we'd need to break off the last segment from
a TSO burst where the size isn't an even multiple of the MSS. I guess
the impact is probably small but it would be good to know.)


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Rick Jones

On 02/23/2016 08:47 AM, Tom Herbert wrote:

Right, GRO should probably not coalesce packets with non-zero IP
identifiers due to the loss of information. Besides that, RFC6848 says
the IP identifier should only be set for fragmentation anyway so there
shouldn't be any issue and really no need for HW TSO (or LRO) to
support that.


You sure that is RFC 6848 "Specifying Civic Address Extensions in the 
Presence Information Data Format Location Object (PIDF-LO)" ?


In whichever RFC that may be, is it a SHOULD or a MUST, and just how 
many "other" stacks might be setting a non-zero IP ID on fragments with 
DF set?


rick jones


We need to do increment IP identifier in UFO, but I only see one
device (neterion) that advertises NETIF_F_UFO-- honestly, removing
that feature might be another good simplification!

Tom


--
-Ed




Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Tom Herbert
On Tue, Feb 23, 2016 at 7:18 AM, Edward Cree  wrote:
> On 23/02/16 03:31, Jesse Gross wrote:
>> The only issue that I see is that making TSO completely unaware of
>> outer headers will likely cause performance regressions in some cases.
>> Imagine if we have an incoming TCP stream with incrementing IP IDs
>> that we aggregate through GRO and forward. Today's TSO would be able
>> to recreate the stream by incrementing the ID as new segments are
>> created. However, if the outgoing NIC is truly only dealing with the
>> L4 header then it wouldn't be able to do this.
> Perhaps TSO should force setting the DF bit, so that the IP ID can be
> ignored.  After all, if your network is going to cause fragmentation and
> reassembly, your performance will probably be bad enough that TSO won't
> help you much.  (And TCP usually wants DF anyway so it can do PMTUD.)
> Arguably, as soon as we perform GRO on traffic to be forwarded, we've
> already violated the end-to-end principle (there are always imaginable
> situations in which a different packet stream comes out than went in),
> so it doesn't really matter if we go on to change the network layer
> parameters in this way - it's not really the same IP datagram any more
> so it's OK for its identification to change.
> And of course this problem doesn't present itself for IPv6 :)

Right, GRO should probably not coalesce packets with non-zero IP
identifiers due to the loss of information. Besides that, RFC6848 says
the IP identifier should only be set for fragmentation anyway so there
shouldn't be any issue and really no need for HW TSO (or LRO) to
support that.

We need to do increment IP identifier in UFO, but I only see one
device (neterion) that advertises NETIF_F_UFO-- honestly, removing
that feature might be another good simplification!

Tom

> --
> -Ed


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Edward Cree
On 23/02/16 03:31, Jesse Gross wrote:
> The only issue that I see is that making TSO completely unaware of
> outer headers will likely cause performance regressions in some cases.
> Imagine if we have an incoming TCP stream with incrementing IP IDs
> that we aggregate through GRO and forward. Today's TSO would be able
> to recreate the stream by incrementing the ID as new segments are
> created. However, if the outgoing NIC is truly only dealing with the
> L4 header then it wouldn't be able to do this.
Perhaps TSO should force setting the DF bit, so that the IP ID can be
ignored.  After all, if your network is going to cause fragmentation and
reassembly, your performance will probably be bad enough that TSO won't
help you much.  (And TCP usually wants DF anyway so it can do PMTUD.)
Arguably, as soon as we perform GRO on traffic to be forwarded, we've
already violated the end-to-end principle (there are always imaginable
situations in which a different packet stream comes out than went in),
so it doesn't really matter if we go on to change the network layer
parameters in this way - it's not really the same IP datagram any more
so it's OK for its identification to change.
And of course this problem doesn't present itself for IPv6 :)
--
-Ed


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-22 Thread Jesse Gross
On Sat, Feb 20, 2016 at 11:51 AM, Tom Herbert  wrote:
> On Fri, Feb 19, 2016 at 6:18 PM, Jesse Gross  wrote:
>> On Fri, Feb 19, 2016 at 4:14 PM, Tom Herbert  wrote:
>>> On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross  wrote:
 On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck  wrote:
> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross  wrote:
>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  
>> wrote:
>>> This patch series makes it so that we enable the outer Tx checksum for 
>>> IPv4
>>> tunnels by default.  This makes the behavior consistent with how we were
>>> handling this for IPv6.  In addition I have updated the internal flags 
>>> for
>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>>> IPv6.
>>>
>>> For most network devices this should be a net gain in terms of 
>>> performance
>>> as having the outer header checksum present allows for devices to report
>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in 
>>> order
>>> to determine if the inner header checksum is valid.
>>>
>>> Below is some data I collected with ixgbe with an X540 that demonstrates
>>> this.  I located two PFs connected back to back in two different name
>>> spaces and then setup a pair of tunnels on each, one with checksum 
>>> enabled
>>> and one without.
>>>
>>> Recv   SendSend  Utilization
>>> Socket Socket  Message  Elapsed  Send
>>> Size   SizeSize Time Throughput  local
>>> bytes  bytes   bytessecs.10^6bits/s  % S
>>>
>>> noudpcsum:
>>>  87380  16384  1638430.00  8898.67   12.80
>>> udpcsum:
>>>  87380  16384  1638430.00  9088.47   5.69
>>>
>>> The one spot where this may cause a performance regression is if the
>>> environment contains devices that can parse the inner headers and a 
>>> device
>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>>> the case of such a device we have to fall back to using GSO to segment 
>>> the
>>> tunnel instead of TSO and as a result we may take a performance hit as 
>>> seen
>>> below with i40e.
>>
>> Do you have any numbers from 40G links? Obviously, at 10G the links
>> are basically saturated and while I can see a difference in the
>> utilization rate, I suspect that the change will be much more apparent
>> at higher speeds.
>
> Unfortunately I don't have any true 40G links to test with.  The
> closest I can get is to run PF to VF on an i40e.  Running that I have
> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
> difference being related to the fact that we are having to
> allocate/free more skbs and make more trips through the
> i40e_lan_xmit_frame function resulting in more descriptors.

 OK, I guess that is more or less in line with what I would expect off
 the top my head. There is a reasonably significant drop in the worst
 case.

>> I'm concerned about the drop in performance for devices that currently
>> support offloads (almost none of which expose
>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>> care most about tunnel performance are the ones that already have
>> these NICs and will be the most impacted by the drop.
>
> The problem is being able to transmit fast is kind of pointless if the
> receiving end cannot handle it.  We hadn't gotten around to really
> getting the Rx checksum bits working until the 3.18 kernel which I
> don't suspect many people are running so at this point messing with
> the TSO bits isn't really making much of a difference.  Then on top of
> that most devices have certain limitations on how many ports they can
> handle and such.  I know the i40e is supposed to support something
> like 10 port numbers, but the fm10k and ixgbe are limited to one port
> as I recall.  So this whole thing is already really brittle as it is.
> My goal with this change is to make the behavior more consistent
> across the board.

 That's true to some degree but there are certainly plenty of cases
 where TSO makes a difference - lower CPU usage, transmitting to
 multiple receivers, people will upgrade their kernels, etc. It's
 clearly good to make things more consistent but hopefully not by
 reducing existing performance. :)

>> My hope is that we can continue to use TSO on devices that only
>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>> length field may vary across segments. However, in practice this is

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-21 Thread David Miller
From: Alexander Duyck 
Date: Fri, 19 Feb 2016 11:26:17 -0800

> This patch series makes it so that we enable the outer Tx checksum
> for IPv4 tunnels by default.

Series applied, thanks Alex.


Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-20 Thread Tom Herbert
On Fri, Feb 19, 2016 at 6:18 PM, Jesse Gross  wrote:
> On Fri, Feb 19, 2016 at 4:14 PM, Tom Herbert  wrote:
>> On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross  wrote:
>>> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck  wrote:
 On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross  wrote:
> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  
> wrote:
>> This patch series makes it so that we enable the outer Tx checksum for 
>> IPv4
>> tunnels by default.  This makes the behavior consistent with how we were
>> handling this for IPv6.  In addition I have updated the internal flags 
>> for
>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>> IPv6.
>>
>> For most network devices this should be a net gain in terms of 
>> performance
>> as having the outer header checksum present allows for devices to report
>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in 
>> order
>> to determine if the inner header checksum is valid.
>>
>> Below is some data I collected with ixgbe with an X540 that demonstrates
>> this.  I located two PFs connected back to back in two different name
>> spaces and then setup a pair of tunnels on each, one with checksum 
>> enabled
>> and one without.
>>
>> Recv   SendSend  Utilization
>> Socket Socket  Message  Elapsed  Send
>> Size   SizeSize Time Throughput  local
>> bytes  bytes   bytessecs.10^6bits/s  % S
>>
>> noudpcsum:
>>  87380  16384  1638430.00  8898.67   12.80
>> udpcsum:
>>  87380  16384  1638430.00  9088.47   5.69
>>
>> The one spot where this may cause a performance regression is if the
>> environment contains devices that can parse the inner headers and a 
>> device
>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>> the case of such a device we have to fall back to using GSO to segment 
>> the
>> tunnel instead of TSO and as a result we may take a performance hit as 
>> seen
>> below with i40e.
>
> Do you have any numbers from 40G links? Obviously, at 10G the links
> are basically saturated and while I can see a difference in the
> utilization rate, I suspect that the change will be much more apparent
> at higher speeds.

 Unfortunately I don't have any true 40G links to test with.  The
 closest I can get is to run PF to VF on an i40e.  Running that I have
 seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
 difference being related to the fact that we are having to
 allocate/free more skbs and make more trips through the
 i40e_lan_xmit_frame function resulting in more descriptors.
>>>
>>> OK, I guess that is more or less in line with what I would expect off
>>> the top my head. There is a reasonably significant drop in the worst
>>> case.
>>>
> I'm concerned about the drop in performance for devices that currently
> support offloads (almost none of which expose
> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
> care most about tunnel performance are the ones that already have
> these NICs and will be the most impacted by the drop.

 The problem is being able to transmit fast is kind of pointless if the
 receiving end cannot handle it.  We hadn't gotten around to really
 getting the Rx checksum bits working until the 3.18 kernel which I
 don't suspect many people are running so at this point messing with
 the TSO bits isn't really making much of a difference.  Then on top of
 that most devices have certain limitations on how many ports they can
 handle and such.  I know the i40e is supposed to support something
 like 10 port numbers, but the fm10k and ixgbe are limited to one port
 as I recall.  So this whole thing is already really brittle as it is.
 My goal with this change is to make the behavior more consistent
 across the board.
>>>
>>> That's true to some degree but there are certainly plenty of cases
>>> where TSO makes a difference - lower CPU usage, transmitting to
>>> multiple receivers, people will upgrade their kernels, etc. It's
>>> clearly good to make things more consistent but hopefully not by
>>> reducing existing performance. :)
>>>
> My hope is that we can continue to use TSO on devices that only
> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
> length field may vary across segments. However, in practice this is
> the only on the final segment and only in cases where the total length
> is not a multiple of the MSS. If we could detect cases where those
> conditions are 

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Jesse Gross
On Fri, Feb 19, 2016 at 4:14 PM, Tom Herbert  wrote:
> On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross  wrote:
>> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck  wrote:
>>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross  wrote:
 On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  
 wrote:
> This patch series makes it so that we enable the outer Tx checksum for 
> IPv4
> tunnels by default.  This makes the behavior consistent with how we were
> handling this for IPv6.  In addition I have updated the internal flags for
> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
> match up will with the ZERO_CSUM6_TX flag which was already in use for
> IPv6.
>
> For most network devices this should be a net gain in terms of performance
> as having the outer header checksum present allows for devices to report
> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in 
> order
> to determine if the inner header checksum is valid.
>
> Below is some data I collected with ixgbe with an X540 that demonstrates
> this.  I located two PFs connected back to back in two different name
> spaces and then setup a pair of tunnels on each, one with checksum enabled
> and one without.
>
> Recv   SendSend  Utilization
> Socket Socket  Message  Elapsed  Send
> Size   SizeSize Time Throughput  local
> bytes  bytes   bytessecs.10^6bits/s  % S
>
> noudpcsum:
>  87380  16384  1638430.00  8898.67   12.80
> udpcsum:
>  87380  16384  1638430.00  9088.47   5.69
>
> The one spot where this may cause a performance regression is if the
> environment contains devices that can parse the inner headers and a device
> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
> the case of such a device we have to fall back to using GSO to segment the
> tunnel instead of TSO and as a result we may take a performance hit as 
> seen
> below with i40e.

 Do you have any numbers from 40G links? Obviously, at 10G the links
 are basically saturated and while I can see a difference in the
 utilization rate, I suspect that the change will be much more apparent
 at higher speeds.
>>>
>>> Unfortunately I don't have any true 40G links to test with.  The
>>> closest I can get is to run PF to VF on an i40e.  Running that I have
>>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
>>> difference being related to the fact that we are having to
>>> allocate/free more skbs and make more trips through the
>>> i40e_lan_xmit_frame function resulting in more descriptors.
>>
>> OK, I guess that is more or less in line with what I would expect off
>> the top my head. There is a reasonably significant drop in the worst
>> case.
>>
 I'm concerned about the drop in performance for devices that currently
 support offloads (almost none of which expose
 NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
 care most about tunnel performance are the ones that already have
 these NICs and will be the most impacted by the drop.
>>>
>>> The problem is being able to transmit fast is kind of pointless if the
>>> receiving end cannot handle it.  We hadn't gotten around to really
>>> getting the Rx checksum bits working until the 3.18 kernel which I
>>> don't suspect many people are running so at this point messing with
>>> the TSO bits isn't really making much of a difference.  Then on top of
>>> that most devices have certain limitations on how many ports they can
>>> handle and such.  I know the i40e is supposed to support something
>>> like 10 port numbers, but the fm10k and ixgbe are limited to one port
>>> as I recall.  So this whole thing is already really brittle as it is.
>>> My goal with this change is to make the behavior more consistent
>>> across the board.
>>
>> That's true to some degree but there are certainly plenty of cases
>> where TSO makes a difference - lower CPU usage, transmitting to
>> multiple receivers, people will upgrade their kernels, etc. It's
>> clearly good to make things more consistent but hopefully not by
>> reducing existing performance. :)
>>
 My hope is that we can continue to use TSO on devices that only
 support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
 length field may vary across segments. However, in practice this is
 the only on the final segment and only in cases where the total length
 is not a multiple of the MSS. If we could detect cases where those
 conditions are met, we could continue to use TSO with the UDP checksum
 field pre-populated. A possible step even further would be to break
 off the final segment into a separate packet to make things conform 

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Tom Herbert
On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross  wrote:
> On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck  wrote:
>> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross  wrote:
>>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  
>>> wrote:
 This patch series makes it so that we enable the outer Tx checksum for IPv4
 tunnels by default.  This makes the behavior consistent with how we were
 handling this for IPv6.  In addition I have updated the internal flags for
 these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
 match up will with the ZERO_CSUM6_TX flag which was already in use for
 IPv6.

 For most network devices this should be a net gain in terms of performance
 as having the outer header checksum present allows for devices to report
 CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in 
 order
 to determine if the inner header checksum is valid.

 Below is some data I collected with ixgbe with an X540 that demonstrates
 this.  I located two PFs connected back to back in two different name
 spaces and then setup a pair of tunnels on each, one with checksum enabled
 and one without.

 Recv   SendSend  Utilization
 Socket Socket  Message  Elapsed  Send
 Size   SizeSize Time Throughput  local
 bytes  bytes   bytessecs.10^6bits/s  % S

 noudpcsum:
  87380  16384  1638430.00  8898.67   12.80
 udpcsum:
  87380  16384  1638430.00  9088.47   5.69

 The one spot where this may cause a performance regression is if the
 environment contains devices that can parse the inner headers and a device
 supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
 the case of such a device we have to fall back to using GSO to segment the
 tunnel instead of TSO and as a result we may take a performance hit as seen
 below with i40e.
>>>
>>> Do you have any numbers from 40G links? Obviously, at 10G the links
>>> are basically saturated and while I can see a difference in the
>>> utilization rate, I suspect that the change will be much more apparent
>>> at higher speeds.
>>
>> Unfortunately I don't have any true 40G links to test with.  The
>> closest I can get is to run PF to VF on an i40e.  Running that I have
>> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
>> difference being related to the fact that we are having to
>> allocate/free more skbs and make more trips through the
>> i40e_lan_xmit_frame function resulting in more descriptors.
>
> OK, I guess that is more or less in line with what I would expect off
> the top my head. There is a reasonably significant drop in the worst
> case.
>
>>> I'm concerned about the drop in performance for devices that currently
>>> support offloads (almost none of which expose
>>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>>> care most about tunnel performance are the ones that already have
>>> these NICs and will be the most impacted by the drop.
>>
>> The problem is being able to transmit fast is kind of pointless if the
>> receiving end cannot handle it.  We hadn't gotten around to really
>> getting the Rx checksum bits working until the 3.18 kernel which I
>> don't suspect many people are running so at this point messing with
>> the TSO bits isn't really making much of a difference.  Then on top of
>> that most devices have certain limitations on how many ports they can
>> handle and such.  I know the i40e is supposed to support something
>> like 10 port numbers, but the fm10k and ixgbe are limited to one port
>> as I recall.  So this whole thing is already really brittle as it is.
>> My goal with this change is to make the behavior more consistent
>> across the board.
>
> That's true to some degree but there are certainly plenty of cases
> where TSO makes a difference - lower CPU usage, transmitting to
> multiple receivers, people will upgrade their kernels, etc. It's
> clearly good to make things more consistent but hopefully not by
> reducing existing performance. :)
>
>>> My hope is that we can continue to use TSO on devices that only
>>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>>> length field may vary across segments. However, in practice this is
>>> the only on the final segment and only in cases where the total length
>>> is not a multiple of the MSS. If we could detect cases where those
>>> conditions are met, we could continue to use TSO with the UDP checksum
>>> field pre-populated. A possible step even further would be to break
>>> off the final segment into a separate packet to make things conform if
>>> necessary. This would avoid a performance regression and I think make
>>> this more palatable to a lot of people.
>>
>> I think Tom and I had discussed this possibility a 

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Jesse Gross
On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck  wrote:
> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross  wrote:
>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  
>> wrote:
>>> This patch series makes it so that we enable the outer Tx checksum for IPv4
>>> tunnels by default.  This makes the behavior consistent with how we were
>>> handling this for IPv6.  In addition I have updated the internal flags for
>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>>> IPv6.
>>>
>>> For most network devices this should be a net gain in terms of performance
>>> as having the outer header checksum present allows for devices to report
>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
>>> to determine if the inner header checksum is valid.
>>>
>>> Below is some data I collected with ixgbe with an X540 that demonstrates
>>> this.  I located two PFs connected back to back in two different name
>>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>>> and one without.
>>>
>>> Recv   SendSend  Utilization
>>> Socket Socket  Message  Elapsed  Send
>>> Size   SizeSize Time Throughput  local
>>> bytes  bytes   bytessecs.10^6bits/s  % S
>>>
>>> noudpcsum:
>>>  87380  16384  1638430.00  8898.67   12.80
>>> udpcsum:
>>>  87380  16384  1638430.00  9088.47   5.69
>>>
>>> The one spot where this may cause a performance regression is if the
>>> environment contains devices that can parse the inner headers and a device
>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>>> the case of such a device we have to fall back to using GSO to segment the
>>> tunnel instead of TSO and as a result we may take a performance hit as seen
>>> below with i40e.
>>
>> Do you have any numbers from 40G links? Obviously, at 10G the links
>> are basically saturated and while I can see a difference in the
>> utilization rate, I suspect that the change will be much more apparent
>> at higher speeds.
>
> Unfortunately I don't have any true 40G links to test with.  The
> closest I can get is to run PF to VF on an i40e.  Running that I have
> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
> difference being related to the fact that we are having to
> allocate/free more skbs and make more trips through the
> i40e_lan_xmit_frame function resulting in more descriptors.

OK, I guess that is more or less in line with what I would expect off
the top my head. There is a reasonably significant drop in the worst
case.

>> I'm concerned about the drop in performance for devices that currently
>> support offloads (almost none of which expose
>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
>> care most about tunnel performance are the ones that already have
>> these NICs and will be the most impacted by the drop.
>
> The problem is being able to transmit fast is kind of pointless if the
> receiving end cannot handle it.  We hadn't gotten around to really
> getting the Rx checksum bits working until the 3.18 kernel which I
> don't suspect many people are running so at this point messing with
> the TSO bits isn't really making much of a difference.  Then on top of
> that most devices have certain limitations on how many ports they can
> handle and such.  I know the i40e is supposed to support something
> like 10 port numbers, but the fm10k and ixgbe are limited to one port
> as I recall.  So this whole thing is already really brittle as it is.
> My goal with this change is to make the behavior more consistent
> across the board.

That's true to some degree but there are certainly plenty of cases
where TSO makes a difference - lower CPU usage, transmitting to
multiple receivers, people will upgrade their kernels, etc. It's
clearly good to make things more consistent but hopefully not by
reducing existing performance. :)

>> My hope is that we can continue to use TSO on devices that only
>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
>> length field may vary across segments. However, in practice this is
>> the only on the final segment and only in cases where the total length
>> is not a multiple of the MSS. If we could detect cases where those
>> conditions are met, we could continue to use TSO with the UDP checksum
>> field pre-populated. A possible step even further would be to break
>> off the final segment into a separate packet to make things conform if
>> necessary. This would avoid a performance regression and I think make
>> this more palatable to a lot of people.
>
> I think Tom and I had discussed this possibility a bit at netconf.
> The GSO logic is something I planned on looking at over the next
> several weeks as I suspect there is probably room for improvement
> there.

That sounds 

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Alex Duyck
On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross  wrote:
> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  wrote:
>> This patch series makes it so that we enable the outer Tx checksum for IPv4
>> tunnels by default.  This makes the behavior consistent with how we were
>> handling this for IPv6.  In addition I have updated the internal flags for
>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
>> match up will with the ZERO_CSUM6_TX flag which was already in use for
>> IPv6.
>>
>> For most network devices this should be a net gain in terms of performance
>> as having the outer header checksum present allows for devices to report
>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
>> to determine if the inner header checksum is valid.
>>
>> Below is some data I collected with ixgbe with an X540 that demonstrates
>> this.  I located two PFs connected back to back in two different name
>> spaces and then setup a pair of tunnels on each, one with checksum enabled
>> and one without.
>>
>> Recv   SendSend  Utilization
>> Socket Socket  Message  Elapsed  Send
>> Size   SizeSize Time Throughput  local
>> bytes  bytes   bytessecs.10^6bits/s  % S
>>
>> noudpcsum:
>>  87380  16384  1638430.00  8898.67   12.80
>> udpcsum:
>>  87380  16384  1638430.00  9088.47   5.69
>>
>> The one spot where this may cause a performance regression is if the
>> environment contains devices that can parse the inner headers and a device
>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
>> the case of such a device we have to fall back to using GSO to segment the
>> tunnel instead of TSO and as a result we may take a performance hit as seen
>> below with i40e.
>
> Do you have any numbers from 40G links? Obviously, at 10G the links
> are basically saturated and while I can see a difference in the
> utilization rate, I suspect that the change will be much more apparent
> at higher speeds.

Unfortunately I don't have any true 40G links to test with.  The
closest I can get is to run PF to VF on an i40e.  Running that I have
seen the numbers go from about 20Gb/s to 15Gb/s with almost all the
difference being related to the fact that we are having to
allocate/free more skbs and make more trips through the
i40e_lan_xmit_frame function resulting in more descriptors.

> I'm concerned about the drop in performance for devices that currently
> support offloads (almost none of which expose
> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
> care most about tunnel performance are the ones that already have
> these NICs and will be the most impacted by the drop.

The problem is being able to transmit fast is kind of pointless if the
receiving end cannot handle it.  We hadn't gotten around to really
getting the Rx checksum bits working until the 3.18 kernel which I
don't suspect many people are running so at this point messing with
the TSO bits isn't really making much of a difference.  Then on top of
that most devices have certain limitations on how many ports they can
handle and such.  I know the i40e is supposed to support something
like 10 port numbers, but the fm10k and ixgbe are limited to one port
as I recall.  So this whole thing is already really brittle as it is.
My goal with this change is to make the behavior more consistent
across the board.

> My hope is that we can continue to use TSO on devices that only
> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
> length field may vary across segments. However, in practice this is
> the only on the final segment and only in cases where the total length
> is not a multiple of the MSS. If we could detect cases where those
> conditions are met, we could continue to use TSO with the UDP checksum
> field pre-populated. A possible step even further would be to break
> off the final segment into a separate packet to make things conform if
> necessary. This would avoid a performance regression and I think make
> this more palatable to a lot of people.

I think Tom and I had discussed this possibility a bit at netconf.
The GSO logic is something I planned on looking at over the next
several weeks as I suspect there is probably room for improvement
there.

>> I also haven't investigated the effect this will have on OVS.  However I
>> suspect the impact should be minimal as the worst case scenario should be
>> that Tx checksumming will become enabled by default which should be
>> consistent with the existing behavior for IPv6.
>
> I don't think that it should cause any problems.

Good to hear.

Do you know if OVS has some way to control the VXLAN configuration so
that it could disable Tx checksums?  If so that would probably be a
good way to address the 40G issues assuming someone is running an
environment hat had nothing but NICs that can support the TSO and Rx
checksum on inner 

Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Jesse Gross
On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck  wrote:
> This patch series makes it so that we enable the outer Tx checksum for IPv4
> tunnels by default.  This makes the behavior consistent with how we were
> handling this for IPv6.  In addition I have updated the internal flags for
> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
> match up will with the ZERO_CSUM6_TX flag which was already in use for
> IPv6.
>
> For most network devices this should be a net gain in terms of performance
> as having the outer header checksum present allows for devices to report
> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
> to determine if the inner header checksum is valid.
>
> Below is some data I collected with ixgbe with an X540 that demonstrates
> this.  I located two PFs connected back to back in two different name
> spaces and then setup a pair of tunnels on each, one with checksum enabled
> and one without.
>
> Recv   SendSend  Utilization
> Socket Socket  Message  Elapsed  Send
> Size   SizeSize Time Throughput  local
> bytes  bytes   bytessecs.10^6bits/s  % S
>
> noudpcsum:
>  87380  16384  1638430.00  8898.67   12.80
> udpcsum:
>  87380  16384  1638430.00  9088.47   5.69
>
> The one spot where this may cause a performance regression is if the
> environment contains devices that can parse the inner headers and a device
> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
> the case of such a device we have to fall back to using GSO to segment the
> tunnel instead of TSO and as a result we may take a performance hit as seen
> below with i40e.

Do you have any numbers from 40G links? Obviously, at 10G the links
are basically saturated and while I can see a difference in the
utilization rate, I suspect that the change will be much more apparent
at higher speeds.

I'm concerned about the drop in performance for devices that currently
support offloads (almost none of which expose
NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that
care most about tunnel performance are the ones that already have
these NICs and will be the most impacted by the drop.

My hope is that we can continue to use TSO on devices that only
support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP
length field may vary across segments. However, in practice this is
the only on the final segment and only in cases where the total length
is not a multiple of the MSS. If we could detect cases where those
conditions are met, we could continue to use TSO with the UDP checksum
field pre-populated. A possible step even further would be to break
off the final segment into a separate packet to make things conform if
necessary. This would avoid a performance regression and I think make
this more palatable to a lot of people.

> I also haven't investigated the effect this will have on OVS.  However I
> suspect the impact should be minimal as the worst case scenario should be
> that Tx checksumming will become enabled by default which should be
> consistent with the existing behavior for IPv6.

I don't think that it should cause any problems.


[net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-19 Thread Alexander Duyck
This patch series makes it so that we enable the outer Tx checksum for IPv4
tunnels by default.  This makes the behavior consistent with how we were
handling this for IPv6.  In addition I have updated the internal flags for
these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should
match up will with the ZERO_CSUM6_TX flag which was already in use for
IPv6.

For most network devices this should be a net gain in terms of performance
as having the outer header checksum present allows for devices to report
CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in order
to determine if the inner header checksum is valid.

Below is some data I collected with ixgbe with an X540 that demonstrates
this.  I located two PFs connected back to back in two different name
spaces and then setup a pair of tunnels on each, one with checksum enabled
and one without.

Recv   SendSend  Utilization
Socket Socket  Message  Elapsed  Send
Size   SizeSize Time Throughput  local
bytes  bytes   bytessecs.10^6bits/s  % S

noudpcsum:
 87380  16384  1638430.00  8898.67   12.80
udpcsum:
 87380  16384  1638430.00  9088.47   5.69

The one spot where this may cause a performance regression is if the
environment contains devices that can parse the inner headers and a device
supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM.  In
the case of such a device we have to fall back to using GSO to segment the
tunnel instead of TSO and as a result we may take a performance hit as seen
below with i40e.

Recv   SendSend  Utilization
Socket Socket  Message  Elapsed  Send
Size   SizeSize Time Throughput  local
bytes  bytes   bytessecs.10^6bits/s  % S

noudpcsum:
 87380  16384  1638430.00  9085.21   3.32
udpcsum:
 87380  16384  1638430.00  9089.23   5.54

In addition it will be necessary to update iproute2 so that we don't
provide the checksum attribute unless specified.  This way on older kernels
which don't have local checksum offload we will default to disabling the
outer checksum, and on newer kernels that have LCO we can default to
enabling it.

I also haven't investigated the effect this will have on OVS.  However I
suspect the impact should be minimal as the worst case scenario should be
that Tx checksumming will become enabled by default which should be
consistent with the existing behavior for IPv6.

---

Alexander Duyck (2):
  GENEVE: Support outer IPv4 Tx checksums by default
  VXLAN: Support outer IPv4 Tx checksums by default


 drivers/net/geneve.c |   16 
 drivers/net/vxlan.c  |   19 +--
 include/net/vxlan.h  |2 +-
 3 files changed, 18 insertions(+), 19 deletions(-)

--