Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
On 21.06.2016 11:42, Tom Herbert wrote: >> > There is also some argument to be had for theory versus application. >> > Arguably it is the customers that are leading to some of the dirty >> > hacks as I think vendors are building NICs based on customer use cases >> > versus following any specifications. In most data centers the tunnel >> > underlays will be deployed throughout the network and UDP will likely >> > be blocked for anything that isn't being used explicitly for >> > tunneling. As such we seem to be seeing a lot of NICs that are only >> > supporting one port for things like this instead of designing them to >> > handle whatever we can throw at them. >> > > Actually, I don't believe that's true. It is not typical to deploy > firewalls within a data center fabric, and nor do we restrict > applications from binding to any UDP ports and they can pretty much > transmit to any port on any host without cost using an unconnected UDP > socket. I think it's more likely that NIC (and switch vendors) simply > assumed that port numbers can be treated as global values. That's > expedient and at small scale we can probably get away with it, but at > large scale this will eventually bite someone. I do have access to relatively normal expensive switches that can basically be used to realize a scenario like the one Alex described. No firewalls necessary. If you can guarantee that your customers never have access to your hypervisors or container management namespace, this is actually a pretty solid assumption. Bye, Hannes
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
On Tue, Jun 21, 2016 at 11:17 AM, Alexander Duyckwrote: > On Tue, Jun 21, 2016 at 10:40 AM, Hannes Frederic Sowa > wrote: >> On 21.06.2016 10:27, Edward Cree wrote: >>> On 21/06/16 18:05, Alexander Duyck wrote: On Tue, Jun 21, 2016 at 1:22 AM, David Miller wrote: > But anyways, the vastness of the key is why we want to keep "sockets" > out of network cards, because proper support of "sockets" requires > access to information the card simply does not and should not have. Right. Really what I would like to see for most of these devices is a 2 tuple filter where you specify the UDP port number, and the PF/VF ID that the traffic is received on. >>> But that doesn't make sense - the traffic is received on a physical network >>> port, and it's the headers (i.e. flow) at that point that determine whether >>> the traffic is encap or not. After all, those headers are all that can >>> determine which PF or VF it's sent to; and if it's multicast and goes to >>> more than one of them, it seems odd for one to treat it as encap and the >>> other to treat it as normal UDP - one of them must be misinterpreting it >>> (unless the UDP is going to a userspace tunnel endpoint, but I'm ignoring >>> that complication for now). >> >> Disabling offloading of packets is never going to cause data corruptions >> or misinterpretations. In some cases we can hint the network card to do >> even more (RSS+checksumming). We always have a safe choice, namely not >> doing hw offloading. > > Agreed. Also we need to keep in mind that in many cases things like > RSS and checksumming can be very easily made port specific since what > we are talking about is just what is reported in the Rx descriptor and > not any sort of change to the packet data. > >> Multicast is often scoped, in some cases we have different multicast >> scopes but the same addresses. In case of scoped traffic, we must verify >> the device as well and can't install the same flow on every NIC. > > Right. Hopefully the NIC vendors are thinking ahead and testing to > validate such cases where multicast or broadcast traffic doesn't do > anything weird to their NICs in terms of offloads. > >>> At a given physical point in the network, a given UDP flow either is or is >>> not carrying encapsulated traffic, and if it tries to be both then things >>> are certain to break, just as much as if two different applications try to >>> use the same UDP flow for two different application protocols. >> >> I think the example Tom was hinting at initially is like that: >> >> A net namespace acts as a router and has a vxlan endpoint active. The >> vxlan endpoint enables vxlan offloading on all net_devices in the same >> namespace. Because we only identify the tunnel endpoint by UDP port >> number, traffic which should actually just be forwarded and should never >> be processed locally suddenly can become processed by the offloading hw >> units. Because UDP ports only form a contract between the end points and >> not with the router in between it would be illegal to treat those not >> locally designated packets as vxlan by the router. > > Yes. The problem is I am sure there are some vendors out there > wanting to tout their product as being excellent at routing VXLAN > traffic so they are probably exploiting this to try and claim > performance gains. > > There is also some argument to be had for theory versus application. > Arguably it is the customers that are leading to some of the dirty > hacks as I think vendors are building NICs based on customer use cases > versus following any specifications. In most data centers the tunnel > underlays will be deployed throughout the network and UDP will likely > be blocked for anything that isn't being used explicitly for > tunneling. As such we seem to be seeing a lot of NICs that are only > supporting one port for things like this instead of designing them to > handle whatever we can throw at them. > Actually, I don't believe that's true. It is not typical to deploy firewalls within a data center fabric, and nor do we restrict applications from binding to any UDP ports and they can pretty much transmit to any port on any host without cost using an unconnected UDP socket. I think it's more likely that NIC (and switch vendors) simply assumed that port numbers can be treated as global values. That's expedient and at small scale we can probably get away with it, but at large scale this will eventually bite someone. > I really think it may be a few more years before we hit the point > where the vendors start to catch a clue about the fact that they need > to have a generic approach that works in all cases versus what we have > now were they are supporting whatever the buzzword of the day is and > not looking much further down the road than that. The fact is in a > few years time we might even have to start dealing with > tunnel-in-tunnel type
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
On Tue, Jun 21, 2016 at 10:40 AM, Hannes Frederic Sowawrote: > On 21.06.2016 10:27, Edward Cree wrote: >> On 21/06/16 18:05, Alexander Duyck wrote: >>> On Tue, Jun 21, 2016 at 1:22 AM, David Miller wrote: But anyways, the vastness of the key is why we want to keep "sockets" out of network cards, because proper support of "sockets" requires access to information the card simply does not and should not have. >>> Right. Really what I would like to see for most of these devices is a >>> 2 tuple filter where you specify the UDP port number, and the PF/VF ID >>> that the traffic is received on. >> But that doesn't make sense - the traffic is received on a physical network >> port, and it's the headers (i.e. flow) at that point that determine whether >> the traffic is encap or not. After all, those headers are all that can >> determine which PF or VF it's sent to; and if it's multicast and goes to >> more than one of them, it seems odd for one to treat it as encap and the >> other to treat it as normal UDP - one of them must be misinterpreting it >> (unless the UDP is going to a userspace tunnel endpoint, but I'm ignoring >> that complication for now). > > Disabling offloading of packets is never going to cause data corruptions > or misinterpretations. In some cases we can hint the network card to do > even more (RSS+checksumming). We always have a safe choice, namely not > doing hw offloading. Agreed. Also we need to keep in mind that in many cases things like RSS and checksumming can be very easily made port specific since what we are talking about is just what is reported in the Rx descriptor and not any sort of change to the packet data. > Multicast is often scoped, in some cases we have different multicast > scopes but the same addresses. In case of scoped traffic, we must verify > the device as well and can't install the same flow on every NIC. Right. Hopefully the NIC vendors are thinking ahead and testing to validate such cases where multicast or broadcast traffic doesn't do anything weird to their NICs in terms of offloads. >> At a given physical point in the network, a given UDP flow either is or is >> not carrying encapsulated traffic, and if it tries to be both then things >> are certain to break, just as much as if two different applications try to >> use the same UDP flow for two different application protocols. > > I think the example Tom was hinting at initially is like that: > > A net namespace acts as a router and has a vxlan endpoint active. The > vxlan endpoint enables vxlan offloading on all net_devices in the same > namespace. Because we only identify the tunnel endpoint by UDP port > number, traffic which should actually just be forwarded and should never > be processed locally suddenly can become processed by the offloading hw > units. Because UDP ports only form a contract between the end points and > not with the router in between it would be illegal to treat those not > locally designated packets as vxlan by the router. Yes. The problem is I am sure there are some vendors out there wanting to tout their product as being excellent at routing VXLAN traffic so they are probably exploiting this to try and claim performance gains. There is also some argument to be had for theory versus application. Arguably it is the customers that are leading to some of the dirty hacks as I think vendors are building NICs based on customer use cases versus following any specifications. In most data centers the tunnel underlays will be deployed throughout the network and UDP will likely be blocked for anything that isn't being used explicitly for tunneling. As such we seem to be seeing a lot of NICs that are only supporting one port for things like this instead of designing them to handle whatever we can throw at them. I really think it may be a few more years before we hit the point where the vendors start to catch a clue about the fact that they need to have a generic approach that works in all cases versus what we have now were they are supporting whatever the buzzword of the day is and not looking much further down the road than that. The fact is in a few years time we might even have to start dealing with tunnel-in-tunnel type workloads to address the use of containers inside of KVM guests. I'm pretty sure we don't have support for recursive tunnel offloads in hardware and likely never will. To that end all I would really need is support for CHECKSUM_COMPLETE or outer Rx checksums enabled, RSS based on the outer source port assuming the destination port is recognized as a tunnel, the ability to have DF bit set for any of the inner tunnel headers, and GSO partial extended to support tunnel-in-tunnel scenarios. - Alex
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
On 21/06/16 18:40, Hannes Frederic Sowa wrote: > On 21.06.2016 10:27, Edward Cree wrote: >> At a given physical point in the network, a given UDP flow either is or is >> not carrying encapsulated traffic, and if it tries to be both then things >> are certain to break, just as much as if two different applications try to >> use the same UDP flow for two different application protocols. > I think the example Tom was hinting at initially is like that: > > A net namespace acts as a router and has a vxlan endpoint active. The > vxlan endpoint enables vxlan offloading on all net_devices in the same > namespace. Because we only identify the tunnel endpoint by UDP port > number, traffic which should actually just be forwarded and should never > be processed locally suddenly can become processed by the offloading hw > units. Because UDP ports only form a contract between the end points and > not with the router in between it would be illegal to treat those not > locally designated packets as vxlan by the router. Oh indeed, what we currently do is broken. We would have to identify, for each interface, which (if any) UDP flows on that interface correspond to our vxlan endpoints, rather than (as now) saying that any UDP port that matches any endpoint must be vxlan on all interfaces. But as long as a vxlan endpoint is a device built on top of another device, it can just ask that device to enable offloads for the corresponding flow; and if that device is also a software device, it might modify the flow spec before passing on the request to _its_ underlying device, or it might just drop the request because it doesn't support it. The problem is, AIUI the device is currently only used for transmitting; anything received on any device that makes it through the IP stack to the vxlan UDP socket is treated as vxlan. And determining which interfaces' traffic will get delivered locally and which will get routed is not necessarily trivial. Perhaps vxlan devices need to only receive traffic that came through their underlying device? Then the mapping to offload becomes much simpler. -Ed > Also multicast traffic is always scoped, so the flow has to include the > ifindex at least to allow differentiation between different scopes.
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
On 21.06.2016 10:27, Edward Cree wrote: > On 21/06/16 18:05, Alexander Duyck wrote: >> On Tue, Jun 21, 2016 at 1:22 AM, David Millerwrote: >>> But anyways, the vastness of the key is why we want to keep "sockets" >>> out of network cards, because proper support of "sockets" requires >>> access to information the card simply does not and should not have. >> Right. Really what I would like to see for most of these devices is a >> 2 tuple filter where you specify the UDP port number, and the PF/VF ID >> that the traffic is received on. > But that doesn't make sense - the traffic is received on a physical network > port, and it's the headers (i.e. flow) at that point that determine whether > the traffic is encap or not. After all, those headers are all that can > determine which PF or VF it's sent to; and if it's multicast and goes to > more than one of them, it seems odd for one to treat it as encap and the > other to treat it as normal UDP - one of them must be misinterpreting it > (unless the UDP is going to a userspace tunnel endpoint, but I'm ignoring > that complication for now). Disabling offloading of packets is never going to cause data corruptions or misinterpretations. In some cases we can hint the network card to do even more (RSS+checksumming). We always have a safe choice, namely not doing hw offloading. Multicast is often scoped, in some cases we have different multicast scopes but the same addresses. In case of scoped traffic, we must verify the device as well and can't install the same flow on every NIC. > At a given physical point in the network, a given UDP flow either is or is > not carrying encapsulated traffic, and if it tries to be both then things > are certain to break, just as much as if two different applications try to > use the same UDP flow for two different application protocols. I think the example Tom was hinting at initially is like that: A net namespace acts as a router and has a vxlan endpoint active. The vxlan endpoint enables vxlan offloading on all net_devices in the same namespace. Because we only identify the tunnel endpoint by UDP port number, traffic which should actually just be forwarded and should never be processed locally suddenly can become processed by the offloading hw units. Because UDP ports only form a contract between the end points and not with the router in between it would be illegal to treat those not locally designated packets as vxlan by the router. Also multicast traffic is always scoped, so the flow has to include the ifindex at least to allow differentiation between different scopes. Bye, Hannes
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
On 21/06/16 18:05, Alexander Duyck wrote: > On Tue, Jun 21, 2016 at 1:22 AM, David Millerwrote: >> But anyways, the vastness of the key is why we want to keep "sockets" >> out of network cards, because proper support of "sockets" requires >> access to information the card simply does not and should not have. > Right. Really what I would like to see for most of these devices is a > 2 tuple filter where you specify the UDP port number, and the PF/VF ID > that the traffic is received on. But that doesn't make sense - the traffic is received on a physical network port, and it's the headers (i.e. flow) at that point that determine whether the traffic is encap or not. After all, those headers are all that can determine which PF or VF it's sent to; and if it's multicast and goes to more than one of them, it seems odd for one to treat it as encap and the other to treat it as normal UDP - one of them must be misinterpreting it (unless the UDP is going to a userspace tunnel endpoint, but I'm ignoring that complication for now). At a given physical point in the network, a given UDP flow either is or is not carrying encapsulated traffic, and if it tries to be both then things are certain to break, just as much as if two different applications try to use the same UDP flow for two different application protocols. -Ed
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
On Tue, Jun 21, 2016 at 1:22 AM, David Millerwrote: > From: Tom Herbert > Date: Mon, 20 Jun 2016 10:05:01 -0700 > >> Generally, this means it needs to at least match by local addresses >> and port for an unconnected/unbound socket, the source address for >> an unconnected/bound socket, a the full 4-tuple for a connected >> socket. > > These lookup keys are all insufficient. > > At the very least the network namespace must be in the lookup key as > well if you want to match "sockets". And this is just the tip of the > iceberg in my opinion. > > The namespace bypassing to me is the biggest flaw in the UDP tunnel > offloads. That is creating real dangers right now. I agree. Fortunately this only really becomes an issue if SR-IOV is enabled. Otherwise the port based offloads only affect the PF as long as no VFs are present. > But anyways, the vastness of the key is why we want to keep "sockets" > out of network cards, because proper support of "sockets" requires > access to information the card simply does not and should not have. Right. Really what I would like to see for most of these devices is a 2 tuple filter where you specify the UDP port number, and the PF/VF ID that the traffic is received on. In order to get that we wouldn't need any additional information from the API. Then we at least have indirect namespace isolation, and if someone really wanted to they could do offloads on the VFs for different traffic.
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
On 21.06.2016 01:22, David Miller wrote: > From: Tom Herbert> Date: Mon, 20 Jun 2016 10:05:01 -0700 > >> Generally, this means it needs to at least match by local addresses >> and port for an unconnected/unbound socket, the source address for >> an unconnected/bound socket, a the full 4-tuple for a connected >> socket. > > These lookup keys are all insufficient. > > At the very least the network namespace must be in the lookup key as > well if you want to match "sockets". And this is just the tip of the > iceberg in my opinion. > > The namespace bypassing to me is the biggest flaw in the UDP tunnel > offloads. That is creating real dangers right now. > > But anyways, the vastness of the key is why we want to keep "sockets" > out of network cards, because proper support of "sockets" requires > access to information the card simply does not and should not have. We can't look up UDP sockets without having the namespace. We inherit the namespace from the device the packet showed up on. So I think in regards to namespaces this looks more or less fine. Not being able to match on the destination ip or even not considering the routing infrastructure is kind of dangerous (e.g. we might want to make the socket not reachable from an interface via ip rules or blackhole routes, but they wouldn't be considered, too). Bye, Hannes
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
On 21/06/16 09:22, David Miller wrote: > From: Tom HerbertDate: Mon, 20 Jun 2016 10:05:01 -0700 >> Generally, this means it needs to at least match by local addresses and port >> for an unconnected/unbound socket, the source address for an >> unconnected/bound socket, a the full 4-tuple for a connected socket. > These lookup keys are all insufficient. At the very least the network > namespace must be in the lookup key as well if you want to match "sockets". But the card doesn't have to be told that; instead, only push a socket to a device offload if the device is in the same ns as the socket. Wouldn't that work? Anything beyond that - i.e. supporting cross-ns offloads - would require knowing how packets / addresses get transformed in bridging them from one ns to another and in general that's quite a wide set of possibilities; it doesn't seem worth while. Especially since the likely use-case of tunnels plus containers is that the host does the decapsulation and transparently gives the container a virtual ethernet device, which keeps the hardware and the "socket" in the same ns. > But anyways, the vastness of the key is why we want to keep "sockets" > out of network cards, because proper support of "sockets" requires > access to information the card simply does not and should not have. I think Tom's talk of "sockets" is a red herring; it's more a question of "flows". If we think of our host as a black box, its decisions ("is this traffic encapsulated?") necessarily depend upon the 5-tuple plus the (implicit) information that the traffic is being received on a particular interface. Netns are another red herring: even without them, what if our host is a router with NAT, forwarding traffic to another host? Now you're trying to match a "socket" on another host (in, perhaps, another IP-address namespace), but the "flow" is still the same: it's defined in terms of the addresses on the incoming traffic, not what they might get NATted to by the time the packets hit an actual socket. So AFAICT, flow matching up to and including 5-tuple is both necessary and sufficient for correct UDP tunnel detection in HW. Sadly most HW (including our latest here at sfc) thinks it only needs UDP dest port :( and for such HW, Tom is right that we can't mix it with forwarding, and have to reserve the port in all ns. -Ed
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
From: Hannes Frederic SowaDate: Mon, 20 Jun 2016 11:11:32 -0700 > I am not sure if this is necessary. The devices actually having the > ndo-ops, used to configure offloading, should only be visible inside one > namespace. If those ndo-ops actually have side effects on other > namespaces, they violate the contract and should be removed from the > driver. :/ This is most if not all drivers right now, to the best of my understanding.
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
From: Tom HerbertDate: Mon, 20 Jun 2016 10:05:01 -0700 > Generally, this means it needs to at least match by local addresses > and port for an unconnected/unbound socket, the source address for > an unconnected/bound socket, a the full 4-tuple for a connected > socket. These lookup keys are all insufficient. At the very least the network namespace must be in the lookup key as well if you want to match "sockets". And this is just the tip of the iceberg in my opinion. The namespace bypassing to me is the biggest flaw in the UDP tunnel offloads. That is creating real dangers right now. But anyways, the vastness of the key is why we want to keep "sockets" out of network cards, because proper support of "sockets" requires access to information the card simply does not and should not have.
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
On 20.06.2016 12:27, Tom Herbert wrote: >> Do we? >> >> We look up the socket in a proper way, inclusive the namespace belonging >> to the device we received the packet on. That said, I don't see a >> problem with that right now. But maybe I miss something? >> > When we so the socket lookup in udp_rcv this is done after IP layer > and packet has been determined to be loacally addressed. The packet is > for the host, and if a UDP socket is matched, even if just based on > destination port, the socket is matched and received functions are > called. There is no ambiguity. > > When the lookup is performed in GRO this is before we processed IP and > determined that the packet is local. So a packet with any address > (local or not) will match a listener socket with the same UDP port. > The GRO can be performed an packet is subsequently forwarded maybe > having been modified. If this packet turns out to not be the protocol > we thought it was (e.g. VXLAN) then we have now potentially silently > corrupted someone else's packets. Grant it, there's probably a lot of > things that are required to make corruption happen, but it does allow > the possibly of systematic data corruption and we haven't discounted > this to become a security vulnerability. Agreed. Maybe we must switch to always use connected sockets for unicast+UDP+GRO? Bye, Hannes
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
On Mon, Jun 20, 2016 at 2:36 PM, Hannes Frederic Sowawrote: > On 20.06.2016 12:27, Tom Herbert wrote: >>> Do we? >>> >>> We look up the socket in a proper way, inclusive the namespace belonging >>> to the device we received the packet on. That said, I don't see a >>> problem with that right now. But maybe I miss something? >>> >> When we so the socket lookup in udp_rcv this is done after IP layer >> and packet has been determined to be loacally addressed. The packet is >> for the host, and if a UDP socket is matched, even if just based on >> destination port, the socket is matched and received functions are >> called. There is no ambiguity. >> >> When the lookup is performed in GRO this is before we processed IP and >> determined that the packet is local. So a packet with any address >> (local or not) will match a listener socket with the same UDP port. >> The GRO can be performed an packet is subsequently forwarded maybe >> having been modified. If this packet turns out to not be the protocol >> we thought it was (e.g. VXLAN) then we have now potentially silently >> corrupted someone else's packets. Grant it, there's probably a lot of >> things that are required to make corruption happen, but it does allow >> the possibly of systematic data corruption and we haven't discounted >> this to become a security vulnerability. > > Agreed. > > Maybe we must switch to always use connected sockets for unicast+UDP+GRO? > No, I don't think we need to go that far. We should be able to enable GRO on pretty much any sort of UDP socket--the other reason to move GRO into udp_rcv is this will be a more suitable environment for user programmable GRO which seems to be on the horizon now. Tom > Bye, > Hannes > >
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
> Do we? > > We look up the socket in a proper way, inclusive the namespace belonging > to the device we received the packet on. That said, I don't see a > problem with that right now. But maybe I miss something? > When we so the socket lookup in udp_rcv this is done after IP layer and packet has been determined to be loacally addressed. The packet is for the host, and if a UDP socket is matched, even if just based on destination port, the socket is matched and received functions are called. There is no ambiguity. When the lookup is performed in GRO this is before we processed IP and determined that the packet is local. So a packet with any address (local or not) will match a listener socket with the same UDP port. The GRO can be performed an packet is subsequently forwarded maybe having been modified. If this packet turns out to not be the protocol we thought it was (e.g. VXLAN) then we have now potentially silently corrupted someone else's packets. Grant it, there's probably a lot of things that are required to make corruption happen, but it does allow the possibly of systematic data corruption and we haven't discounted this to become a security vulnerability. Tom > Thanks, > Hannes >
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
Hello, On 20.06.2016 10:05, Tom Herbert wrote: >> And finally, he added a big comment explaining that new tunnel types >> should not be added to the tunnel type list, and those that exist >> should only be used for RX. >> >> Therefore, this isn't openning the door for new random offloads, quite >> the contrary. Instead, if it making clearer what the existing >> facilitites support, and putting an explicit cap on them. >> > We have no reason to believe there won't be more offloaded UDP > protocols. There are plenty of other UDP encapsulations, transport > protocols, and application layer protocols that could be offloaded and > some of these may provide protocol specific features that have no > generic analogue. The problem is that this interface is wrong. Certainly not. And you have a hand in this as well. ;) When we will see more and more offloads on networking cards we certainly have to redesign this interface. New cards seem to be available which offer new kinds of offloads but I think we should wait until we have 2-3 vendors supporting such interfaces and trying to come up with either a new version/extension of this interface or move over to a new interface. We simply don't know yet how and what we actually might need to support and what abstraction is best to communicate with those drivers. I think time will simply solve this problem. The Linux kernel is a living orgasm anyway so it will find a way to deal with that. > The interface and current NIC implementations are based on the > assumption that UDP port number is sufficient to decode a UDP payload. > This is incorrect, UDP port numbers do not have global meaning they > can only be correctly interrupted by the endpoints in the > communication (RFC7605). Just because port number 4789 is assigned to > VXLAN does not mean that any packet a device sees that has destination > port 4789 is VXLAN, it may very well be something else completely > different. This opens up the very real possibility that devices in the > network, including NICs, misinterpret packets because they are looking > only at port number. Furthermore, if a device modifies packets based > on just port number, say like in GRO, this opens the door to silent > data corruption. This will be a real problem with a router that is > trying UDP offload packets just being forwarded, but even on just a > host this can be an issue when using ns with ipvlan of macvlan. The Linux kernel does it correctly: for every tunnel we open, we create a UDP socket and wildcard bind it to the configured port number, so nothing in the current namespace can bind to this port number at the same time. The problem is the actual hardware to date: Intel networking cards seem to only configure the UDP port in use, not even the address family nor the destination ip. Other networking cards seem to follow this schema. We currently can't do anything else. If later NICs support better filtering schemes, we can easily add it with Alex's series. The only improvement I see right now is that we basically have to bind to both IP protocol versions, because NICs don't care about the version. > The design of a UDP offload interface should be predicated on the fact > that we are offloading the receive path of a UDP socket. It follows > that the requirement of the offload device is to match packets that > will be received by the UDP socket. Generally, this means it needs to > at least match by local addresses and port for an unconnected/unbound > socket, the source address for an unconnected/bound socket, a the full > 4-tuple for a connected socket. VLAN, macvlan, or anything else that > could affect the selection of receive socket would also need to be > taken in to account. The typical driver interface that provides for > fine grained packet matching is n-tuple filtering. Agreed. But I fear that we have to deal with the hardware at hand. > To address the immediate problem with the current existing UDP HW > offloads I suggest: > > 1) Disallow UDP HW offload when forwarding is enabled. Applying > offloads to UDP encapsulated packets that are only being forwarded is > not correct. Strictly speaking, this is the only correct way to do it right now. A forwarding host can apply the wrong the optimizations to received UDP packets if a tunnel is being created and the same host is used to act as a router. Unfortunately this basically kills vxlan offloading for most, if not all, virtualization cases and probably for most containers, too. > 2) Open a socket for the encapsulation port in each created ns > (similar to how RDS does this). This prevents a UDP port being used > for encapsulation from being bound for other purposes in ns. I am not sure if this is necessary. The devices actually having the ndo-ops, used to configure offloading, should only be visible inside one namespace. If those ndo-ops actually have side effects on other namespaces, they violate the contract and should be removed from the driver. :/ I would
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
> And finally, he added a big comment explaining that new tunnel types > should not be added to the tunnel type list, and those that exist > should only be used for RX. > > Therefore, this isn't openning the door for new random offloads, quite > the contrary. Instead, if it making clearer what the existing > facilitites support, and putting an explicit cap on them. > We have no reason to believe there won't be more offloaded UDP protocols. There are plenty of other UDP encapsulations, transport protocols, and application layer protocols that could be offloaded and some of these may provide protocol specific features that have no generic analogue. The problem is that this interface is wrong. The interface and current NIC implementations are based on the assumption that UDP port number is sufficient to decode a UDP payload. This is incorrect, UDP port numbers do not have global meaning they can only be correctly interrupted by the endpoints in the communication (RFC7605). Just because port number 4789 is assigned to VXLAN does not mean that any packet a device sees that has destination port 4789 is VXLAN, it may very well be something else completely different. This opens up the very real possibility that devices in the network, including NICs, misinterpret packets because they are looking only at port number. Furthermore, if a device modifies packets based on just port number, say like in GRO, this opens the door to silent data corruption. This will be a real problem with a router that is trying UDP offload packets just being forwarded, but even on just a host this can be an issue when using ns with ipvlan of macvlan. The design of a UDP offload interface should be predicated on the fact that we are offloading the receive path of a UDP socket. It follows that the requirement of the offload device is to match packets that will be received by the UDP socket. Generally, this means it needs to at least match by local addresses and port for an unconnected/unbound socket, the source address for an unconnected/bound socket, a the full 4-tuple for a connected socket. VLAN, macvlan, or anything else that could affect the selection of receive socket would also need to be taken in to account. The typical driver interface that provides for fine grained packet matching is n-tuple filtering. To address the immediate problem with the current existing UDP HW offloads I suggest: 1) Disallow UDP HW offload when forwarding is enabled. Applying offloads to UDP encapsulated packets that are only being forwarded is not correct. 2) Open a socket for the encapsulation port in each created ns (similar to how RDS does this). This prevents a UDP port being used for encapsulation from being bound for other purposes in ns. btw, we also have this problem in the stack. GRO for UDP encapsulation is performed before we have verified the packet is for us ("us" being the encapsulation socket in the default name space). The solution here I believe is to move UDP GRO to be in udp_receive, although if the above are implemented that would also address the issue in SW GRO at least for the short term. flow_dissector does not attempt crack open UDP encapsulation so we don't have the issue there. Thanks, Tom
Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers
From: Alexander DuyckDate: Thu, 16 Jun 2016 12:20:35 -0700 > s patch is meant to address two things. First we are currently using > the ndo_add/del_vxlan_port calls with VXLAN-GPE tunnels and we cannot > really support that as it is likely to cause more harm than good since > VXLAN-GPE can support tunnels without a MAC address on the inner header. > > As such we need to add a new offload to advertise this, but in doing so it > would mean introducing 3 new functions for the driver to request the ports, > and then for the tunnel to push the changes to add and delete the ports to > the device. However instead of taking that approach I think it would be > much better if we just made one common function for fetching the ports, and > provided a generic means to push the tunnels to the device. So in order to > make this work this patch set does several things. ... Series applied, thanks Alexander. Tom, I've heard your arguments, but I think your fears are unfounded. Look at what Alexander's patches are actually doing. First, he's fixing a bug. Hardware that supports VXLAN offloading doesn't support VXLAN-GPE, yet we were sending such things to the VXLAN offload paths. Second, he's eliminating a metric ton of Kconfig garbage, as drivers had to have all kinds of contrived dependencies to support UDP tunnel offloads. Third, he's consolidating several driver NDO methods into just two. And finally, he added a big comment explaining that new tunnel types should not be added to the tunnel type list, and those that exist should only be used for RX. Therefore, this isn't openning the door for new random offloads, quite the contrary. Instead, if it making clearer what the existing facilitites support, and putting an explicit cap on them. Thanks.