Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-21 Thread Hannes Frederic Sowa
On 21.06.2016 11:42, Tom Herbert wrote:
>> > There is also some argument to be had for theory versus application.
>> > Arguably it is the customers that are leading to some of the dirty
>> > hacks as I think vendors are building NICs based on customer use cases
>> > versus following any specifications.  In most data centers the tunnel
>> > underlays will be deployed throughout the network and UDP will likely
>> > be blocked for anything that isn't being used explicitly for
>> > tunneling.  As such we seem to be seeing a lot of NICs that are only
>> > supporting one port for things like this instead of designing them to
>> > handle whatever we can throw at them.
>> >
> Actually, I don't believe that's true. It is not typical to deploy
> firewalls within a data center fabric, and nor do we restrict
> applications from binding to any UDP ports and they can pretty much
> transmit to any port on any host without cost using an unconnected UDP
> socket. I think it's more likely that NIC (and switch vendors) simply
> assumed that port numbers can be treated as global values. That's
> expedient and at small scale we can probably get away with it, but at
> large scale this will eventually bite someone.

I do have access to relatively normal expensive switches that can
basically be used to realize a scenario like the one Alex described. No
firewalls necessary. If you can guarantee that your customers never have
access to your hypervisors or container management namespace, this is
actually a pretty solid assumption.

Bye,
Hannes



Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-21 Thread Tom Herbert
On Tue, Jun 21, 2016 at 11:17 AM, Alexander Duyck
 wrote:
> On Tue, Jun 21, 2016 at 10:40 AM, Hannes Frederic Sowa
>  wrote:
>> On 21.06.2016 10:27, Edward Cree wrote:
>>> On 21/06/16 18:05, Alexander Duyck wrote:
 On Tue, Jun 21, 2016 at 1:22 AM, David Miller  wrote:
> But anyways, the vastness of the key is why we want to keep "sockets"
> out of network cards, because proper support of "sockets" requires
> access to information the card simply does not and should not have.
 Right.  Really what I would like to see for most of these devices is a
 2 tuple filter where you specify the UDP port number, and the PF/VF ID
 that the traffic is received on.
>>> But that doesn't make sense - the traffic is received on a physical network
>>> port, and it's the headers (i.e. flow) at that point that determine whether
>>> the traffic is encap or not.  After all, those headers are all that can
>>> determine which PF or VF it's sent to; and if it's multicast and goes to
>>> more than one of them, it seems odd for one to treat it as encap and the
>>> other to treat it as normal UDP - one of them must be misinterpreting it
>>> (unless the UDP is going to a userspace tunnel endpoint, but I'm ignoring
>>> that complication for now).
>>
>> Disabling offloading of packets is never going to cause data corruptions
>> or misinterpretations. In some cases we can hint the network card to do
>> even more (RSS+checksumming). We always have a safe choice, namely not
>> doing hw offloading.
>
> Agreed.  Also we need to keep in mind that in many cases things like
> RSS and checksumming can be very easily made port specific since what
> we are talking about is just what is reported in the Rx descriptor and
> not any sort of change to the packet data.
>
>> Multicast is often scoped, in some cases we have different multicast
>> scopes but the same addresses. In case of scoped traffic, we must verify
>> the device as well and can't install the same flow on every NIC.
>
> Right.  Hopefully the NIC vendors are thinking ahead and testing to
> validate such cases where multicast or broadcast traffic doesn't do
> anything weird to their NICs in terms of offloads.
>
>>> At a given physical point in the network, a given UDP flow either is or is
>>> not carrying encapsulated traffic, and if it tries to be both then things
>>> are certain to break, just as much as if two different applications try to
>>> use the same UDP flow for two different application protocols.
>>
>> I think the example Tom was hinting at initially is like that:
>>
>> A net namespace acts as a router and has a vxlan endpoint active. The
>> vxlan endpoint enables vxlan offloading on all net_devices in the same
>> namespace. Because we only identify the tunnel endpoint by UDP port
>> number, traffic which should actually just be forwarded and should never
>> be processed locally suddenly can become processed by the offloading hw
>> units. Because UDP ports only form a contract between the end points and
>> not with the router in between it would be illegal to treat those not
>> locally designated packets as vxlan by the router.
>
> Yes.  The problem is I am sure there are some vendors out there
> wanting to tout their product as being excellent at routing VXLAN
> traffic so they are probably exploiting this to try and claim
> performance gains.
>
> There is also some argument to be had for theory versus application.
> Arguably it is the customers that are leading to some of the dirty
> hacks as I think vendors are building NICs based on customer use cases
> versus following any specifications.  In most data centers the tunnel
> underlays will be deployed throughout the network and UDP will likely
> be blocked for anything that isn't being used explicitly for
> tunneling.  As such we seem to be seeing a lot of NICs that are only
> supporting one port for things like this instead of designing them to
> handle whatever we can throw at them.
>
Actually, I don't believe that's true. It is not typical to deploy
firewalls within a data center fabric, and nor do we restrict
applications from binding to any UDP ports and they can pretty much
transmit to any port on any host without cost using an unconnected UDP
socket. I think it's more likely that NIC (and switch vendors) simply
assumed that port numbers can be treated as global values. That's
expedient and at small scale we can probably get away with it, but at
large scale this will eventually bite someone.

> I really think it may be a few more years before we hit the point
> where the vendors start to catch a clue about the fact that they need
> to have a generic approach that works in all cases versus what we have
> now were they are supporting whatever the buzzword of the day is and
> not looking much further down the road than that.  The fact is in a
> few years time we might even have to start dealing with
> tunnel-in-tunnel type 

Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-21 Thread Alexander Duyck
On Tue, Jun 21, 2016 at 10:40 AM, Hannes Frederic Sowa
 wrote:
> On 21.06.2016 10:27, Edward Cree wrote:
>> On 21/06/16 18:05, Alexander Duyck wrote:
>>> On Tue, Jun 21, 2016 at 1:22 AM, David Miller  wrote:
 But anyways, the vastness of the key is why we want to keep "sockets"
 out of network cards, because proper support of "sockets" requires
 access to information the card simply does not and should not have.
>>> Right.  Really what I would like to see for most of these devices is a
>>> 2 tuple filter where you specify the UDP port number, and the PF/VF ID
>>> that the traffic is received on.
>> But that doesn't make sense - the traffic is received on a physical network
>> port, and it's the headers (i.e. flow) at that point that determine whether
>> the traffic is encap or not.  After all, those headers are all that can
>> determine which PF or VF it's sent to; and if it's multicast and goes to
>> more than one of them, it seems odd for one to treat it as encap and the
>> other to treat it as normal UDP - one of them must be misinterpreting it
>> (unless the UDP is going to a userspace tunnel endpoint, but I'm ignoring
>> that complication for now).
>
> Disabling offloading of packets is never going to cause data corruptions
> or misinterpretations. In some cases we can hint the network card to do
> even more (RSS+checksumming). We always have a safe choice, namely not
> doing hw offloading.

Agreed.  Also we need to keep in mind that in many cases things like
RSS and checksumming can be very easily made port specific since what
we are talking about is just what is reported in the Rx descriptor and
not any sort of change to the packet data.

> Multicast is often scoped, in some cases we have different multicast
> scopes but the same addresses. In case of scoped traffic, we must verify
> the device as well and can't install the same flow on every NIC.

Right.  Hopefully the NIC vendors are thinking ahead and testing to
validate such cases where multicast or broadcast traffic doesn't do
anything weird to their NICs in terms of offloads.

>> At a given physical point in the network, a given UDP flow either is or is
>> not carrying encapsulated traffic, and if it tries to be both then things
>> are certain to break, just as much as if two different applications try to
>> use the same UDP flow for two different application protocols.
>
> I think the example Tom was hinting at initially is like that:
>
> A net namespace acts as a router and has a vxlan endpoint active. The
> vxlan endpoint enables vxlan offloading on all net_devices in the same
> namespace. Because we only identify the tunnel endpoint by UDP port
> number, traffic which should actually just be forwarded and should never
> be processed locally suddenly can become processed by the offloading hw
> units. Because UDP ports only form a contract between the end points and
> not with the router in between it would be illegal to treat those not
> locally designated packets as vxlan by the router.

Yes.  The problem is I am sure there are some vendors out there
wanting to tout their product as being excellent at routing VXLAN
traffic so they are probably exploiting this to try and claim
performance gains.

There is also some argument to be had for theory versus application.
Arguably it is the customers that are leading to some of the dirty
hacks as I think vendors are building NICs based on customer use cases
versus following any specifications.  In most data centers the tunnel
underlays will be deployed throughout the network and UDP will likely
be blocked for anything that isn't being used explicitly for
tunneling.  As such we seem to be seeing a lot of NICs that are only
supporting one port for things like this instead of designing them to
handle whatever we can throw at them.

I really think it may be a few more years before we hit the point
where the vendors start to catch a clue about the fact that they need
to have a generic approach that works in all cases versus what we have
now were they are supporting whatever the buzzword of the day is and
not looking much further down the road than that.  The fact is in a
few years time we might even have to start dealing with
tunnel-in-tunnel type workloads to address the use of containers
inside of KVM guests.  I'm pretty sure we don't have support for
recursive tunnel offloads in hardware and likely never will.  To that
end all I would really need is support for CHECKSUM_COMPLETE or outer
Rx checksums enabled, RSS based on the outer source port assuming the
destination port is recognized as a tunnel, the ability to have DF bit
set for any of the inner tunnel headers, and GSO partial extended to
support tunnel-in-tunnel scenarios.

- Alex


Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-21 Thread Edward Cree
On 21/06/16 18:40, Hannes Frederic Sowa wrote:
> On 21.06.2016 10:27, Edward Cree wrote:
>> At a given physical point in the network, a given UDP flow either is or is
>> not carrying encapsulated traffic, and if it tries to be both then things
>> are certain to break, just as much as if two different applications try to
>> use the same UDP flow for two different application protocols.
> I think the example Tom was hinting at initially is like that:
>
> A net namespace acts as a router and has a vxlan endpoint active. The
> vxlan endpoint enables vxlan offloading on all net_devices in the same
> namespace. Because we only identify the tunnel endpoint by UDP port
> number, traffic which should actually just be forwarded and should never
> be processed locally suddenly can become processed by the offloading hw
> units. Because UDP ports only form a contract between the end points and
> not with the router in between it would be illegal to treat those not
> locally designated packets as vxlan by the router.
Oh indeed, what we currently do is broken.  We would have to identify, for
each interface, which (if any) UDP flows on that interface correspond to
our vxlan endpoints, rather than (as now) saying that any UDP port that
matches any endpoint must be vxlan on all interfaces.
But as long as a vxlan endpoint is a device built on top of another device,
it can just ask that device to enable offloads for the corresponding flow;
and if that device is also a software device, it might modify the flow spec
before passing on the request to _its_ underlying device, or it might just
drop the request because it doesn't support it.
The problem is, AIUI the device is currently only used for transmitting;
anything received on any device that makes it through the IP stack to the
vxlan UDP socket is treated as vxlan.  And determining which interfaces'
traffic will get delivered locally and which will get routed is not
necessarily trivial.  Perhaps vxlan devices need to only receive traffic
that came through their underlying device?  Then the mapping to offload
becomes much simpler.

-Ed
> Also multicast traffic is always scoped, so the flow has to include the
> ifindex at least to allow differentiation between different scopes.


Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-21 Thread Hannes Frederic Sowa
On 21.06.2016 10:27, Edward Cree wrote:
> On 21/06/16 18:05, Alexander Duyck wrote:
>> On Tue, Jun 21, 2016 at 1:22 AM, David Miller  wrote:
>>> But anyways, the vastness of the key is why we want to keep "sockets"
>>> out of network cards, because proper support of "sockets" requires
>>> access to information the card simply does not and should not have.
>> Right.  Really what I would like to see for most of these devices is a
>> 2 tuple filter where you specify the UDP port number, and the PF/VF ID
>> that the traffic is received on.
> But that doesn't make sense - the traffic is received on a physical network
> port, and it's the headers (i.e. flow) at that point that determine whether
> the traffic is encap or not.  After all, those headers are all that can
> determine which PF or VF it's sent to; and if it's multicast and goes to
> more than one of them, it seems odd for one to treat it as encap and the
> other to treat it as normal UDP - one of them must be misinterpreting it
> (unless the UDP is going to a userspace tunnel endpoint, but I'm ignoring
> that complication for now).

Disabling offloading of packets is never going to cause data corruptions
or misinterpretations. In some cases we can hint the network card to do
even more (RSS+checksumming). We always have a safe choice, namely not
doing hw offloading.

Multicast is often scoped, in some cases we have different multicast
scopes but the same addresses. In case of scoped traffic, we must verify
the device as well and can't install the same flow on every NIC.

> At a given physical point in the network, a given UDP flow either is or is
> not carrying encapsulated traffic, and if it tries to be both then things
> are certain to break, just as much as if two different applications try to
> use the same UDP flow for two different application protocols.

I think the example Tom was hinting at initially is like that:

A net namespace acts as a router and has a vxlan endpoint active. The
vxlan endpoint enables vxlan offloading on all net_devices in the same
namespace. Because we only identify the tunnel endpoint by UDP port
number, traffic which should actually just be forwarded and should never
be processed locally suddenly can become processed by the offloading hw
units. Because UDP ports only form a contract between the end points and
not with the router in between it would be illegal to treat those not
locally designated packets as vxlan by the router.

Also multicast traffic is always scoped, so the flow has to include the
ifindex at least to allow differentiation between different scopes.

Bye,
Hannes




Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-21 Thread Edward Cree
On 21/06/16 18:05, Alexander Duyck wrote:
> On Tue, Jun 21, 2016 at 1:22 AM, David Miller  wrote:
>> But anyways, the vastness of the key is why we want to keep "sockets"
>> out of network cards, because proper support of "sockets" requires
>> access to information the card simply does not and should not have.
> Right.  Really what I would like to see for most of these devices is a
> 2 tuple filter where you specify the UDP port number, and the PF/VF ID
> that the traffic is received on.
But that doesn't make sense - the traffic is received on a physical network
port, and it's the headers (i.e. flow) at that point that determine whether
the traffic is encap or not.  After all, those headers are all that can
determine which PF or VF it's sent to; and if it's multicast and goes to
more than one of them, it seems odd for one to treat it as encap and the
other to treat it as normal UDP - one of them must be misinterpreting it
(unless the UDP is going to a userspace tunnel endpoint, but I'm ignoring
that complication for now).
At a given physical point in the network, a given UDP flow either is or is
not carrying encapsulated traffic, and if it tries to be both then things
are certain to break, just as much as if two different applications try to
use the same UDP flow for two different application protocols.

-Ed


Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-21 Thread Alexander Duyck
On Tue, Jun 21, 2016 at 1:22 AM, David Miller  wrote:
> From: Tom Herbert 
> Date: Mon, 20 Jun 2016 10:05:01 -0700
>
>> Generally, this means it needs to at least match by local addresses
>> and port for an unconnected/unbound socket, the source address for
>> an unconnected/bound socket, a the full 4-tuple for a connected
>> socket.
>
> These lookup keys are all insufficient.
>
> At the very least the network namespace must be in the lookup key as
> well if you want to match "sockets".  And this is just the tip of the
> iceberg in my opinion.
>
> The namespace bypassing to me is the biggest flaw in the UDP tunnel
> offloads.  That is creating real dangers right now.

I agree.  Fortunately this only really becomes an issue if SR-IOV is
enabled.  Otherwise the port based offloads only affect the PF as long
as no VFs are present.

> But anyways, the vastness of the key is why we want to keep "sockets"
> out of network cards, because proper support of "sockets" requires
> access to information the card simply does not and should not have.

Right.  Really what I would like to see for most of these devices is a
2 tuple filter where you specify the UDP port number, and the PF/VF ID
that the traffic is received on.  In order to get that we wouldn't
need any additional information from the API.  Then we at least have
indirect namespace isolation, and if someone really wanted to they
could do offloads on the VFs for different traffic.


Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-21 Thread Hannes Frederic Sowa
On 21.06.2016 01:22, David Miller wrote:
> From: Tom Herbert 
> Date: Mon, 20 Jun 2016 10:05:01 -0700
> 
>> Generally, this means it needs to at least match by local addresses
>> and port for an unconnected/unbound socket, the source address for
>> an unconnected/bound socket, a the full 4-tuple for a connected
>> socket.
> 
> These lookup keys are all insufficient.
> 
> At the very least the network namespace must be in the lookup key as
> well if you want to match "sockets".  And this is just the tip of the
> iceberg in my opinion.
> 
> The namespace bypassing to me is the biggest flaw in the UDP tunnel
> offloads.  That is creating real dangers right now.
> 
> But anyways, the vastness of the key is why we want to keep "sockets"
> out of network cards, because proper support of "sockets" requires
> access to information the card simply does not and should not have.

We can't look up UDP sockets without having the namespace. We inherit
the namespace from the device the packet showed up on. So I think in
regards to namespaces this looks more or less fine.

Not being able to match on the destination ip or even not considering
the routing infrastructure is kind of dangerous (e.g. we might want to
make the socket not reachable from an interface via ip rules or
blackhole routes, but they wouldn't be considered, too).

Bye,
Hannes



Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-21 Thread Edward Cree
On 21/06/16 09:22, David Miller wrote:
> From: Tom Herbert  Date: Mon, 20 Jun 2016 10:05:01 -0700
>> Generally, this means it needs to at least match by local addresses and port 
>> for an unconnected/unbound socket, the source address for an 
>> unconnected/bound socket, a the full 4-tuple for a connected socket. 
> These lookup keys are all insufficient. At the very least the network 
> namespace must be in the lookup key as well if you want to match "sockets".
But the card doesn't have to be told that; instead, only push a socket to
a device offload if the device is in the same ns as the socket.  Wouldn't
that work?
Anything beyond that - i.e. supporting cross-ns offloads - would require
knowing how packets / addresses get transformed in bridging them from one
ns to another and in general that's quite a wide set of possibilities; it
doesn't seem worth while.  Especially since the likely use-case of tunnels
plus containers is that the host does the decapsulation and transparently
gives the container a virtual ethernet device, which keeps the hardware
and the "socket" in the same ns.
> But anyways, the vastness of the key is why we want to keep "sockets"
> out of network cards, because proper support of "sockets" requires
> access to information the card simply does not and should not have.

I think Tom's talk of "sockets" is a red herring; it's more a question of
"flows".  If we think of our host as a black box, its decisions ("is this
traffic encapsulated?") necessarily depend upon the 5-tuple plus the
(implicit) information that the traffic is being received on a particular
interface.
Netns are another red herring: even without them, what if our host is a
router with NAT, forwarding traffic to another host?  Now you're trying to
match a "socket" on another host (in, perhaps, another IP-address
namespace), but the "flow" is still the same: it's defined in terms of the
addresses on the incoming traffic, not what they might get NATted to by
the time the packets hit an actual socket.

So AFAICT, flow matching up to and including 5-tuple is both necessary and
sufficient for correct UDP tunnel detection in HW.  Sadly most HW
(including our latest here at sfc) thinks it only needs UDP dest port :(
and for such HW, Tom is right that we can't mix it with forwarding, and
have to reserve the port in all ns.

-Ed



Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-21 Thread David Miller
From: Hannes Frederic Sowa 
Date: Mon, 20 Jun 2016 11:11:32 -0700

> I am not sure if this is necessary. The devices actually having the
> ndo-ops, used to configure offloading, should only be visible inside one
> namespace. If those ndo-ops actually have side effects on other
> namespaces, they violate the contract and should be removed from the
> driver. :/

This is most if not all drivers right now, to the best of my
understanding.


Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-21 Thread David Miller
From: Tom Herbert 
Date: Mon, 20 Jun 2016 10:05:01 -0700

> Generally, this means it needs to at least match by local addresses
> and port for an unconnected/unbound socket, the source address for
> an unconnected/bound socket, a the full 4-tuple for a connected
> socket.

These lookup keys are all insufficient.

At the very least the network namespace must be in the lookup key as
well if you want to match "sockets".  And this is just the tip of the
iceberg in my opinion.

The namespace bypassing to me is the biggest flaw in the UDP tunnel
offloads.  That is creating real dangers right now.

But anyways, the vastness of the key is why we want to keep "sockets"
out of network cards, because proper support of "sockets" requires
access to information the card simply does not and should not have.


Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-20 Thread Hannes Frederic Sowa
On 20.06.2016 12:27, Tom Herbert wrote:
>> Do we?
>>
>> We look up the socket in a proper way, inclusive the namespace belonging
>> to the device we received the packet on. That said, I don't see a
>> problem with that right now. But maybe I miss something?
>>
> When we so the socket lookup in udp_rcv this is done after IP layer
> and packet has been determined to be loacally addressed. The packet is
> for the host, and if a UDP socket is matched, even if just based on
> destination port, the socket is matched and received functions are
> called. There is no ambiguity.
> 
> When the lookup is performed in GRO this is before we processed IP and
> determined that the packet is local. So a packet with any address
> (local or not) will match a listener socket with the same UDP port.
> The GRO can be performed an packet is subsequently forwarded maybe
> having been modified. If this packet turns out to not be the protocol
> we thought it was (e.g. VXLAN) then we have now potentially silently
> corrupted someone else's packets. Grant it, there's probably a lot of
> things that are required to make corruption happen, but it does allow
> the possibly of systematic data corruption and we haven't discounted
> this to become a security vulnerability.

Agreed.

Maybe we must switch to always use connected sockets for unicast+UDP+GRO?

Bye,
Hannes




Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-20 Thread Tom Herbert
On Mon, Jun 20, 2016 at 2:36 PM, Hannes Frederic Sowa
 wrote:
> On 20.06.2016 12:27, Tom Herbert wrote:
>>> Do we?
>>>
>>> We look up the socket in a proper way, inclusive the namespace belonging
>>> to the device we received the packet on. That said, I don't see a
>>> problem with that right now. But maybe I miss something?
>>>
>> When we so the socket lookup in udp_rcv this is done after IP layer
>> and packet has been determined to be loacally addressed. The packet is
>> for the host, and if a UDP socket is matched, even if just based on
>> destination port, the socket is matched and received functions are
>> called. There is no ambiguity.
>>
>> When the lookup is performed in GRO this is before we processed IP and
>> determined that the packet is local. So a packet with any address
>> (local or not) will match a listener socket with the same UDP port.
>> The GRO can be performed an packet is subsequently forwarded maybe
>> having been modified. If this packet turns out to not be the protocol
>> we thought it was (e.g. VXLAN) then we have now potentially silently
>> corrupted someone else's packets. Grant it, there's probably a lot of
>> things that are required to make corruption happen, but it does allow
>> the possibly of systematic data corruption and we haven't discounted
>> this to become a security vulnerability.
>
> Agreed.
>
> Maybe we must switch to always use connected sockets for unicast+UDP+GRO?
>
No, I don't think we need to go that far. We should be able to enable
GRO on pretty much any sort of UDP socket--the other reason to move
GRO into udp_rcv is this will be a more suitable environment for user
programmable GRO which seems to be on the horizon now.

Tom

> Bye,
> Hannes
>
>


Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-20 Thread Tom Herbert
> Do we?
>
> We look up the socket in a proper way, inclusive the namespace belonging
> to the device we received the packet on. That said, I don't see a
> problem with that right now. But maybe I miss something?
>
When we so the socket lookup in udp_rcv this is done after IP layer
and packet has been determined to be loacally addressed. The packet is
for the host, and if a UDP socket is matched, even if just based on
destination port, the socket is matched and received functions are
called. There is no ambiguity.

When the lookup is performed in GRO this is before we processed IP and
determined that the packet is local. So a packet with any address
(local or not) will match a listener socket with the same UDP port.
The GRO can be performed an packet is subsequently forwarded maybe
having been modified. If this packet turns out to not be the protocol
we thought it was (e.g. VXLAN) then we have now potentially silently
corrupted someone else's packets. Grant it, there's probably a lot of
things that are required to make corruption happen, but it does allow
the possibly of systematic data corruption and we haven't discounted
this to become a security vulnerability.

Tom

> Thanks,
> Hannes
>


Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-20 Thread Hannes Frederic Sowa
Hello,

On 20.06.2016 10:05, Tom Herbert wrote:
>> And finally, he added a big comment explaining that new tunnel types
>> should not be added to the tunnel type list, and those that exist
>> should only be used for RX.
>>
>> Therefore, this isn't openning the door for new random offloads, quite
>> the contrary.  Instead, if it making clearer what the existing
>> facilitites support, and putting an explicit cap on them.
>>
> We have no reason to believe there won't be more offloaded UDP
> protocols. There are plenty of other UDP encapsulations, transport
> protocols, and application layer protocols that could be offloaded and
> some of these may provide protocol specific features that have no
> generic analogue. The problem is that this interface is wrong.

Certainly not. And you have a hand in this as well. ;)

When we will see more and more offloads on networking cards we certainly
have to redesign this interface. New cards seem to be available which
offer new kinds of offloads but I think we should wait until we have 2-3
vendors supporting such interfaces and trying to come up with either a
new version/extension of this interface or move over to a new interface.
We simply don't know yet how and what we actually might need to support
and what abstraction is best to communicate with those drivers. I think
time will simply solve this problem. The Linux kernel is a living orgasm
anyway so it will find a way to deal with that.

> The interface and current NIC implementations are based on the
> assumption that UDP port number is sufficient to decode a UDP payload.
> This is incorrect, UDP port numbers do not have global meaning they
> can only be correctly interrupted by the endpoints in the
> communication (RFC7605).  Just because port number 4789 is assigned to
> VXLAN does not mean that any packet a device sees that has destination
> port 4789 is VXLAN, it may very well be something else completely
> different. This opens up the very real possibility that devices in the
> network, including NICs, misinterpret packets because they are looking
> only at port number. Furthermore, if a device modifies packets based
> on just port number, say like in GRO, this opens the door to silent
> data corruption. This will be a real problem with a router that is
> trying UDP offload packets just being forwarded, but even on just a
> host this can be an issue when using ns with ipvlan of macvlan.

The Linux kernel does it correctly: for every tunnel we open, we create
a UDP socket and wildcard bind it to the configured port number, so
nothing in the current namespace can bind to this port number at the
same time.

The problem is the actual hardware to date: Intel networking cards seem
to only configure the UDP port in use, not even the address family nor
the destination ip. Other networking cards seem to follow this schema.
We currently can't do anything else. If later NICs support better
filtering schemes, we can easily add it with Alex's series.

The only improvement I see right now is that we basically have to bind
to both IP protocol versions, because NICs don't care about the version.

> The design of a UDP offload interface should be predicated on the fact
> that we are offloading the receive path of a UDP socket. It follows
> that the requirement of the offload device is to match packets that
> will be received by the UDP socket. Generally, this means it needs to
> at least match by local addresses and port for an unconnected/unbound
> socket, the source address for an unconnected/bound socket, a the full
> 4-tuple for a connected socket. VLAN, macvlan, or anything else that
> could affect the selection of receive socket would also need to be
> taken in to account. The typical driver interface that provides for
> fine grained packet matching is n-tuple filtering.

Agreed. But I fear that we have to deal with the hardware at hand.

> To address the immediate problem with the current existing UDP HW
> offloads I suggest:
> 
> 1) Disallow UDP HW offload when forwarding is enabled. Applying
> offloads to UDP encapsulated packets that are only being forwarded is
> not correct.

Strictly speaking, this is the only correct way to do it right now. A
forwarding host can apply the wrong the optimizations to received UDP
packets if a tunnel is being created and the same host is used to act as
a router. Unfortunately this basically kills vxlan offloading for most,
if not all, virtualization cases and probably for most containers, too.

> 2) Open a socket for the encapsulation port in each created ns
> (similar to how RDS does this). This prevents a UDP port being used
> for encapsulation from being bound for other purposes in ns.

I am not sure if this is necessary. The devices actually having the
ndo-ops, used to configure offloading, should only be visible inside one
namespace. If those ndo-ops actually have side effects on other
namespaces, they violate the contract and should be removed from the
driver. :/

I would 

Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-20 Thread Tom Herbert
> And finally, he added a big comment explaining that new tunnel types
> should not be added to the tunnel type list, and those that exist
> should only be used for RX.
>
> Therefore, this isn't openning the door for new random offloads, quite
> the contrary.  Instead, if it making clearer what the existing
> facilitites support, and putting an explicit cap on them.
>
We have no reason to believe there won't be more offloaded UDP
protocols. There are plenty of other UDP encapsulations, transport
protocols, and application layer protocols that could be offloaded and
some of these may provide protocol specific features that have no
generic analogue. The problem is that this interface is wrong.

The interface and current NIC implementations are based on the
assumption that UDP port number is sufficient to decode a UDP payload.
This is incorrect, UDP port numbers do not have global meaning they
can only be correctly interrupted by the endpoints in the
communication (RFC7605).  Just because port number 4789 is assigned to
VXLAN does not mean that any packet a device sees that has destination
port 4789 is VXLAN, it may very well be something else completely
different. This opens up the very real possibility that devices in the
network, including NICs, misinterpret packets because they are looking
only at port number. Furthermore, if a device modifies packets based
on just port number, say like in GRO, this opens the door to silent
data corruption. This will be a real problem with a router that is
trying UDP offload packets just being forwarded, but even on just a
host this can be an issue when using ns with ipvlan of macvlan.

The design of a UDP offload interface should be predicated on the fact
that we are offloading the receive path of a UDP socket. It follows
that the requirement of the offload device is to match packets that
will be received by the UDP socket. Generally, this means it needs to
at least match by local addresses and port for an unconnected/unbound
socket, the source address for an unconnected/bound socket, a the full
4-tuple for a connected socket. VLAN, macvlan, or anything else that
could affect the selection of receive socket would also need to be
taken in to account. The typical driver interface that provides for
fine grained packet matching is n-tuple filtering.

To address the immediate problem with the current existing UDP HW
offloads I suggest:

1) Disallow UDP HW offload when forwarding is enabled. Applying
offloads to UDP encapsulated packets that are only being forwarded is
not correct.
2) Open a socket for the encapsulation port in each created ns
(similar to how RDS does this). This prevents a UDP port being used
for encapsulation from being bound for other purposes in ns.

btw, we also have this problem in the stack. GRO for UDP encapsulation
is performed before we have verified the packet is for us ("us" being
the encapsulation socket in the default name space).  The solution
here I believe is to move UDP GRO to be in udp_receive, although if
the above are implemented that would also address the issue in SW GRO
at least for the short term. flow_dissector does not attempt crack
open UDP encapsulation so we don't have the issue there.

Thanks,
Tom


Re: [net-next PATCH v3 00/17] Future-proof tunnel offload handlers

2016-06-17 Thread David Miller
From: Alexander Duyck 
Date: Thu, 16 Jun 2016 12:20:35 -0700

> s patch is meant to address two things.  First we are currently using
> the ndo_add/del_vxlan_port calls with VXLAN-GPE tunnels and we cannot
> really support that as it is likely to cause more harm than good since
> VXLAN-GPE can support tunnels without a MAC address on the inner header.
> 
> As such we need to add a new offload to advertise this, but in doing so it
> would mean introducing 3 new functions for the driver to request the ports,
> and then for the tunnel to push the changes to add and delete the ports to
> the device.  However instead of taking that approach I think it would be
> much better if we just made one common function for fetching the ports, and
> provided a generic means to push the tunnels to the device.  So in order to
> make this work this patch set does several things.
 ...

Series applied, thanks Alexander.

Tom, I've heard your arguments, but I think your fears are unfounded.
Look at what Alexander's patches are actually doing.

First, he's fixing a bug.  Hardware that supports VXLAN offloading
doesn't support VXLAN-GPE, yet we were sending such things to the
VXLAN offload paths.

Second, he's eliminating a metric ton of Kconfig garbage, as drivers
had to have all kinds of contrived dependencies to support UDP tunnel
offloads.

Third, he's consolidating several driver NDO methods into just two.

And finally, he added a big comment explaining that new tunnel types
should not be added to the tunnel type list, and those that exist
should only be used for RX.

Therefore, this isn't openning the door for new random offloads, quite
the contrary.  Instead, if it making clearer what the existing
facilitites support, and putting an explicit cap on them.

Thanks.