Hi all,
(for those of you who read openstack-dev@, you may notice some duplication in
this email comparing to the related thread:
http://lists.openstack.org/pipermail/openstack-dev/2016-June/097189.html If
that’s the case, sorry!)
tl;dr lots of Open vSwitch based SDN controllers plug devices that are meant to
have different MTUs into the same ‘integration’ bridge (usually called br-int),
and it sometimes makes MTU arrangements for those devices ineffective. Neutron
team seeks guidance from Open vSwitch folks on how to proceed.
First thing, I’d like to note that when speaking about ‘Neutron' below, I
implicitly mean ‘Neutron ML2/Open vSwitch reference implementation’. Though I
believe same issues should affect other SDN solutions (OVN? dragonflow?) built
on top of Open vSwitch that use a single integration bridge.
Now, let’s try to scope the problem. Neutron consistently uses a single bridge
to plug all devices managed by a node. Those devices may belong to the same
layer 2 domain ('network' in neutron-speak), as well as different layer 2
domains. Those domains may be implemented by using different encapsulation
technologies, that in Neutron ML2 plugin case results in networks having
different MTU values calculated for those networks. All devices that belong to
a single network are supposed to use the network MTU. Those include virtualized
interfaces inside VMs, as well as devices on the data path from VMs to the
integration bridge. Meaning, for a typical Neutron Open vSwitch setup, the
following devices are meant to carry the network MTU:
VM interface - tap device - ‘hybrid’ Linux bridge* - VETH pair => plugged into
br-int.
(* used for iptables based firewall)
Now, Neutron (OpenStack Networking) and Nova (OpenStack Compute) components set
relevant MTUs on all of those devices (except a VM interface, that is usually
configured by the guest OS itself, based on information provided through
DHCP/RA responses, or other means).
It all works as long as all devices we plug into br-int belong to networks with
identical MTUs. But since Neutron allows for different MTUs, the assumption
does not hold.
While Neutron indeed plugs devices that belong to different broadcast domains
into the same switch, it does not mean to allow traffic that belong to
different domains to be switched. (All inter-domain communication is handled by
virtual routers that are implemented as network namespaces.) Isolation is
achieved thru local vlan tagging. Quoting:
"All VM VIFs are plugged into the integration bridge. VM VIFs on a given
virtual network share a common “local” VLAN (i.e. not propagated externally).
The VLAN id of this local VLAN is mapped to the physical networking details
realizing that virtual network.”
http://docs.openstack.org/developer/neutron/devref/openvswitch_agent.html#bridge-management
What it means is that while devices are plugged into the same bridge, due to
the additional layer of isolation, Neutron effectively uses a single bridge as
a set of switches, one per network participating in the bridge setup.
So back to MTU. When I boot a VM using a VXLAN backed network, the tap device
of MTU=1450 is plugged into the br-int bridge, which lowers the bridge MTU to
1450. Then when I plug a device that belongs to a GRE network (MTU = 1458) into
that same bridge, the GRE network backed device also gets its MTU reduced to
1450, and no ‘ip link’ commands allow to raise it to the intended MTU=1458.
Curiously, when I move the latter device into a network namespace and try to
set MTU on that same device, it works. (Jiri Benc told me that it’s missing
validation in vswitchd code that allows it). We actually utilized that magic in
a fix in Neutron to make router devices (that are in a namespace) to get
intended MTU values: https://review.openstack.org/#/c/327651/ where we now
first move the device in a namespace, and only then set its MTU.
There are several issues with the Neutron patch. First, it relies on a bug in
Open vSwitch. Second, it does not solve the problem for other devices that are
plugged into br-int and that don’t belong to separate namespaces (which are all
VM VIFs in OpenStack).
One idea that was mentioned to me by Jiri Benc is to reimplement Neutron bridge
setup to use multiple bridges, one per network. In that way, there won’t be a
need to have devices with different MTUs on the same integration bridge.
Isolation between domains would also be simplified, because now we would not
need to maintain any local VLAN tagging rules to isolate domains from each
other; isolation would naturally happen, since now all connection paths between
domains will have an L3 layer (namespace) on their road.
If we would start from scratch, it would probably be the best idea with little
drawbacks. Sadly, we are looking at a huge number of setups that rely on a
single bridge for multiple domains, and as I said before, it’s not just
Neutron. Migrating those existing workloads to a new better bridge setup would
be a huge pain, and I am not even sure whether it’s possible to replace them
without full migration of workloads to other nodes. That’s a huge engineering
work, and something that would need to happen in all affected SDN solutions.
One alternative to that could be kernel/vSwitch layer allowing to relax the
‘least of all device MTUs’ rule for some setups that explicitly ask for that.
If only such an option would be available to SDN controllers, it could be
utilized by them to be able to keep their existing single bridge setup.
And that’s the end of the story. So, what do you think of the problem? Is
alternative proposed viable? If so, what’s the proper place for such
configuration to exist - kernel or ovs?
I would be glad to find some solution that is acceptable by both Neutron as
well as Open vSwitch communities, and something that we both can support in the
long run.
Cheers,
Ihar
_______________________________________________
dev mailing list
[email protected]
http://openvswitch.org/mailman/listinfo/dev