I'd like to chime in and add that this is something that we've been missing in our Juju and MAAS deployment of OpenStack.
One of our problem area for example is properly routing management, storage and public traffic for our OpenStack deployment without complex static rules and a lot of annoying workarounds. I hope that both the Juju and MAAS team take a proper look at supporting these use cases. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1737428 Title: VRF support to solve routing problems associated with multi-homing Status in juju: Incomplete Status in MAAS: Incomplete Status in linux package in Ubuntu: Incomplete Bug description: Problem description: * a host is multi-homed if it has multiple network interfaces with L3 addresses configured (physical or virtual interfaces, natural to OpenStack regardless of IPv4/IPv6 and IPv6 in general); (see 3.3.4 Local Multihoming https://tools.ietf.org/html/rfc1122#page-60 and 3.3.4.2 Multihoming Requirements) * if all hosts that need to participate in L3 communication are located on the same L2 network there is no need for a routing device to be present. ARP/NDP and auto-created directly connected routes are enough; * multi-homing with hosts located on different L2 networks requires more intelligent routing: - "directly connected" routes are no longer enough to talk to all relevant hosts in the same network space; - a default gateway in the main routing table may not be the correct routing device that knows where to forward traffic (management network traffic goes to a management switch and router, other traffic goes to L3 ToR switch but may go via different bonds); - even if a default gateway knows where to forward traffic, it may not be the intended physical path (storage replication traffic must go through a specific outgoing interface, not the same interface as storage access traffic although both interfaces are connected to the same ToR); - there is no longer a single "default gateway" as applications need either per-logical-direction routers or to become routers themselves (if destination == X, forward to next-hop Y). Leaf-spine architecture is a good example of how multiple L2 networks force you to use spaces that have VLANs in different switch fabrics => one or more hops between hosts with interfaces associated with the same network space; - while network spaces implicitly require L3 reachability between each host that has a NIC associated with a network space, the current definition does not mention routing infrastructure required for that. For a single L2 this problem is hidden by directly connected routes, for multi-L2, no solution is provided or discussed; * existing solutions to multi-homing require routing table management on a given host: complex static routing rules, dynamic routing (e.g. running an OSPF or BGP daemon on a host); * using static routes is rigid and requires network planning (i.e. working with network engineers which may have varying degrees of experience, doing VLSM planning etc.); * using dynamic routing requires a broader integration into an organization's L3 network infrastructure. Routing can be implemented differently across different organizations and it is a security and operational burden to integrate with a company's routing infrastructure. Summary: a mechanism is needed to associate an interface with a forwarding table (FIB) which has its own default gateway and make an application with a listen(2)ing socket(2) return connected sockets associated with different FIBs. In other words, applications need to implicitly get source/destination-based routing capabilities without the need to use static routing schemes or dynamic routing and with minimum or no modifications to the applications themselves. Goals: * avoid turning individual hosts into routers; * avoid complex static rules; * better support multi-fabric deployments with minimum effort (Juju, charms, MAAS, applications, network infrastructure); * reduce operational complexity (custom L3 infrastructure integration for each deployment); * reduce delivery risks (L3 infrastructure, L3 department responsiveness varies); * avoid any form of L2 stretching at the infrastructure level - this is inefficient for various reasons. NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to read this post to understand suggestions below. How to solve it? What does it mean for Juju to support VRF devices? * enslave certain devices on provisioning based on network space information (physical NICs, VLAN devices, bonds AND bridges created for containers must be considered) - VRF devices logically enslave devices similar to bridges but work differently (on L3, not L2); * the above is per network namespace so it will work equally well in a LXD container; Conceptually: # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf # sysctl -p # # create additional routing tables # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF 1 mgmt 10 pub 20 storacc 30 storrepl EOF # # populate per-routing table default gateways # ip route add mgmt default via 192.168.0.1 # ip route add pub default via 172.16.0.1 # ip route add storacc default via 10.10.4.1 # ip route add storrepl default via 10.10.5.1 # # add and bring up VRF devices # ip link add mgmt type vrf table 1 && ip link set dev mgmt up # ip link add pub type vrf table 10 && ip link set dev pub up # ip link add storacc type vrf table 20 && ip link set dev storacc up # ip link add storrepl type vrf table 30 && ip link set dev storrepl up # # enslave actual devices to VRF devices # ip link set mgmtbr0 master mgmt # ip link set pubbr0 master pub # ip link set storaccbr0 master storacc # ip link set storreplbr0 master storrepl # make your services use INADDR_ANY for listening sockets in charms if not done already (use 0.0.0.0) charm-related: * (no-op) services with listening sockets on INADDR_ANY will not need any modifications either on the charm side or at the application level - this is the cheapest way to solve multi-homing problems; * (later) a more advanced functionality for applications that do not use INADDR_ANY but bind a listening socket to a specific address - this requires `ip vrf exec` functionality in iproute2 or application modifications. Notes: * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move routing problems to L3 departments. Juju deploy "router" is a different scenario which should reside on a model separate from IAAS; * We are not turning hosts into routers with this - this is a way to move routing decisions to the next hop which is available on a directly connected route. The problem we are solving here is N next hops instead of just one. Those hops can worry about administrative distance/different routing protocols, route costs/metrics, routing protocol peer authentication etc. * Linux kernel functionality was mostly upstreamed in 4.4; * Linux kernel only while a unit agent can run on Windows too (nothing we can do here). Implementation description: 1. Kernel 4.4 (GA xenial) * CONFIG_NET_VRF=m - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172 * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109 backports needed from 4.5 - required for VRF-unaware applications that use INADDR_ANY: 6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept) 63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept) only `ip vrf exec` related - NOT required for baseline functionality: * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and CGROUP_BPF enabled - xenial HWE only (not HWE-edge) 2. User space (iproute2) iproute2 supports the vrf keyword in a version packaged with Ubuntu 16.04. More specific functionality like `ip vrf exec <vrf-name>` is available in later versions: https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0 git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0 v4.10.0 v4.11.0 ... 3. MAAS - already hands over per-subnet default gateways https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360 https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378 4. Juju and/or MAAS: * create per-network-space routing tables (default gateways must be taken from subnets in MAAS - subnets related to the same space will have different default gateways) * create VRF devices relevant to network spaces; * enslave interfaces to VRF devices (this includes Linux bridges created by Juju for containers). 5. Nothing for baseline functionality other than configuring software to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets. (future work) configure software to use `ip vrf exec` even if it doesn't support VRFs directly when INADDR_ANY is not used. See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note that setsockopt requirement is worked around via `ip vrf exec` in iproute2 (no need to rewrite every application): "Applications that are to work within a VRF need to bind their socket to the VRF device: setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); or to specify the output device using cmsg and IP_PKTINFO. TCP & UDP services running in the default VRF context (ie., not bound to any VRF device) can work across ***all VRF domains*** by enabling the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 sysctl -w net.ipv4.udp_l3mdev_accept=1" http://man7.org/linux/man-pages/man8/ip-vrf.8.html "This ip-vrf command is a helper to run a command against a specific VRF with the VRF association ***inherited parent to child***." References: https://en.wikipedia.org/wiki/Multihoming http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html http://blog.ipspace.net/2010/09/ribs-and-fibs.html https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must- read https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF http://netdevconf.org/1.2/session.html?david-ahern-talk https://www.kernel.org/doc/Documentation/networking/vrf.txt https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and- Forwarding-%28VRF%29 http://blog.ipspace.net/2016/02/running-bgp-on-servers.html https://tools.ietf.org/html/rfc7938 http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage example on 16.04) To manage notifications about this bug go to: https://bugs.launchpad.net/juju/+bug/1737428/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp