[Bug 1737428] Re: VRF support to solve routing problems associated with multi-homing

Dmitrii Shcherbakov Sun, 10 Dec 2017 13:11:27 -0800

** Description changed:

  Problem description:
  
  * a host is multi-homed if it has multiple network interfaces with L3
  addresses configured (physical or virtual interfaces, natural to
  OpenStack regardless of IPv4/IPv6 and IPv6 in general);
  
  * if all hosts that need to participate in L3 communication are located
  on the same L2 network there is no need for a routing device to be
  present. ARP/NDP and auto-created directly connected routes are enough;
  
  * multi-homing with hosts located on different L2 networks requires more 
intelligent routing:
-   - "directly connected" routes are no longer enough to talk to all relevant 
hosts in the same network space;
-   - a default gateway in the main routing table may not be the correct 
routing device that knows where to forward traffic (management network traffic 
goes to a management switch and router, other traffic goes to L3 ToR switch but 
may go via different bonds);
-   - even if a default gateway knows where to forward traffic, it may not be 
the intended physical path (storage replication traffic must go through a 
specific outgoing interface, not the same interface as storage access traffic 
although both interfaces are connected to the same ToR);
-   - there is no longer a single "default gateway" as applications need either 
per-logical-direction routers or to become routers themselves (if destination 
== X, forward to next-hop Y). Leaf-spine architecture is a good example of how 
multiple L2 networks force you to use spaces that have VLANs in different 
switch fabrics => one or more hops between hosts with interfaces associated 
with the same network space;
-   - while network spaces implicitly require L3 reachability between each host 
that has a NIC associated with a network space, the current definition does not 
mention routing infrastructure required for that. For a single L2 this problem 
is hidden by directly connected routes, for multi-L2, no solution is provided 
or discussed;
+   - "directly connected" routes are no longer enough to talk to all relevant 
hosts in the same network space;
+   - a default gateway in the main routing table may not be the correct 
routing device that knows where to forward traffic (management network traffic 
goes to a management switch and router, other traffic goes to L3 ToR switch but 
may go via different bonds);
+   - even if a default gateway knows where to forward traffic, it may not be 
the intended physical path (storage replication traffic must go through a 
specific outgoing interface, not the same interface as storage access traffic 
although both interfaces are connected to the same ToR);
+   - there is no longer a single "default gateway" as applications need either 
per-logical-direction routers or to become routers themselves (if destination 
== X, forward to next-hop Y). Leaf-spine architecture is a good example of how 
multiple L2 networks force you to use spaces that have VLANs in different 
switch fabrics => one or more hops between hosts with interfaces associated 
with the same network space;
+   - while network spaces implicitly require L3 reachability between each host 
that has a NIC associated with a network space, the current definition does not 
mention routing infrastructure required for that. For a single L2 this problem 
is hidden by directly connected routes, for multi-L2, no solution is provided 
or discussed;
  
  * existing solutions to multi-homing require routing table management on
  a given host: complex static routing rules, dynamic routing (e.g.
  running an OSPF or BGP daemon on a host);
  
  * using static routes is rigid and requires network planning (i.e.
  working with network engineers which may have varying degrees of
  experience, doing VLSM planning etc.);
  
  * using dynamic routing requires a broader integration into an
  organization's L3 network infrastructure. Routing can be implemented
  differently across different organizations and it is a security and
  operational burden to integrate with a company's routing infrastructure.
  
  Summary: a mechanism is needed to associate an interface with a
  forwarding table (FIB) which has its own default gateway and make an
  application with a listen(2)ing socket(2) return connected sockets
  associated with different FIBs. In other words, applications need to
  implicitly get source/destination-based routing capabilities without the
  need to use static routing schemes or dynamic routing and with minimum
  or no modifications to the applications themselves.
  
  Goals:
  
  * avoid turning individual hosts into routers;
  * avoid complex static rules;
  * better support multi-fabric deployments with minimum effort (Juju, charms, 
MAAS, applications, network infrastructure);
  * reduce operational complexity (custom L3 infrastructure integration for 
each deployment);
  * reduce delivery risks (L3 infrastructure, L3 department responsiveness 
varies);
  * avoid any form of L2 stretching at the infrastructure level - this is 
inefficient for various reasons.
  
  NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to
  read this post to understand suggestions below.
  
  How to solve it?
  
  What does it mean for Juju to support VRF devices?
  
  * enslave certain devices on provisioning based on network space information 
(physical NICs, VLAN devices, bonds AND bridges created for containers must be 
considered) - VRF devices logically enslave devices similar to bridges but work 
differently (on L3, not L2);
  * the above is per network namespace so it will work equally well in a LXD 
container;
  
  Conceptually:
  
  # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf
  # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf
  # sysctl -p
  
+ # # create additional routing tables
+ # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF
+ 1  mgmt
+ 10 pub
+ 20 storacc
+ 30 storrepl
+ EOF
+ 
+ # # populate per-routing table default gateways
+ # ip route add mgmt default via 192.168.0.1
+ # ip route add pub default via 172.16.0.1
+ # ip route add storacc default via 10.10.4.1
+ # ip route add storrepl default via 10.10.5.1
+ 
+ # # add and bring up VRF devices
  # ip link add mgmt type vrf table 1 && ip link set dev mgmt up
  # ip link add pub type vrf table 2 && ip link set dev pub up
- 
- # ip link set mgmtbr0 master management
- # ip link set pubbr0 master public
+ # ip link add storacc type vrf table 1 && ip link set dev mgmt up
+ # ip link add storrepl type vrf table 2 && ip link set dev pub up
+ 
+ # # enslave actual devices to VRF devices
+ # ip link set mgmtbr0 master mgmt
+ # ip link set pubbr0 master pub
+ # ip link set storaccbr0 master storacc
+ # ip link set storreplbr0 master storrepl
  
  # make your services use INADDR_ANY for listening sockets in charms if
  not done already (use 0.0.0.0)
  
  charm-related:
  
  * (no-op) services with listening sockets on INADDR_ANY will not need
  any modifications either on the charm side or at the application level -
  this is the cheapest way to solve multi-homing problems;
  
  * (later) a more advanced functionality for applications that do not use
  INADDR_ANY but bind a listening socket to a specific address - this
  requires `ip vrf exec` functionality in iproute2 or application
  modifications.
  
  Notes:
  
  * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move 
routing problems to L3 departments. Juju deploy "router" is a different 
scenario which should reside on a model separate from IAAS;
  * We are not turning hosts into routers with this - this is a way to move 
routing decisions to the next hop which is available on a directly connected 
route. The problem we are solving here is N next hops instead of just one. 
Those hops can worry about administrative distance/different routing protocols, 
route costs/metrics, routing protocol peer authentication etc.
  * Linux kernel functionality was mostly upstreamed in 4.4;
  * Linux kernel only while a unit agent can run on Windows too (nothing we can 
do here).
  
  Implementation description:
  
  1. Kernel
  
  4.4 (GA xenial)
  
  * CONFIG_NET_VRF=m - present in xenial GA kernels
  
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172
  
  * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels
  
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109
  
  backports needed from 4.5 - required for VRF-unaware applications that
  use INADDR_ANY:
  
  6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept)
  63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept)
- 
  
  only `ip vrf exec` related - NOT required for baseline functionality:
  
  * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and
  CGROUP_BPF enabled - xenial HWE only (not HWE-edge)
  
  2. User space (iproute2)
  
  iproute2 supports the vrf keyword in a version packaged with Ubuntu
  16.04.
  
  More specific functionality like `ip vrf exec <vrf-name>` is available
  in later versions:
  
  
https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0
  git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0
  v4.10.0
  v4.11.0
  ...
  
  3. MAAS - already hands over per-subnet default gateways
  
  
https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360
  
https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378
  
  4. Juju and/or MAAS:
  
  * create VRF devices relevant to network spaces;
  * enslave interfaces to VRF devices (this includes Linux bridges created by 
Juju for containers).
  
  5. Nothing for baseline functionality other than configuring software to
  use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets.
  
  (future work) configure software to use `ip vrf exec` even if it doesn't
  support VRFs directly when INADDR_ANY is not used.
  
  See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note
  that setsockopt requirement is worked around via `ip vrf exec` in
  iproute2 (no need to rewrite every application):
  
  "Applications that are to work within a VRF need to bind their socket to
  the VRF device:
  
  setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);
  
  or to specify the output device using cmsg and IP_PKTINFO.
  
  TCP & UDP services running in the default VRF context (ie., not bound to
  any VRF device) can work across ***all VRF domains*** by enabling the
  tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
  
  sysctl -w net.ipv4.tcp_l3mdev_accept=1
  sysctl -w net.ipv4.udp_l3mdev_accept=1"
  
  http://man7.org/linux/man-pages/man8/ip-vrf.8.html
  "This ip-vrf command is a helper to run a command against a specific VRF with 
the VRF association ***inherited parent to child***."
  
  References:
  
  https://en.wikipedia.org/wiki/Multihoming
  http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html
  http://blog.ipspace.net/2010/09/ribs-and-fibs.html
  
  https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read
  
  
https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF
  
  http://netdevconf.org/1.2/session.html?david-ahern-talk
  
  https://www.kernel.org/doc/Documentation/networking/vrf.txt
  
  https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-
  Forwarding-%28VRF%29
  
  http://blog.ipspace.net/2016/02/running-bgp-on-servers.html
  https://tools.ietf.org/html/rfc7938
  
  http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage
  example on 16.04)


** Description changed:

  Problem description:
  
  * a host is multi-homed if it has multiple network interfaces with L3
  addresses configured (physical or virtual interfaces, natural to
  OpenStack regardless of IPv4/IPv6 and IPv6 in general);
  
  * if all hosts that need to participate in L3 communication are located
  on the same L2 network there is no need for a routing device to be
  present. ARP/NDP and auto-created directly connected routes are enough;
  
  * multi-homing with hosts located on different L2 networks requires more 
intelligent routing:
    - "directly connected" routes are no longer enough to talk to all relevant 
hosts in the same network space;
    - a default gateway in the main routing table may not be the correct 
routing device that knows where to forward traffic (management network traffic 
goes to a management switch and router, other traffic goes to L3 ToR switch but 
may go via different bonds);
    - even if a default gateway knows where to forward traffic, it may not be 
the intended physical path (storage replication traffic must go through a 
specific outgoing interface, not the same interface as storage access traffic 
although both interfaces are connected to the same ToR);
    - there is no longer a single "default gateway" as applications need either 
per-logical-direction routers or to become routers themselves (if destination 
== X, forward to next-hop Y). Leaf-spine architecture is a good example of how 
multiple L2 networks force you to use spaces that have VLANs in different 
switch fabrics => one or more hops between hosts with interfaces associated 
with the same network space;
    - while network spaces implicitly require L3 reachability between each host 
that has a NIC associated with a network space, the current definition does not 
mention routing infrastructure required for that. For a single L2 this problem 
is hidden by directly connected routes, for multi-L2, no solution is provided 
or discussed;
  
  * existing solutions to multi-homing require routing table management on
  a given host: complex static routing rules, dynamic routing (e.g.
  running an OSPF or BGP daemon on a host);
  
  * using static routes is rigid and requires network planning (i.e.
  working with network engineers which may have varying degrees of
  experience, doing VLSM planning etc.);
  
  * using dynamic routing requires a broader integration into an
  organization's L3 network infrastructure. Routing can be implemented
  differently across different organizations and it is a security and
  operational burden to integrate with a company's routing infrastructure.
  
  Summary: a mechanism is needed to associate an interface with a
  forwarding table (FIB) which has its own default gateway and make an
  application with a listen(2)ing socket(2) return connected sockets
  associated with different FIBs. In other words, applications need to
  implicitly get source/destination-based routing capabilities without the
  need to use static routing schemes or dynamic routing and with minimum
  or no modifications to the applications themselves.
  
  Goals:
  
  * avoid turning individual hosts into routers;
  * avoid complex static rules;
  * better support multi-fabric deployments with minimum effort (Juju, charms, 
MAAS, applications, network infrastructure);
  * reduce operational complexity (custom L3 infrastructure integration for 
each deployment);
  * reduce delivery risks (L3 infrastructure, L3 department responsiveness 
varies);
  * avoid any form of L2 stretching at the infrastructure level - this is 
inefficient for various reasons.
  
  NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to
  read this post to understand suggestions below.
  
  How to solve it?
  
  What does it mean for Juju to support VRF devices?
  
  * enslave certain devices on provisioning based on network space information 
(physical NICs, VLAN devices, bonds AND bridges created for containers must be 
considered) - VRF devices logically enslave devices similar to bridges but work 
differently (on L3, not L2);
  * the above is per network namespace so it will work equally well in a LXD 
container;
  
  Conceptually:
  
  # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf
  # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf
  # sysctl -p
  
  # # create additional routing tables
  # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF
  1  mgmt
  10 pub
  20 storacc
  30 storrepl
  EOF
  
  # # populate per-routing table default gateways
  # ip route add mgmt default via 192.168.0.1
  # ip route add pub default via 172.16.0.1
  # ip route add storacc default via 10.10.4.1
  # ip route add storrepl default via 10.10.5.1
  
  # # add and bring up VRF devices
  # ip link add mgmt type vrf table 1 && ip link set dev mgmt up
  # ip link add pub type vrf table 2 && ip link set dev pub up
  # ip link add storacc type vrf table 1 && ip link set dev mgmt up
  # ip link add storrepl type vrf table 2 && ip link set dev pub up
  
  # # enslave actual devices to VRF devices
  # ip link set mgmtbr0 master mgmt
  # ip link set pubbr0 master pub
  # ip link set storaccbr0 master storacc
  # ip link set storreplbr0 master storrepl
  
  # make your services use INADDR_ANY for listening sockets in charms if
  not done already (use 0.0.0.0)
  
  charm-related:
  
  * (no-op) services with listening sockets on INADDR_ANY will not need
  any modifications either on the charm side or at the application level -
  this is the cheapest way to solve multi-homing problems;
  
  * (later) a more advanced functionality for applications that do not use
  INADDR_ANY but bind a listening socket to a specific address - this
  requires `ip vrf exec` functionality in iproute2 or application
  modifications.
  
  Notes:
  
  * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move 
routing problems to L3 departments. Juju deploy "router" is a different 
scenario which should reside on a model separate from IAAS;
  * We are not turning hosts into routers with this - this is a way to move 
routing decisions to the next hop which is available on a directly connected 
route. The problem we are solving here is N next hops instead of just one. 
Those hops can worry about administrative distance/different routing protocols, 
route costs/metrics, routing protocol peer authentication etc.
  * Linux kernel functionality was mostly upstreamed in 4.4;
  * Linux kernel only while a unit agent can run on Windows too (nothing we can 
do here).
  
  Implementation description:
  
  1. Kernel
  
  4.4 (GA xenial)
  
  * CONFIG_NET_VRF=m - present in xenial GA kernels
  
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172
  
  * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels
  
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109
  
  backports needed from 4.5 - required for VRF-unaware applications that
  use INADDR_ANY:
  
  6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept)
  63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept)
  
  only `ip vrf exec` related - NOT required for baseline functionality:
  
  * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and
  CGROUP_BPF enabled - xenial HWE only (not HWE-edge)
  
  2. User space (iproute2)
  
  iproute2 supports the vrf keyword in a version packaged with Ubuntu
  16.04.
  
  More specific functionality like `ip vrf exec <vrf-name>` is available
  in later versions:
  
  
https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0
  git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0
  v4.10.0
  v4.11.0
  ...
  
  3. MAAS - already hands over per-subnet default gateways
  
  
https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360
  
https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378
  
  4. Juju and/or MAAS:
  
+ * create per-network-space routing tables (default gateways must be taken 
from subnets in MAAS - subnets related to the same space will have different 
default gateways)
  * create VRF devices relevant to network spaces;
  * enslave interfaces to VRF devices (this includes Linux bridges created by 
Juju for containers).
  
  5. Nothing for baseline functionality other than configuring software to
  use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets.
  
  (future work) configure software to use `ip vrf exec` even if it doesn't
  support VRFs directly when INADDR_ANY is not used.
  
  See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note
  that setsockopt requirement is worked around via `ip vrf exec` in
  iproute2 (no need to rewrite every application):
  
  "Applications that are to work within a VRF need to bind their socket to
  the VRF device:
  
  setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);
  
  or to specify the output device using cmsg and IP_PKTINFO.
  
  TCP & UDP services running in the default VRF context (ie., not bound to
  any VRF device) can work across ***all VRF domains*** by enabling the
  tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
  
  sysctl -w net.ipv4.tcp_l3mdev_accept=1
  sysctl -w net.ipv4.udp_l3mdev_accept=1"
  
  http://man7.org/linux/man-pages/man8/ip-vrf.8.html
  "This ip-vrf command is a helper to run a command against a specific VRF with 
the VRF association ***inherited parent to child***."
  
  References:
  
  https://en.wikipedia.org/wiki/Multihoming
  http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html
  http://blog.ipspace.net/2010/09/ribs-and-fibs.html
  
  https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read
  
  
https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF
  
  http://netdevconf.org/1.2/session.html?david-ahern-talk
  
  https://www.kernel.org/doc/Documentation/networking/vrf.txt
  
  https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-
  Forwarding-%28VRF%29
  
  http://blog.ipspace.net/2016/02/running-bgp-on-servers.html
  https://tools.ietf.org/html/rfc7938
  
  http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage
  example on 16.04)

** Description changed:

  Problem description:
  
  * a host is multi-homed if it has multiple network interfaces with L3
  addresses configured (physical or virtual interfaces, natural to
  OpenStack regardless of IPv4/IPv6 and IPv6 in general);
  
  * if all hosts that need to participate in L3 communication are located
  on the same L2 network there is no need for a routing device to be
  present. ARP/NDP and auto-created directly connected routes are enough;
  
  * multi-homing with hosts located on different L2 networks requires more 
intelligent routing:
    - "directly connected" routes are no longer enough to talk to all relevant 
hosts in the same network space;
    - a default gateway in the main routing table may not be the correct 
routing device that knows where to forward traffic (management network traffic 
goes to a management switch and router, other traffic goes to L3 ToR switch but 
may go via different bonds);
    - even if a default gateway knows where to forward traffic, it may not be 
the intended physical path (storage replication traffic must go through a 
specific outgoing interface, not the same interface as storage access traffic 
although both interfaces are connected to the same ToR);
    - there is no longer a single "default gateway" as applications need either 
per-logical-direction routers or to become routers themselves (if destination 
== X, forward to next-hop Y). Leaf-spine architecture is a good example of how 
multiple L2 networks force you to use spaces that have VLANs in different 
switch fabrics => one or more hops between hosts with interfaces associated 
with the same network space;
    - while network spaces implicitly require L3 reachability between each host 
that has a NIC associated with a network space, the current definition does not 
mention routing infrastructure required for that. For a single L2 this problem 
is hidden by directly connected routes, for multi-L2, no solution is provided 
or discussed;
  
  * existing solutions to multi-homing require routing table management on
  a given host: complex static routing rules, dynamic routing (e.g.
  running an OSPF or BGP daemon on a host);
  
  * using static routes is rigid and requires network planning (i.e.
  working with network engineers which may have varying degrees of
  experience, doing VLSM planning etc.);
  
  * using dynamic routing requires a broader integration into an
  organization's L3 network infrastructure. Routing can be implemented
  differently across different organizations and it is a security and
  operational burden to integrate with a company's routing infrastructure.
  
  Summary: a mechanism is needed to associate an interface with a
  forwarding table (FIB) which has its own default gateway and make an
  application with a listen(2)ing socket(2) return connected sockets
  associated with different FIBs. In other words, applications need to
  implicitly get source/destination-based routing capabilities without the
  need to use static routing schemes or dynamic routing and with minimum
  or no modifications to the applications themselves.
  
  Goals:
  
  * avoid turning individual hosts into routers;
  * avoid complex static rules;
  * better support multi-fabric deployments with minimum effort (Juju, charms, 
MAAS, applications, network infrastructure);
  * reduce operational complexity (custom L3 infrastructure integration for 
each deployment);
  * reduce delivery risks (L3 infrastructure, L3 department responsiveness 
varies);
  * avoid any form of L2 stretching at the infrastructure level - this is 
inefficient for various reasons.
  
  NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to
  read this post to understand suggestions below.
  
  How to solve it?
  
  What does it mean for Juju to support VRF devices?
  
  * enslave certain devices on provisioning based on network space information 
(physical NICs, VLAN devices, bonds AND bridges created for containers must be 
considered) - VRF devices logically enslave devices similar to bridges but work 
differently (on L3, not L2);
  * the above is per network namespace so it will work equally well in a LXD 
container;
  
  Conceptually:
  
  # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf
  # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf
  # sysctl -p
  
  # # create additional routing tables
  # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF
  1  mgmt
  10 pub
  20 storacc
  30 storrepl
  EOF
  
  # # populate per-routing table default gateways
  # ip route add mgmt default via 192.168.0.1
  # ip route add pub default via 172.16.0.1
  # ip route add storacc default via 10.10.4.1
  # ip route add storrepl default via 10.10.5.1
  
  # # add and bring up VRF devices
  # ip link add mgmt type vrf table 1 && ip link set dev mgmt up
- # ip link add pub type vrf table 2 && ip link set dev pub up
- # ip link add storacc type vrf table 1 && ip link set dev mgmt up
- # ip link add storrepl type vrf table 2 && ip link set dev pub up
+ # ip link add pub type vrf table 10 && ip link set dev pub up
+ # ip link add storacc type vrf table 20 && ip link set dev mgmt up
+ # ip link add storrepl type vrf table 30 && ip link set dev pub up
  
  # # enslave actual devices to VRF devices
  # ip link set mgmtbr0 master mgmt
  # ip link set pubbr0 master pub
  # ip link set storaccbr0 master storacc
  # ip link set storreplbr0 master storrepl
  
  # make your services use INADDR_ANY for listening sockets in charms if
  not done already (use 0.0.0.0)
  
  charm-related:
  
  * (no-op) services with listening sockets on INADDR_ANY will not need
  any modifications either on the charm side or at the application level -
  this is the cheapest way to solve multi-homing problems;
  
  * (later) a more advanced functionality for applications that do not use
  INADDR_ANY but bind a listening socket to a specific address - this
  requires `ip vrf exec` functionality in iproute2 or application
  modifications.
  
  Notes:
  
  * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move 
routing problems to L3 departments. Juju deploy "router" is a different 
scenario which should reside on a model separate from IAAS;
  * We are not turning hosts into routers with this - this is a way to move 
routing decisions to the next hop which is available on a directly connected 
route. The problem we are solving here is N next hops instead of just one. 
Those hops can worry about administrative distance/different routing protocols, 
route costs/metrics, routing protocol peer authentication etc.
  * Linux kernel functionality was mostly upstreamed in 4.4;
  * Linux kernel only while a unit agent can run on Windows too (nothing we can 
do here).
  
  Implementation description:
  
  1. Kernel
  
  4.4 (GA xenial)
  
  * CONFIG_NET_VRF=m - present in xenial GA kernels
  
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172
  
  * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels
  
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109
  
  backports needed from 4.5 - required for VRF-unaware applications that
  use INADDR_ANY:
  
  6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept)
  63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept)
  
  only `ip vrf exec` related - NOT required for baseline functionality:
  
  * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and
  CGROUP_BPF enabled - xenial HWE only (not HWE-edge)
  
  2. User space (iproute2)
  
  iproute2 supports the vrf keyword in a version packaged with Ubuntu
  16.04.
  
  More specific functionality like `ip vrf exec <vrf-name>` is available
  in later versions:
  
  
https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0
  git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0
  v4.10.0
  v4.11.0
  ...
  
  3. MAAS - already hands over per-subnet default gateways
  
  
https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360
  
https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378
  
  4. Juju and/or MAAS:
  
  * create per-network-space routing tables (default gateways must be taken 
from subnets in MAAS - subnets related to the same space will have different 
default gateways)
  * create VRF devices relevant to network spaces;
  * enslave interfaces to VRF devices (this includes Linux bridges created by 
Juju for containers).
  
  5. Nothing for baseline functionality other than configuring software to
  use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets.
  
  (future work) configure software to use `ip vrf exec` even if it doesn't
  support VRFs directly when INADDR_ANY is not used.
  
  See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note
  that setsockopt requirement is worked around via `ip vrf exec` in
  iproute2 (no need to rewrite every application):
  
  "Applications that are to work within a VRF need to bind their socket to
  the VRF device:
  
  setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);
  
  or to specify the output device using cmsg and IP_PKTINFO.
  
  TCP & UDP services running in the default VRF context (ie., not bound to
  any VRF device) can work across ***all VRF domains*** by enabling the
  tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
  
  sysctl -w net.ipv4.tcp_l3mdev_accept=1
  sysctl -w net.ipv4.udp_l3mdev_accept=1"
  
  http://man7.org/linux/man-pages/man8/ip-vrf.8.html
  "This ip-vrf command is a helper to run a command against a specific VRF with 
the VRF association ***inherited parent to child***."
  
  References:
  
  https://en.wikipedia.org/wiki/Multihoming
  http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html
  http://blog.ipspace.net/2010/09/ribs-and-fibs.html
  
  https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read
  
  
https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF
  
  http://netdevconf.org/1.2/session.html?david-ahern-talk
  
  https://www.kernel.org/doc/Documentation/networking/vrf.txt
  
  https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-
  Forwarding-%28VRF%29
  
  http://blog.ipspace.net/2016/02/running-bgp-on-servers.html
  https://tools.ietf.org/html/rfc7938
  
  http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage
  example on 16.04)

** Description changed:

  Problem description:
  
  * a host is multi-homed if it has multiple network interfaces with L3
  addresses configured (physical or virtual interfaces, natural to
  OpenStack regardless of IPv4/IPv6 and IPv6 in general);
  
  * if all hosts that need to participate in L3 communication are located
  on the same L2 network there is no need for a routing device to be
  present. ARP/NDP and auto-created directly connected routes are enough;
  
  * multi-homing with hosts located on different L2 networks requires more 
intelligent routing:
    - "directly connected" routes are no longer enough to talk to all relevant 
hosts in the same network space;
    - a default gateway in the main routing table may not be the correct 
routing device that knows where to forward traffic (management network traffic 
goes to a management switch and router, other traffic goes to L3 ToR switch but 
may go via different bonds);
    - even if a default gateway knows where to forward traffic, it may not be 
the intended physical path (storage replication traffic must go through a 
specific outgoing interface, not the same interface as storage access traffic 
although both interfaces are connected to the same ToR);
    - there is no longer a single "default gateway" as applications need either 
per-logical-direction routers or to become routers themselves (if destination 
== X, forward to next-hop Y). Leaf-spine architecture is a good example of how 
multiple L2 networks force you to use spaces that have VLANs in different 
switch fabrics => one or more hops between hosts with interfaces associated 
with the same network space;
    - while network spaces implicitly require L3 reachability between each host 
that has a NIC associated with a network space, the current definition does not 
mention routing infrastructure required for that. For a single L2 this problem 
is hidden by directly connected routes, for multi-L2, no solution is provided 
or discussed;
  
  * existing solutions to multi-homing require routing table management on
  a given host: complex static routing rules, dynamic routing (e.g.
  running an OSPF or BGP daemon on a host);
  
  * using static routes is rigid and requires network planning (i.e.
  working with network engineers which may have varying degrees of
  experience, doing VLSM planning etc.);
  
  * using dynamic routing requires a broader integration into an
  organization's L3 network infrastructure. Routing can be implemented
  differently across different organizations and it is a security and
  operational burden to integrate with a company's routing infrastructure.
  
  Summary: a mechanism is needed to associate an interface with a
  forwarding table (FIB) which has its own default gateway and make an
  application with a listen(2)ing socket(2) return connected sockets
  associated with different FIBs. In other words, applications need to
  implicitly get source/destination-based routing capabilities without the
  need to use static routing schemes or dynamic routing and with minimum
  or no modifications to the applications themselves.
  
  Goals:
  
  * avoid turning individual hosts into routers;
  * avoid complex static rules;
  * better support multi-fabric deployments with minimum effort (Juju, charms, 
MAAS, applications, network infrastructure);
  * reduce operational complexity (custom L3 infrastructure integration for 
each deployment);
  * reduce delivery risks (L3 infrastructure, L3 department responsiveness 
varies);
  * avoid any form of L2 stretching at the infrastructure level - this is 
inefficient for various reasons.
  
  NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to
  read this post to understand suggestions below.
  
  How to solve it?
  
  What does it mean for Juju to support VRF devices?
  
  * enslave certain devices on provisioning based on network space information 
(physical NICs, VLAN devices, bonds AND bridges created for containers must be 
considered) - VRF devices logically enslave devices similar to bridges but work 
differently (on L3, not L2);
  * the above is per network namespace so it will work equally well in a LXD 
container;
  
  Conceptually:
  
  # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf
  # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf
  # sysctl -p
  
  # # create additional routing tables
  # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF
  1  mgmt
  10 pub
  20 storacc
  30 storrepl
  EOF
  
  # # populate per-routing table default gateways
  # ip route add mgmt default via 192.168.0.1
  # ip route add pub default via 172.16.0.1
  # ip route add storacc default via 10.10.4.1
  # ip route add storrepl default via 10.10.5.1
  
  # # add and bring up VRF devices
  # ip link add mgmt type vrf table 1 && ip link set dev mgmt up
  # ip link add pub type vrf table 10 && ip link set dev pub up
- # ip link add storacc type vrf table 20 && ip link set dev mgmt up
- # ip link add storrepl type vrf table 30 && ip link set dev pub up
+ # ip link add storacc type vrf table 20 && ip link set dev storacc up
+ # ip link add storrepl type vrf table 30 && ip link set dev storrepl up
  
  # # enslave actual devices to VRF devices
  # ip link set mgmtbr0 master mgmt
  # ip link set pubbr0 master pub
  # ip link set storaccbr0 master storacc
  # ip link set storreplbr0 master storrepl
  
  # make your services use INADDR_ANY for listening sockets in charms if
  not done already (use 0.0.0.0)
  
  charm-related:
  
  * (no-op) services with listening sockets on INADDR_ANY will not need
  any modifications either on the charm side or at the application level -
  this is the cheapest way to solve multi-homing problems;
  
  * (later) a more advanced functionality for applications that do not use
  INADDR_ANY but bind a listening socket to a specific address - this
  requires `ip vrf exec` functionality in iproute2 or application
  modifications.
  
  Notes:
  
  * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move 
routing problems to L3 departments. Juju deploy "router" is a different 
scenario which should reside on a model separate from IAAS;
  * We are not turning hosts into routers with this - this is a way to move 
routing decisions to the next hop which is available on a directly connected 
route. The problem we are solving here is N next hops instead of just one. 
Those hops can worry about administrative distance/different routing protocols, 
route costs/metrics, routing protocol peer authentication etc.
  * Linux kernel functionality was mostly upstreamed in 4.4;
  * Linux kernel only while a unit agent can run on Windows too (nothing we can 
do here).
  
  Implementation description:
  
  1. Kernel
  
  4.4 (GA xenial)
  
  * CONFIG_NET_VRF=m - present in xenial GA kernels
  
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172
  
  * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels
  
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109
  
  backports needed from 4.5 - required for VRF-unaware applications that
  use INADDR_ANY:
  
  6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept)
  63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept)
  
  only `ip vrf exec` related - NOT required for baseline functionality:
  
  * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and
  CGROUP_BPF enabled - xenial HWE only (not HWE-edge)
  
  2. User space (iproute2)
  
  iproute2 supports the vrf keyword in a version packaged with Ubuntu
  16.04.
  
  More specific functionality like `ip vrf exec <vrf-name>` is available
  in later versions:
  
  
https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0
  git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0
  v4.10.0
  v4.11.0
  ...
  
  3. MAAS - already hands over per-subnet default gateways
  
  
https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360
  
https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378
  
  4. Juju and/or MAAS:
  
  * create per-network-space routing tables (default gateways must be taken 
from subnets in MAAS - subnets related to the same space will have different 
default gateways)
  * create VRF devices relevant to network spaces;
  * enslave interfaces to VRF devices (this includes Linux bridges created by 
Juju for containers).
  
  5. Nothing for baseline functionality other than configuring software to
  use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets.
  
  (future work) configure software to use `ip vrf exec` even if it doesn't
  support VRFs directly when INADDR_ANY is not used.
  
  See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note
  that setsockopt requirement is worked around via `ip vrf exec` in
  iproute2 (no need to rewrite every application):
  
  "Applications that are to work within a VRF need to bind their socket to
  the VRF device:
  
  setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);
  
  or to specify the output device using cmsg and IP_PKTINFO.
  
  TCP & UDP services running in the default VRF context (ie., not bound to
  any VRF device) can work across ***all VRF domains*** by enabling the
  tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
  
  sysctl -w net.ipv4.tcp_l3mdev_accept=1
  sysctl -w net.ipv4.udp_l3mdev_accept=1"
  
  http://man7.org/linux/man-pages/man8/ip-vrf.8.html
  "This ip-vrf command is a helper to run a command against a specific VRF with 
the VRF association ***inherited parent to child***."
  
  References:
  
  https://en.wikipedia.org/wiki/Multihoming
  http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html
  http://blog.ipspace.net/2010/09/ribs-and-fibs.html
  
  https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-read
  
  
https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF
  
  http://netdevconf.org/1.2/session.html?david-ahern-talk
  
  https://www.kernel.org/doc/Documentation/networking/vrf.txt
  
  https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-
  Forwarding-%28VRF%29
  
  http://blog.ipspace.net/2016/02/running-bgp-on-servers.html
  https://tools.ietf.org/html/rfc7938
  
  http://www.routereflector.com/2016/11/working-with-vrf-on-linux/ (usage
  example on 16.04)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1737428

Title:
  VRF support to solve routing problems associated with multi-homing

To manage notifications about this bug go to:
https://bugs.launchpad.net/juju/+bug/1737428/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1737428] Re: VRF support to solve routing problems associated with multi-homing

Reply via email to