[Yahoo-eng-team] [Bug 1738659] Re: linux bridge assigns mac address to the wrong port
I have investigated this further, and it looks like the issue is in my external network, since a device is bouncing back the arp request, and this is why the bridge assigns it to the bond2 interface. So what happens is the following: default gw<-->physical switch<-->[bond2 bridge tap]<-->[eth0 cirrosVM] The arp request goes out on the eth0 interface, and enters the bridge on the tap interface. The bridge assigns the eth0 mac address to the tap interface, and sends the arp request out on the bond2 interface. Now some deice on the left side of the bridge (either the physical switch or the default gw), broadcasts that arp requests back, therefore the same arp request enters back the bridge on the bond2 interface, and the bridge assigns the source mac address of the arp request (which is still the eth0 mac address) to the bond2 port in the forwarding table, which causes the behavior I have noticed... This also explains why I see 2 arp requests and a single arp reply when tracing: # tcpdump -n -i bond2 arp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on bond2, link-type EN10MB (Ethernet), capture size 262144 bytes 06:13:48.581758 ARP, Request who-has 10.20.21.1 tell 10.20.21.114, length 28 06:13:48.581791 ARP, Request who-has 10.20.21.1 tell 10.20.21.114, length 28 06:13:48.582221 ARP, Reply 10.20.21.1 is-at 00:17:08:c4:52:80, length 46 I am really sorry for all the trouble. ** Changed in: neutron Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1738659 Title: linux bridge assigns mac address to the wrong port Status in neutron: Invalid Bug description: * High level description: linux bridge assigns mac address to the physical external interface instead of the tap interface, therefore the VM instance that uses the tap interface is not able to communicate over IP. The workaround I have found is to convert the bridge to a hub, by setting ageing to 0 (brctl setageing br-name 0). In this way, the bridge floods all packets on all attached bridge interfaces, and everything starts working. * Pre-conditions: I have an openstack pike running in latest centos 7 release (7.4.1708). Neutron was manually installed as described in the neutron installation guide at https://docs.openstack.org/neutron/latest/install/install-rdo.html. I have configured neutron for Network Option 2 (self service networks), however the setup I am testing here is an external flat provider network with a single cirros VM instance attached directly to it (without any router in between). The openstack environment is made of two nodes: a controller and a compute. The neutron package versions is 11.0.2-2.el7 (latest in centos 7), the bridge-utils version is 1.5-9.el7 and the kernel version is 3.10.0-693.11.1.el7.x86_64. I have tested this with cirros image cirros-0.4.0-x86_64-disk.img and cirros-0.3.5-x86_64-disk.img. # rpm -qa | grep neutron-linuxbridge openstack-neutron-linuxbridge-11.0.2-2.el7.noarch # rpm -qf /usr/sbin/brctl bridge-utils-1.5-9.el7.x86_64 # uname -a Linux compute1 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Bridging is configured like below in both controller and compute: ml2_conf.ini: [ml2_type_flat] flat_networks = physnet1 linuxbridge_agent.ini: [linux_bridge] physical_interface_mappings = physnet1:bond2 * Step-by-step reproduction steps: This is how I created the provider network: openstack network create \ --share \ --external \ --provider-physical-network physnet1 \ --provider-network-type flat \ ExtNet1 This is how I create the provider subnet: openstack subnet create \ --network ExtNet1 \ --allocation-pool start=10.20.21.96,end=$10.20.21.127 \ --dns-nameserver 10.20.21.1 \ --gateway 10.20.21.1 \ --subnet-range 10.20.21.0/24 \ ExtSubnet1 This is how I launch a cirros instance and attach it to the provider network: openstack server create \ --flavor m1.nano \ --image cirros-0.4.0-x86_64-disk.img \ --nic net-id=$(openstack network list | grep ExtNet1 | cut -d\ -f 2) \ --security-group default \ --key-name controller-key \ cirros1 Based on the above, neutron creates in my compute node the following bridge: # brctl show bridge name bridge id STP enabled interfaces brq75a55ef7-4a 8000.fc15b413e6a3 no bond2 tap44bc34bb-e2 bond2 is the physical interface used for the flat provider network (in access mode, no vlans) and tap44bc34bb-e2 is the tap interface attached to my cirros VM instance. In the bridge, the bond2 is port 2, and the tap tap44bc34bb-e2 interface is port 1, and both are in forwarding mode. # brctl showstp brq75a
[Yahoo-eng-team] [Bug 1738659] [NEW] linux bridge assigns mac address to the wrong port
Public bug reported: * High level description: linux bridge assigns mac address to the physical external interface instead of the tap interface, therefore the VM instance that uses the tap interface is not able to communicate over IP. The workaround I have found is to convert the bridge to a hub, by setting ageing to 0 (brctl setageing br-name 0). In this way, the bridge floods all packets on all attached bridge interfaces, and everything starts working. * Pre-conditions: I have an openstack pike running in latest centos 7 release (7.4.1708). Neutron was manually installed as described in the neutron installation guide at https://docs.openstack.org/neutron/latest/install/install-rdo.html. I have configured neutron for Network Option 2 (self service networks), however the setup I am testing here is an external flat provider network with a single cirros VM instance attached directly to it (without any router in between). The openstack environment is made of two nodes: a controller and a compute. The neutron package versions is 11.0.2-2.el7 (latest in centos 7), the bridge-utils version is 1.5-9.el7 and the kernel version is 3.10.0-693.11.1.el7.x86_64. I have tested this with cirros image cirros-0.4.0-x86_64-disk.img and cirros-0.3.5-x86_64-disk.img. # rpm -qa | grep neutron-linuxbridge openstack-neutron-linuxbridge-11.0.2-2.el7.noarch # rpm -qf /usr/sbin/brctl bridge-utils-1.5-9.el7.x86_64 # uname -a Linux compute1 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Bridging is configured like below in both controller and compute: ml2_conf.ini: [ml2_type_flat] flat_networks = physnet1 linuxbridge_agent.ini: [linux_bridge] physical_interface_mappings = physnet1:bond2 * Step-by-step reproduction steps: This is how I created the provider network: openstack network create \ --share \ --external \ --provider-physical-network physnet1 \ --provider-network-type flat \ ExtNet1 This is how I create the provider subnet: openstack subnet create \ --network ExtNet1 \ --allocation-pool start=10.20.21.96,end=$10.20.21.127 \ --dns-nameserver 10.20.21.1 \ --gateway 10.20.21.1 \ --subnet-range 10.20.21.0/24 \ ExtSubnet1 This is how I launch a cirros instance and attach it to the provider network: openstack server create \ --flavor m1.nano \ --image cirros-0.4.0-x86_64-disk.img \ --nic net-id=$(openstack network list | grep ExtNet1 | cut -d\ -f 2) \ --security-group default \ --key-name controller-key \ cirros1 Based on the above, neutron creates in my compute node the following bridge: # brctl show bridge name bridge id STP enabled interfaces brq75a55ef7-4a 8000.fc15b413e6a3 no bond2 tap44bc34bb-e2 bond2 is the physical interface used for the flat provider network (in access mode, no vlans) and tap44bc34bb-e2 is the tap interface attached to my cirros VM instance. In the bridge, the bond2 is port 2, and the tap tap44bc34bb-e2 interface is port 1, and both are in forwarding mode. # brctl showstp brq75a55ef7-4a brq75a55ef7-4a <...> ageing time 300.00 <...> bond2 (2) port id8002stateforwarding <...> tap44bc34bb-e2 (1) port id8001stateforwarding <...> The network flow is like below: default gw<-->physical switch<-->[bond2 bridge tap]<-->[eth0 cirrosVM] eth0 mac address is fa:16:3e:cc:dc:ec. After the cirros VM comes up, it is not able to get an IP address from the DHCP agent, and there is no IP communication. Therefore I have to use the console and manually assign the corresponding IP address to the cirros VM eth0 interface, but still the IP connectivity does not work, since I cannot ping any external IPs, even in the same 10.20.21.0/24 subnet. What I have found is that in the bridge forwarding table, the bridge wrongly assigns the eth0 mac address to the port 2, which is bond2 interface, instead of assigning it to the port 1, which is the tap interface. This happens only if the arp table in the cirros VM instance does not contain the mac address of the destination IP I am pinging (default gw in this case), so the cirros VM sends an arp request (Request who-has 10.203.219.1 tell 10.203.219.114). See below the eth0 mac address wrongly assigned in the forwarding table to the port 2: # brctl showmacs brq75a55ef7-4a | grep fa:16:3e:cc:dc:ec 2 fa:16:3e:cc:dc:ec no 0.39 However, since the eth0 mac address is wrongly assigned to the port2, the arp reply back (Reply 10.203.219.1 is-at 00:17:08:c4:52:80) does not reach anymore eth0. Using a tcpdump, I can see in the compute node the arp request going through the tap interface an the bridge, however the arp reply back does not show up anymore on the tap interface. Since the arp reply back does not reach the eth0, the arp