** Changed in: neutron/icehouse Status: Fix Committed => Fix Released
-- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1372438 Title: Race condition in l2pop drops tunnels Status in OpenStack Neutron (virtual network service): Fix Released Status in neutron icehouse series: Fix Released Status in neutron juno series: Fix Released Bug description: The issue was originally raised by a Red Hat performance engineer (Joe Talerico) here: https://bugzilla.redhat.com/show_bug.cgi?id=1136969 (see starting from comment 4). Joe created a Fedora instance in his OS cloud based on RHEL7-OSP5 (Icehouse), where he installed Rally client to run benchmarks against that cloud itself. He assigned a floating IP to that instance to be able to access API endpoints from inside the Rally machine. Then he ran a scenario which basically started up 100+ new instances in parallel, tried to access each of them via ssh, and once it succeeded, clean up each created instance (with its ports). Once in a while, his Rally instance lost connection to outside world. This was because VXLAN tunnel to the compute node hosting the Rally machine was dropped on networker node where DHCP, L3, Metadata agents were running. Once we restarted OVS agent, the tunnel was recreated properly. The scenario failed only if L2POP mechanism was enabled. I've looked thru the OVS agent logs and found out that the tunnel was dropped due to a legitimate fdb entry removal request coming from neutron-server side. So the fault is probably on neutron-server side, in l2pop mechanism driver. I've then looked thru the patches in Juno to see whether there is something related to the issue already merged, and found the patch that gets rid of _precommit step when cleaning up fdb entries. Once we've applied the patch on the neutron-server node, we stopped to experience those connectivity failures. After discussion with Vivekanandan Narasimhan, we came up with the following race condition that may result in tunnels being dropped while legitimate resources are still using them: (quoting Vivek below) ''' - - port1 delete request comes in; - - port1 delete request acquires lock - - port2 create/update request comes in; - - port2 create/update waits on due to unavailability of lock - - precommit phase for port1 determines that the port is the last one, so we should drop the FLOODING_ENTRY; - - port1 delete applied to db; - - port1 transaction releases the lock - - port2 create/update acquires the lock - - precommit phase for port2 determines that the port is the first one, so request FLOODING_ENTRY + MAC-specific flow creation; - - port2 create/update request applied to db; - - port2 transaction releases the lock Now at this point postcommit of either of them could happen, because code-pieces operate outside the locked zone. If it happens, this way, tunnel would retain: - - postcommit phase for port1 requests FLOODING_ENTRY deletion due to port1 deletion - - postcommit phase requests FLOODING_ENTRY + MAC-specific flow creation for port2; If it happens the below way, tunnel would break: - - postcommit phase for create por2 requests FLOODING_ENTRY + MAC-specific flow - - postcommit phase for delete port1 requests FLOODING_ENTRY deletion ''' We considered the patch to get rid of precommit for backport to Icehouse [1] that seems to eliminate the race, but we're concerned that we reverted that to previous behaviour in Juno as part of DVR work [2], though we haven't done any testing to check whether the issue is present in Juno (though brief analysis of the code shows that it should fail there too). Ideally, the fix for Juno should be easily backportable because the issue is currently present in Icehouse, and we would like to have the same fix for both branches (Icehouse and Juno) instead of backporting patch [1] to Icehouse and implementing another patch for Juno. [1]: https://review.openstack.org/#/c/95165/ [2]: https://review.openstack.org/#/c/102398/ To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1372438/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp