On 2/27/24 20:14, Han Zhou wrote: > For kernel datapath, when a new tunnel port is created in the same > transaction in which an old tunnel port with the same tunnel > configuration is deleted, the new tunnel port creation will fail and > left in an error state. This can be easily reproduced in OVN by > triggering a chassis deletion and addition with the same encap in the > same SB DB transaction, such as: > > ovn-sbctl chassis-add aaaaaa geneve 1.2.3.4 > ovn-sbctl chassis-del aaaaaa -- chassis-add bbbbbb 1.2.3.4 > > ovs-vsctl show | grep error > error: "could not add network device ovn-bbbbbb-0 to ofproto (File exists)" > > Related logs in OVS: > — > 2024-02-23T05:41:49.978Z|405933|bridge|INFO|bridge br-int: deleted interface > ovn-aaaaaa-0 on port 113 > 2024-02-23T05:41:49.989Z|405935|tunnel|WARN|ovn-bbbbbb-0: attempting to add > tunnel port with same config as port 'ovn-aaaaaa-0' (::->1.2.3.4, key=flow, > legacy_l2, dp port=9) > 2024-02-23T05:41:49.989Z|405936|ofproto|WARN|br-int: could not add port > ovn-bbbbbb-0 (File exists) > 2024-02-23T05:41:49.989Z|405937|bridge|WARN|could not add network device > ovn-bbbbbb-0 to ofproto (File exists) > —
Hi, Han. Thanks for the patch! > > Depending on when there are other OVSDB changes, it may take a long time > for the device to be added successfully, triggered by the next OVS > iteration. > > (note: native tunnel ports do not have this problem) I don't think this is correct. The code path is common for both system and native tunnels. I can reproduce the issues in a sandbox with: $ make -j8 sandbox SANDBOXFLAGS="\-\-dummy='system'" [tutorial]$ ovs-vsctl add-port br0 tunnel_port \ -- set Interface tunnel_port \ type=geneve options:remote_ip=flow options:key=123 [tutorial]$ ovs-vsctl del-port tunnel_port \ -- add-port br0 tunnel_port2 \ -- set Interface tunnel_port2 \ type=geneve options:remote_ip=flow options:key=123 ovs-vsctl: Error detected while setting up 'tunnel_port2': could not add network device tunnel_port2 to ofproto (File exists). See ovs-vswitchd log for details. The same should work in a testsuite as well, i.e. we should be able to create a test for this scenario. Note: The --dummy=system prevents OVS from replacing tunnel ports with dummy ones. > > The problem is how the tunnel port deletion and creation are handled. In > bridge_reconfigure(), port deletion is handled before creation, to avoid > such resource conflict. However, for kernel tunnel ports, the real clean > up is performed at the end of the bridge_reconfigure() in the: > bridge_run__()->ofproto_run()->...->ofproto_dpif:port_destruct() > > We cannot call bridge_run__() at an earlier point before all > reconfigurations are done, so this patch tries a generic approach to > just re-run the bridge_reconfigure() when there are any port creations > encountered "File exists" error, which indicates a possible resource > conflict may have happened due to a later deleted port, and retry may > succeed. > > Signed-off-by: Han Zhou <hz...@ovn.org> > --- > This is RFC because I am not sure if there is a better way to solve the > problem > more specifically by executing the port_destruct for the old port before > trying > to create the new port. The fix may be more complex though. I don't think re-trying is a good approach in general. We should likely just destroy the tnl_port structure right away, similarly to how we clean up stp, lldp and bundles in ofproto_port_unregister(). Maybe we can add a new callback similar to bundle_remove() and call tnl_port_del() from it? (I didn't try, so I'm not 100% sure this will not cause any issues.) What do you think? Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev