Re: [ovs-dev] [RFC] bridge: Retry tunnel port addition for conflict.

2024-03-12 Thread Tao Liu
On 03/11  , Ilya Maximets wrote:
> On 3/11/24 06:17, Han Zhou wrote:
> > 
> > 
> > On Fri, Mar 8, 2024 at 12:17 AM Tao Liu  > > wrote:
> >>
> >>
> >> On 3/7/24 5:39 AM, Ilya Maximets wrote:
> >> > On 2/27/24 20:14, Han Zhou wrote:
> >> >> For kernel datapath, when a new tunnel port is created in the same
> >> >> transaction in which an old tunnel port with the same tunnel
> >> >> configuration is deleted, the new tunnel port creation will fail and
> >> >> left in an error state. This can be easily reproduced in OVN by
> >> >> triggering a chassis deletion and addition with the same encap in the
> >> >> same SB DB transaction, such as:
> >> >>
> >> >> ovn-sbctl chassis-add aa geneve 1.2.3.4
> >> >> ovn-sbctl chassis-del aa -- chassis-add bb 1.2.3.4
> >> >>
> >> >> ovs-vsctl show | grep error
> >> >> error: "could not add network device ovn-bb-0 to ofproto (File 
> >> >> exists)"
> >> >>
> >> >> Related logs in OVS:
> >> >> —
> >> >> 2024-02-23T05:41:49.978Z|405933|bridge|INFO|bridge br-int: deleted 
> >> >> interface ovn-aa-0 on port 113
> >> >> 2024-02-23T05:41:49.989Z|405935|tunnel|WARN|ovn-bb-0: attempting to 
> >> >> add tunnel port with same config as port 'ovn-aa-0' (::->1.2.3.4, 
> >> >> key=flow, legacy_l2, dp port=9)
> >> >> 2024-02-23T05:41:49.989Z|405936|ofproto|WARN|br-int: could not add port 
> >> >> ovn-bb-0 (File exists)
> >> >> 2024-02-23T05:41:49.989Z|405937|bridge|WARN|could not add network 
> >> >> device ovn-bb-0 to ofproto (File exists)
> >> >> —
> >> >
> >> > Hi, Han.  Thanks for the patch!
> >> >
> >> >>
> >> >> Depending on when there are other OVSDB changes, it may take a long time
> >> >> for the device to be added successfully, triggered by the next OVS
> >> >> iteration.
> >> >>
> >> >> (note: native tunnel ports do not have this problem)
> >> >
> >> > I don't think this is correct.  The code path is common for both system
> >> > and native tunnels.  I can reproduce the issues in a sandbox with:
> >> >
> >> > $ make -j8 sandbox SANDBOXFLAGS="\-\-dummy='system'"
> >> > [tutorial]$ ovs-vsctl add-port br0 tunnel_port \
> >> >                  -- set Interface tunnel_port \
> >> >                         type=geneve options:remote_ip=flow 
> >> > options:key=123
> >> > [tutorial]$ ovs-vsctl del-port tunnel_port \
> >> >                  -- add-port br0 tunnel_port2 \
> >> >                  -- set Interface tunnel_port2 \
> >> >                         type=geneve options:remote_ip=flow 
> >> > options:key=123
> >> > ovs-vsctl: Error detected while setting up 'tunnel_port2':
> >> > could not add network device tunnel_port2 to ofproto (File exists).
> >> > See ovs-vswitchd log for details.
> >> >
> >> > The same should work in a testsuite as well, i.e. we should be able to
> >> > create a test for this scenario.
> >> >
> >> > Note: The --dummy=system prevents OVS from replacing tunnel ports with
> >> >        dummy ones.
> >> >
> > 
> > Thanks Ilya for the correction! --dummy=system is very helpful.
> > 
> >> >>
> >> >> The problem is how the tunnel port deletion and creation are handled. In
> >> >> bridge_reconfigure(), port deletion is handled before creation, to avoid
> >> >> such resource conflict. However, for kernel tunnel ports, the real clean
> >> >> up is performed at the end of the bridge_reconfigure() in the:
> >> >> bridge_run__()->ofproto_run()->...->ofproto_dpif:port_destruct()
> >> >>
> >> >> We cannot call bridge_run__() at an earlier point before all
> >> >> reconfigurations are done, so this patch tries a generic approach to
> >> >> just re-run the bridge_reconfigure() when there are any port creations
> >> >> encountered "File exists" error, which indicates a possible resource
> >> >> conflict may have happened due to a later deleted port, and retry may
> >> >> succeed.
> >> >>
> >> >> Signed-off-by: Han Zhou mailto:hz...@ovn.org>>
> >> >> ---
> >> >> This is RFC because I am not sure if there is a better way to solve the 
> >> >> problem
> >> >> more specifically by executing the port_destruct for the old port 
> >> >> before trying
> >> >> to create the new port. The fix may be more complex though.
> >> >
> >> > I don't think re-trying is a good approach in general.  We should likely
> >> > just destroy the tnl_port structure right away, similarly to how we clean
> >> > up stp, lldp and bundles in ofproto_port_unregister().  Maybe we can add
> >> > a new callback similar to bundle_remove() and call tnl_port_del() from 
> >> > it?
> >> > (I didn't try, so I'm not 100% sure this will not cause any issues.)
> >> >
> >> > What do you think?
> >> >
> >> > Best regards, Ilya Maximets.
> >> > ___
> >> > dev mailing list
> >> > d...@openvswitch.org 
> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev 
> >> > 
> >>
> >> Hi Ilya and Han,
> >>
> >> We hit this issue some days ago

Re: [ovs-dev] [RFC] bridge: Retry tunnel port addition for conflict.

2024-03-11 Thread Ilya Maximets
On 3/11/24 06:17, Han Zhou wrote:
> 
> 
> On Fri, Mar 8, 2024 at 12:17 AM Tao Liu  > wrote:
>>
>>
>> On 3/7/24 5:39 AM, Ilya Maximets wrote:
>> > On 2/27/24 20:14, Han Zhou wrote:
>> >> For kernel datapath, when a new tunnel port is created in the same
>> >> transaction in which an old tunnel port with the same tunnel
>> >> configuration is deleted, the new tunnel port creation will fail and
>> >> left in an error state. This can be easily reproduced in OVN by
>> >> triggering a chassis deletion and addition with the same encap in the
>> >> same SB DB transaction, such as:
>> >>
>> >> ovn-sbctl chassis-add aa geneve 1.2.3.4
>> >> ovn-sbctl chassis-del aa -- chassis-add bb 1.2.3.4
>> >>
>> >> ovs-vsctl show | grep error
>> >> error: "could not add network device ovn-bb-0 to ofproto (File 
>> >> exists)"
>> >>
>> >> Related logs in OVS:
>> >> —
>> >> 2024-02-23T05:41:49.978Z|405933|bridge|INFO|bridge br-int: deleted 
>> >> interface ovn-aa-0 on port 113
>> >> 2024-02-23T05:41:49.989Z|405935|tunnel|WARN|ovn-bb-0: attempting to 
>> >> add tunnel port with same config as port 'ovn-aa-0' (::->1.2.3.4, 
>> >> key=flow, legacy_l2, dp port=9)
>> >> 2024-02-23T05:41:49.989Z|405936|ofproto|WARN|br-int: could not add port 
>> >> ovn-bb-0 (File exists)
>> >> 2024-02-23T05:41:49.989Z|405937|bridge|WARN|could not add network device 
>> >> ovn-bb-0 to ofproto (File exists)
>> >> —
>> >
>> > Hi, Han.  Thanks for the patch!
>> >
>> >>
>> >> Depending on when there are other OVSDB changes, it may take a long time
>> >> for the device to be added successfully, triggered by the next OVS
>> >> iteration.
>> >>
>> >> (note: native tunnel ports do not have this problem)
>> >
>> > I don't think this is correct.  The code path is common for both system
>> > and native tunnels.  I can reproduce the issues in a sandbox with:
>> >
>> > $ make -j8 sandbox SANDBOXFLAGS="\-\-dummy='system'"
>> > [tutorial]$ ovs-vsctl add-port br0 tunnel_port \
>> >                  -- set Interface tunnel_port \
>> >                         type=geneve options:remote_ip=flow options:key=123
>> > [tutorial]$ ovs-vsctl del-port tunnel_port \
>> >                  -- add-port br0 tunnel_port2 \
>> >                  -- set Interface tunnel_port2 \
>> >                         type=geneve options:remote_ip=flow options:key=123
>> > ovs-vsctl: Error detected while setting up 'tunnel_port2':
>> > could not add network device tunnel_port2 to ofproto (File exists).
>> > See ovs-vswitchd log for details.
>> >
>> > The same should work in a testsuite as well, i.e. we should be able to
>> > create a test for this scenario.
>> >
>> > Note: The --dummy=system prevents OVS from replacing tunnel ports with
>> >        dummy ones.
>> >
> 
> Thanks Ilya for the correction! --dummy=system is very helpful.
> 
>> >>
>> >> The problem is how the tunnel port deletion and creation are handled. In
>> >> bridge_reconfigure(), port deletion is handled before creation, to avoid
>> >> such resource conflict. However, for kernel tunnel ports, the real clean
>> >> up is performed at the end of the bridge_reconfigure() in the:
>> >> bridge_run__()->ofproto_run()->...->ofproto_dpif:port_destruct()
>> >>
>> >> We cannot call bridge_run__() at an earlier point before all
>> >> reconfigurations are done, so this patch tries a generic approach to
>> >> just re-run the bridge_reconfigure() when there are any port creations
>> >> encountered "File exists" error, which indicates a possible resource
>> >> conflict may have happened due to a later deleted port, and retry may
>> >> succeed.
>> >>
>> >> Signed-off-by: Han Zhou mailto:hz...@ovn.org>>
>> >> ---
>> >> This is RFC because I am not sure if there is a better way to solve the 
>> >> problem
>> >> more specifically by executing the port_destruct for the old port before 
>> >> trying
>> >> to create the new port. The fix may be more complex though.
>> >
>> > I don't think re-trying is a good approach in general.  We should likely
>> > just destroy the tnl_port structure right away, similarly to how we clean
>> > up stp, lldp and bundles in ofproto_port_unregister().  Maybe we can add
>> > a new callback similar to bundle_remove() and call tnl_port_del() from it?
>> > (I didn't try, so I'm not 100% sure this will not cause any issues.)
>> >
>> > What do you think?
>> >
>> > Best regards, Ilya Maximets.
>> > ___
>> > dev mailing list
>> > d...@openvswitch.org 
>> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev 
>> > 
>>
>> Hi Ilya and Han,
>>
>> We hit this issue some days ago. As our analysis, it was introduced by
>> commit fe83f81df977 ("netdev: Remove netdev from global shash when the
>> user is changing interface configuration.")
>>
>> Reproduce:
>>    ovs-vsctl add-port br-int vxlan0 \
>>      -- set interface vxlan0 type=vxlan options:remot

Re: [ovs-dev] [RFC] bridge: Retry tunnel port addition for conflict.

2024-03-10 Thread Han Zhou
On Fri, Mar 8, 2024 at 12:17 AM Tao Liu  wrote:
>
>
> On 3/7/24 5:39 AM, Ilya Maximets wrote:
> > On 2/27/24 20:14, Han Zhou wrote:
> >> For kernel datapath, when a new tunnel port is created in the same
> >> transaction in which an old tunnel port with the same tunnel
> >> configuration is deleted, the new tunnel port creation will fail and
> >> left in an error state. This can be easily reproduced in OVN by
> >> triggering a chassis deletion and addition with the same encap in the
> >> same SB DB transaction, such as:
> >>
> >> ovn-sbctl chassis-add aa geneve 1.2.3.4
> >> ovn-sbctl chassis-del aa -- chassis-add bb 1.2.3.4
> >>
> >> ovs-vsctl show | grep error
> >> error: "could not add network device ovn-bb-0 to ofproto (File
exists)"
> >>
> >> Related logs in OVS:
> >> —
> >> 2024-02-23T05:41:49.978Z|405933|bridge|INFO|bridge br-int: deleted
interface ovn-aa-0 on port 113
> >> 2024-02-23T05:41:49.989Z|405935|tunnel|WARN|ovn-bb-0: attempting
to add tunnel port with same config as port 'ovn-aa-0' (::->1.2.3.4,
key=flow, legacy_l2, dp port=9)
> >> 2024-02-23T05:41:49.989Z|405936|ofproto|WARN|br-int: could not add
port ovn-bb-0 (File exists)
> >> 2024-02-23T05:41:49.989Z|405937|bridge|WARN|could not add network
device ovn-bb-0 to ofproto (File exists)
> >> —
> >
> > Hi, Han.  Thanks for the patch!
> >
> >>
> >> Depending on when there are other OVSDB changes, it may take a long
time
> >> for the device to be added successfully, triggered by the next OVS
> >> iteration.
> >>
> >> (note: native tunnel ports do not have this problem)
> >
> > I don't think this is correct.  The code path is common for both system
> > and native tunnels.  I can reproduce the issues in a sandbox with:
> >
> > $ make -j8 sandbox SANDBOXFLAGS="\-\-dummy='system'"
> > [tutorial]$ ovs-vsctl add-port br0 tunnel_port \
> >  -- set Interface tunnel_port \
> > type=geneve options:remote_ip=flow
options:key=123
> > [tutorial]$ ovs-vsctl del-port tunnel_port \
> >  -- add-port br0 tunnel_port2 \
> >  -- set Interface tunnel_port2 \
> > type=geneve options:remote_ip=flow
options:key=123
> > ovs-vsctl: Error detected while setting up 'tunnel_port2':
> > could not add network device tunnel_port2 to ofproto (File exists).
> > See ovs-vswitchd log for details.
> >
> > The same should work in a testsuite as well, i.e. we should be able to
> > create a test for this scenario.
> >
> > Note: The --dummy=system prevents OVS from replacing tunnel ports with
> >dummy ones.
> >

Thanks Ilya for the correction! --dummy=system is very helpful.

> >>
> >> The problem is how the tunnel port deletion and creation are handled.
In
> >> bridge_reconfigure(), port deletion is handled before creation, to
avoid
> >> such resource conflict. However, for kernel tunnel ports, the real
clean
> >> up is performed at the end of the bridge_reconfigure() in the:
> >> bridge_run__()->ofproto_run()->...->ofproto_dpif:port_destruct()
> >>
> >> We cannot call bridge_run__() at an earlier point before all
> >> reconfigurations are done, so this patch tries a generic approach to
> >> just re-run the bridge_reconfigure() when there are any port creations
> >> encountered "File exists" error, which indicates a possible resource
> >> conflict may have happened due to a later deleted port, and retry may
> >> succeed.
> >>
> >> Signed-off-by: Han Zhou 
> >> ---
> >> This is RFC because I am not sure if there is a better way to solve
the problem
> >> more specifically by executing the port_destruct for the old port
before trying
> >> to create the new port. The fix may be more complex though.
> >
> > I don't think re-trying is a good approach in general.  We should likely
> > just destroy the tnl_port structure right away, similarly to how we
clean
> > up stp, lldp and bundles in ofproto_port_unregister().  Maybe we can add
> > a new callback similar to bundle_remove() and call tnl_port_del() from
it?
> > (I didn't try, so I'm not 100% sure this will not cause any issues.)
> >
> > What do you think?
> >
> > Best regards, Ilya Maximets.
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
> Hi Ilya and Han,
>
> We hit this issue some days ago. As our analysis, it was introduced by
> commit fe83f81df977 ("netdev: Remove netdev from global shash when the
> user is changing interface configuration.")
>
> Reproduce:
>ovs-vsctl add-port br-int vxlan0 \
>  -- set interface vxlan0 type=vxlan options:remote_ip=10.10.10.1
>
>sleep 2
>
>ovs-vsctl --if-exists del-port vxlan0 \
>  -- add-port br-int vxlan1 \
>  -- set interface vxlan1 type=vxlan options:remote_ip=10.10.10.1
>ovs-vsctl: Error detected while setting up 'vxlan1': could not add
>network device vxlan1 to ofproto (File exists).
>
> vswitchd log:
>bridge|INFO|

Re: [ovs-dev] [RFC] bridge: Retry tunnel port addition for conflict.

2024-03-08 Thread Tao Liu


On 3/7/24 5:39 AM, Ilya Maximets wrote:

On 2/27/24 20:14, Han Zhou wrote:

For kernel datapath, when a new tunnel port is created in the same
transaction in which an old tunnel port with the same tunnel
configuration is deleted, the new tunnel port creation will fail and
left in an error state. This can be easily reproduced in OVN by
triggering a chassis deletion and addition with the same encap in the
same SB DB transaction, such as:

ovn-sbctl chassis-add aa geneve 1.2.3.4
ovn-sbctl chassis-del aa -- chassis-add bb 1.2.3.4

ovs-vsctl show | grep error
error: "could not add network device ovn-bb-0 to ofproto (File exists)"

Related logs in OVS:
—
2024-02-23T05:41:49.978Z|405933|bridge|INFO|bridge br-int: deleted interface 
ovn-aa-0 on port 113
2024-02-23T05:41:49.989Z|405935|tunnel|WARN|ovn-bb-0: attempting to add tunnel 
port with same config as port 'ovn-aa-0' (::->1.2.3.4, key=flow, legacy_l2, 
dp port=9)
2024-02-23T05:41:49.989Z|405936|ofproto|WARN|br-int: could not add port 
ovn-bb-0 (File exists)
2024-02-23T05:41:49.989Z|405937|bridge|WARN|could not add network device 
ovn-bb-0 to ofproto (File exists)
—


Hi, Han.  Thanks for the patch!



Depending on when there are other OVSDB changes, it may take a long time
for the device to be added successfully, triggered by the next OVS
iteration.

(note: native tunnel ports do not have this problem)


I don't think this is correct.  The code path is common for both system
and native tunnels.  I can reproduce the issues in a sandbox with:

$ make -j8 sandbox SANDBOXFLAGS="\-\-dummy='system'"
[tutorial]$ ovs-vsctl add-port br0 tunnel_port \
 -- set Interface tunnel_port \
type=geneve options:remote_ip=flow options:key=123
[tutorial]$ ovs-vsctl del-port tunnel_port \
 -- add-port br0 tunnel_port2 \
 -- set Interface tunnel_port2 \
type=geneve options:remote_ip=flow options:key=123
ovs-vsctl: Error detected while setting up 'tunnel_port2':
could not add network device tunnel_port2 to ofproto (File exists).
See ovs-vswitchd log for details.

The same should work in a testsuite as well, i.e. we should be able to
create a test for this scenario.

Note: The --dummy=system prevents OVS from replacing tunnel ports with
   dummy ones.



The problem is how the tunnel port deletion and creation are handled. In
bridge_reconfigure(), port deletion is handled before creation, to avoid
such resource conflict. However, for kernel tunnel ports, the real clean
up is performed at the end of the bridge_reconfigure() in the:
bridge_run__()->ofproto_run()->...->ofproto_dpif:port_destruct()

We cannot call bridge_run__() at an earlier point before all
reconfigurations are done, so this patch tries a generic approach to
just re-run the bridge_reconfigure() when there are any port creations
encountered "File exists" error, which indicates a possible resource
conflict may have happened due to a later deleted port, and retry may
succeed.

Signed-off-by: Han Zhou 
---
This is RFC because I am not sure if there is a better way to solve the problem
more specifically by executing the port_destruct for the old port before trying
to create the new port. The fix may be more complex though.


I don't think re-trying is a good approach in general.  We should likely
just destroy the tnl_port structure right away, similarly to how we clean
up stp, lldp and bundles in ofproto_port_unregister().  Maybe we can add
a new callback similar to bundle_remove() and call tnl_port_del() from it?
(I didn't try, so I'm not 100% sure this will not cause any issues.)

What do you think?

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Hi Ilya and Han,

We hit this issue some days ago. As our analysis, it was introduced by
commit fe83f81df977 ("netdev: Remove netdev from global shash when the 
user is changing interface configuration.")


Reproduce:
  ovs-vsctl add-port br-int vxlan0 \
-- set interface vxlan0 type=vxlan options:remote_ip=10.10.10.1

  sleep 2

  ovs-vsctl --if-exists del-port vxlan0 \
-- add-port br-int vxlan1 \
-- set interface vxlan1 type=vxlan options:remote_ip=10.10.10.1
  ovs-vsctl: Error detected while setting up 'vxlan1': could not add
  network device vxlan1 to ofproto (File exists).

vswitchd log:
  bridge|INFO|bridge br-int: added interface vxlan0 on port 1106
  bridge|INFO|bridge br-int: deleted interface vxlan0 on port 1106
  tunnel|WARN|vxlan1: attempting to add tunnel port with same config as 
port 'vxlan0' (::->10.10.10.1, key=0, legacy_l2, dp port=122)

  ofproto|WARN|br-int: could not add port vxlan1 (File exists)
  bridge|WARN|could not add network device vxlan1 to ofproto (File exists)

CallTrace:
  bridge_reconfigure
bridge_del_ports
  port_destroy
iface_destroy__
  netdev_remove   

Re: [ovs-dev] [RFC] bridge: Retry tunnel port addition for conflict.

2024-03-06 Thread Ilya Maximets
On 2/27/24 20:14, Han Zhou wrote:
> For kernel datapath, when a new tunnel port is created in the same
> transaction in which an old tunnel port with the same tunnel
> configuration is deleted, the new tunnel port creation will fail and
> left in an error state. This can be easily reproduced in OVN by
> triggering a chassis deletion and addition with the same encap in the
> same SB DB transaction, such as:
> 
> ovn-sbctl chassis-add aa geneve 1.2.3.4
> ovn-sbctl chassis-del aa -- chassis-add bb 1.2.3.4
> 
> ovs-vsctl show | grep error
> error: "could not add network device ovn-bb-0 to ofproto (File exists)"
> 
> Related logs in OVS:
> —
> 2024-02-23T05:41:49.978Z|405933|bridge|INFO|bridge br-int: deleted interface 
> ovn-aa-0 on port 113
> 2024-02-23T05:41:49.989Z|405935|tunnel|WARN|ovn-bb-0: attempting to add 
> tunnel port with same config as port 'ovn-aa-0' (::->1.2.3.4, key=flow, 
> legacy_l2, dp port=9)
> 2024-02-23T05:41:49.989Z|405936|ofproto|WARN|br-int: could not add port 
> ovn-bb-0 (File exists)
> 2024-02-23T05:41:49.989Z|405937|bridge|WARN|could not add network device 
> ovn-bb-0 to ofproto (File exists)
> —

Hi, Han.  Thanks for the patch!

> 
> Depending on when there are other OVSDB changes, it may take a long time
> for the device to be added successfully, triggered by the next OVS
> iteration.
> 
> (note: native tunnel ports do not have this problem)

I don't think this is correct.  The code path is common for both system
and native tunnels.  I can reproduce the issues in a sandbox with:

$ make -j8 sandbox SANDBOXFLAGS="\-\-dummy='system'"
[tutorial]$ ovs-vsctl add-port br0 tunnel_port \
-- set Interface tunnel_port \
   type=geneve options:remote_ip=flow options:key=123
[tutorial]$ ovs-vsctl del-port tunnel_port \
-- add-port br0 tunnel_port2 \
-- set Interface tunnel_port2 \
   type=geneve options:remote_ip=flow options:key=123
ovs-vsctl: Error detected while setting up 'tunnel_port2':
could not add network device tunnel_port2 to ofproto (File exists).
See ovs-vswitchd log for details.

The same should work in a testsuite as well, i.e. we should be able to
create a test for this scenario.

Note: The --dummy=system prevents OVS from replacing tunnel ports with
  dummy ones.

> 
> The problem is how the tunnel port deletion and creation are handled. In
> bridge_reconfigure(), port deletion is handled before creation, to avoid
> such resource conflict. However, for kernel tunnel ports, the real clean
> up is performed at the end of the bridge_reconfigure() in the:
> bridge_run__()->ofproto_run()->...->ofproto_dpif:port_destruct()
> 
> We cannot call bridge_run__() at an earlier point before all
> reconfigurations are done, so this patch tries a generic approach to
> just re-run the bridge_reconfigure() when there are any port creations
> encountered "File exists" error, which indicates a possible resource
> conflict may have happened due to a later deleted port, and retry may
> succeed.
> 
> Signed-off-by: Han Zhou 
> ---
> This is RFC because I am not sure if there is a better way to solve the 
> problem
> more specifically by executing the port_destruct for the old port before 
> trying
> to create the new port. The fix may be more complex though.

I don't think re-trying is a good approach in general.  We should likely
just destroy the tnl_port structure right away, similarly to how we clean
up stp, lldp and bundles in ofproto_port_unregister().  Maybe we can add
a new callback similar to bundle_remove() and call tnl_port_del() from it?
(I didn't try, so I'm not 100% sure this will not cause any issues.)

What do you think?

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [RFC] bridge: Retry tunnel port addition for conflict.

2024-02-27 Thread Han Zhou
For kernel datapath, when a new tunnel port is created in the same
transaction in which an old tunnel port with the same tunnel
configuration is deleted, the new tunnel port creation will fail and
left in an error state. This can be easily reproduced in OVN by
triggering a chassis deletion and addition with the same encap in the
same SB DB transaction, such as:

ovn-sbctl chassis-add aa geneve 1.2.3.4
ovn-sbctl chassis-del aa -- chassis-add bb 1.2.3.4

ovs-vsctl show | grep error
error: "could not add network device ovn-bb-0 to ofproto (File exists)"

Related logs in OVS:
—
2024-02-23T05:41:49.978Z|405933|bridge|INFO|bridge br-int: deleted interface 
ovn-aa-0 on port 113
2024-02-23T05:41:49.989Z|405935|tunnel|WARN|ovn-bb-0: attempting to add 
tunnel port with same config as port 'ovn-aa-0' (::->1.2.3.4, key=flow, 
legacy_l2, dp port=9)
2024-02-23T05:41:49.989Z|405936|ofproto|WARN|br-int: could not add port 
ovn-bb-0 (File exists)
2024-02-23T05:41:49.989Z|405937|bridge|WARN|could not add network device 
ovn-bb-0 to ofproto (File exists)
—

Depending on when there are other OVSDB changes, it may take a long time
for the device to be added successfully, triggered by the next OVS
iteration.

(note: native tunnel ports do not have this problem)

The problem is how the tunnel port deletion and creation are handled. In
bridge_reconfigure(), port deletion is handled before creation, to avoid
such resource conflict. However, for kernel tunnel ports, the real clean
up is performed at the end of the bridge_reconfigure() in the:
bridge_run__()->ofproto_run()->...->ofproto_dpif:port_destruct()

We cannot call bridge_run__() at an earlier point before all
reconfigurations are done, so this patch tries a generic approach to
just re-run the bridge_reconfigure() when there are any port creations
encountered "File exists" error, which indicates a possible resource
conflict may have happened due to a later deleted port, and retry may
succeed.

Signed-off-by: Han Zhou 
---
This is RFC because I am not sure if there is a better way to solve the problem
more specifically by executing the port_destruct for the old port before trying
to create the new port. The fix may be more complex though.

 tests/tunnel.at   |  1 +
 vswitchd/bridge.c | 47 ---
 2 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/tests/tunnel.at b/tests/tunnel.at
index 71e7c2df4eae..d570c78790c7 100644
--- a/tests/tunnel.at
+++ b/tests/tunnel.at
@@ -1059,6 +1059,7 @@ AT_CHECK([ovs-vsctl add-port br0 p2 -- set Interface p2 
type=vxlan \
 
 AT_CHECK([grep 'p2: could not set configuration (File exists)' 
ovs-vswitchd.log | sed "s/^.*\(p2:.*\)$/\1/"], [0],
   [p2: could not set configuration (File exists)
+p2: could not set configuration (File exists)
 ])
 
 OVS_VSWITCHD_STOP(["/p2: VXLAN-GBP, and non-VXLAN-GBP tunnels can't be 
configured on the same dst_port/d
diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
index 95a65fcdcd5e..9057da98e6c0 100644
--- a/vswitchd/bridge.c
+++ b/vswitchd/bridge.c
@@ -278,7 +278,7 @@ static void bridge_delete_ofprotos(void);
 static void bridge_delete_or_reconfigure_ports(struct bridge *);
 static void bridge_del_ports(struct bridge *,
  const struct shash *wanted_ports);
-static void bridge_add_ports(struct bridge *,
+static bool bridge_add_ports(struct bridge *,
  const struct shash *wanted_ports);
 
 static void bridge_configure_datapath_id(struct bridge *);
@@ -333,8 +333,8 @@ static void mirror_refresh_stats(struct mirror *);
 
 static void iface_configure_lacp(struct iface *,
  struct lacp_member_settings *);
-static bool iface_create(struct bridge *, const struct ovsrec_interface *,
- const struct ovsrec_port *);
+static int iface_create(struct bridge *, const struct ovsrec_interface *,
+const struct ovsrec_port *);
 static bool iface_is_internal(const struct ovsrec_interface *iface,
   const struct ovsrec_bridge *br);
 static const char *iface_get_type(const struct ovsrec_interface *,
@@ -858,7 +858,9 @@ datapath_reconfigure(const struct ovsrec_open_vswitch *cfg)
 }
 }
 
-static void
+/* Returns true if any ports addition failed and may need retry. Otherwise
+ * return false. */
+static bool
 bridge_reconfigure(const struct ovsrec_open_vswitch *ovs_cfg)
 {
 struct sockaddr_in *managers;
@@ -943,8 +945,11 @@ bridge_reconfigure(const struct ovsrec_open_vswitch 
*ovs_cfg)
 
 config_ofproto_types(&ovs_cfg->other_config);
 
+bool need_retry = false;
 HMAP_FOR_EACH (br, node, &all_bridges) {
-bridge_add_ports(br, &br->wanted_ports);
+if (bridge_add_ports(br, &br->wanted_ports)) {
+need_retry = true;
+}
 shash_destroy(&br->wanted_ports);
 }
 
@@ -1003,6 +1008,7 @@ bridge_reconfigure(const struct ovsrec_open_vsw