On Thu, Jan 29, 2026 at 02:38:14PM +0100, Dumitru Ceara wrote:
> Hi Tiago, Mairtin,
> 
> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote:
> > Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh
> > <[email protected]> escreveu:
> >>
> >> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis wrote:
> >>> Hi everyone,
> >>>
> >>> I have been working on implementing incremental processing in OVN-IC and
> >>> encountered a design issue regarding how OVN-IC handles multi-AZ writes.
> >>>
> >>> The Issue
> >>> In a scenario where multiple AZs are connected via OVN-IC, certain events
> >>> trigger all AZs to attempt writing the same data to the ISB/INB
> >>> simultaneously. This race condition leads to a constraint violation, which
> >>> causes the transaction to fail and forces a full recompute.
> >>>
> >>> Example:
> >>> A clear example of this can be seen in ovn-ic.c:ts_run:
> >>>
> >>>     if (ctx->ovnisb_txn) {
> >>>         /* Create ISB Datapath_Binding */
> >>>         ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) {
> >>>             const struct icsbrec_datapath_binding *isb_dp =
> >>>                 shash_find_and_delete(isb_ts_dps, ts->name);
> >>>             if (!isb_dp) {
> >>>                 /* Allocate tunnel key */
> >>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode,
> >>>                                                  "transit switch 
> >>> datapath");
> >>>                 if (!dp_key) {
> >>>                     continue;
> >>>                 }
> >>>
> >>>                 isb_dp = icsbrec_datapath_binding_insert(ctx->ovnisb_txn);
> >>>                 icsbrec_datapath_binding_set_transit_switch(isb_dp,
> >>> ts->name);
> >>>                 icsbrec_datapath_binding_set_tunnel_key(isb_dp, dp_key);
> >>>             } else if (dp_key_refresh) {
> >>>                 /* Refresh tunnel key since encap mode has changed. */
> >>>                 int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode,
> >>>                                                  "transit switch 
> >>> datapath");
> >>>                 if (dp_key) {
> >>>                     icsbrec_datapath_binding_set_tunnel_key(isb_dp, 
> >>> dp_key);
> >>>                 }
> >>>             }
> >>>
> >>>             if (!isb_dp->type) {
> >>>                 icsbrec_datapath_binding_set_type(isb_dp, 
> >>> "transit-switch");
> >>>             }
> >>>
> >>>             if (!isb_dp->nb_ic_uuid) {
> >>>                 icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp,
> >>>                                                         &ts->header_.uuid,
> >>> 1);
> >>>             }
> >>>         }
> >>>
> >>>         struct shash_node *node;
> >>>         SHASH_FOR_EACH (node, isb_ts_dps) {
> >>>             icsbrec_datapath_binding_delete(node->data);
> >>>         }
> >>>     }
> >>>
> >>> When a new transit-switch is created, every AZ attempts to create the same
> >>> datapath_binding on the ISB. Only one request succeeds; the others fail
> >>> with a "constraint-violation."
> >>>
> >>> Impact:
> >>> This behavior negates the performance benefits of implementing incremental
> >>> processing, as the system falls back to a full recompute upon these
> >>> failures.
> >>>
> >>> For development purposes, I am currently ignoring these errors, but the
> >>> ideal way of fixing this issue is to have a mechanism where only a single
> >>> AZ handles the writes but this would require implementing some consensus
> >>> protocol.
> >>>
> >>> Does anyone have any advice on how we can fix this issue?
> >> ovn-ic in each AZ enumerates all existing ISB datapaths in
> >> enumerate_datapaths
> >> function, then will attempt to add missing datapaths. Since multilpe AZs
> >> will attempt to add the same missing entry, all but the first will fail
> >> causing transaction errors. Currently, ovn-ic will enumerate the ISB
> >> datapath again, see the entry that succeeded and continue to create NB
> >> in local AZ. This solution does cause a transaction error on all but 1
> >> AZ whenever a Transit router is added, but we currently dont have a
> >> mechanism to manage this gracefully across multiple AZs.
> > 
> > Hi Mairtin, thanks for the reply.
> > 
> > Since there is no mechanism to manage which AZ should insert the data,
> > the only good solution besides implementing a full-fledge consensus 
> > algorithm
> > like Raft to select a leader AZ,  that I came up with is to simply set an 
> > option
> > in IC_NB_Global to manually configure a specific AZ as a leader, and in the
> > code check if the AZ is the leader or not.
> > 
> > Example:
> > $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1
> > 
> > In the code:
> > 
> > const struct icnbrec_ic_nb_global *icnb_global =
> >     icnbrec_ic_nb_global_table_first(ic_nb_global_table);
> > 
> > const struct nbrec_nb_global *nb_global =
> >     nbrec_nb_global_table_first(nb_global_table);
> > 
> > const char *leader = smap_get(&icnb_global->options, "leader")
> > if (!strcmp(leader, nb_global->name)) {
> > // Insert logic here
> > }
> > 
> > Do you have any opinion on this approach?
> > 
> 
> I was thinking of something a bit different (not too different though).
> 
> The hierarchy is:
> 
>              IC-NB
>                |
> ovn-ic (AZ1)  ovn-ic (AZ2)  ...  ovn-ic (AZN)
>                |
>              IC-SB
> 
> Conceptually this is similar to the intra-az hierarchy:
> 
>                       NB
>                       |
> ovn-northd (active)  ovn-northd (backup)  ...  ovn-northd (backup)
>                       |
>                       SB
> 
> The way the instances synchronize is by taking the (single) SB database
> lock.  Only one northd succeeds, so that one becomes the "active".
> 
> What if we do the same for ovn-ic?
> 
> Make all ovn-ic try to take the IC-SB lock.  Only the one that succeeds
> becomes "active" and may write to the IC-SB.
> 
> That has one implication though: the active instance (it can be any
> ovn-ic in any AZ) must also make sure the IC-SB port bindings and
> datapaths for other AZs are up to date.  Today it only takes care of the
> resources for its own AZ.

> 
> Each ovn-ic, both active and backup are still responsible for writing to
> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB
> centralized databases.
> 
> I didn't check the code for this into too many details though so there
> might be other things to consider.
> 
> What do you think?
> 
> Regards,
> Dumitru
> 
> >>>
> >>> Thanks,
> >>> Tiago Matos
> >>>
> >>> --
> >>>
> >>>
> >>>
> >>>
> >>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no
> >>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes no
> >>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do dessa
> >>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es citadas 
> >>> est?o
> >>> imediatamente anuladas e proibidas?._
> >>>
> >>>
> >>> *?**?Apesar do Magazine Luiza tomar
> >>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja
> >>> presente nesse e-mail, a empresa n?o poder? aceitar a responsabilidade por
> >>> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos?.*
> >>>
> >>>
> >>>
> >>> -------------- next part --------------
> >>> An HTML attachment was scrubbed...
> >>> URL: 
> >>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm>
> >>
> >>
> >> Hi Tiago,
> >>
> >> I ran into similar issues when adding transit router support and have
> >> added a comment above. I also have been working on OVN-IC related
> >> features, so if you would like to discuss above issue further or other
> >> OVN-IC work I would like to help.
> >>
> >> Regards,
> >> Mairtin
> >>
> > 
> > 
> > Regards,
> > Tiago Matos
> > 
> 

Hi Dumitru,

A lock similar to northd seems like a good solution, do you think
serializing access to ISB might have a significant negative performance
impact?

Mairtin

_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to