On 1/29/26 3:26 PM, Mairtin O'Loingsigh wrote: > On Thu, Jan 29, 2026 at 02:38:14PM +0100, Dumitru Ceara wrote: >> Hi Tiago, Mairtin, >> >> On 1/29/26 2:24 PM, Tiago Matos Carvalho Reis via discuss wrote: >>> Em qui., 29 de jan. de 2026 às 09:11, Mairtin O'Loingsigh >>> <[email protected]> escreveu: >>>> >>>> On Wed, Jan 28, 2026 at 03:55:26PM -0300, Tiago Matos Carvalho Reis wrote: >>>>> Hi everyone, >>>>> >>>>> I have been working on implementing incremental processing in OVN-IC and >>>>> encountered a design issue regarding how OVN-IC handles multi-AZ writes. >>>>> >>>>> The Issue >>>>> In a scenario where multiple AZs are connected via OVN-IC, certain events >>>>> trigger all AZs to attempt writing the same data to the ISB/INB >>>>> simultaneously. This race condition leads to a constraint violation, which >>>>> causes the transaction to fail and forces a full recompute. >>>>> >>>>> Example: >>>>> A clear example of this can be seen in ovn-ic.c:ts_run: >>>>> >>>>> if (ctx->ovnisb_txn) { >>>>> /* Create ISB Datapath_Binding */ >>>>> ICNBREC_TRANSIT_SWITCH_FOR_EACH (ts, ctx->ovninb_idl) { >>>>> const struct icsbrec_datapath_binding *isb_dp = >>>>> shash_find_and_delete(isb_ts_dps, ts->name); >>>>> if (!isb_dp) { >>>>> /* Allocate tunnel key */ >>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode, >>>>> "transit switch >>>>> datapath"); >>>>> if (!dp_key) { >>>>> continue; >>>>> } >>>>> >>>>> isb_dp = icsbrec_datapath_binding_insert(ctx->ovnisb_txn); >>>>> icsbrec_datapath_binding_set_transit_switch(isb_dp, >>>>> ts->name); >>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, dp_key); >>>>> } else if (dp_key_refresh) { >>>>> /* Refresh tunnel key since encap mode has changed. */ >>>>> int64_t dp_key = allocate_dp_key(dp_tnlids, vxlan_mode, >>>>> "transit switch >>>>> datapath"); >>>>> if (dp_key) { >>>>> icsbrec_datapath_binding_set_tunnel_key(isb_dp, >>>>> dp_key); >>>>> } >>>>> } >>>>> >>>>> if (!isb_dp->type) { >>>>> icsbrec_datapath_binding_set_type(isb_dp, >>>>> "transit-switch"); >>>>> } >>>>> >>>>> if (!isb_dp->nb_ic_uuid) { >>>>> icsbrec_datapath_binding_set_nb_ic_uuid(isb_dp, >>>>> &ts->header_.uuid, >>>>> 1); >>>>> } >>>>> } >>>>> >>>>> struct shash_node *node; >>>>> SHASH_FOR_EACH (node, isb_ts_dps) { >>>>> icsbrec_datapath_binding_delete(node->data); >>>>> } >>>>> } >>>>> >>>>> When a new transit-switch is created, every AZ attempts to create the same >>>>> datapath_binding on the ISB. Only one request succeeds; the others fail >>>>> with a "constraint-violation." >>>>> >>>>> Impact: >>>>> This behavior negates the performance benefits of implementing incremental >>>>> processing, as the system falls back to a full recompute upon these >>>>> failures. >>>>> >>>>> For development purposes, I am currently ignoring these errors, but the >>>>> ideal way of fixing this issue is to have a mechanism where only a single >>>>> AZ handles the writes but this would require implementing some consensus >>>>> protocol. >>>>> >>>>> Does anyone have any advice on how we can fix this issue? >>>> ovn-ic in each AZ enumerates all existing ISB datapaths in >>>> enumerate_datapaths >>>> function, then will attempt to add missing datapaths. Since multilpe AZs >>>> will attempt to add the same missing entry, all but the first will fail >>>> causing transaction errors. Currently, ovn-ic will enumerate the ISB >>>> datapath again, see the entry that succeeded and continue to create NB >>>> in local AZ. This solution does cause a transaction error on all but 1 >>>> AZ whenever a Transit router is added, but we currently dont have a >>>> mechanism to manage this gracefully across multiple AZs. >>> >>> Hi Mairtin, thanks for the reply. >>> >>> Since there is no mechanism to manage which AZ should insert the data, >>> the only good solution besides implementing a full-fledge consensus >>> algorithm >>> like Raft to select a leader AZ, that I came up with is to simply set an >>> option >>> in IC_NB_Global to manually configure a specific AZ as a leader, and in the >>> code check if the AZ is the leader or not. >>> >>> Example: >>> $ ovn-ic-nbctl set IC_NB_Global . options:leader=az1 >>> >>> In the code: >>> >>> const struct icnbrec_ic_nb_global *icnb_global = >>> icnbrec_ic_nb_global_table_first(ic_nb_global_table); >>> >>> const struct nbrec_nb_global *nb_global = >>> nbrec_nb_global_table_first(nb_global_table); >>> >>> const char *leader = smap_get(&icnb_global->options, "leader") >>> if (!strcmp(leader, nb_global->name)) { >>> // Insert logic here >>> } >>> >>> Do you have any opinion on this approach? >>> >> >> I was thinking of something a bit different (not too different though). >> >> The hierarchy is: >> >> IC-NB >> | >> ovn-ic (AZ1) ovn-ic (AZ2) ... ovn-ic (AZN) >> | >> IC-SB >> >> Conceptually this is similar to the intra-az hierarchy: >> >> NB >> | >> ovn-northd (active) ovn-northd (backup) ... ovn-northd (backup) >> | >> SB >> >> The way the instances synchronize is by taking the (single) SB database >> lock. Only one northd succeeds, so that one becomes the "active". >> >> What if we do the same for ovn-ic? >> >> Make all ovn-ic try to take the IC-SB lock. Only the one that succeeds >> becomes "active" and may write to the IC-SB. >> >> That has one implication though: the active instance (it can be any >> ovn-ic in any AZ) must also make sure the IC-SB port bindings and >> datapaths for other AZs are up to date. Today it only takes care of the >> resources for its own AZ. > >> >> Each ovn-ic, both active and backup are still responsible for writing to >> the per-AZ OVN NB database based on the contents of the IC-NB and IC-SB >> centralized databases. >> >> I didn't check the code for this into too many details though so there >> might be other things to consider. >> >> What do you think? >> >> Regards, >> Dumitru >> >>>>> >>>>> Thanks, >>>>> Tiago Matos >>>>> >>>>> -- >>>>> >>>>> >>>>> >>>>> >>>>> _?Esta mensagem ? direcionada apenas para os endere?os constantes no >>>>> cabe?alho inicial. Se voc? n?o est? listado nos endere?os constantes no >>>>> cabe?alho, pedimos-lhe que desconsidere completamente o conte?do dessa >>>>> mensagem e cuja c?pia, encaminhamento e/ou execu??o das a??es citadas >>>>> est?o >>>>> imediatamente anuladas e proibidas?._ >>>>> >>>>> >>>>> *?**?Apesar do Magazine Luiza tomar >>>>> todas as precau??es razo?veis para assegurar que nenhum v?rus esteja >>>>> presente nesse e-mail, a empresa n?o poder? aceitar a responsabilidade por >>>>> quaisquer perdas ou danos causados por esse e-mail ou por seus anexos?.* >>>>> >>>>> >>>>> >>>>> -------------- next part -------------- >>>>> An HTML attachment was scrubbed... >>>>> URL: >>>>> <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20260128/90a7463f/attachment.htm> >>>> >>>> >>>> Hi Tiago, >>>> >>>> I ran into similar issues when adding transit router support and have >>>> added a comment above. I also have been working on OVN-IC related >>>> features, so if you would like to discuss above issue further or other >>>> OVN-IC work I would like to help. >>>> >>>> Regards, >>>> Mairtin >>>> >>> >>> >>> Regards, >>> Tiago Matos >>> >> > > Hi Dumitru, >
Hi Mairtin, > A lock similar to northd seems like a good solution, do you think > serializing access to ISB might have a significant negative performance > impact? > It's hard to quantify, I guess. On one hand we'd have a single ovn-ic instance doing all the IC-SB writes (we're not really serializing, we're doing RW / RO). But on the other hand we wouldn't have transaction failures every time a datapath is added. However, I suspect that in most cases the data in the IC databases can be handled by a single instance of ovn-ic without excessive resource uses. It would be great to test though. Regards, Dumitru > Mairtin > _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
