Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Paolo: Do you mean that the same hash for nat con and conntrackcon will trigger this issue? I am also encountering this assert issue now, but I have not figured out the clear emerging scenario? Can you provide a test case that can emerge? Best regards. smime.p7s Description: S/MIME cryptographic signature ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Paolo: Is this logic wrong for streams with the same 5-tuple hash to hit the same CT stream table? So if “avoid to include two keys with the same hash belong to the same connection even for the nat case ”, is it a waste of CT's table resources ? smime.p7s Description: S/MIME cryptographic signature ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Paolo, Hi Michael, I want to confirm that following patch is working on openvswitch 3.0.3 and the OVS crash is not happen after patching. https://patchwork.ozlabs.org/project/openvswitch/patch/168192964823.4031872.3228556334798413886.st...@fed.void/ But currently, I find some logs like following. I'm not sure if it is related with above patch. 2023-05-23T08:35:18.383Z|5|conntrack(pmd-c49/id:104)|WARN|Unable to NAT due to tuple space exhaustion - if DoS attack, use firewalling and/or zone partitioning. Any ideas? Best regards. > Date: Thu, 04 May 2023 19:24:53 +0200 > From: Paolo Valerio > To: Lazuardi Nasution > Cc: , ovs-discuss@openvswitch.org > Subject: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day > Message-ID: <871qjwt3fe@fed.void> > Content-Type: text/plain; charset=utf-8 > > Lazuardi Nasution writes: > > > Hi Paolo, > > > > Should we combine this patch too? > > > > https://patchwork.ozlabs.org/project/openvswitch/patch/ > > 168192964823.4031872.3228556334798413886.st...@fed.void/ > > > > Hi, > > no, it basically does the same thing in a slightly different way > reducing the need for modification in the case of backporting to > previous versions. > > > Best regards. > > > > On Wed, Apr 5, 2023 at 2:51?AM Paolo Valerio > wrote: > > > > Hello, > > > > thanks for reporting this. > > I had a look at it, and, although this needs to be confirmed, I > suspect > > it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections > (but > > not yet reclaimed). > > > > The nat part does not necessarily perform any actual translation, but > > could still be triggered by ct(nat(src)...) which is the all-zero > binding > > to avoid collisions, if any. > > > > Is there any chance to test the following patch (targeted for ovs > 2.17)? > > This should help to confirm. > > > > -- >8 -- > > diff --git a/lib/conntrack.c b/lib/conntrack.c > > index 08da4ddf7..ba334afb0 100644 > > --- a/lib/conntrack.c > > +++ b/lib/conntrack.c > > @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct > > conn_key *); > > ?static struct conn *new_conn(struct conntrack *ct, struct dp_packet > *pkt, > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct conn_key *, long long now, > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? uint32_t tp_id); > > -static void delete_conn_cmn(struct conn *); > > +static void delete_conn__(struct conn *); > > ?static void delete_conn(struct conn *); > > -static void delete_conn_one(struct conn *conn); > > ?static enum ct_update_res conn_update(struct conntrack *ct, struct > conn > > *conn, > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct dp_packet *pkt, > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct conn_lookup_ctx *ctx, > > @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, > uint16_t > > zone) > > ?} > > > > ?static void > > -conn_clean_cmn(struct conntrack *ct, struct conn *conn) > > +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t > hash) > > ? ? ?OVS_REQUIRES(ct->ct_lock) > > ?{ > > ? ? ?if (conn->alg) { > > ? ? ? ? ?expectation_clean(ct, &conn->key); > > ? ? ?} > > > > -? ? uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis); > > ? ? ?cmap_remove(&ct->conns, &conn->cm_node, hash); > > > > ? ? ?struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); > > @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn > *conn) > > ? ? ?OVS_REQUIRES(ct->ct_lock) > > ?{ > > ? ? ?ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); > > +? ? uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis); > > > > -? ? conn_clean_cmn(ct, conn); > > +? ? conn_clean_cmn(ct, conn, conn_hash); > > ? ? ?if (conn->nat_conn) { > > ? ? ? ? ?uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct-> > > hash_basis); > > -? ? ? ? cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > > +? ? ? ? if (conn_hash != hash) { > > +? ? ? ? ? ? cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > > +? ? ? ? } > > ? ? ?} > > ? ? ?ovs_list_remove(&conn->exp_node); > > ? ? ?conn->cleaned = true; > > @@ -479,19 +480,6 @@ conn_clean(struct
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Lazuardi Nasution writes: > Hi Paolo, > > Should we combine this patch too? > > https://patchwork.ozlabs.org/project/openvswitch/patch/ > 168192964823.4031872.3228556334798413886.st...@fed.void/ > Hi, no, it basically does the same thing in a slightly different way reducing the need for modification in the case of backporting to previous versions. > Best regards. > > On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio wrote: > > Hello, > > thanks for reporting this. > I had a look at it, and, although this needs to be confirmed, I suspect > it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but > not yet reclaimed). > > The nat part does not necessarily perform any actual translation, but > could still be triggered by ct(nat(src)...) which is the all-zero binding > to avoid collisions, if any. > > Is there any chance to test the following patch (targeted for ovs 2.17)? > This should help to confirm. > > -- >8 -- > diff --git a/lib/conntrack.c b/lib/conntrack.c > index 08da4ddf7..ba334afb0 100644 > --- a/lib/conntrack.c > +++ b/lib/conntrack.c > @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct > conn_key *); > static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt, > struct conn_key *, long long now, > uint32_t tp_id); > -static void delete_conn_cmn(struct conn *); > +static void delete_conn__(struct conn *); > static void delete_conn(struct conn *); > -static void delete_conn_one(struct conn *conn); > static enum ct_update_res conn_update(struct conntrack *ct, struct conn > *conn, > struct dp_packet *pkt, > struct conn_lookup_ctx *ctx, > @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t > zone) > } > > static void > -conn_clean_cmn(struct conntrack *ct, struct conn *conn) > +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash) > OVS_REQUIRES(ct->ct_lock) > { > if (conn->alg) { > expectation_clean(ct, &conn->key); > } > > - uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis); > cmap_remove(&ct->conns, &conn->cm_node, hash); > > struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); > @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn) > OVS_REQUIRES(ct->ct_lock) > { > ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); > + uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis); > > - conn_clean_cmn(ct, conn); > + conn_clean_cmn(ct, conn, conn_hash); > if (conn->nat_conn) { > uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct-> > hash_basis); > - cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > + if (conn_hash != hash) { > + cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > + } > } > ovs_list_remove(&conn->exp_node); > conn->cleaned = true; > @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn) > atomic_count_dec(&ct->n_conn); > } > > -static void > -conn_clean_one(struct conntrack *ct, struct conn *conn) > - OVS_REQUIRES(ct->ct_lock) > -{ > - conn_clean_cmn(ct, conn); > - if (conn->conn_type == CT_CONN_TYPE_DEFAULT) { > - ovs_list_remove(&conn->exp_node); > - conn->cleaned = true; > - atomic_count_dec(&ct->n_conn); > - } > - ovsrcu_postpone(delete_conn_one, conn); > -} > - > /* Destroys the connection tracker 'ct' and frees all the allocated > memory. > * The caller of this function must already have shut down packet input > * and PMD threads (which would have been quiesced). */ > @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct) > > ovs_mutex_lock(&ct->ct_lock); > CMAP_FOR_EACH (conn, cm_node, &ct->conns) { > - conn_clean_one(ct, conn); > + if (conn->conn_type == CT_CONN_TYPE_UN_NAT) { > + continue; > + } > + conn_clean(ct, conn); > } > cmap_destroy(&ct->conns); > > @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct > dp_packet *pkt, > nat_conn->alg = NULL; > nat_conn->nat_conn = NULL; > uint32_t nat_hash = conn_key_hash(&nat_conn->key, ct-> > hash_basis); > - cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); > + > + if (nat_hash != ctx->hash) { > + cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); > + } > } > > nc->nat_conn
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Paolo, Should we combine this patch too? https://patchwork.ozlabs.org/project/openvswitch/patch/168192964823.4031872.3228556334798413886.st...@fed.void/ Best regards. On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio wrote: > Hello, > > thanks for reporting this. > I had a look at it, and, although this needs to be confirmed, I suspect > it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but > not yet reclaimed). > > The nat part does not necessarily perform any actual translation, but > could still be triggered by ct(nat(src)...) which is the all-zero binding > to avoid collisions, if any. > > Is there any chance to test the following patch (targeted for ovs 2.17)? > This should help to confirm. > > -- >8 -- > diff --git a/lib/conntrack.c b/lib/conntrack.c > index 08da4ddf7..ba334afb0 100644 > --- a/lib/conntrack.c > +++ b/lib/conntrack.c > @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct > conn_key *); > static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt, > struct conn_key *, long long now, > uint32_t tp_id); > -static void delete_conn_cmn(struct conn *); > +static void delete_conn__(struct conn *); > static void delete_conn(struct conn *); > -static void delete_conn_one(struct conn *conn); > static enum ct_update_res conn_update(struct conntrack *ct, struct conn > *conn, >struct dp_packet *pkt, >struct conn_lookup_ctx *ctx, > @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t > zone) > } > > static void > -conn_clean_cmn(struct conntrack *ct, struct conn *conn) > +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash) > OVS_REQUIRES(ct->ct_lock) > { > if (conn->alg) { > expectation_clean(ct, &conn->key); > } > > -uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis); > cmap_remove(&ct->conns, &conn->cm_node, hash); > > struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); > @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn) > OVS_REQUIRES(ct->ct_lock) > { > ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); > +uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis); > > -conn_clean_cmn(ct, conn); > +conn_clean_cmn(ct, conn, conn_hash); > if (conn->nat_conn) { > uint32_t hash = conn_key_hash(&conn->nat_conn->key, > ct->hash_basis); > -cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > +if (conn_hash != hash) { > +cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > +} > } > ovs_list_remove(&conn->exp_node); > conn->cleaned = true; > @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn) > atomic_count_dec(&ct->n_conn); > } > > -static void > -conn_clean_one(struct conntrack *ct, struct conn *conn) > -OVS_REQUIRES(ct->ct_lock) > -{ > -conn_clean_cmn(ct, conn); > -if (conn->conn_type == CT_CONN_TYPE_DEFAULT) { > -ovs_list_remove(&conn->exp_node); > -conn->cleaned = true; > -atomic_count_dec(&ct->n_conn); > -} > -ovsrcu_postpone(delete_conn_one, conn); > -} > - > /* Destroys the connection tracker 'ct' and frees all the allocated > memory. > * The caller of this function must already have shut down packet input > * and PMD threads (which would have been quiesced). */ > @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct) > > ovs_mutex_lock(&ct->ct_lock); > CMAP_FOR_EACH (conn, cm_node, &ct->conns) { > -conn_clean_one(ct, conn); > +if (conn->conn_type == CT_CONN_TYPE_UN_NAT) { > +continue; > +} > +conn_clean(ct, conn); > } > cmap_destroy(&ct->conns); > > @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct > dp_packet *pkt, > nat_conn->alg = NULL; > nat_conn->nat_conn = NULL; > uint32_t nat_hash = conn_key_hash(&nat_conn->key, > ct->hash_basis); > -cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); > + > +if (nat_hash != ctx->hash) { > +cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); > +} > } > > nc->nat_conn = nat_conn; > @@ -1080,7 +1074,7 @@ conn_not_found(struct conntrack *ct, struct > dp_packet *pkt, > nat_res_exhaustion: > free(nat_conn); > ovs_list_remove(&nc->exp_node); > -delete_conn_cmn(nc); > +delete_conn__(nc); > static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); > VLOG_WARN_RL(&rl, "Unable to NAT due to tuple space exhaustion - " > "if DoS attack, use firewalling and/or zone > partitioning."); > @@ -2549,7 +2543,7 @@ new_conn(struct conntrack *ct, struct dp_packet > *pkt, struct conn_key *key, > } > > static void > -delete_conn_cmn(struc
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
"Plato, Michael" writes: > Hi Paolo, > I installed the patch for 2.17 on april 6th in our test environment and can > confirm that it works. We haven't had any crashes since then. Many thanks for > the quick solution! > Hi Micheal, Nice! That's helpful. Thanks for testing it. Paolo > Best regards > > Michael > > -Ursprüngliche Nachricht- > Von: Paolo Valerio > Gesendet: Montag, 17. April 2023 10:36 > An: Lazuardi Nasution > Cc: ovs-discuss@openvswitch.org; Plato, Michael > Betreff: Re: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day > > Lazuardi Nasution writes: > >> Hi Paolo, >> >> I'm interested in your statement of "expired connections (but not yet >> reclaimed)". Do you think that shortening conntrack timeout policy will help? >> Or, should we make it larger so there will be fewer conntrack table >> update and flush attempts? >> > > it's hard to say as it depends on the specific use case. > Probably making it larger for the specific case could help, but in general, I > would not rely on that. > Of course, an actual fix is needed. It would be great if the patch sent could > tested, but in any case, I'll work on a formal patch. > >> Best regards. >> >> On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio wrote: >> >> Hello, >> >> thanks for reporting this. >> I had a look at it, and, although this needs to be confirmed, I suspect >> it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but >> not yet reclaimed). >> >> The nat part does not necessarily perform any actual translation, but >> could still be triggered by ct(nat(src)...) which is the all-zero binding >> to avoid collisions, if any. >> >> Is there any chance to test the following patch (targeted for ovs 2.17)? >> This should help to confirm. >> >> -- >8 -- >> diff --git a/lib/conntrack.c b/lib/conntrack.c >> index 08da4ddf7..ba334afb0 100644 >> --- a/lib/conntrack.c >> +++ b/lib/conntrack.c >> @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct >> conn_key *); >> static struct conn *new_conn(struct conntrack *ct, struct dp_packet >> *pkt, >> struct conn_key *, long long now, >> uint32_t tp_id); >> -static void delete_conn_cmn(struct conn *); >> +static void delete_conn__(struct conn *); >> static void delete_conn(struct conn *); >> -static void delete_conn_one(struct conn *conn); >> static enum ct_update_res conn_update(struct conntrack *ct, struct conn >> *conn, >> struct dp_packet *pkt, >> struct conn_lookup_ctx *ctx, >> @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t >> zone) >> } >> >> static void >> -conn_clean_cmn(struct conntrack *ct, struct conn *conn) >> +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash) >> OVS_REQUIRES(ct->ct_lock) >> { >> if (conn->alg) { >> expectation_clean(ct, &conn->key); >> } >> >> - uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis); >> cmap_remove(&ct->conns, &conn->cm_node, hash); >> >> struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); >> @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn) >> OVS_REQUIRES(ct->ct_lock) >> { >> ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); >> + uint32_t conn_hash = conn_key_hash(&conn->key, >> ct->hash_basis); >> >> - conn_clean_cmn(ct, conn); >> + conn_clean_cmn(ct, conn, conn_hash); >> if (conn->nat_conn) { >> uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct-> >> hash_basis); >> - cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); >> + if (conn_hash != hash) { >> + cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); >> + } >> } >> ovs_list_remove(&conn->exp_node); >> conn->cleaned = true; >> @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn) >> atomic_co
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Paolo, I installed the patch for 2.17 on april 6th in our test environment and can confirm that it works. We haven't had any crashes since then. Many thanks for the quick solution! Best regards Michael -Ursprüngliche Nachricht- Von: Paolo Valerio Gesendet: Montag, 17. April 2023 10:36 An: Lazuardi Nasution Cc: ovs-discuss@openvswitch.org; Plato, Michael Betreff: Re: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day Lazuardi Nasution writes: > Hi Paolo, > > I'm interested in your statement of "expired connections (but not yet > reclaimed)". Do you think that shortening conntrack timeout policy will help? > Or, should we make it larger so there will be fewer conntrack table > update and flush attempts? > it's hard to say as it depends on the specific use case. Probably making it larger for the specific case could help, but in general, I would not rely on that. Of course, an actual fix is needed. It would be great if the patch sent could tested, but in any case, I'll work on a formal patch. > Best regards. > > On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio wrote: > > Hello, > > thanks for reporting this. > I had a look at it, and, although this needs to be confirmed, I suspect > it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but > not yet reclaimed). > > The nat part does not necessarily perform any actual translation, but > could still be triggered by ct(nat(src)...) which is the all-zero binding > to avoid collisions, if any. > > Is there any chance to test the following patch (targeted for ovs 2.17)? > This should help to confirm. > > -- >8 -- > diff --git a/lib/conntrack.c b/lib/conntrack.c > index 08da4ddf7..ba334afb0 100644 > --- a/lib/conntrack.c > +++ b/lib/conntrack.c > @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct > conn_key *); > static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt, > struct conn_key *, long long now, > uint32_t tp_id); > -static void delete_conn_cmn(struct conn *); > +static void delete_conn__(struct conn *); > static void delete_conn(struct conn *); > -static void delete_conn_one(struct conn *conn); > static enum ct_update_res conn_update(struct conntrack *ct, struct conn > *conn, > struct dp_packet *pkt, > struct conn_lookup_ctx *ctx, > @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t > zone) > } > > static void > -conn_clean_cmn(struct conntrack *ct, struct conn *conn) > +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash) > OVS_REQUIRES(ct->ct_lock) > { > if (conn->alg) { > expectation_clean(ct, &conn->key); > } > > - uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis); > cmap_remove(&ct->conns, &conn->cm_node, hash); > > struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); > @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn) > OVS_REQUIRES(ct->ct_lock) > { > ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); > + uint32_t conn_hash = conn_key_hash(&conn->key, > ct->hash_basis); > > - conn_clean_cmn(ct, conn); > + conn_clean_cmn(ct, conn, conn_hash); > if (conn->nat_conn) { > uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct-> > hash_basis); > - cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > + if (conn_hash != hash) { > + cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > + } > } > ovs_list_remove(&conn->exp_node); > conn->cleaned = true; > @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn) > atomic_count_dec(&ct->n_conn); > } > > -static void > -conn_clean_one(struct conntrack *ct, struct conn *conn) > - OVS_REQUIRES(ct->ct_lock) > -{ > - conn_clean_cmn(ct, conn); > - if (conn->conn_type == CT_CONN_TYPE_DEFAULT) { > - ovs_list_remove(&conn->exp_node); > - conn->cleaned = true; > - atomic_count_dec(&ct->n_conn); > - } > - ovsrcu_postpone(delete_conn_one, conn); > -} > -
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Lazuardi Nasution writes: > Hi Paolo, > > I'm interested in your statement of "expired connections (but not yet > reclaimed)". Do you think that shortening conntrack timeout policy will help? > Or, should we make it larger so there will be fewer conntrack table update and > flush attempts? > it's hard to say as it depends on the specific use case. Probably making it larger for the specific case could help, but in general, I would not rely on that. Of course, an actual fix is needed. It would be great if the patch sent could tested, but in any case, I'll work on a formal patch. > Best regards. > > On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio wrote: > > Hello, > > thanks for reporting this. > I had a look at it, and, although this needs to be confirmed, I suspect > it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but > not yet reclaimed). > > The nat part does not necessarily perform any actual translation, but > could still be triggered by ct(nat(src)...) which is the all-zero binding > to avoid collisions, if any. > > Is there any chance to test the following patch (targeted for ovs 2.17)? > This should help to confirm. > > -- >8 -- > diff --git a/lib/conntrack.c b/lib/conntrack.c > index 08da4ddf7..ba334afb0 100644 > --- a/lib/conntrack.c > +++ b/lib/conntrack.c > @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct > conn_key *); > static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt, > struct conn_key *, long long now, > uint32_t tp_id); > -static void delete_conn_cmn(struct conn *); > +static void delete_conn__(struct conn *); > static void delete_conn(struct conn *); > -static void delete_conn_one(struct conn *conn); > static enum ct_update_res conn_update(struct conntrack *ct, struct conn > *conn, > struct dp_packet *pkt, > struct conn_lookup_ctx *ctx, > @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t > zone) > } > > static void > -conn_clean_cmn(struct conntrack *ct, struct conn *conn) > +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash) > OVS_REQUIRES(ct->ct_lock) > { > if (conn->alg) { > expectation_clean(ct, &conn->key); > } > > - uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis); > cmap_remove(&ct->conns, &conn->cm_node, hash); > > struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); > @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn) > OVS_REQUIRES(ct->ct_lock) > { > ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); > + uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis); > > - conn_clean_cmn(ct, conn); > + conn_clean_cmn(ct, conn, conn_hash); > if (conn->nat_conn) { > uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct-> > hash_basis); > - cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > + if (conn_hash != hash) { > + cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > + } > } > ovs_list_remove(&conn->exp_node); > conn->cleaned = true; > @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn) > atomic_count_dec(&ct->n_conn); > } > > -static void > -conn_clean_one(struct conntrack *ct, struct conn *conn) > - OVS_REQUIRES(ct->ct_lock) > -{ > - conn_clean_cmn(ct, conn); > - if (conn->conn_type == CT_CONN_TYPE_DEFAULT) { > - ovs_list_remove(&conn->exp_node); > - conn->cleaned = true; > - atomic_count_dec(&ct->n_conn); > - } > - ovsrcu_postpone(delete_conn_one, conn); > -} > - > /* Destroys the connection tracker 'ct' and frees all the allocated > memory. > * The caller of this function must already have shut down packet input > * and PMD threads (which would have been quiesced). */ > @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct) > > ovs_mutex_lock(&ct->ct_lock); > CMAP_FOR_EACH (conn, cm_node, &ct->conns) { > - conn_clean_one(ct, conn); > + if (conn->conn_type == CT_CONN_TYPE_UN_NAT) { > + continue; > + } > + conn_clean(ct, conn); > } > cmap_destroy(&ct->conns); > > @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct > dp_packet *pkt, > nat_conn->alg = NULL; > nat_conn->nat_conn = NULL; > uint32_t nat_hash = conn_key_hash(&nat_conn->key, ct-> > hash_basis); > - cmap_insert(&ct->conns,
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Paolo, I'm interested in your statement of "expired connections (but not yet reclaimed)". Do you think that shortening conntrack timeout policy will help? Or, should we make it larger so there will be fewer conntrack table update and flush attempts? Best regards. On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio wrote: > Hello, > > thanks for reporting this. > I had a look at it, and, although this needs to be confirmed, I suspect > it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but > not yet reclaimed). > > The nat part does not necessarily perform any actual translation, but > could still be triggered by ct(nat(src)...) which is the all-zero binding > to avoid collisions, if any. > > Is there any chance to test the following patch (targeted for ovs 2.17)? > This should help to confirm. > > -- >8 -- > diff --git a/lib/conntrack.c b/lib/conntrack.c > index 08da4ddf7..ba334afb0 100644 > --- a/lib/conntrack.c > +++ b/lib/conntrack.c > @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct > conn_key *); > static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt, > struct conn_key *, long long now, > uint32_t tp_id); > -static void delete_conn_cmn(struct conn *); > +static void delete_conn__(struct conn *); > static void delete_conn(struct conn *); > -static void delete_conn_one(struct conn *conn); > static enum ct_update_res conn_update(struct conntrack *ct, struct conn > *conn, >struct dp_packet *pkt, >struct conn_lookup_ctx *ctx, > @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t > zone) > } > > static void > -conn_clean_cmn(struct conntrack *ct, struct conn *conn) > +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash) > OVS_REQUIRES(ct->ct_lock) > { > if (conn->alg) { > expectation_clean(ct, &conn->key); > } > > -uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis); > cmap_remove(&ct->conns, &conn->cm_node, hash); > > struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); > @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn) > OVS_REQUIRES(ct->ct_lock) > { > ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); > +uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis); > > -conn_clean_cmn(ct, conn); > +conn_clean_cmn(ct, conn, conn_hash); > if (conn->nat_conn) { > uint32_t hash = conn_key_hash(&conn->nat_conn->key, > ct->hash_basis); > -cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > +if (conn_hash != hash) { > +cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > +} > } > ovs_list_remove(&conn->exp_node); > conn->cleaned = true; > @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn) > atomic_count_dec(&ct->n_conn); > } > > -static void > -conn_clean_one(struct conntrack *ct, struct conn *conn) > -OVS_REQUIRES(ct->ct_lock) > -{ > -conn_clean_cmn(ct, conn); > -if (conn->conn_type == CT_CONN_TYPE_DEFAULT) { > -ovs_list_remove(&conn->exp_node); > -conn->cleaned = true; > -atomic_count_dec(&ct->n_conn); > -} > -ovsrcu_postpone(delete_conn_one, conn); > -} > - > /* Destroys the connection tracker 'ct' and frees all the allocated > memory. > * The caller of this function must already have shut down packet input > * and PMD threads (which would have been quiesced). */ > @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct) > > ovs_mutex_lock(&ct->ct_lock); > CMAP_FOR_EACH (conn, cm_node, &ct->conns) { > -conn_clean_one(ct, conn); > +if (conn->conn_type == CT_CONN_TYPE_UN_NAT) { > +continue; > +} > +conn_clean(ct, conn); > } > cmap_destroy(&ct->conns); > > @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct > dp_packet *pkt, > nat_conn->alg = NULL; > nat_conn->nat_conn = NULL; > uint32_t nat_hash = conn_key_hash(&nat_conn->key, > ct->hash_basis); > -cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); > + > +if (nat_hash != ctx->hash) { > +cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); > +} > } > > nc->nat_conn = nat_conn; > @@ -1080,7 +1074,7 @@ conn_not_found(struct conntrack *ct, struct > dp_packet *pkt, > nat_res_exhaustion: > free(nat_conn); > ovs_list_remove(&nc->exp_node); > -delete_conn_cmn(nc); > +delete_conn__(nc); > static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); > VLOG_WARN_RL(&rl, "Unable to NAT due to tuple space exhaustion - " > "if DoS attack, use firewalling and/or zone > partitioning."); > @@ -2549,7 +2543,7 @@ new_conn(struct conntrack *c
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Lazuardi Nasution writes: > Hi Paolo, > > Would you mind to explain this to me? Currently, I'm still looking for > compiling options of installed OVS-DPDK from Ubuntu repo. After that, I'll try > your patch and compile it with same options. > the idea is to avoid to include two keys with the same hash belonging to the same connection even for the nat case. Considering a flow like this: tcp,in_port="ovs-p0" actions=ct(commit,nat(src)),output:"ovs-p1" and a TCP syn matching this rule, an entry in ct is created. This normally, if no other packets refresh the entry or move the state, timeouts in 30s. You can see that with: ovs-appctl dpctl/dump-conntrack -s tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=47838,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=47838),timeout=30,protoinfo=(state=SYN_SENT) There's a timespan between the expiration and the actual clean-up of the connection. If a new connection with the same 5-tuple (or even a retransmission) is received in that timespan, the issue should occur. In ovs 3.x the patch (intended for testing only) should be slightly different as some things changed there. This should be enough for a quick test: -- >8 -- diff --git a/lib/conntrack.c b/lib/conntrack.c index 13c5ab628..7f6f1c2a8 100644 --- a/lib/conntrack.c +++ b/lib/conntrack.c @@ -481,8 +481,10 @@ conn_clean__(struct conntrack *ct, struct conn *conn) cmap_remove(&ct->conns, &conn->cm_node, hash); if (conn->nat_conn) { -hash = conn_key_hash(&conn->nat_conn->key, ct->hash_basis); -cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); +uint32_t nc_hash = conn_key_hash(&conn->nat_conn->key, ct->hash_basis); +if (hash != nc_hash) { +cmap_remove(&ct->conns, &conn->nat_conn->cm_node, nc_hash); +} } rculist_remove(&conn->node); @@ -1090,7 +1092,9 @@ conn_not_found(struct conntrack *ct, struct dp_packet *pkt, nat_conn->alg = NULL; nat_conn->nat_conn = NULL; uint32_t nat_hash = conn_key_hash(&nat_conn->key, ct->hash_basis); -cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); +if (nat_hash != ctx->hash) { +cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); +} } nc->nat_conn = nat_conn; > Best regards. > > On Wed, Apr 5, 2023, 2:51 AM Paolo Valerio wrote: > > Hello, > > thanks for reporting this. > I had a look at it, and, although this needs to be confirmed, I suspect > it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but > not yet reclaimed). > > The nat part does not necessarily perform any actual translation, but > could still be triggered by ct(nat(src)...) which is the all-zero binding > to avoid collisions, if any. > > Is there any chance to test the following patch (targeted for ovs 2.17)? > This should help to confirm. > > -- >8 -- > diff --git a/lib/conntrack.c b/lib/conntrack.c > index 08da4ddf7..ba334afb0 100644 > --- a/lib/conntrack.c > +++ b/lib/conntrack.c > @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct > conn_key *); > static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt, > struct conn_key *, long long now, > uint32_t tp_id); > -static void delete_conn_cmn(struct conn *); > +static void delete_conn__(struct conn *); > static void delete_conn(struct conn *); > -static void delete_conn_one(struct conn *conn); > static enum ct_update_res conn_update(struct conntrack *ct, struct conn > *conn, > struct dp_packet *pkt, > struct conn_lookup_ctx *ctx, > @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t > zone) > } > > static void > -conn_clean_cmn(struct conntrack *ct, struct conn *conn) > +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash) > OVS_REQUIRES(ct->ct_lock) > { > if (conn->alg) { > expectation_clean(ct, &conn->key); > } > > - uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis); > cmap_remove(&ct->conns, &conn->cm_node, hash); > > struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); > @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn) > OVS_REQUIRES(ct->ct_lock) > { > ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); > + uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis); > > - conn_clean_cmn(ct, conn); > + conn_clean_cmn(ct, conn, conn_hash); > if (conn->nat_conn) { > uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct-> > hash_basis); > - cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); >
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Paolo, many thanks for the patch. I'll try it asap... Regards Michael -Ursprüngliche Nachricht- Von: Paolo Valerio Gesendet: Dienstag, 4. April 2023 21:51 An: ovs-discuss@openvswitch.org Cc: Plato, Michael ; mrxlazuar...@gmail.com Betreff: Re: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day Hello, thanks for reporting this. I had a look at it, and, although this needs to be confirmed, I suspect it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but not yet reclaimed). The nat part does not necessarily perform any actual translation, but could still be triggered by ct(nat(src)...) which is the all-zero binding to avoid collisions, if any. Is there any chance to test the following patch (targeted for ovs 2.17)? This should help to confirm. -- >8 -- diff --git a/lib/conntrack.c b/lib/conntrack.c index 08da4ddf7..ba334afb0 100644 --- a/lib/conntrack.c +++ b/lib/conntrack.c @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct conn_key *); static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt, struct conn_key *, long long now, uint32_t tp_id); -static void delete_conn_cmn(struct conn *); +static void delete_conn__(struct conn *); static void delete_conn(struct conn *); -static void delete_conn_one(struct conn *conn); static enum ct_update_res conn_update(struct conntrack *ct, struct conn *conn, struct dp_packet *pkt, struct conn_lookup_ctx *ctx, @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t zone) } static void -conn_clean_cmn(struct conntrack *ct, struct conn *conn) +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash) OVS_REQUIRES(ct->ct_lock) { if (conn->alg) { expectation_clean(ct, &conn->key); } -uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis); cmap_remove(&ct->conns, &conn->cm_node, hash); struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn) OVS_REQUIRES(ct->ct_lock) { ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); +uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis); -conn_clean_cmn(ct, conn); +conn_clean_cmn(ct, conn, conn_hash); if (conn->nat_conn) { uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct->hash_basis); -cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); +if (conn_hash != hash) { +cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); +} } ovs_list_remove(&conn->exp_node); conn->cleaned = true; @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn) atomic_count_dec(&ct->n_conn); } -static void -conn_clean_one(struct conntrack *ct, struct conn *conn) -OVS_REQUIRES(ct->ct_lock) -{ -conn_clean_cmn(ct, conn); -if (conn->conn_type == CT_CONN_TYPE_DEFAULT) { -ovs_list_remove(&conn->exp_node); -conn->cleaned = true; -atomic_count_dec(&ct->n_conn); -} -ovsrcu_postpone(delete_conn_one, conn); -} - /* Destroys the connection tracker 'ct' and frees all the allocated memory. * The caller of this function must already have shut down packet input * and PMD threads (which would have been quiesced). */ @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct) ovs_mutex_lock(&ct->ct_lock); CMAP_FOR_EACH (conn, cm_node, &ct->conns) { -conn_clean_one(ct, conn); +if (conn->conn_type == CT_CONN_TYPE_UN_NAT) { +continue; +} +conn_clean(ct, conn); } cmap_destroy(&ct->conns); @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct dp_packet *pkt, nat_conn->alg = NULL; nat_conn->nat_conn = NULL; uint32_t nat_hash = conn_key_hash(&nat_conn->key, ct->hash_basis); -cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); + +if (nat_hash != ctx->hash) { +cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); +} } nc->nat_conn = nat_conn; @@ -1080,7 +1074,7 @@ conn_not_found(struct conntrack *ct, struct dp_packet *pkt, nat_res_exhaustion: free(nat_conn); ovs_list_remove(&nc->exp_node); -delete_conn_cmn(nc); +delete_conn__(nc); static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); VLOG_WARN_RL(&rl, "Unable to NAT due to tuple space exhaustion - " "if DoS attack, use firewalling and/or zone partitioning."); @@ -25
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Michael, Great, know that. I will try on my cluster too. Btw, do you know how to find compiling options of OVS-DPDK package from Ubuntu repo? Best regards. On Wed, Apr 5, 2023, 1:56 PM Plato, Michael wrote: > Hi, > > > > yes our k8s cluster is on the same subnet. I stopped one of the etcd nodes > yesterday which triggers a lot of reconnection attempts from the other > cluster members. Stilll no issues so far and no ovs crashes 😊 > > > > Regards > > > > Michael > > > > *Von:* Lazuardi Nasution > *Gesendet:* Dienstag, 4. April 2023 09:56 > *An:* Plato, Michael > *Cc:* ovs-discuss@openvswitch.org > *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day > > > > Hi Michael, > > > > I assume that your k8s cluster is on the same subnet, right? Would you > mind testing it by shutting down one of etcd instances and see if this bug > still exists? > > > > Best regards. > > > > On Tue, Apr 4, 2023 at 2:50 PM Plato, Michael > wrote: > > Hi, > > from my perspective the patch works for all cases. My test environment > runs with several k8s clusters and I haven't noticed any etcd failures so > far. > > > > Best regards > > > > Michael > > > > *Von:* Lazuardi Nasution > *Gesendet:* Dienstag, 4. April 2023 09:41 > *An:* Plato, Michael > *Cc:* ovs-discuss@openvswitch.org > *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day > > > > Hi Michael, > > > > Is your patch working on the same subnet unreachable traffic too. In my > case, crashes happen when too many unreachable replies even from the same > subnet. For example, when one of the etcd instances is down, there will be > huge reconnection attempts and then unreachable replies from the > destination VM where the down etcd instance exists. > > > > Best regards. > > > > On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael > wrote: > > Hi, > > I have some news on this topic. Unfortunately I could not find the root > cause. But I managed to implement a workaround (see patch in attachment). > The basic idea is to mark the nat flows as invalid if there is no longer an > associated connection. From my point of view it is a race condition. It can > be triggered by many short-lived connections. With the patch we no longer > have any crashes. I can't say if it has any negative effects though, as I'm > not an expert. So far I haven't found any problems at least. Without this > patch we had hundreds of crashes a day :/ > > > > Best regards > > > Michael > > > > *Von:* Lazuardi Nasution > *Gesendet:* Montag, 3. April 2023 13:50 > *An:* ovs-discuss@openvswitch.org > *Cc:* Plato, Michael > *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day > > > > Hi, > > > > Is this related to following glibc bug? I'm not so sure about this because > when I check the glibc source of installed version (2.35), the proposed > patch has been applied. > > > > https://sourceware.org/bugzilla/show_bug.cgi?id=12889 > > > > I can confirm that this problem only happen if I use statefull ACL which > is related to conntrack. The racing situation happen when massive > unreachable replies are received. For example, if I run etcd on VMs but one > etcd node has been disabled which causes massive connection attempts and > unreachable replies. > > > > Best regards. > > > > On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution > wrote: > > Hi Michael, > > > > Have you found the solution for this case? I find the same weird problem > without any information about which conntrack entries are causing > this issue. > > > > I'm using OVS 3.0.1 with DPDK 21.11.2 on Ubuntu 22.04. By the way, this > problem is disappear after I remove some Kubernutes cluster VMs and some DB > cluster VMs. > > > > Best regards. > > > > Date: Thu, 29 Sep 2022 07:56:32 + > From: "Plato, Michael" > To: "ovs-discuss@openvswitch.org" > Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day > Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de> > Content-Type: text/plain; charset="us-ascii" > > Hi, > > we are about to roll out our new openstack infrastructure based on yoga > and during our testing we observered that the openvswitch-switch systemd > unit restarts several times a day, causing network interruptions for all > VMs on the compute node in question. > After some research we found that the ovs-vswitchd crashes with the > following assertion failure: > > "2022-0
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi, yes our k8s cluster is on the same subnet. I stopped one of the etcd nodes yesterday which triggers a lot of reconnection attempts from the other cluster members. Stilll no issues so far and no ovs crashes 😊 Regards Michael Von: Lazuardi Nasution Gesendet: Dienstag, 4. April 2023 09:56 An: Plato, Michael Cc: ovs-discuss@openvswitch.org Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day Hi Michael, I assume that your k8s cluster is on the same subnet, right? Would you mind testing it by shutting down one of etcd instances and see if this bug still exists? Best regards. On Tue, Apr 4, 2023 at 2:50 PM Plato, Michael mailto:michael.pl...@tu-berlin.de>> wrote: Hi, from my perspective the patch works for all cases. My test environment runs with several k8s clusters and I haven't noticed any etcd failures so far. Best regards Michael Von: Lazuardi Nasution mailto:mrxlazuar...@gmail.com>> Gesendet: Dienstag, 4. April 2023 09:41 An: Plato, Michael mailto:michael.pl...@tu-berlin.de>> Cc: ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org> Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day Hi Michael, Is your patch working on the same subnet unreachable traffic too. In my case, crashes happen when too many unreachable replies even from the same subnet. For example, when one of the etcd instances is down, there will be huge reconnection attempts and then unreachable replies from the destination VM where the down etcd instance exists. Best regards. On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael mailto:michael.pl...@tu-berlin.de>> wrote: Hi, I have some news on this topic. Unfortunately I could not find the root cause. But I managed to implement a workaround (see patch in attachment). The basic idea is to mark the nat flows as invalid if there is no longer an associated connection. From my point of view it is a race condition. It can be triggered by many short-lived connections. With the patch we no longer have any crashes. I can't say if it has any negative effects though, as I'm not an expert. So far I haven't found any problems at least. Without this patch we had hundreds of crashes a day :/ Best regards Michael Von: Lazuardi Nasution mailto:mrxlazuar...@gmail.com>> Gesendet: Montag, 3. April 2023 13:50 An: ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org> Cc: Plato, Michael mailto:michael.pl...@tu-berlin.de>> Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day Hi, Is this related to following glibc bug? I'm not so sure about this because when I check the glibc source of installed version (2.35), the proposed patch has been applied. https://sourceware.org/bugzilla/show_bug.cgi?id=12889 I can confirm that this problem only happen if I use statefull ACL which is related to conntrack. The racing situation happen when massive unreachable replies are received. For example, if I run etcd on VMs but one etcd node has been disabled which causes massive connection attempts and unreachable replies. Best regards. On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution mailto:mrxlazuar...@gmail.com>> wrote: Hi Michael, Have you found the solution for this case? I find the same weird problem without any information about which conntrack entries are causing this issue. I'm using OVS 3.0.1 with DPDK 21.11.2 on Ubuntu 22.04. By the way, this problem is disappear after I remove some Kubernutes cluster VMs and some DB cluster VMs. Best regards. Date: Thu, 29 Sep 2022 07:56:32 + From: "Plato, Michael" mailto:michael.pl...@tu-berlin.de>> To: "ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>" mailto:ovs-discuss@openvswitch.org>> Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de<mailto:8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>> Content-Type: text/plain; charset="us-ascii" Hi, we are about to roll out our new openstack infrastructure based on yoga and during our testing we observered that the openvswitch-switch systemd unit restarts several times a day, causing network interruptions for all VMs on the compute node in question. After some research we found that the ovs-vswitchd crashes with the following assertion failure: "2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095: assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in conn_update_state()" To get more information about the connection that leads to this assertion failure, I added some debug code to conntrack.c . We have seen that we can trigger this issue when trying to connect from a VM to a destination which is unreachable. For example curl https://www.google.de:444 Shortly after that we get an assertion and the debug code says: conn_type=1
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Paolo, Would you mind to explain this to me? Currently, I'm still looking for compiling options of installed OVS-DPDK from Ubuntu repo. After that, I'll try your patch and compile it with same options. Best regards. On Wed, Apr 5, 2023, 2:51 AM Paolo Valerio wrote: > Hello, > > thanks for reporting this. > I had a look at it, and, although this needs to be confirmed, I suspect > it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but > not yet reclaimed). > > The nat part does not necessarily perform any actual translation, but > could still be triggered by ct(nat(src)...) which is the all-zero binding > to avoid collisions, if any. > > Is there any chance to test the following patch (targeted for ovs 2.17)? > This should help to confirm. > > -- >8 -- > diff --git a/lib/conntrack.c b/lib/conntrack.c > index 08da4ddf7..ba334afb0 100644 > --- a/lib/conntrack.c > +++ b/lib/conntrack.c > @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct > conn_key *); > static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt, > struct conn_key *, long long now, > uint32_t tp_id); > -static void delete_conn_cmn(struct conn *); > +static void delete_conn__(struct conn *); > static void delete_conn(struct conn *); > -static void delete_conn_one(struct conn *conn); > static enum ct_update_res conn_update(struct conntrack *ct, struct conn > *conn, >struct dp_packet *pkt, >struct conn_lookup_ctx *ctx, > @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t > zone) > } > > static void > -conn_clean_cmn(struct conntrack *ct, struct conn *conn) > +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash) > OVS_REQUIRES(ct->ct_lock) > { > if (conn->alg) { > expectation_clean(ct, &conn->key); > } > > -uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis); > cmap_remove(&ct->conns, &conn->cm_node, hash); > > struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); > @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn) > OVS_REQUIRES(ct->ct_lock) > { > ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); > +uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis); > > -conn_clean_cmn(ct, conn); > +conn_clean_cmn(ct, conn, conn_hash); > if (conn->nat_conn) { > uint32_t hash = conn_key_hash(&conn->nat_conn->key, > ct->hash_basis); > -cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > +if (conn_hash != hash) { > +cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); > +} > } > ovs_list_remove(&conn->exp_node); > conn->cleaned = true; > @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn) > atomic_count_dec(&ct->n_conn); > } > > -static void > -conn_clean_one(struct conntrack *ct, struct conn *conn) > -OVS_REQUIRES(ct->ct_lock) > -{ > -conn_clean_cmn(ct, conn); > -if (conn->conn_type == CT_CONN_TYPE_DEFAULT) { > -ovs_list_remove(&conn->exp_node); > -conn->cleaned = true; > -atomic_count_dec(&ct->n_conn); > -} > -ovsrcu_postpone(delete_conn_one, conn); > -} > - > /* Destroys the connection tracker 'ct' and frees all the allocated > memory. > * The caller of this function must already have shut down packet input > * and PMD threads (which would have been quiesced). */ > @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct) > > ovs_mutex_lock(&ct->ct_lock); > CMAP_FOR_EACH (conn, cm_node, &ct->conns) { > -conn_clean_one(ct, conn); > +if (conn->conn_type == CT_CONN_TYPE_UN_NAT) { > +continue; > +} > +conn_clean(ct, conn); > } > cmap_destroy(&ct->conns); > > @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct > dp_packet *pkt, > nat_conn->alg = NULL; > nat_conn->nat_conn = NULL; > uint32_t nat_hash = conn_key_hash(&nat_conn->key, > ct->hash_basis); > -cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); > + > +if (nat_hash != ctx->hash) { > +cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); > +} > } > > nc->nat_conn = nat_conn; > @@ -1080,7 +1074,7 @@ conn_not_found(struct conntrack *ct, struct > dp_packet *pkt, > nat_res_exhaustion: > free(nat_conn); > ovs_list_remove(&nc->exp_node); > -delete_conn_cmn(nc); > +delete_conn__(nc); > static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); > VLOG_WARN_RL(&rl, "Unable to NAT due to tuple space exhaustion - " > "if DoS attack, use firewalling and/or zone > partitioning."); > @@ -2549,7 +2543,7 @@ new_conn(struct conntrack *ct, struct dp_packet > *pkt, struct conn_key *key,
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hello, thanks for reporting this. I had a look at it, and, although this needs to be confirmed, I suspect it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but not yet reclaimed). The nat part does not necessarily perform any actual translation, but could still be triggered by ct(nat(src)...) which is the all-zero binding to avoid collisions, if any. Is there any chance to test the following patch (targeted for ovs 2.17)? This should help to confirm. -- >8 -- diff --git a/lib/conntrack.c b/lib/conntrack.c index 08da4ddf7..ba334afb0 100644 --- a/lib/conntrack.c +++ b/lib/conntrack.c @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct conn_key *); static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt, struct conn_key *, long long now, uint32_t tp_id); -static void delete_conn_cmn(struct conn *); +static void delete_conn__(struct conn *); static void delete_conn(struct conn *); -static void delete_conn_one(struct conn *conn); static enum ct_update_res conn_update(struct conntrack *ct, struct conn *conn, struct dp_packet *pkt, struct conn_lookup_ctx *ctx, @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t zone) } static void -conn_clean_cmn(struct conntrack *ct, struct conn *conn) +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash) OVS_REQUIRES(ct->ct_lock) { if (conn->alg) { expectation_clean(ct, &conn->key); } -uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis); cmap_remove(&ct->conns, &conn->cm_node, hash); struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn) OVS_REQUIRES(ct->ct_lock) { ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); +uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis); -conn_clean_cmn(ct, conn); +conn_clean_cmn(ct, conn, conn_hash); if (conn->nat_conn) { uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct->hash_basis); -cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); +if (conn_hash != hash) { +cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash); +} } ovs_list_remove(&conn->exp_node); conn->cleaned = true; @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn) atomic_count_dec(&ct->n_conn); } -static void -conn_clean_one(struct conntrack *ct, struct conn *conn) -OVS_REQUIRES(ct->ct_lock) -{ -conn_clean_cmn(ct, conn); -if (conn->conn_type == CT_CONN_TYPE_DEFAULT) { -ovs_list_remove(&conn->exp_node); -conn->cleaned = true; -atomic_count_dec(&ct->n_conn); -} -ovsrcu_postpone(delete_conn_one, conn); -} - /* Destroys the connection tracker 'ct' and frees all the allocated memory. * The caller of this function must already have shut down packet input * and PMD threads (which would have been quiesced). */ @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct) ovs_mutex_lock(&ct->ct_lock); CMAP_FOR_EACH (conn, cm_node, &ct->conns) { -conn_clean_one(ct, conn); +if (conn->conn_type == CT_CONN_TYPE_UN_NAT) { +continue; +} +conn_clean(ct, conn); } cmap_destroy(&ct->conns); @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct dp_packet *pkt, nat_conn->alg = NULL; nat_conn->nat_conn = NULL; uint32_t nat_hash = conn_key_hash(&nat_conn->key, ct->hash_basis); -cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); + +if (nat_hash != ctx->hash) { +cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash); +} } nc->nat_conn = nat_conn; @@ -1080,7 +1074,7 @@ conn_not_found(struct conntrack *ct, struct dp_packet *pkt, nat_res_exhaustion: free(nat_conn); ovs_list_remove(&nc->exp_node); -delete_conn_cmn(nc); +delete_conn__(nc); static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5); VLOG_WARN_RL(&rl, "Unable to NAT due to tuple space exhaustion - " "if DoS attack, use firewalling and/or zone partitioning."); @@ -2549,7 +2543,7 @@ new_conn(struct conntrack *ct, struct dp_packet *pkt, struct conn_key *key, } static void -delete_conn_cmn(struct conn *conn) +delete_conn__(struct conn *conn) { free(conn->alg); free(conn); @@ -2561,17 +2555,7 @@ delete_conn(struct conn *conn) ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT); ovs_mutex_destroy(&conn->lock); free(conn->nat_conn); -delete_conn_cmn(conn); -} - -/* Only used by conn_clean_one(). */ -static void -delete_conn_one(struct conn *conn) -{ -if (conn->conn_type == CT_CONN_TYPE_DEFAULT) { -ovs_mutex_destroy(&conn
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Michael, I assume that your k8s cluster is on the same subnet, right? Would you mind testing it by shutting down one of etcd instances and see if this bug still exists? Best regards. On Tue, Apr 4, 2023 at 2:50 PM Plato, Michael wrote: > Hi, > > from my perspective the patch works for all cases. My test environment > runs with several k8s clusters and I haven't noticed any etcd failures so > far. > > > > Best regards > > > > Michael > > > > *Von:* Lazuardi Nasution > *Gesendet:* Dienstag, 4. April 2023 09:41 > *An:* Plato, Michael > *Cc:* ovs-discuss@openvswitch.org > *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day > > > > Hi Michael, > > > > Is your patch working on the same subnet unreachable traffic too. In my > case, crashes happen when too many unreachable replies even from the same > subnet. For example, when one of the etcd instances is down, there will be > huge reconnection attempts and then unreachable replies from the > destination VM where the down etcd instance exists. > > > > Best regards. > > > > On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael > wrote: > > Hi, > > I have some news on this topic. Unfortunately I could not find the root > cause. But I managed to implement a workaround (see patch in attachment). > The basic idea is to mark the nat flows as invalid if there is no longer an > associated connection. From my point of view it is a race condition. It can > be triggered by many short-lived connections. With the patch we no longer > have any crashes. I can't say if it has any negative effects though, as I'm > not an expert. So far I haven't found any problems at least. Without this > patch we had hundreds of crashes a day :/ > > > > Best regards > > > Michael > > > > *Von:* Lazuardi Nasution > *Gesendet:* Montag, 3. April 2023 13:50 > *An:* ovs-discuss@openvswitch.org > *Cc:* Plato, Michael > *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day > > > > Hi, > > > > Is this related to following glibc bug? I'm not so sure about this because > when I check the glibc source of installed version (2.35), the proposed > patch has been applied. > > > > https://sourceware.org/bugzilla/show_bug.cgi?id=12889 > > > > I can confirm that this problem only happen if I use statefull ACL which > is related to conntrack. The racing situation happen when massive > unreachable replies are received. For example, if I run etcd on VMs but one > etcd node has been disabled which causes massive connection attempts and > unreachable replies. > > > > Best regards. > > > > On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution > wrote: > > Hi Michael, > > > > Have you found the solution for this case? I find the same weird problem > without any information about which conntrack entries are causing > this issue. > > > > I'm using OVS 3.0.1 with DPDK 21.11.2 on Ubuntu 22.04. By the way, this > problem is disappear after I remove some Kubernutes cluster VMs and some DB > cluster VMs. > > > > Best regards. > > > > Date: Thu, 29 Sep 2022 07:56:32 + > From: "Plato, Michael" > To: "ovs-discuss@openvswitch.org" > Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day > Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de> > Content-Type: text/plain; charset="us-ascii" > > Hi, > > we are about to roll out our new openstack infrastructure based on yoga > and during our testing we observered that the openvswitch-switch systemd > unit restarts several times a day, causing network interruptions for all > VMs on the compute node in question. > After some research we found that the ovs-vswitchd crashes with the > following assertion failure: > > "2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095: > assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in > conn_update_state()" > > To get more information about the connection that leads to this assertion > failure, I added some debug code to conntrack.c . > We have seen that we can trigger this issue when trying to connect from a > VM to a destination which is unreachable. For example curl > https://www.google.de:444 > > Shortly after that we get an assertion and the debug code says: > > conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ? > src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst > ip 172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444 > zone/rev zone 2/2 nw_proto/rev nw_proto 6/6 > > ovs-appctl dpctl/dump-conn
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi, from my perspective the patch works for all cases. My test environment runs with several k8s clusters and I haven't noticed any etcd failures so far. Best regards Michael Von: Lazuardi Nasution Gesendet: Dienstag, 4. April 2023 09:41 An: Plato, Michael Cc: ovs-discuss@openvswitch.org Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day Hi Michael, Is your patch working on the same subnet unreachable traffic too. In my case, crashes happen when too many unreachable replies even from the same subnet. For example, when one of the etcd instances is down, there will be huge reconnection attempts and then unreachable replies from the destination VM where the down etcd instance exists. Best regards. On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael mailto:michael.pl...@tu-berlin.de>> wrote: Hi, I have some news on this topic. Unfortunately I could not find the root cause. But I managed to implement a workaround (see patch in attachment). The basic idea is to mark the nat flows as invalid if there is no longer an associated connection. From my point of view it is a race condition. It can be triggered by many short-lived connections. With the patch we no longer have any crashes. I can't say if it has any negative effects though, as I'm not an expert. So far I haven't found any problems at least. Without this patch we had hundreds of crashes a day :/ Best regards Michael Von: Lazuardi Nasution mailto:mrxlazuar...@gmail.com>> Gesendet: Montag, 3. April 2023 13:50 An: ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org> Cc: Plato, Michael mailto:michael.pl...@tu-berlin.de>> Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day Hi, Is this related to following glibc bug? I'm not so sure about this because when I check the glibc source of installed version (2.35), the proposed patch has been applied. https://sourceware.org/bugzilla/show_bug.cgi?id=12889 I can confirm that this problem only happen if I use statefull ACL which is related to conntrack. The racing situation happen when massive unreachable replies are received. For example, if I run etcd on VMs but one etcd node has been disabled which causes massive connection attempts and unreachable replies. Best regards. On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution mailto:mrxlazuar...@gmail.com>> wrote: Hi Michael, Have you found the solution for this case? I find the same weird problem without any information about which conntrack entries are causing this issue. I'm using OVS 3.0.1 with DPDK 21.11.2 on Ubuntu 22.04. By the way, this problem is disappear after I remove some Kubernutes cluster VMs and some DB cluster VMs. Best regards. Date: Thu, 29 Sep 2022 07:56:32 + From: "Plato, Michael" mailto:michael.pl...@tu-berlin.de>> To: "ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>" mailto:ovs-discuss@openvswitch.org>> Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de<mailto:8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>> Content-Type: text/plain; charset="us-ascii" Hi, we are about to roll out our new openstack infrastructure based on yoga and during our testing we observered that the openvswitch-switch systemd unit restarts several times a day, causing network interruptions for all VMs on the compute node in question. After some research we found that the ovs-vswitchd crashes with the following assertion failure: "2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095: assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in conn_update_state()" To get more information about the connection that leads to this assertion failure, I added some debug code to conntrack.c . We have seen that we can trigger this issue when trying to connect from a VM to a destination which is unreachable. For example curl https://www.google.de:444 Shortly after that we get an assertion and the debug code says: conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ? src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst ip 172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444 zone/rev zone 2/2 nw_proto/rev nw_proto 6/6 ovs-appctl dpctl/dump-conntrack | grep "444" tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT) Versions: ovs-vsctl --version ovs-vsctl (Open vSwitch) 2.17.2 DB Schema 8.3.0 ovn-controller --version ovn-controller 22.03.0 Open vSwitch Library 2.17.0 OpenFlow versions 0x6:0x6 SB DB Schema 20.21.0 DPDK 21.11.2 We are now unsure if this is a misconfiguration or if we hit a bug. Thanks for any feedback Michael ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Michael, Is your patch working on the same subnet unreachable traffic too. In my case, crashes happen when too many unreachable replies even from the same subnet. For example, when one of the etcd instances is down, there will be huge reconnection attempts and then unreachable replies from the destination VM where the down etcd instance exists. Best regards. On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael wrote: > Hi, > > I have some news on this topic. Unfortunately I could not find the root > cause. But I managed to implement a workaround (see patch in attachment). > The basic idea is to mark the nat flows as invalid if there is no longer an > associated connection. From my point of view it is a race condition. It can > be triggered by many short-lived connections. With the patch we no longer > have any crashes. I can't say if it has any negative effects though, as I'm > not an expert. So far I haven't found any problems at least. Without this > patch we had hundreds of crashes a day :/ > > > > Best regards > > > Michael > > > > *Von:* Lazuardi Nasution > *Gesendet:* Montag, 3. April 2023 13:50 > *An:* ovs-discuss@openvswitch.org > *Cc:* Plato, Michael > *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day > > > > Hi, > > > > Is this related to following glibc bug? I'm not so sure about this because > when I check the glibc source of installed version (2.35), the proposed > patch has been applied. > > > > https://sourceware.org/bugzilla/show_bug.cgi?id=12889 > > > > I can confirm that this problem only happen if I use statefull ACL which > is related to conntrack. The racing situation happen when massive > unreachable replies are received. For example, if I run etcd on VMs but one > etcd node has been disabled which causes massive connection attempts and > unreachable replies. > > > > Best regards. > > > > On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution > wrote: > > Hi Michael, > > > > Have you found the solution for this case? I find the same weird problem > without any information about which conntrack entries are causing > this issue. > > > > I'm using OVS 3.0.1 with DPDK 21.11.2 on Ubuntu 22.04. By the way, this > problem is disappear after I remove some Kubernutes cluster VMs and some DB > cluster VMs. > > > > Best regards. > > > > Date: Thu, 29 Sep 2022 07:56:32 + > From: "Plato, Michael" > To: "ovs-discuss@openvswitch.org" > Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day > Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de> > Content-Type: text/plain; charset="us-ascii" > > Hi, > > we are about to roll out our new openstack infrastructure based on yoga > and during our testing we observered that the openvswitch-switch systemd > unit restarts several times a day, causing network interruptions for all > VMs on the compute node in question. > After some research we found that the ovs-vswitchd crashes with the > following assertion failure: > > "2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095: > assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in > conn_update_state()" > > To get more information about the connection that leads to this assertion > failure, I added some debug code to conntrack.c . > We have seen that we can trigger this issue when trying to connect from a > VM to a destination which is unreachable. For example curl > https://www.google.de:444 > > Shortly after that we get an assertion and the debug code says: > > conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ? > src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst > ip 172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444 > zone/rev zone 2/2 nw_proto/rev nw_proto 6/6 > > ovs-appctl dpctl/dump-conntrack | grep "444" > > tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT) > > Versions: > ovs-vsctl --version > ovs-vsctl (Open vSwitch) 2.17.2 > DB Schema 8.3.0 > > ovn-controller --version > ovn-controller 22.03.0 > Open vSwitch Library 2.17.0 > OpenFlow versions 0x6:0x6 > SB DB Schema 20.21.0 > > DPDK 21.11.2 > > We are now unsure if this is a misconfiguration or if we hit a bug. > > Thanks for any feedback > > Michael > > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi, I have some news on this topic. Unfortunately I could not find the root cause. But I managed to implement a workaround (see patch in attachment). The basic idea is to mark the nat flows as invalid if there is no longer an associated connection. From my point of view it is a race condition. It can be triggered by many short-lived connections. With the patch we no longer have any crashes. I can't say if it has any negative effects though, as I'm not an expert. So far I haven't found any problems at least. Without this patch we had hundreds of crashes a day :/ Best regards Michael Von: Lazuardi Nasution Gesendet: Montag, 3. April 2023 13:50 An: ovs-discuss@openvswitch.org Cc: Plato, Michael Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day Hi, Is this related to following glibc bug? I'm not so sure about this because when I check the glibc source of installed version (2.35), the proposed patch has been applied. https://sourceware.org/bugzilla/show_bug.cgi?id=12889 I can confirm that this problem only happen if I use statefull ACL which is related to conntrack. The racing situation happen when massive unreachable replies are received. For example, if I run etcd on VMs but one etcd node has been disabled which causes massive connection attempts and unreachable replies. Best regards. On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution mailto:mrxlazuar...@gmail.com>> wrote: Hi Michael, Have you found the solution for this case? I find the same weird problem without any information about which conntrack entries are causing this issue. I'm using OVS 3.0.1 with DPDK 21.11.2 on Ubuntu 22.04. By the way, this problem is disappear after I remove some Kubernutes cluster VMs and some DB cluster VMs. Best regards. Date: Thu, 29 Sep 2022 07:56:32 + From: "Plato, Michael" mailto:michael.pl...@tu-berlin.de>> To: "ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>" mailto:ovs-discuss@openvswitch.org>> Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de<mailto:8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>> Content-Type: text/plain; charset="us-ascii" Hi, we are about to roll out our new openstack infrastructure based on yoga and during our testing we observered that the openvswitch-switch systemd unit restarts several times a day, causing network interruptions for all VMs on the compute node in question. After some research we found that the ovs-vswitchd crashes with the following assertion failure: "2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095: assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in conn_update_state()" To get more information about the connection that leads to this assertion failure, I added some debug code to conntrack.c . We have seen that we can trigger this issue when trying to connect from a VM to a destination which is unreachable. For example curl https://www.google.de:444 Shortly after that we get an assertion and the debug code says: conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ? src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst ip 172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444 zone/rev zone 2/2 nw_proto/rev nw_proto 6/6 ovs-appctl dpctl/dump-conntrack | grep "444" tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT) Versions: ovs-vsctl --version ovs-vsctl (Open vSwitch) 2.17.2 DB Schema 8.3.0 ovn-controller --version ovn-controller 22.03.0 Open vSwitch Library 2.17.0 OpenFlow versions 0x6:0x6 SB DB Schema 20.21.0 DPDK 21.11.2 We are now unsure if this is a misconfiguration or if we hit a bug. Thanks for any feedback Michael ovs-conntrack.patch Description: ovs-conntrack.patch ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi, Is this related to following glibc bug? I'm not so sure about this because when I check the glibc source of installed version (2.35), the proposed patch has been applied. https://sourceware.org/bugzilla/show_bug.cgi?id=12889 I can confirm that this problem only happen if I use statefull ACL which is related to conntrack. The racing situation happen when massive unreachable replies are received. For example, if I run etcd on VMs but one etcd node has been disabled which causes massive connection attempts and unreachable replies. Best regards. On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution wrote: > Hi Michael, > > Have you found the solution for this case? I find the same weird problem > without any information about which conntrack entries are causing > this issue. > > I'm using OVS 3.0.1 with DPDK 21.11.2 on Ubuntu 22.04. By the way, this > problem is disappear after I remove some Kubernutes cluster VMs and some DB > cluster VMs. > > Best regards. > > >> Date: Thu, 29 Sep 2022 07:56:32 + >> From: "Plato, Michael" >> To: "ovs-discuss@openvswitch.org" >> Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day >> Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de> >> Content-Type: text/plain; charset="us-ascii" >> >> Hi, >> >> we are about to roll out our new openstack infrastructure based on yoga >> and during our testing we observered that the openvswitch-switch systemd >> unit restarts several times a day, causing network interruptions for all >> VMs on the compute node in question. >> After some research we found that the ovs-vswitchd crashes with the >> following assertion failure: >> >> "2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095: >> assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in >> conn_update_state()" >> >> To get more information about the connection that leads to this assertion >> failure, I added some debug code to conntrack.c . >> We have seen that we can trigger this issue when trying to connect from a >> VM to a destination which is unreachable. For example curl >> https://www.google.de:444 >> >> Shortly after that we get an assertion and the debug code says: >> >> conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ? >> src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst >> ip 172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444 >> zone/rev zone 2/2 nw_proto/rev nw_proto 6/6 >> >> ovs-appctl dpctl/dump-conntrack | grep "444" >> >> tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT) >> >> Versions: >> ovs-vsctl --version >> ovs-vsctl (Open vSwitch) 2.17.2 >> DB Schema 8.3.0 >> >> ovn-controller --version >> ovn-controller 22.03.0 >> Open vSwitch Library 2.17.0 >> OpenFlow versions 0x6:0x6 >> SB DB Schema 20.21.0 >> >> DPDK 21.11.2 >> >> We are now unsure if this is a misconfiguration or if we hit a bug. >> >> Thanks for any feedback >> >> Michael >> >> ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Hi Michael, Have you found the solution for this case? I find the same weird problem without any information about which conntrack entries are causing this issue. I'm using OVS 3.0.1 with DPDK 21.11.2 on Ubuntu 22.04. By the way, this problem is disappear after I remove some Kubernutes cluster VMs and some DB cluster VMs. Best regards. > Date: Thu, 29 Sep 2022 07:56:32 + > From: "Plato, Michael" > To: "ovs-discuss@openvswitch.org" > Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day > Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de> > Content-Type: text/plain; charset="us-ascii" > > Hi, > > we are about to roll out our new openstack infrastructure based on yoga > and during our testing we observered that the openvswitch-switch systemd > unit restarts several times a day, causing network interruptions for all > VMs on the compute node in question. > After some research we found that the ovs-vswitchd crashes with the > following assertion failure: > > "2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095: > assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in > conn_update_state()" > > To get more information about the connection that leads to this assertion > failure, I added some debug code to conntrack.c . > We have seen that we can trigger this issue when trying to connect from a > VM to a destination which is unreachable. For example curl > https://www.google.de:444 > > Shortly after that we get an assertion and the debug code says: > > conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ? > src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst > ip 172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444 > zone/rev zone 2/2 nw_proto/rev nw_proto 6/6 > > ovs-appctl dpctl/dump-conntrack | grep "444" > > tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT) > > Versions: > ovs-vsctl --version > ovs-vsctl (Open vSwitch) 2.17.2 > DB Schema 8.3.0 > > ovn-controller --version > ovn-controller 22.03.0 > Open vSwitch Library 2.17.0 > OpenFlow versions 0x6:0x6 > SB DB Schema 20.21.0 > > DPDK 21.11.2 > > We are now unsure if this is a misconfiguration or if we hit a bug. > > Thanks for any feedback > > Michael > > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss