Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-10-29 Thread 张同剑
Hi Paolo:

Do you mean that the same hash for nat con and conntrackcon will trigger
this issue? I am also encountering this assert issue now, but I have not
figured out the clear emerging scenario?

Can you provide a test case that can emerge?

  

 

Best regards.

 



smime.p7s
Description: S/MIME cryptographic signature
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-10-13 Thread 张同剑
Hi Paolo:

Is this logic wrong for streams with the same 5-tuple hash to hit the same
CT stream table? 

So if “avoid to include two keys with the same hash belong to the same
connection even for the nat case ”, is it a waste of CT's table resources ?



smime.p7s
Description: S/MIME cryptographic signature
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-05-23 Thread Lazuardi Nasution via discuss
Hi Paolo, Hi Michael,

I want to confirm that following patch is working on openvswitch 3.0.3 and
the OVS crash is not happen after patching.

https://patchwork.ozlabs.org/project/openvswitch/patch/168192964823.4031872.3228556334798413886.st...@fed.void/

But currently, I find some logs like following. I'm not sure if it is
related with above patch.

2023-05-23T08:35:18.383Z|5|conntrack(pmd-c49/id:104)|WARN|Unable to NAT
due to tuple space exhaustion - if DoS attack, use firewalling and/or zone
partitioning.

Any ideas?

Best regards.


> Date: Thu, 04 May 2023 19:24:53 +0200
> From: Paolo Valerio 
> To: Lazuardi Nasution 
> Cc: , ovs-discuss@openvswitch.org
> Subject: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
> Message-ID: <871qjwt3fe@fed.void>
> Content-Type: text/plain; charset=utf-8
>
> Lazuardi Nasution  writes:
>
> > Hi Paolo,
> >
> > Should we combine this patch too?
> >
> > https://patchwork.ozlabs.org/project/openvswitch/patch/
> > 168192964823.4031872.3228556334798413886.st...@fed.void/
> >
>
> Hi,
>
> no, it basically does the same thing in a slightly different way
> reducing the need for modification in the case of backporting to
> previous versions.
>
> > Best regards.
> >
> > On Wed, Apr 5, 2023 at 2:51?AM Paolo Valerio 
> wrote:
> >
> > Hello,
> >
> > thanks for reporting this.
> > I had a look at it, and, although this needs to be confirmed, I
> suspect
> > it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections
> (but
> > not yet reclaimed).
> >
> > The nat part does not necessarily perform any actual translation, but
> > could still be triggered by ct(nat(src)...) which is the all-zero
> binding
> > to avoid collisions, if any.
> >
> > Is there any chance to test the following patch (targeted for ovs
> 2.17)?
> > This should help to confirm.
> >
> > -- >8 --
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index 08da4ddf7..ba334afb0 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct
> > conn_key *);
> > ?static struct conn *new_conn(struct conntrack *ct, struct dp_packet
> *pkt,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct conn_key *, long long now,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? uint32_t tp_id);
> > -static void delete_conn_cmn(struct conn *);
> > +static void delete_conn__(struct conn *);
> > ?static void delete_conn(struct conn *);
> > -static void delete_conn_one(struct conn *conn);
> > ?static enum ct_update_res conn_update(struct conntrack *ct, struct
> conn
> > *conn,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct dp_packet *pkt,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct conn_lookup_ctx *ctx,
> > @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct,
> uint16_t
> > zone)
> > ?}
> >
> > ?static void
> > -conn_clean_cmn(struct conntrack *ct, struct conn *conn)
> > +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t
> hash)
> > ? ? ?OVS_REQUIRES(ct->ct_lock)
> > ?{
> > ? ? ?if (conn->alg) {
> > ? ? ? ? ?expectation_clean(ct, &conn->key);
> > ? ? ?}
> >
> > -? ? uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis);
> > ? ? ?cmap_remove(&ct->conns, &conn->cm_node, hash);
> >
> > ? ? ?struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone);
> > @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn
> *conn)
> > ? ? ?OVS_REQUIRES(ct->ct_lock)
> > ?{
> > ? ? ?ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
> > +? ? uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis);
> >
> > -? ? conn_clean_cmn(ct, conn);
> > +? ? conn_clean_cmn(ct, conn, conn_hash);
> > ? ? ?if (conn->nat_conn) {
> > ? ? ? ? ?uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct->
> > hash_basis);
> > -? ? ? ? cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> > +? ? ? ? if (conn_hash != hash) {
> > +? ? ? ? ? ? cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> > +? ? ? ? }
> > ? ? ?}
> > ? ? ?ovs_list_remove(&conn->exp_node);
> > ? ? ?conn->cleaned = true;
> > @@ -479,19 +480,6 @@ conn_clean(struct

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-05-04 Thread Paolo Valerio via discuss
Lazuardi Nasution  writes:

> Hi Paolo,
>
> Should we combine this patch too?
>
> https://patchwork.ozlabs.org/project/openvswitch/patch/
> 168192964823.4031872.3228556334798413886.st...@fed.void/
>

Hi,

no, it basically does the same thing in a slightly different way
reducing the need for modification in the case of backporting to
previous versions.

> Best regards.
>
> On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio  wrote:
>
> Hello,
>
> thanks for reporting this.
> I had a look at it, and, although this needs to be confirmed, I suspect
> it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but
> not yet reclaimed).
>
> The nat part does not necessarily perform any actual translation, but
> could still be triggered by ct(nat(src)...) which is the all-zero binding
> to avoid collisions, if any.
>
> Is there any chance to test the following patch (targeted for ovs 2.17)?
> This should help to confirm.
>
> -- >8 --
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 08da4ddf7..ba334afb0 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct
> conn_key *);
>  static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt,
>                               struct conn_key *, long long now,
>                               uint32_t tp_id);
> -static void delete_conn_cmn(struct conn *);
> +static void delete_conn__(struct conn *);
>  static void delete_conn(struct conn *);
> -static void delete_conn_one(struct conn *conn);
>  static enum ct_update_res conn_update(struct conntrack *ct, struct conn
> *conn,
>                                        struct dp_packet *pkt,
>                                        struct conn_lookup_ctx *ctx,
> @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t
> zone)
>  }
>
>  static void
> -conn_clean_cmn(struct conntrack *ct, struct conn *conn)
> +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
>      OVS_REQUIRES(ct->ct_lock)
>  {
>      if (conn->alg) {
>          expectation_clean(ct, &conn->key);
>      }
>
> -    uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis);
>      cmap_remove(&ct->conns, &conn->cm_node, hash);
>
>      struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone);
> @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>      OVS_REQUIRES(ct->ct_lock)
>  {
>      ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
> +    uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis);
>
> -    conn_clean_cmn(ct, conn);
> +    conn_clean_cmn(ct, conn, conn_hash);
>      if (conn->nat_conn) {
>          uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct->
> hash_basis);
> -        cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +        if (conn_hash != hash) {
> +            cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +        }
>      }
>      ovs_list_remove(&conn->exp_node);
>      conn->cleaned = true;
> @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>      atomic_count_dec(&ct->n_conn);
>  }
>
> -static void
> -conn_clean_one(struct conntrack *ct, struct conn *conn)
> -    OVS_REQUIRES(ct->ct_lock)
> -{
> -    conn_clean_cmn(ct, conn);
> -    if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
> -        ovs_list_remove(&conn->exp_node);
> -        conn->cleaned = true;
> -        atomic_count_dec(&ct->n_conn);
> -    }
> -    ovsrcu_postpone(delete_conn_one, conn);
> -}
> -
>  /* Destroys the connection tracker 'ct' and frees all the allocated
> memory.
>   * The caller of this function must already have shut down packet input
>   * and PMD threads (which would have been quiesced).  */
> @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct)
>
>      ovs_mutex_lock(&ct->ct_lock);
>      CMAP_FOR_EACH (conn, cm_node, &ct->conns) {
> -        conn_clean_one(ct, conn);
> +        if (conn->conn_type == CT_CONN_TYPE_UN_NAT) {
> +            continue;
> +        }
> +        conn_clean(ct, conn);
>      }
>      cmap_destroy(&ct->conns);
>
> @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct
> dp_packet *pkt,
>              nat_conn->alg = NULL;
>              nat_conn->nat_conn = NULL;
>              uint32_t nat_hash = conn_key_hash(&nat_conn->key, ct->
> hash_basis);
> -            cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
> +
> +            if (nat_hash != ctx->hash) {
> +                cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
> +            }
>          }
>
>          nc->nat_conn 

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-05-04 Thread Lazuardi Nasution via discuss
Hi Paolo,

Should we combine this patch too?

https://patchwork.ozlabs.org/project/openvswitch/patch/168192964823.4031872.3228556334798413886.st...@fed.void/

Best regards.

On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio  wrote:

> Hello,
>
> thanks for reporting this.
> I had a look at it, and, although this needs to be confirmed, I suspect
> it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but
> not yet reclaimed).
>
> The nat part does not necessarily perform any actual translation, but
> could still be triggered by ct(nat(src)...) which is the all-zero binding
> to avoid collisions, if any.
>
> Is there any chance to test the following patch (targeted for ovs 2.17)?
> This should help to confirm.
>
> -- >8 --
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 08da4ddf7..ba334afb0 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct
> conn_key *);
>  static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt,
>   struct conn_key *, long long now,
>   uint32_t tp_id);
> -static void delete_conn_cmn(struct conn *);
> +static void delete_conn__(struct conn *);
>  static void delete_conn(struct conn *);
> -static void delete_conn_one(struct conn *conn);
>  static enum ct_update_res conn_update(struct conntrack *ct, struct conn
> *conn,
>struct dp_packet *pkt,
>struct conn_lookup_ctx *ctx,
> @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t
> zone)
>  }
>
>  static void
> -conn_clean_cmn(struct conntrack *ct, struct conn *conn)
> +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
>  OVS_REQUIRES(ct->ct_lock)
>  {
>  if (conn->alg) {
>  expectation_clean(ct, &conn->key);
>  }
>
> -uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis);
>  cmap_remove(&ct->conns, &conn->cm_node, hash);
>
>  struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone);
> @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>  OVS_REQUIRES(ct->ct_lock)
>  {
>  ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
> +uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis);
>
> -conn_clean_cmn(ct, conn);
> +conn_clean_cmn(ct, conn, conn_hash);
>  if (conn->nat_conn) {
>  uint32_t hash = conn_key_hash(&conn->nat_conn->key,
> ct->hash_basis);
> -cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +if (conn_hash != hash) {
> +cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +}
>  }
>  ovs_list_remove(&conn->exp_node);
>  conn->cleaned = true;
> @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>  atomic_count_dec(&ct->n_conn);
>  }
>
> -static void
> -conn_clean_one(struct conntrack *ct, struct conn *conn)
> -OVS_REQUIRES(ct->ct_lock)
> -{
> -conn_clean_cmn(ct, conn);
> -if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
> -ovs_list_remove(&conn->exp_node);
> -conn->cleaned = true;
> -atomic_count_dec(&ct->n_conn);
> -}
> -ovsrcu_postpone(delete_conn_one, conn);
> -}
> -
>  /* Destroys the connection tracker 'ct' and frees all the allocated
> memory.
>   * The caller of this function must already have shut down packet input
>   * and PMD threads (which would have been quiesced).  */
> @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct)
>
>  ovs_mutex_lock(&ct->ct_lock);
>  CMAP_FOR_EACH (conn, cm_node, &ct->conns) {
> -conn_clean_one(ct, conn);
> +if (conn->conn_type == CT_CONN_TYPE_UN_NAT) {
> +continue;
> +}
> +conn_clean(ct, conn);
>  }
>  cmap_destroy(&ct->conns);
>
> @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct
> dp_packet *pkt,
>  nat_conn->alg = NULL;
>  nat_conn->nat_conn = NULL;
>  uint32_t nat_hash = conn_key_hash(&nat_conn->key,
> ct->hash_basis);
> -cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
> +
> +if (nat_hash != ctx->hash) {
> +cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
> +}
>  }
>
>  nc->nat_conn = nat_conn;
> @@ -1080,7 +1074,7 @@ conn_not_found(struct conntrack *ct, struct
> dp_packet *pkt,
>  nat_res_exhaustion:
>  free(nat_conn);
>  ovs_list_remove(&nc->exp_node);
> -delete_conn_cmn(nc);
> +delete_conn__(nc);
>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
>  VLOG_WARN_RL(&rl, "Unable to NAT due to tuple space exhaustion - "
>   "if DoS attack, use firewalling and/or zone
> partitioning.");
> @@ -2549,7 +2543,7 @@ new_conn(struct conntrack *ct, struct dp_packet
> *pkt, struct conn_key *key,
>  }
>
>  static void
> -delete_conn_cmn(struc

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-17 Thread Paolo Valerio via discuss
"Plato, Michael"  writes:

> Hi Paolo,
> I installed the patch for 2.17 on april 6th in our test environment and can 
> confirm that it works. We haven't had any crashes since then. Many thanks for 
> the quick solution!
>

Hi Micheal,

Nice! That's helpful. Thanks for testing it.

Paolo

> Best regards
>
> Michael
>
> -Ursprüngliche Nachricht-
> Von: Paolo Valerio  
> Gesendet: Montag, 17. April 2023 10:36
> An: Lazuardi Nasution 
> Cc: ovs-discuss@openvswitch.org; Plato, Michael 
> Betreff: Re: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
>
> Lazuardi Nasution  writes:
>
>> Hi Paolo,
>>
>> I'm interested in your statement of "expired connections (but not yet 
>> reclaimed)". Do you think that shortening conntrack timeout policy will help?
>> Or, should we make it larger so there will be fewer conntrack table 
>> update and flush attempts?
>>
>
> it's hard to say as it depends on the specific use case.
> Probably making it larger for the specific case could help, but in general, I 
> would not rely on that.
> Of course, an actual fix is needed. It would be great if the patch sent could 
> tested, but in any case, I'll work on a formal patch.
>
>> Best regards.
>>
>> On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio  wrote:
>>
>> Hello,
>>
>> thanks for reporting this.
>> I had a look at it, and, although this needs to be confirmed, I suspect
>> it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but
>> not yet reclaimed).
>>
>> The nat part does not necessarily perform any actual translation, but
>> could still be triggered by ct(nat(src)...) which is the all-zero binding
>> to avoid collisions, if any.
>>
>> Is there any chance to test the following patch (targeted for ovs 2.17)?
>> This should help to confirm.
>>
>> -- >8 --
>> diff --git a/lib/conntrack.c b/lib/conntrack.c
>> index 08da4ddf7..ba334afb0 100644
>> --- a/lib/conntrack.c
>> +++ b/lib/conntrack.c
>> @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct
>> conn_key *);
>>  static struct conn *new_conn(struct conntrack *ct, struct dp_packet 
>> *pkt,
>>                               struct conn_key *, long long now,
>>                               uint32_t tp_id);
>> -static void delete_conn_cmn(struct conn *);
>> +static void delete_conn__(struct conn *);
>>  static void delete_conn(struct conn *);
>> -static void delete_conn_one(struct conn *conn);
>>  static enum ct_update_res conn_update(struct conntrack *ct, struct conn
>> *conn,
>>                                        struct dp_packet *pkt,
>>                                        struct conn_lookup_ctx *ctx,
>> @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t
>> zone)
>>  }
>>
>>  static void
>> -conn_clean_cmn(struct conntrack *ct, struct conn *conn)
>> +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
>>      OVS_REQUIRES(ct->ct_lock)
>>  {
>>      if (conn->alg) {
>>          expectation_clean(ct, &conn->key);
>>      }
>>
>> -    uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis);
>>      cmap_remove(&ct->conns, &conn->cm_node, hash);
>>
>>      struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone);
>> @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>>      OVS_REQUIRES(ct->ct_lock)
>>  {
>>      ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
>> +    uint32_t conn_hash = conn_key_hash(&conn->key, 
>> ct->hash_basis);
>>
>> -    conn_clean_cmn(ct, conn);
>> +    conn_clean_cmn(ct, conn, conn_hash);
>>      if (conn->nat_conn) {
>>          uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct->
>> hash_basis);
>> -        cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
>> +        if (conn_hash != hash) {
>> +            cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
>> +        }
>>      }
>>      ovs_list_remove(&conn->exp_node);
>>      conn->cleaned = true;
>> @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>>      atomic_co

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-17 Thread Plato, Michael via discuss
Hi Paolo,
I installed the patch for 2.17 on april 6th in our test environment and can 
confirm that it works. We haven't had any crashes since then. Many thanks for 
the quick solution!

Best regards

Michael

-Ursprüngliche Nachricht-
Von: Paolo Valerio  
Gesendet: Montag, 17. April 2023 10:36
An: Lazuardi Nasution 
Cc: ovs-discuss@openvswitch.org; Plato, Michael 
Betreff: Re: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Lazuardi Nasution  writes:

> Hi Paolo,
>
> I'm interested in your statement of "expired connections (but not yet 
> reclaimed)". Do you think that shortening conntrack timeout policy will help?
> Or, should we make it larger so there will be fewer conntrack table 
> update and flush attempts?
>

it's hard to say as it depends on the specific use case.
Probably making it larger for the specific case could help, but in general, I 
would not rely on that.
Of course, an actual fix is needed. It would be great if the patch sent could 
tested, but in any case, I'll work on a formal patch.

> Best regards.
>
> On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio  wrote:
>
> Hello,
>
> thanks for reporting this.
> I had a look at it, and, although this needs to be confirmed, I suspect
> it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but
> not yet reclaimed).
>
> The nat part does not necessarily perform any actual translation, but
> could still be triggered by ct(nat(src)...) which is the all-zero binding
> to avoid collisions, if any.
>
> Is there any chance to test the following patch (targeted for ovs 2.17)?
> This should help to confirm.
>
> -- >8 --
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 08da4ddf7..ba334afb0 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct
> conn_key *);
>  static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt,
>                               struct conn_key *, long long now,
>                               uint32_t tp_id);
> -static void delete_conn_cmn(struct conn *);
> +static void delete_conn__(struct conn *);
>  static void delete_conn(struct conn *);
> -static void delete_conn_one(struct conn *conn);
>  static enum ct_update_res conn_update(struct conntrack *ct, struct conn
> *conn,
>                                        struct dp_packet *pkt,
>                                        struct conn_lookup_ctx *ctx,
> @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t
> zone)
>  }
>
>  static void
> -conn_clean_cmn(struct conntrack *ct, struct conn *conn)
> +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
>      OVS_REQUIRES(ct->ct_lock)
>  {
>      if (conn->alg) {
>          expectation_clean(ct, &conn->key);
>      }
>
> -    uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis);
>      cmap_remove(&ct->conns, &conn->cm_node, hash);
>
>      struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone);
> @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>      OVS_REQUIRES(ct->ct_lock)
>  {
>      ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
> +    uint32_t conn_hash = conn_key_hash(&conn->key, 
> ct->hash_basis);
>
> -    conn_clean_cmn(ct, conn);
> +    conn_clean_cmn(ct, conn, conn_hash);
>      if (conn->nat_conn) {
>          uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct->
> hash_basis);
> -        cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +        if (conn_hash != hash) {
> +            cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +        }
>      }
>      ovs_list_remove(&conn->exp_node);
>      conn->cleaned = true;
> @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>      atomic_count_dec(&ct->n_conn);
>  }
>
> -static void
> -conn_clean_one(struct conntrack *ct, struct conn *conn)
> -    OVS_REQUIRES(ct->ct_lock)
> -{
> -    conn_clean_cmn(ct, conn);
> -    if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
> -        ovs_list_remove(&conn->exp_node);
> -        conn->cleaned = true;
> -        atomic_count_dec(&ct->n_conn);
> -    }
> -    ovsrcu_postpone(delete_conn_one, conn);
> -}
> -

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-17 Thread Paolo Valerio via discuss
Lazuardi Nasution  writes:

> Hi Paolo,
>
> I'm interested in your statement of "expired connections (but not yet
> reclaimed)". Do you think that shortening conntrack timeout policy will help?
> Or, should we make it larger so there will be fewer conntrack table update and
> flush attempts?
>

it's hard to say as it depends on the specific use case.
Probably making it larger for the specific case could help, but in
general, I would not rely on that.
Of course, an actual fix is needed. It would be great if the patch sent
could tested, but in any case, I'll work on a formal patch.

> Best regards.
>
> On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio  wrote:
>
> Hello,
>
> thanks for reporting this.
> I had a look at it, and, although this needs to be confirmed, I suspect
> it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but
> not yet reclaimed).
>
> The nat part does not necessarily perform any actual translation, but
> could still be triggered by ct(nat(src)...) which is the all-zero binding
> to avoid collisions, if any.
>
> Is there any chance to test the following patch (targeted for ovs 2.17)?
> This should help to confirm.
>
> -- >8 --
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 08da4ddf7..ba334afb0 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct
> conn_key *);
>  static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt,
>                               struct conn_key *, long long now,
>                               uint32_t tp_id);
> -static void delete_conn_cmn(struct conn *);
> +static void delete_conn__(struct conn *);
>  static void delete_conn(struct conn *);
> -static void delete_conn_one(struct conn *conn);
>  static enum ct_update_res conn_update(struct conntrack *ct, struct conn
> *conn,
>                                        struct dp_packet *pkt,
>                                        struct conn_lookup_ctx *ctx,
> @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t
> zone)
>  }
>
>  static void
> -conn_clean_cmn(struct conntrack *ct, struct conn *conn)
> +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
>      OVS_REQUIRES(ct->ct_lock)
>  {
>      if (conn->alg) {
>          expectation_clean(ct, &conn->key);
>      }
>
> -    uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis);
>      cmap_remove(&ct->conns, &conn->cm_node, hash);
>
>      struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone);
> @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>      OVS_REQUIRES(ct->ct_lock)
>  {
>      ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
> +    uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis);
>
> -    conn_clean_cmn(ct, conn);
> +    conn_clean_cmn(ct, conn, conn_hash);
>      if (conn->nat_conn) {
>          uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct->
> hash_basis);
> -        cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +        if (conn_hash != hash) {
> +            cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +        }
>      }
>      ovs_list_remove(&conn->exp_node);
>      conn->cleaned = true;
> @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>      atomic_count_dec(&ct->n_conn);
>  }
>
> -static void
> -conn_clean_one(struct conntrack *ct, struct conn *conn)
> -    OVS_REQUIRES(ct->ct_lock)
> -{
> -    conn_clean_cmn(ct, conn);
> -    if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
> -        ovs_list_remove(&conn->exp_node);
> -        conn->cleaned = true;
> -        atomic_count_dec(&ct->n_conn);
> -    }
> -    ovsrcu_postpone(delete_conn_one, conn);
> -}
> -
>  /* Destroys the connection tracker 'ct' and frees all the allocated
> memory.
>   * The caller of this function must already have shut down packet input
>   * and PMD threads (which would have been quiesced).  */
> @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct)
>
>      ovs_mutex_lock(&ct->ct_lock);
>      CMAP_FOR_EACH (conn, cm_node, &ct->conns) {
> -        conn_clean_one(ct, conn);
> +        if (conn->conn_type == CT_CONN_TYPE_UN_NAT) {
> +            continue;
> +        }
> +        conn_clean(ct, conn);
>      }
>      cmap_destroy(&ct->conns);
>
> @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct
> dp_packet *pkt,
>              nat_conn->alg = NULL;
>              nat_conn->nat_conn = NULL;
>              uint32_t nat_hash = conn_key_hash(&nat_conn->key, ct->
> hash_basis);
> -            cmap_insert(&ct->conns, 

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-13 Thread Lazuardi Nasution via discuss
Hi Paolo,

I'm interested in your statement of "expired connections (but not yet
reclaimed)". Do you think that shortening conntrack timeout policy will
help? Or, should we make it larger so there will be fewer conntrack table
update and flush attempts?

Best regards.

On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio  wrote:

> Hello,
>
> thanks for reporting this.
> I had a look at it, and, although this needs to be confirmed, I suspect
> it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but
> not yet reclaimed).
>
> The nat part does not necessarily perform any actual translation, but
> could still be triggered by ct(nat(src)...) which is the all-zero binding
> to avoid collisions, if any.
>
> Is there any chance to test the following patch (targeted for ovs 2.17)?
> This should help to confirm.
>
> -- >8 --
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 08da4ddf7..ba334afb0 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct
> conn_key *);
>  static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt,
>   struct conn_key *, long long now,
>   uint32_t tp_id);
> -static void delete_conn_cmn(struct conn *);
> +static void delete_conn__(struct conn *);
>  static void delete_conn(struct conn *);
> -static void delete_conn_one(struct conn *conn);
>  static enum ct_update_res conn_update(struct conntrack *ct, struct conn
> *conn,
>struct dp_packet *pkt,
>struct conn_lookup_ctx *ctx,
> @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t
> zone)
>  }
>
>  static void
> -conn_clean_cmn(struct conntrack *ct, struct conn *conn)
> +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
>  OVS_REQUIRES(ct->ct_lock)
>  {
>  if (conn->alg) {
>  expectation_clean(ct, &conn->key);
>  }
>
> -uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis);
>  cmap_remove(&ct->conns, &conn->cm_node, hash);
>
>  struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone);
> @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>  OVS_REQUIRES(ct->ct_lock)
>  {
>  ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
> +uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis);
>
> -conn_clean_cmn(ct, conn);
> +conn_clean_cmn(ct, conn, conn_hash);
>  if (conn->nat_conn) {
>  uint32_t hash = conn_key_hash(&conn->nat_conn->key,
> ct->hash_basis);
> -cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +if (conn_hash != hash) {
> +cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +}
>  }
>  ovs_list_remove(&conn->exp_node);
>  conn->cleaned = true;
> @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>  atomic_count_dec(&ct->n_conn);
>  }
>
> -static void
> -conn_clean_one(struct conntrack *ct, struct conn *conn)
> -OVS_REQUIRES(ct->ct_lock)
> -{
> -conn_clean_cmn(ct, conn);
> -if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
> -ovs_list_remove(&conn->exp_node);
> -conn->cleaned = true;
> -atomic_count_dec(&ct->n_conn);
> -}
> -ovsrcu_postpone(delete_conn_one, conn);
> -}
> -
>  /* Destroys the connection tracker 'ct' and frees all the allocated
> memory.
>   * The caller of this function must already have shut down packet input
>   * and PMD threads (which would have been quiesced).  */
> @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct)
>
>  ovs_mutex_lock(&ct->ct_lock);
>  CMAP_FOR_EACH (conn, cm_node, &ct->conns) {
> -conn_clean_one(ct, conn);
> +if (conn->conn_type == CT_CONN_TYPE_UN_NAT) {
> +continue;
> +}
> +conn_clean(ct, conn);
>  }
>  cmap_destroy(&ct->conns);
>
> @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct
> dp_packet *pkt,
>  nat_conn->alg = NULL;
>  nat_conn->nat_conn = NULL;
>  uint32_t nat_hash = conn_key_hash(&nat_conn->key,
> ct->hash_basis);
> -cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
> +
> +if (nat_hash != ctx->hash) {
> +cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
> +}
>  }
>
>  nc->nat_conn = nat_conn;
> @@ -1080,7 +1074,7 @@ conn_not_found(struct conntrack *ct, struct
> dp_packet *pkt,
>  nat_res_exhaustion:
>  free(nat_conn);
>  ovs_list_remove(&nc->exp_node);
> -delete_conn_cmn(nc);
> +delete_conn__(nc);
>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
>  VLOG_WARN_RL(&rl, "Unable to NAT due to tuple space exhaustion - "
>   "if DoS attack, use firewalling and/or zone
> partitioning.");
> @@ -2549,7 +2543,7 @@ new_conn(struct conntrack *c

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-05 Thread Paolo Valerio via discuss
Lazuardi Nasution  writes:

> Hi Paolo,
>
> Would you mind to explain this to me? Currently, I'm still looking for
> compiling options of installed OVS-DPDK from Ubuntu repo. After that, I'll try
> your patch and compile it with same options.
>

the idea is to avoid to include two keys with the same hash belonging to
the same connection even for the nat case.

Considering a flow like this:

tcp,in_port="ovs-p0" actions=ct(commit,nat(src)),output:"ovs-p1"

and a TCP syn matching this rule, an entry in ct is created. This
normally, if no other packets refresh the entry or move the state,
timeouts in 30s.
You can see that with:

ovs-appctl dpctl/dump-conntrack -s

tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=47838,dport=8080),reply=(src=10.1.1.2,dst=10.1.1.1,sport=8080,dport=47838),timeout=30,protoinfo=(state=SYN_SENT)

There's a timespan between the expiration and the actual clean-up of the
connection. If a new connection with the same 5-tuple (or even a
retransmission) is received in that timespan, the issue should occur.

In ovs 3.x the patch (intended for testing only) should be slightly
different as some things changed there.
This should be enough for a quick test:

-- >8 --
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 13c5ab628..7f6f1c2a8 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -481,8 +481,10 @@ conn_clean__(struct conntrack *ct, struct conn *conn)
 cmap_remove(&ct->conns, &conn->cm_node, hash);
 
 if (conn->nat_conn) {
-hash = conn_key_hash(&conn->nat_conn->key, ct->hash_basis);
-cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
+uint32_t nc_hash = conn_key_hash(&conn->nat_conn->key, ct->hash_basis);
+if (hash != nc_hash) {
+cmap_remove(&ct->conns, &conn->nat_conn->cm_node, nc_hash);
+}
 }
 
 rculist_remove(&conn->node);
@@ -1090,7 +1092,9 @@ conn_not_found(struct conntrack *ct, struct dp_packet 
*pkt,
 nat_conn->alg = NULL;
 nat_conn->nat_conn = NULL;
 uint32_t nat_hash = conn_key_hash(&nat_conn->key, ct->hash_basis);
-cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
+if (nat_hash != ctx->hash) {
+cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
+}
 }
 
 nc->nat_conn = nat_conn;


> Best regards.
>
> On Wed, Apr 5, 2023, 2:51 AM Paolo Valerio  wrote:
>
> Hello,
>
> thanks for reporting this.
> I had a look at it, and, although this needs to be confirmed, I suspect
> it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but
> not yet reclaimed).
>
> The nat part does not necessarily perform any actual translation, but
> could still be triggered by ct(nat(src)...) which is the all-zero binding
> to avoid collisions, if any.
>
> Is there any chance to test the following patch (targeted for ovs 2.17)?
> This should help to confirm.
>
> -- >8 --
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 08da4ddf7..ba334afb0 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct
> conn_key *);
>  static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt,
>                               struct conn_key *, long long now,
>                               uint32_t tp_id);
> -static void delete_conn_cmn(struct conn *);
> +static void delete_conn__(struct conn *);
>  static void delete_conn(struct conn *);
> -static void delete_conn_one(struct conn *conn);
>  static enum ct_update_res conn_update(struct conntrack *ct, struct conn
> *conn,
>                                        struct dp_packet *pkt,
>                                        struct conn_lookup_ctx *ctx,
> @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t
> zone)
>  }
>
>  static void
> -conn_clean_cmn(struct conntrack *ct, struct conn *conn)
> +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
>      OVS_REQUIRES(ct->ct_lock)
>  {
>      if (conn->alg) {
>          expectation_clean(ct, &conn->key);
>      }
>
> -    uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis);
>      cmap_remove(&ct->conns, &conn->cm_node, hash);
>
>      struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone);
> @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>      OVS_REQUIRES(ct->ct_lock)
>  {
>      ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
> +    uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis);
>
> -    conn_clean_cmn(ct, conn);
> +    conn_clean_cmn(ct, conn, conn_hash);
>      if (conn->nat_conn) {
>          uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct->
> hash_basis);
> -        cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
>

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-05 Thread Plato, Michael via discuss
Hi Paolo,
many thanks for the patch. I'll try it asap...

Regards

Michael

-Ursprüngliche Nachricht-
Von: Paolo Valerio  
Gesendet: Dienstag, 4. April 2023 21:51
An: ovs-discuss@openvswitch.org
Cc: Plato, Michael ; mrxlazuar...@gmail.com
Betreff: Re: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hello,

thanks for reporting this.
I had a look at it, and, although this needs to be confirmed, I suspect it's 
related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but not yet 
reclaimed).

The nat part does not necessarily perform any actual translation, but could 
still be triggered by ct(nat(src)...) which is the all-zero binding to avoid 
collisions, if any.

Is there any chance to test the following patch (targeted for ovs 2.17)?
This should help to confirm.

-- >8 --
diff --git a/lib/conntrack.c b/lib/conntrack.c index 08da4ddf7..ba334afb0 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct conn_key 
*);  static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt,
  struct conn_key *, long long now,
  uint32_t tp_id);
-static void delete_conn_cmn(struct conn *);
+static void delete_conn__(struct conn *);
 static void delete_conn(struct conn *); -static void delete_conn_one(struct 
conn *conn);  static enum ct_update_res conn_update(struct conntrack *ct, 
struct conn *conn,
   struct dp_packet *pkt,
   struct conn_lookup_ctx *ctx,
@@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t zone)  }

 static void
-conn_clean_cmn(struct conntrack *ct, struct conn *conn)
+conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
 OVS_REQUIRES(ct->ct_lock)
 {
 if (conn->alg) {
 expectation_clean(ct, &conn->key);
 }

-uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis);
 cmap_remove(&ct->conns, &conn->cm_node, hash);

 struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); @@ 
-467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
 OVS_REQUIRES(ct->ct_lock)
 {
 ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
+uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis);

-conn_clean_cmn(ct, conn);
+conn_clean_cmn(ct, conn, conn_hash);
 if (conn->nat_conn) {
 uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct->hash_basis);
-cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
+if (conn_hash != hash) {
+cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
+}
 }
 ovs_list_remove(&conn->exp_node);
 conn->cleaned = true;
@@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn)
 atomic_count_dec(&ct->n_conn);
 }

-static void
-conn_clean_one(struct conntrack *ct, struct conn *conn)
-OVS_REQUIRES(ct->ct_lock)
-{
-conn_clean_cmn(ct, conn);
-if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
-ovs_list_remove(&conn->exp_node);
-conn->cleaned = true;
-atomic_count_dec(&ct->n_conn);
-}
-ovsrcu_postpone(delete_conn_one, conn);
-}
-
 /* Destroys the connection tracker 'ct' and frees all the allocated memory.
  * The caller of this function must already have shut down packet input
  * and PMD threads (which would have been quiesced).  */ @@ -505,7 +493,10 @@ 
conntrack_destroy(struct conntrack *ct)

 ovs_mutex_lock(&ct->ct_lock);
 CMAP_FOR_EACH (conn, cm_node, &ct->conns) {
-conn_clean_one(ct, conn);
+if (conn->conn_type == CT_CONN_TYPE_UN_NAT) {
+continue;
+}
+conn_clean(ct, conn);
 }
 cmap_destroy(&ct->conns);

@@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct dp_packet 
*pkt,
 nat_conn->alg = NULL;
 nat_conn->nat_conn = NULL;
 uint32_t nat_hash = conn_key_hash(&nat_conn->key, ct->hash_basis);
-cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
+
+if (nat_hash != ctx->hash) {
+cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
+}
 }

 nc->nat_conn = nat_conn;
@@ -1080,7 +1074,7 @@ conn_not_found(struct conntrack *ct, struct dp_packet 
*pkt,
 nat_res_exhaustion:
 free(nat_conn);
 ovs_list_remove(&nc->exp_node);
-delete_conn_cmn(nc);
+delete_conn__(nc);
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
 VLOG_WARN_RL(&rl, "Unable to NAT due to tuple space exhaustion - "
  "if DoS attack, use firewalling and/or zone partitioning."); 
@@ -25

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-05 Thread Lazuardi Nasution via discuss
Hi Michael,

Great, know that. I will try on my cluster too. Btw, do you know how to
find compiling options of OVS-DPDK package from Ubuntu repo?

Best regards.

On Wed, Apr 5, 2023, 1:56 PM Plato, Michael 
wrote:

> Hi,
>
>
>
> yes our k8s cluster is on the same subnet. I stopped one of the etcd nodes
> yesterday which triggers a lot of reconnection attempts from the other
> cluster members. Stilll no issues so far and no ovs crashes 😊
>
>
>
> Regards
>
>
>
> Michael
>
>
>
> *Von:* Lazuardi Nasution 
> *Gesendet:* Dienstag, 4. April 2023 09:56
> *An:* Plato, Michael 
> *Cc:* ovs-discuss@openvswitch.org
> *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
>
>
>
> Hi Michael,
>
>
>
> I assume that your k8s cluster is on the same subnet, right? Would you
> mind testing it by shutting down one of etcd instances and see if this bug
> still exists?
>
>
>
> Best regards.
>
>
>
> On Tue, Apr 4, 2023 at 2:50 PM Plato, Michael 
> wrote:
>
> Hi,
>
> from my perspective the patch works for all cases. My test environment
> runs with several k8s clusters and I haven't noticed any etcd failures so
> far.
>
>
>
> Best regards
>
>
>
> Michael
>
>
>
> *Von:* Lazuardi Nasution 
> *Gesendet:* Dienstag, 4. April 2023 09:41
> *An:* Plato, Michael 
> *Cc:* ovs-discuss@openvswitch.org
> *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
>
>
>
> Hi Michael,
>
>
>
> Is your patch working on the same subnet unreachable traffic too. In my
> case, crashes happen when too many unreachable replies even from the same
> subnet. For example, when one of the etcd instances is down, there will be
> huge reconnection attempts and then unreachable replies from the
> destination VM where the down etcd instance exists.
>
>
>
> Best regards.
>
>
>
> On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael 
> wrote:
>
> Hi,
>
> I have some news on this topic. Unfortunately I could not find the root
> cause. But I managed to implement a workaround (see patch in attachment).
> The basic idea is to mark the nat flows as invalid if there is no longer an
> associated connection. From my point of view it is a race condition. It can
> be triggered by many short-lived connections. With the patch we no longer
> have any crashes. I can't say if it has any negative effects though, as I'm
> not an expert. So far I haven't found any problems at least. Without this
> patch we had hundreds of crashes a day :/
>
>
>
> Best regards
>
>
> Michael
>
>
>
> *Von:* Lazuardi Nasution 
> *Gesendet:* Montag, 3. April 2023 13:50
> *An:* ovs-discuss@openvswitch.org
> *Cc:* Plato, Michael 
> *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
>
>
>
> Hi,
>
>
>
> Is this related to following glibc bug? I'm not so sure about this because
> when I check the glibc source of installed version (2.35), the proposed
> patch has been applied.
>
>
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=12889
>
>
>
> I can confirm that this problem only happen if I use statefull ACL which
> is related to conntrack. The racing situation happen when massive
> unreachable replies are received. For example, if I run etcd on VMs but one
> etcd node has been disabled which causes massive connection attempts and
> unreachable replies.
>
>
>
> Best regards.
>
>
>
> On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution 
> wrote:
>
> Hi Michael,
>
>
>
> Have you found the solution for this case? I find the same weird problem
> without any information about which conntrack entries are causing
> this issue.
>
>
>
> I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this
> problem is disappear after I remove some Kubernutes cluster VMs and some DB
> cluster VMs.
>
>
>
> Best regards.
>
>
>
> Date: Thu, 29 Sep 2022 07:56:32 +
> From: "Plato, Michael" 
> To: "ovs-discuss@openvswitch.org" 
> Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
> Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi,
>
> we are about to roll out our new openstack infrastructure based on yoga
> and during our testing we observered that the openvswitch-switch systemd
> unit restarts several times a day, causing network interruptions for all
> VMs on the compute node in question.
> After some research we found that the ovs-vswitchd crashes with the
> following assertion failure:
>
> "2022-0

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-04 Thread Plato, Michael via discuss
Hi,

yes our k8s cluster is on the same subnet. I stopped one of the etcd nodes 
yesterday which triggers a lot of reconnection attempts from the other cluster 
members. Stilll no issues so far and no ovs crashes 😊

Regards

Michael

Von: Lazuardi Nasution 
Gesendet: Dienstag, 4. April 2023 09:56
An: Plato, Michael 
Cc: ovs-discuss@openvswitch.org
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi Michael,

I assume that your k8s cluster is on the same subnet, right? Would you mind 
testing it by shutting down one of etcd instances and see if this bug still 
exists?

Best regards.

On Tue, Apr 4, 2023 at 2:50 PM Plato, Michael 
mailto:michael.pl...@tu-berlin.de>> wrote:
Hi,
from my perspective the patch works for all cases. My test environment runs 
with several k8s clusters and I haven't noticed any etcd failures so far.

Best regards

Michael

Von: Lazuardi Nasution mailto:mrxlazuar...@gmail.com>>
Gesendet: Dienstag, 4. April 2023 09:41
An: Plato, Michael 
mailto:michael.pl...@tu-berlin.de>>
Cc: ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi Michael,

Is your patch working on the same subnet unreachable traffic too. In my case, 
crashes happen when too many unreachable replies even from the same subnet. For 
example, when one of the etcd instances is down, there will be huge 
reconnection attempts and then unreachable replies from the destination VM 
where the down etcd instance exists.

Best regards.

On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael 
mailto:michael.pl...@tu-berlin.de>> wrote:
Hi,
I have some news on this topic. Unfortunately I could not find the root cause. 
But I managed to implement a workaround (see patch in attachment). The basic 
idea is to mark the nat flows as invalid if there is no longer an associated 
connection. From my point of view it is a race condition. It can be triggered 
by many short-lived connections. With the patch we no longer have any crashes. 
I can't say if it has any negative effects though, as I'm not an expert. So far 
I haven't found any problems at least. Without this patch we had hundreds of 
crashes a day :/

Best regards

Michael

Von: Lazuardi Nasution mailto:mrxlazuar...@gmail.com>>
Gesendet: Montag, 3. April 2023 13:50
An: ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Cc: Plato, Michael 
mailto:michael.pl...@tu-berlin.de>>
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi,

Is this related to following glibc bug? I'm not so sure about this because when 
I check the glibc source of installed version (2.35), the proposed patch has 
been applied.

https://sourceware.org/bugzilla/show_bug.cgi?id=12889

I can confirm that this problem only happen if I use statefull ACL which is 
related to conntrack. The racing situation happen when massive unreachable 
replies are received. For example, if I run etcd on VMs but one etcd node has 
been disabled which causes massive connection attempts and unreachable replies.

Best regards.

On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution 
mailto:mrxlazuar...@gmail.com>> wrote:
Hi Michael,

Have you found the solution for this case? I find the same weird problem 
without any information about which conntrack entries are causing this issue.

I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this 
problem is disappear after I remove some Kubernutes cluster VMs and some DB 
cluster VMs.

Best regards.

Date: Thu, 29 Sep 2022 07:56:32 +
From: "Plato, Michael" 
mailto:michael.pl...@tu-berlin.de>>
To: "ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>" 
mailto:ovs-discuss@openvswitch.org>>
Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Message-ID: 
<8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de<mailto:8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>>
Content-Type: text/plain; charset="us-ascii"

Hi,

we are about to roll out our new openstack infrastructure based on yoga and 
during our testing we observered that the openvswitch-switch systemd unit 
restarts several times a day, causing network interruptions for all VMs on the 
compute node in question.
After some research we found that the ovs-vswitchd crashes with the following 
assertion failure:

"2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
 assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in 
conn_update_state()"

To get more information about the connection that leads to this assertion 
failure, I added some debug code to conntrack.c .
We have seen that we can trigger this issue when trying to connect from a VM to 
a destination which is unreachable. For example curl https://www.google.de:444

Shortly after that we get an assertion and the debug code says:

conn_type=1 

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-04 Thread Lazuardi Nasution via discuss
Hi Paolo,

Would you mind to explain this to me? Currently, I'm still looking for
compiling options of installed OVS-DPDK from Ubuntu repo. After that, I'll
try your patch and compile it with same options.

Best regards.

On Wed, Apr 5, 2023, 2:51 AM Paolo Valerio  wrote:

> Hello,
>
> thanks for reporting this.
> I had a look at it, and, although this needs to be confirmed, I suspect
> it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but
> not yet reclaimed).
>
> The nat part does not necessarily perform any actual translation, but
> could still be triggered by ct(nat(src)...) which is the all-zero binding
> to avoid collisions, if any.
>
> Is there any chance to test the following patch (targeted for ovs 2.17)?
> This should help to confirm.
>
> -- >8 --
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 08da4ddf7..ba334afb0 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct
> conn_key *);
>  static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt,
>   struct conn_key *, long long now,
>   uint32_t tp_id);
> -static void delete_conn_cmn(struct conn *);
> +static void delete_conn__(struct conn *);
>  static void delete_conn(struct conn *);
> -static void delete_conn_one(struct conn *conn);
>  static enum ct_update_res conn_update(struct conntrack *ct, struct conn
> *conn,
>struct dp_packet *pkt,
>struct conn_lookup_ctx *ctx,
> @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t
> zone)
>  }
>
>  static void
> -conn_clean_cmn(struct conntrack *ct, struct conn *conn)
> +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
>  OVS_REQUIRES(ct->ct_lock)
>  {
>  if (conn->alg) {
>  expectation_clean(ct, &conn->key);
>  }
>
> -uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis);
>  cmap_remove(&ct->conns, &conn->cm_node, hash);
>
>  struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone);
> @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>  OVS_REQUIRES(ct->ct_lock)
>  {
>  ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
> +uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis);
>
> -conn_clean_cmn(ct, conn);
> +conn_clean_cmn(ct, conn, conn_hash);
>  if (conn->nat_conn) {
>  uint32_t hash = conn_key_hash(&conn->nat_conn->key,
> ct->hash_basis);
> -cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +if (conn_hash != hash) {
> +cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
> +}
>  }
>  ovs_list_remove(&conn->exp_node);
>  conn->cleaned = true;
> @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>  atomic_count_dec(&ct->n_conn);
>  }
>
> -static void
> -conn_clean_one(struct conntrack *ct, struct conn *conn)
> -OVS_REQUIRES(ct->ct_lock)
> -{
> -conn_clean_cmn(ct, conn);
> -if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
> -ovs_list_remove(&conn->exp_node);
> -conn->cleaned = true;
> -atomic_count_dec(&ct->n_conn);
> -}
> -ovsrcu_postpone(delete_conn_one, conn);
> -}
> -
>  /* Destroys the connection tracker 'ct' and frees all the allocated
> memory.
>   * The caller of this function must already have shut down packet input
>   * and PMD threads (which would have been quiesced).  */
> @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct)
>
>  ovs_mutex_lock(&ct->ct_lock);
>  CMAP_FOR_EACH (conn, cm_node, &ct->conns) {
> -conn_clean_one(ct, conn);
> +if (conn->conn_type == CT_CONN_TYPE_UN_NAT) {
> +continue;
> +}
> +conn_clean(ct, conn);
>  }
>  cmap_destroy(&ct->conns);
>
> @@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct
> dp_packet *pkt,
>  nat_conn->alg = NULL;
>  nat_conn->nat_conn = NULL;
>  uint32_t nat_hash = conn_key_hash(&nat_conn->key,
> ct->hash_basis);
> -cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
> +
> +if (nat_hash != ctx->hash) {
> +cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
> +}
>  }
>
>  nc->nat_conn = nat_conn;
> @@ -1080,7 +1074,7 @@ conn_not_found(struct conntrack *ct, struct
> dp_packet *pkt,
>  nat_res_exhaustion:
>  free(nat_conn);
>  ovs_list_remove(&nc->exp_node);
> -delete_conn_cmn(nc);
> +delete_conn__(nc);
>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
>  VLOG_WARN_RL(&rl, "Unable to NAT due to tuple space exhaustion - "
>   "if DoS attack, use firewalling and/or zone
> partitioning.");
> @@ -2549,7 +2543,7 @@ new_conn(struct conntrack *ct, struct dp_packet
> *pkt, struct conn_key *key,

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-04 Thread Paolo Valerio via discuss
Hello,

thanks for reporting this.
I had a look at it, and, although this needs to be confirmed, I suspect
it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but
not yet reclaimed).

The nat part does not necessarily perform any actual translation, but
could still be triggered by ct(nat(src)...) which is the all-zero binding
to avoid collisions, if any.

Is there any chance to test the following patch (targeted for ovs 2.17)?
This should help to confirm.

-- >8 --
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 08da4ddf7..ba334afb0 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct conn_key 
*);
 static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt,
  struct conn_key *, long long now,
  uint32_t tp_id);
-static void delete_conn_cmn(struct conn *);
+static void delete_conn__(struct conn *);
 static void delete_conn(struct conn *);
-static void delete_conn_one(struct conn *conn);
 static enum ct_update_res conn_update(struct conntrack *ct, struct conn *conn,
   struct dp_packet *pkt,
   struct conn_lookup_ctx *ctx,
@@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t zone)
 }

 static void
-conn_clean_cmn(struct conntrack *ct, struct conn *conn)
+conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
 OVS_REQUIRES(ct->ct_lock)
 {
 if (conn->alg) {
 expectation_clean(ct, &conn->key);
 }

-uint32_t hash = conn_key_hash(&conn->key, ct->hash_basis);
 cmap_remove(&ct->conns, &conn->cm_node, hash);

 struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone);
@@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
 OVS_REQUIRES(ct->ct_lock)
 {
 ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
+uint32_t conn_hash = conn_key_hash(&conn->key, ct->hash_basis);

-conn_clean_cmn(ct, conn);
+conn_clean_cmn(ct, conn, conn_hash);
 if (conn->nat_conn) {
 uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct->hash_basis);
-cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
+if (conn_hash != hash) {
+cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
+}
 }
 ovs_list_remove(&conn->exp_node);
 conn->cleaned = true;
@@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn)
 atomic_count_dec(&ct->n_conn);
 }

-static void
-conn_clean_one(struct conntrack *ct, struct conn *conn)
-OVS_REQUIRES(ct->ct_lock)
-{
-conn_clean_cmn(ct, conn);
-if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
-ovs_list_remove(&conn->exp_node);
-conn->cleaned = true;
-atomic_count_dec(&ct->n_conn);
-}
-ovsrcu_postpone(delete_conn_one, conn);
-}
-
 /* Destroys the connection tracker 'ct' and frees all the allocated memory.
  * The caller of this function must already have shut down packet input
  * and PMD threads (which would have been quiesced).  */
@@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct)

 ovs_mutex_lock(&ct->ct_lock);
 CMAP_FOR_EACH (conn, cm_node, &ct->conns) {
-conn_clean_one(ct, conn);
+if (conn->conn_type == CT_CONN_TYPE_UN_NAT) {
+continue;
+}
+conn_clean(ct, conn);
 }
 cmap_destroy(&ct->conns);

@@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct dp_packet 
*pkt,
 nat_conn->alg = NULL;
 nat_conn->nat_conn = NULL;
 uint32_t nat_hash = conn_key_hash(&nat_conn->key, ct->hash_basis);
-cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
+
+if (nat_hash != ctx->hash) {
+cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
+}
 }

 nc->nat_conn = nat_conn;
@@ -1080,7 +1074,7 @@ conn_not_found(struct conntrack *ct, struct dp_packet 
*pkt,
 nat_res_exhaustion:
 free(nat_conn);
 ovs_list_remove(&nc->exp_node);
-delete_conn_cmn(nc);
+delete_conn__(nc);
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
 VLOG_WARN_RL(&rl, "Unable to NAT due to tuple space exhaustion - "
  "if DoS attack, use firewalling and/or zone partitioning.");
@@ -2549,7 +2543,7 @@ new_conn(struct conntrack *ct, struct dp_packet *pkt, 
struct conn_key *key,
 }

 static void
-delete_conn_cmn(struct conn *conn)
+delete_conn__(struct conn *conn)
 {
 free(conn->alg);
 free(conn);
@@ -2561,17 +2555,7 @@ delete_conn(struct conn *conn)
 ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
 ovs_mutex_destroy(&conn->lock);
 free(conn->nat_conn);
-delete_conn_cmn(conn);
-}
-
-/* Only used by conn_clean_one(). */
-static void
-delete_conn_one(struct conn *conn)
-{
-if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
-ovs_mutex_destroy(&conn

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-04 Thread Lazuardi Nasution via discuss
Hi Michael,

I assume that your k8s cluster is on the same subnet, right? Would you mind
testing it by shutting down one of etcd instances and see if this bug still
exists?

Best regards.

On Tue, Apr 4, 2023 at 2:50 PM Plato, Michael 
wrote:

> Hi,
>
> from my perspective the patch works for all cases. My test environment
> runs with several k8s clusters and I haven't noticed any etcd failures so
> far.
>
>
>
> Best regards
>
>
>
> Michael
>
>
>
> *Von:* Lazuardi Nasution 
> *Gesendet:* Dienstag, 4. April 2023 09:41
> *An:* Plato, Michael 
> *Cc:* ovs-discuss@openvswitch.org
> *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
>
>
>
> Hi Michael,
>
>
>
> Is your patch working on the same subnet unreachable traffic too. In my
> case, crashes happen when too many unreachable replies even from the same
> subnet. For example, when one of the etcd instances is down, there will be
> huge reconnection attempts and then unreachable replies from the
> destination VM where the down etcd instance exists.
>
>
>
> Best regards.
>
>
>
> On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael 
> wrote:
>
> Hi,
>
> I have some news on this topic. Unfortunately I could not find the root
> cause. But I managed to implement a workaround (see patch in attachment).
> The basic idea is to mark the nat flows as invalid if there is no longer an
> associated connection. From my point of view it is a race condition. It can
> be triggered by many short-lived connections. With the patch we no longer
> have any crashes. I can't say if it has any negative effects though, as I'm
> not an expert. So far I haven't found any problems at least. Without this
> patch we had hundreds of crashes a day :/
>
>
>
> Best regards
>
>
> Michael
>
>
>
> *Von:* Lazuardi Nasution 
> *Gesendet:* Montag, 3. April 2023 13:50
> *An:* ovs-discuss@openvswitch.org
> *Cc:* Plato, Michael 
> *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
>
>
>
> Hi,
>
>
>
> Is this related to following glibc bug? I'm not so sure about this because
> when I check the glibc source of installed version (2.35), the proposed
> patch has been applied.
>
>
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=12889
>
>
>
> I can confirm that this problem only happen if I use statefull ACL which
> is related to conntrack. The racing situation happen when massive
> unreachable replies are received. For example, if I run etcd on VMs but one
> etcd node has been disabled which causes massive connection attempts and
> unreachable replies.
>
>
>
> Best regards.
>
>
>
> On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution 
> wrote:
>
> Hi Michael,
>
>
>
> Have you found the solution for this case? I find the same weird problem
> without any information about which conntrack entries are causing
> this issue.
>
>
>
> I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this
> problem is disappear after I remove some Kubernutes cluster VMs and some DB
> cluster VMs.
>
>
>
> Best regards.
>
>
>
> Date: Thu, 29 Sep 2022 07:56:32 +
> From: "Plato, Michael" 
> To: "ovs-discuss@openvswitch.org" 
> Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
> Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi,
>
> we are about to roll out our new openstack infrastructure based on yoga
> and during our testing we observered that the openvswitch-switch systemd
> unit restarts several times a day, causing network interruptions for all
> VMs on the compute node in question.
> After some research we found that the ovs-vswitchd crashes with the
> following assertion failure:
>
> "2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
> assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in
> conn_update_state()"
>
> To get more information about the connection that leads to this assertion
> failure, I added some debug code to conntrack.c .
> We have seen that we can trigger this issue when trying to connect from a
> VM to a destination which is unreachable. For example curl
> https://www.google.de:444
>
> Shortly after that we get an assertion and the debug code says:
>
> conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ?
> src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst
> ip 172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444
> zone/rev zone 2/2 nw_proto/rev nw_proto 6/6
>
> ovs-appctl dpctl/dump-conn

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-04 Thread Plato, Michael via discuss
Hi,
from my perspective the patch works for all cases. My test environment runs 
with several k8s clusters and I haven't noticed any etcd failures so far.

Best regards

Michael

Von: Lazuardi Nasution 
Gesendet: Dienstag, 4. April 2023 09:41
An: Plato, Michael 
Cc: ovs-discuss@openvswitch.org
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi Michael,

Is your patch working on the same subnet unreachable traffic too. In my case, 
crashes happen when too many unreachable replies even from the same subnet. For 
example, when one of the etcd instances is down, there will be huge 
reconnection attempts and then unreachable replies from the destination VM 
where the down etcd instance exists.

Best regards.

On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael 
mailto:michael.pl...@tu-berlin.de>> wrote:
Hi,
I have some news on this topic. Unfortunately I could not find the root cause. 
But I managed to implement a workaround (see patch in attachment). The basic 
idea is to mark the nat flows as invalid if there is no longer an associated 
connection. From my point of view it is a race condition. It can be triggered 
by many short-lived connections. With the patch we no longer have any crashes. 
I can't say if it has any negative effects though, as I'm not an expert. So far 
I haven't found any problems at least. Without this patch we had hundreds of 
crashes a day :/

Best regards

Michael

Von: Lazuardi Nasution mailto:mrxlazuar...@gmail.com>>
Gesendet: Montag, 3. April 2023 13:50
An: ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Cc: Plato, Michael 
mailto:michael.pl...@tu-berlin.de>>
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi,

Is this related to following glibc bug? I'm not so sure about this because when 
I check the glibc source of installed version (2.35), the proposed patch has 
been applied.

https://sourceware.org/bugzilla/show_bug.cgi?id=12889

I can confirm that this problem only happen if I use statefull ACL which is 
related to conntrack. The racing situation happen when massive unreachable 
replies are received. For example, if I run etcd on VMs but one etcd node has 
been disabled which causes massive connection attempts and unreachable replies.

Best regards.

On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution 
mailto:mrxlazuar...@gmail.com>> wrote:
Hi Michael,

Have you found the solution for this case? I find the same weird problem 
without any information about which conntrack entries are causing this issue.

I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this 
problem is disappear after I remove some Kubernutes cluster VMs and some DB 
cluster VMs.

Best regards.

Date: Thu, 29 Sep 2022 07:56:32 +
From: "Plato, Michael" 
mailto:michael.pl...@tu-berlin.de>>
To: "ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>" 
mailto:ovs-discuss@openvswitch.org>>
Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Message-ID: 
<8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de<mailto:8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>>
Content-Type: text/plain; charset="us-ascii"

Hi,

we are about to roll out our new openstack infrastructure based on yoga and 
during our testing we observered that the openvswitch-switch systemd unit 
restarts several times a day, causing network interruptions for all VMs on the 
compute node in question.
After some research we found that the ovs-vswitchd crashes with the following 
assertion failure:

"2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
 assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in 
conn_update_state()"

To get more information about the connection that leads to this assertion 
failure, I added some debug code to conntrack.c .
We have seen that we can trigger this issue when trying to connect from a VM to 
a destination which is unreachable. For example curl https://www.google.de:444

Shortly after that we get an assertion and the debug code says:

conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ?
src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst ip 
172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444 zone/rev zone 
2/2 nw_proto/rev nw_proto 6/6

ovs-appctl dpctl/dump-conntrack | grep "444"
tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT)

Versions:
ovs-vsctl --version
ovs-vsctl (Open vSwitch) 2.17.2
DB Schema 8.3.0

ovn-controller --version
ovn-controller 22.03.0
Open vSwitch Library 2.17.0
OpenFlow versions 0x6:0x6
SB DB Schema 20.21.0

DPDK 21.11.2

We are now unsure if this is a misconfiguration or if we hit a bug.

Thanks for any feedback

Michael
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-04 Thread Lazuardi Nasution via discuss
Hi Michael,

Is your patch working on the same subnet unreachable traffic too. In my
case, crashes happen when too many unreachable replies even from the same
subnet. For example, when one of the etcd instances is down, there will be
huge reconnection attempts and then unreachable replies from the
destination VM where the down etcd instance exists.

Best regards.

On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael 
wrote:

> Hi,
>
> I have some news on this topic. Unfortunately I could not find the root
> cause. But I managed to implement a workaround (see patch in attachment).
> The basic idea is to mark the nat flows as invalid if there is no longer an
> associated connection. From my point of view it is a race condition. It can
> be triggered by many short-lived connections. With the patch we no longer
> have any crashes. I can't say if it has any negative effects though, as I'm
> not an expert. So far I haven't found any problems at least. Without this
> patch we had hundreds of crashes a day :/
>
>
>
> Best regards
>
>
> Michael
>
>
>
> *Von:* Lazuardi Nasution 
> *Gesendet:* Montag, 3. April 2023 13:50
> *An:* ovs-discuss@openvswitch.org
> *Cc:* Plato, Michael 
> *Betreff:* Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day
>
>
>
> Hi,
>
>
>
> Is this related to following glibc bug? I'm not so sure about this because
> when I check the glibc source of installed version (2.35), the proposed
> patch has been applied.
>
>
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=12889
>
>
>
> I can confirm that this problem only happen if I use statefull ACL which
> is related to conntrack. The racing situation happen when massive
> unreachable replies are received. For example, if I run etcd on VMs but one
> etcd node has been disabled which causes massive connection attempts and
> unreachable replies.
>
>
>
> Best regards.
>
>
>
> On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution 
> wrote:
>
> Hi Michael,
>
>
>
> Have you found the solution for this case? I find the same weird problem
> without any information about which conntrack entries are causing
> this issue.
>
>
>
> I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this
> problem is disappear after I remove some Kubernutes cluster VMs and some DB
> cluster VMs.
>
>
>
> Best regards.
>
>
>
> Date: Thu, 29 Sep 2022 07:56:32 +
> From: "Plato, Michael" 
> To: "ovs-discuss@openvswitch.org" 
> Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
> Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi,
>
> we are about to roll out our new openstack infrastructure based on yoga
> and during our testing we observered that the openvswitch-switch systemd
> unit restarts several times a day, causing network interruptions for all
> VMs on the compute node in question.
> After some research we found that the ovs-vswitchd crashes with the
> following assertion failure:
>
> "2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
> assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in
> conn_update_state()"
>
> To get more information about the connection that leads to this assertion
> failure, I added some debug code to conntrack.c .
> We have seen that we can trigger this issue when trying to connect from a
> VM to a destination which is unreachable. For example curl
> https://www.google.de:444
>
> Shortly after that we get an assertion and the debug code says:
>
> conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ?
> src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst
> ip 172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444
> zone/rev zone 2/2 nw_proto/rev nw_proto 6/6
>
> ovs-appctl dpctl/dump-conntrack | grep "444"
>
> tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT)
>
> Versions:
> ovs-vsctl --version
> ovs-vsctl (Open vSwitch) 2.17.2
> DB Schema 8.3.0
>
> ovn-controller --version
> ovn-controller 22.03.0
> Open vSwitch Library 2.17.0
> OpenFlow versions 0x6:0x6
> SB DB Schema 20.21.0
>
> DPDK 21.11.2
>
> We are now unsure if this is a misconfiguration or if we hit a bug.
>
> Thanks for any feedback
>
> Michael
>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-03 Thread Plato, Michael via discuss
Hi,
I have some news on this topic. Unfortunately I could not find the root cause. 
But I managed to implement a workaround (see patch in attachment). The basic 
idea is to mark the nat flows as invalid if there is no longer an associated 
connection. From my point of view it is a race condition. It can be triggered 
by many short-lived connections. With the patch we no longer have any crashes. 
I can't say if it has any negative effects though, as I'm not an expert. So far 
I haven't found any problems at least. Without this patch we had hundreds of 
crashes a day :/

Best regards

Michael

Von: Lazuardi Nasution 
Gesendet: Montag, 3. April 2023 13:50
An: ovs-discuss@openvswitch.org
Cc: Plato, Michael 
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi,

Is this related to following glibc bug? I'm not so sure about this because when 
I check the glibc source of installed version (2.35), the proposed patch has 
been applied.

https://sourceware.org/bugzilla/show_bug.cgi?id=12889

I can confirm that this problem only happen if I use statefull ACL which is 
related to conntrack. The racing situation happen when massive unreachable 
replies are received. For example, if I run etcd on VMs but one etcd node has 
been disabled which causes massive connection attempts and unreachable replies.

Best regards.

On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution 
mailto:mrxlazuar...@gmail.com>> wrote:
Hi Michael,

Have you found the solution for this case? I find the same weird problem 
without any information about which conntrack entries are causing this issue.

I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this 
problem is disappear after I remove some Kubernutes cluster VMs and some DB 
cluster VMs.

Best regards.

Date: Thu, 29 Sep 2022 07:56:32 +
From: "Plato, Michael" 
mailto:michael.pl...@tu-berlin.de>>
To: "ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>" 
mailto:ovs-discuss@openvswitch.org>>
Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Message-ID: 
<8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de<mailto:8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>>
Content-Type: text/plain; charset="us-ascii"

Hi,

we are about to roll out our new openstack infrastructure based on yoga and 
during our testing we observered that the openvswitch-switch systemd unit 
restarts several times a day, causing network interruptions for all VMs on the 
compute node in question.
After some research we found that the ovs-vswitchd crashes with the following 
assertion failure:

"2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
 assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in 
conn_update_state()"

To get more information about the connection that leads to this assertion 
failure, I added some debug code to conntrack.c .
We have seen that we can trigger this issue when trying to connect from a VM to 
a destination which is unreachable. For example curl https://www.google.de:444

Shortly after that we get an assertion and the debug code says:

conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ?
src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst ip 
172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444 zone/rev zone 
2/2 nw_proto/rev nw_proto 6/6

ovs-appctl dpctl/dump-conntrack | grep "444"
tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT)

Versions:
ovs-vsctl --version
ovs-vsctl (Open vSwitch) 2.17.2
DB Schema 8.3.0

ovn-controller --version
ovn-controller 22.03.0
Open vSwitch Library 2.17.0
OpenFlow versions 0x6:0x6
SB DB Schema 20.21.0

DPDK 21.11.2

We are now unsure if this is a misconfiguration or if we hit a bug.

Thanks for any feedback

Michael


ovs-conntrack.patch
Description: ovs-conntrack.patch
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-03 Thread Lazuardi Nasution via discuss
Hi,

Is this related to following glibc bug? I'm not so sure about this because
when I check the glibc source of installed version (2.35), the proposed
patch has been applied.

https://sourceware.org/bugzilla/show_bug.cgi?id=12889

I can confirm that this problem only happen if I use statefull ACL which is
related to conntrack. The racing situation happen when massive unreachable
replies are received. For example, if I run etcd on VMs but one etcd node
has been disabled which causes massive connection attempts and unreachable
replies.

Best regards.

On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution 
wrote:

> Hi Michael,
>
> Have you found the solution for this case? I find the same weird problem
> without any information about which conntrack entries are causing
> this issue.
>
> I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this
> problem is disappear after I remove some Kubernutes cluster VMs and some DB
> cluster VMs.
>
> Best regards.
>
>
>> Date: Thu, 29 Sep 2022 07:56:32 +
>> From: "Plato, Michael" 
>> To: "ovs-discuss@openvswitch.org" 
>> Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
>> Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Hi,
>>
>> we are about to roll out our new openstack infrastructure based on yoga
>> and during our testing we observered that the openvswitch-switch systemd
>> unit restarts several times a day, causing network interruptions for all
>> VMs on the compute node in question.
>> After some research we found that the ovs-vswitchd crashes with the
>> following assertion failure:
>>
>> "2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
>> assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in
>> conn_update_state()"
>>
>> To get more information about the connection that leads to this assertion
>> failure, I added some debug code to conntrack.c .
>> We have seen that we can trigger this issue when trying to connect from a
>> VM to a destination which is unreachable. For example curl
>> https://www.google.de:444
>>
>> Shortly after that we get an assertion and the debug code says:
>>
>> conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ?
>> src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst
>> ip 172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444
>> zone/rev zone 2/2 nw_proto/rev nw_proto 6/6
>>
>> ovs-appctl dpctl/dump-conntrack | grep "444"
>>
>> tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT)
>>
>> Versions:
>> ovs-vsctl --version
>> ovs-vsctl (Open vSwitch) 2.17.2
>> DB Schema 8.3.0
>>
>> ovn-controller --version
>> ovn-controller 22.03.0
>> Open vSwitch Library 2.17.0
>> OpenFlow versions 0x6:0x6
>> SB DB Schema 20.21.0
>>
>> DPDK 21.11.2
>>
>> We are now unsure if this is a misconfiguration or if we hit a bug.
>>
>> Thanks for any feedback
>>
>> Michael
>>
>>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-03-20 Thread Lazuardi Nasution via discuss
Hi Michael,

Have you found the solution for this case? I find the same weird problem
without any information about which conntrack entries are causing
this issue.

I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this
problem is disappear after I remove some Kubernutes cluster VMs and some DB
cluster VMs.

Best regards.


> Date: Thu, 29 Sep 2022 07:56:32 +
> From: "Plato, Michael" 
> To: "ovs-discuss@openvswitch.org" 
> Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
> Message-ID: <8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi,
>
> we are about to roll out our new openstack infrastructure based on yoga
> and during our testing we observered that the openvswitch-switch systemd
> unit restarts several times a day, causing network interruptions for all
> VMs on the compute node in question.
> After some research we found that the ovs-vswitchd crashes with the
> following assertion failure:
>
> "2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
> assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in
> conn_update_state()"
>
> To get more information about the connection that leads to this assertion
> failure, I added some debug code to conntrack.c .
> We have seen that we can trigger this issue when trying to connect from a
> VM to a destination which is unreachable. For example curl
> https://www.google.de:444
>
> Shortly after that we get an assertion and the debug code says:
>
> conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ?
> src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst
> ip 172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444
> zone/rev zone 2/2 nw_proto/rev nw_proto 6/6
>
> ovs-appctl dpctl/dump-conntrack | grep "444"
>
> tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT)
>
> Versions:
> ovs-vsctl --version
> ovs-vsctl (Open vSwitch) 2.17.2
> DB Schema 8.3.0
>
> ovn-controller --version
> ovn-controller 22.03.0
> Open vSwitch Library 2.17.0
> OpenFlow versions 0x6:0x6
> SB DB Schema 20.21.0
>
> DPDK 21.11.2
>
> We are now unsure if this is a misconfiguration or if we hit a bug.
>
> Thanks for any feedback
>
> Michael
>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss