On Tue, Jul 18, 2023 at 6:16 AM Dumitru Ceara <dce...@redhat.com> wrote:
>
> On 7/18/23 12:14, Han Zhou wrote:
> > On Mon, Jul 17, 2023 at 9:51 PM Dumitru Ceara <dce...@redhat.com> wrote:
> >>
> >> On 7/14/23 08:41, Ales Musil wrote:
> >>> On Thu, Jul 13, 2023 at 4:39 PM Dumitru Ceara <dce...@redhat.com>
wrote:
> >>>
> >>>> Regular dp-hash is not a canonical L4 hash (at least with the netlink
> >>>> datapath).  If the datapath supports l4 symmetrical dp-hash use that
> > one
> >>>> instead.
> >>>>
> >>>> Reported-at: https://github.com/ovn-org/ovn/issues/112
> >>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2188679
> >>>> Signed-off-by: Dumitru Ceara <dce...@redhat.com>
> >>>>
> >>>
> >>> Hi Dumitru,
> >>>
> >>>
> >>>> ---
> >>>>  include/ovn/features.h |  2 ++
> >>>>  lib/actions.c          |  6 ++++++
> >>>>  lib/features.c         | 49
+++++++++++++++++++++++++++++++++---------
> >>>>  3 files changed, 47 insertions(+), 10 deletions(-)
> >>>>
> >>>> diff --git a/include/ovn/features.h b/include/ovn/features.h
> >>>> index de8f1f5489..3bf536127f 100644
> >>>> --- a/include/ovn/features.h
> >>>> +++ b/include/ovn/features.h
> >>>> @@ -34,12 +34,14 @@ enum ovs_feature_support_bits {
> >>>>      OVS_CT_ZERO_SNAT_SUPPORT_BIT,
> >>>>      OVS_DP_METER_SUPPORT_BIT,
> >>>>      OVS_CT_TUPLE_FLUSH_BIT,
> >>>> +    OVS_DP_HASH_L4_SYM_BIT,
> >>>>  };
> >>>>
> >>>>  enum ovs_feature_value {
> >>>>      OVS_CT_ZERO_SNAT_SUPPORT = (1 << OVS_CT_ZERO_SNAT_SUPPORT_BIT),
> >>>>      OVS_DP_METER_SUPPORT = (1 << OVS_DP_METER_SUPPORT_BIT),
> >>>>      OVS_CT_TUPLE_FLUSH_SUPPORT = (1 << OVS_CT_TUPLE_FLUSH_BIT),
> >>>> +    OVS_DP_HASH_L4_SYM_SUPPORT = (1 << OVS_DP_HASH_L4_SYM_BIT),
> >>>>  };
> >>>>
> >>>>  void ovs_feature_support_destroy(void);
> >>>> diff --git a/lib/actions.c b/lib/actions.c
> >>>> index 037172e606..9d52cb75a8 100644
> >>>> --- a/lib/actions.c
> >>>> +++ b/lib/actions.c
> >>>> @@ -1625,6 +1625,12 @@ encode_SELECT(const struct ovnact_select
> > *select,
> >>>>      struct ds ds = DS_EMPTY_INITIALIZER;
> >>>>      ds_put_format(&ds, "type=select,selection_method=dp_hash");
> >>>>
> >>>> +    if (ovs_feature_is_supported(OVS_DP_HASH_L4_SYM_SUPPORT)) {
> >>>> +        /* Select dp-hash l4_symmetric by setting the upper 32bits
of
> >>>> +         * selection_method_param to 1: */
> >>>>
> >>>
> >>> This comment is a bit unfortunate, because it reads like you want to
set
> >>> all bits of the upper half
> >>> to 1 e.g. 0xffffffff. Maybe change it to: "selection_method_param to
> > value
> >>> 1 (1 << 32)." during merge, wdyt?
> >>>
> >>>
> >>
> >> Makes sense, thanks for the suggestion!  I changed it accordingly.
> >>
> >>>> +        ds_put_cstr(&ds, ",selection_method_param=0x100000000");
> >>>> +    }
> >>>> +
> >>>>      struct mf_subfield sf = expr_resolve_field(&select->res_field);
> >>>>
> >>>>      for (size_t bucket_id = 0; bucket_id < select->n_dsts;
> > bucket_id++) {
> >>>> diff --git a/lib/features.c b/lib/features.c
> >>>> index 97c9c86f00..d24e8f6c5c 100644
> >>>> --- a/lib/features.c
> >>>> +++ b/lib/features.c
> >>>> @@ -21,6 +21,7 @@
> >>>>  #include "lib/dirs.h"
> >>>>  #include "socket-util.h"
> >>>>  #include "lib/vswitch-idl.h"
> >>>> +#include "odp-netlink.h"
> >>>>  #include "openvswitch/vlog.h"
> >>>>  #include "openvswitch/ofpbuf.h"
> >>>>  #include "openvswitch/rconn.h"
> >>>> @@ -33,20 +34,48 @@ VLOG_DEFINE_THIS_MODULE(features);
> >>>>
> >>>>  #define FEATURES_DEFAULT_PROBE_INTERVAL_SEC 5
> >>>>
> >>>> +/* Parses 'cap_name' from 'ovs_capabilities' and returns whether the
> >>>> + * type of capability is supported or not. */
> >>>> +typedef bool ovs_feature_parse_func(const struct smap
> > *ovs_capabilities,
> >>>> +                                    const char *cap_name);
> >>>> +
> >>>>  struct ovs_feature {
> >>>>      enum ovs_feature_value value;
> >>>>      const char *name;
> >>>> +    ovs_feature_parse_func *parse;
> >>>>  };
> >>>>
> >>>> +static bool
> >>>> +bool_parser(const struct smap *ovs_capabilities, const char
*cap_name)
> >>>> +{
> >>>> +    return smap_get_bool(ovs_capabilities, cap_name, false);
> >>>> +}
> >>>> +
> >>>> +static bool
> >>>> +dp_hash_l4_sym_support_parser(const struct smap *ovs_capabilities,
> >>>> +                              const char *cap_name OVS_UNUSED)
> >>>> +{
> >>>> +    int max_hash_alg = smap_get_int(ovs_capabilities,
"max_hash_alg",
> > 0);
> >>>> +
> >>>> +    return max_hash_alg == OVS_HASH_ALG_SYM_L4;
> >>>> +}
> >>>> +
> >>>>  static struct ovs_feature all_ovs_features[] = {
> >>>>      {
> >>>>          .value = OVS_CT_ZERO_SNAT_SUPPORT,
> >>>> -        .name = "ct_zero_snat"
> >>>> +        .name = "ct_zero_snat",
> >>>> +        .parse = bool_parser,
> >>>>      },
> >>>>      {
> >>>>          .value = OVS_CT_TUPLE_FLUSH_SUPPORT,
> >>>> -        .name = "ct_flush"
> >>>> -    }
> >>>> +        .name = "ct_flush",
> >>>> +        .parse = bool_parser,
> >>>> +    },
> >>>> +    {
> >>>> +        .value = OVS_DP_HASH_L4_SYM_SUPPORT,
> >>>> +        .name = "dp_hash_l4_sym_support",
> >>>> +        .parse = dp_hash_l4_sym_support_parser,
> >>>> +    },
> >>>>  };
> >>>>
> >>>>  /* A bitmap of OVS features that have been detected as 'supported'.
*/
> >>>> @@ -65,6 +94,7 @@ ovs_feature_is_valid(enum ovs_feature_value
feature)
> >>>>      case OVS_CT_ZERO_SNAT_SUPPORT:
> >>>>      case OVS_DP_METER_SUPPORT:
> >>>>      case OVS_CT_TUPLE_FLUSH_SUPPORT:
> >>>> +    case OVS_DP_HASH_L4_SYM_SUPPORT:
> >>>>          return true;
> >>>>      default:
> >>>>          return false;
> >>>> @@ -183,18 +213,17 @@ ovs_feature_support_run(const struct smap
> >>>> *ovs_capabilities,
> >>>>      }
> >>>>
> >>>>      for (size_t i = 0; i < ARRAY_SIZE(all_ovs_features); i++) {
> >>>> -        enum ovs_feature_value value = all_ovs_features[i].value;
> >>>> -        const char *name = all_ovs_features[i].name;
> >>>> -        bool old_state = supported_ovs_features & value;
> >>>> -        bool new_state = smap_get_bool(ovs_capabilities, name,
false);
> >>>> +        struct ovs_feature *feature = &all_ovs_features[i];
> >>>> +        bool old_state = supported_ovs_features & feature->value;
> >>>> +        bool new_state = feature->parse(ovs_capabilities,
> > feature->name);
> >>>>          if (new_state != old_state) {
> >>>>              updated = true;
> >>>>              if (new_state) {
> >>>> -                supported_ovs_features |= value;
> >>>> +                supported_ovs_features |= feature->value;
> >>>>              } else {
> >>>> -                supported_ovs_features &= ~value;
> >>>> +                supported_ovs_features &= ~feature->value;
> >>>>              }
> >>>> -            VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s", name,
> >>>> +            VLOG_INFO_RL(&rl, "OVS Feature: %s, state: %s",
> > feature->name,
> >>>>                           new_state ? "supported" : "not supported");
> >>>>          }
> >>>>      }
> >>>> --
> >>>> 2.31.1
> >>>>
> >>>> _______________________________________________
> >>>> dev mailing list
> >>>> d...@openvswitch.org
> >>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>>>
> >>>>
> >>> Other than that it looks good.
> >>>
> >>> Acked-by: Ales Musil <amu...@redhat.com>
> >>>
> >>
> >> Thanks, I applied this to main.
> >>
> >> I was wondering if we should backport this though.  It's not exactly a
> >> bug fix but it does avoid using a broken datapath feature (when
> >> possible) so it avoids a broken behavior.
> >>
> >> The changes are quite contained.
> >>
> >> CC-ing some of the maintainers explicitly to get some more eyes on
this.
> >>
> > Thanks Dumitru. Sorry that I was following the reported issue. I tried
to
> > look for some more details, but I am not authorized to access bug
2175716.
>
> Sorry about that, that bug used to be public but some internal policies
> made it private.
>
> > It is not quite clear to me what the impact of this problem is. The OVN
> > issue (112) seems to be a very common configuration of OVN ECMP and I
> > assume the problem is not easily reproduced because I have been
> > testing/deploying the feature and it worked as expected. Do you have
more
> > details about the trigger? Are there any related fixes in OVS and
kernel?
> >
>
> The original bug report came from OpenShift:
> https://issues.redhat.com/browse/OCPBUGS-7406
>
> CC-ing Patryk as he root caused the original issue.
>
> The simplified topology is:
>                               +--- GW1 ---+
> client ---- cluster-router ---+           +--- (out of OVN) --- server
>                               +--- GW2 ---+
>
> On cluster-router there's an ECMP route to reach 'server' via:
> - GW1
> - GW2
>
> On GW1 traffic going out of OVN is SNATed to IP1.
> On GW2 traffic going out of OVN is SNATed to IP2.
>
> The client initiates a TCP session towards the server; a path is chosen
> based on dp-hash, e.g. GW2.
>
> From this point on, during normal operation, traffic on this session
> flows fine via GW2.
>
> If the client closes the connection, then for the closing connection
> there's an exception when computing the dp-hash if the socket is in
> state TIME_WAIT, substate TCP_FIN_WAIT_2:
> https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L2218
> In this last case we call:
>
> tcp_v4_timewait_ack(sk, skb)
> |
> tcp_v4_send_ack(sk, skb, ...)
>
> The latter uses a global, per-cpu "ctl_sk" to build the last ACK packet:
> https://elixir.bootlin.com/linux/v6.2.1/source/net/ipv4/tcp_ipv4.c#L925
>
>         ctl_sk = this_cpu_read(ipv4_tcp_sk);
>         sock_net_set(ctl_sk, net);
>         ctl_sk->sk_mark = (sk->sk_state == TCP_TIME_WAIT) ?
>                            inet_twsk(sk)->tw_mark : sk->sk_mark;
>         ctl_sk->sk_priority = (sk->sk_state == TCP_TIME_WAIT) ?
>                            inet_twsk(sk)->tw_priority : sk->sk_priority;
>         transmit_time = tcp_transmit_time(sk);
>         ip_send_unicast_reply(ctl_sk,
>                               skb, &TCP_SKB_CB(skb)->header.h4.opt,
>                               ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
>                               &arg, arg.iov[0].iov_len,
>                               transmit_time);
>
> As ctl_sk's sk_txhash is 0, when transmitting this last ack, its hash is
> computed from the packet contents:
>
https://elixir.bootlin.com/linux/v6.2.1/source/include/linux/skbuff.h#L1540
>
> static inline __u32 skb_get_hash(struct sk_buff *skb)
> {
>         if (!skb->l4_hash && !skb->sw_hash)
>                 __skb_get_hash(skb);
>
>         return skb->hash;
> }
> [...]
> void __skb_get_hash(struct sk_buff *skb)
> {
>         struct flow_keys keys;
>         u32 hash;
>
>         __flow_hash_secret_init();
>
>         hash = ___skb_get_hash(skb, &keys, &hashrnd);
>
>         __skb_set_sw_hash(skb, hash, flow_keys_have_l4(&keys));
> }
>
> But the original socket's txhash was actually a random hash:
> https://elixir.bootlin.com/linux/v6.2.1/source/include/net/sock.h#L2128
>
> static inline void sk_set_txhash(struct sock *sk)
> {
>         /* This pairs with READ_ONCE() in skb_set_hash_from_sk() */
>         WRITE_ONCE(sk->sk_txhash, net_tx_rndhash());
> }
>
> This means that the last ack might enter the ovs datapath with a
> different dp_hash than all other packets sent on that session in that
> direction. It's possible that this last ack gets hashed to a different
> path of the ECMP route.
>
> This specific issue was fixed in the kernel through:
>
>
https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#mc84997c4cc755354e5527b41b7a15869536a21f4
>
> However, while debugging we realized the problem is actually visible
> with _any_ retransmit.  That's because a retransmit will trigger a rehash:
>
> https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_output.c#L4124
>
> int tcp_rtx_synack(const struct sock *sk, struct request_sock *req)
> {
>         const struct tcp_request_sock_ops *af_ops =
tcp_rsk(req)->af_specific;
>         struct flowi fl;
>         int res;
>
>         /* Paired with WRITE_ONCE() in sock_setsockopt() */
>         if (READ_ONCE(sk->sk_txrehash) == SOCK_TXREHASH_ENABLED)
>                 tcp_rsk(req)->txhash = net_tx_rndhash();
>         res = af_ops->send_synack(sk, NULL, &fl, req, NULL,
TCP_SYNACK_NORMAL,
>                                   NULL);
> [...]
>
> So suddenly all subsequent packets (skbs) will have a different 'hash'
> value and might get hashed to a different path.
>
> This last part was not fixed.  After a discussion on the netdev mailing
> list the conclusion was to not change the hash behavior for locally
> terminated TCP sessions:
>
>
https://lore.kernel.org/all/20230511093456.672221-1-aten...@kernel.org/T/#m1fd8da6d912d5acead66c4c397396e6ac4b160dd
>
> Eventually the documentation change patch was dropped and a v2 of the
> other fixes was accepted:
> https://www.spinics.net/lists/netdev/msg906252.html
>
> The rehash on retransmit problem is still present though making the
> kernel's dp-hash implementation unusable.
>
> During some internal discussions Aaron (in CC) pointed out that OVS
> supports L4 symmetric hashing and that should be a stable L4 hash.  OVS
> already reports the maximum supported hashing algorithm through the
> Datapath.Capabilities.max_hash_alg ovsdb field.  The only problem was
> that the kernel datapath didn't support L4 symmetric hashing.  But Aaron
> posted a patch to implement that and the patch got accepted:
>
> https://www.spinics.net/lists/netdev/msg911045.html
>
> Then we moved OVN to use L4 symmetric hashing if available.  That's to
> ensure correctness of the packet processing.  We did run performance
> tests and the hit due to the extra hashing was acceptable, max 3%
> throughput degradation in OVN scenarios that simulate common traffic
> patterns (like ovn-kubernetes ones) - in most cases the performance
> degradation was around 2%.
>
> Sorry for the long story but I hope it provides all the required
> context.  Thinking more about it I should've probably added some of this
> to the original commit log, I apologize.
>
> Regards,
> Dumitru
>

Thanks Dumitru for the details! It is now quite clear and I am not against
backporting it.
But I have a question regarding upgrading. For the existing traffic, does
this mean the packets may choose a different path after OVN upgrading?
Should this be called out?

Regards,
Han
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to