Based on the discussion in OVN meeting today I did some more testing, and
here are my findings.

- With ICMP (ping) between same source and destination it is always same
bucket selected by dp_hash.
- With "nc" specifying same TCP 5-tuples, the packets can end up into
different buckets. This is similar to what Numan and Maciej observed.

However, I was using the OVN ECMP feature to test instead of LB. Since ECMP
feature doesn't use conntrack, here are some more findings. The bucket
selection changes only between 2 buckets, and the change happens when the
packet datapath changes between userspace and kernel datapath. Let's say
the first packet of a flow (megaflow) goes to userspace, it hits bucket1,
and then if I send the second and more packets immediately they will all
hit bucket2, but if I wait for a while until the flow disappears from the
megaflow cache and then send the next packet, it will hit bucket1 again.
This behavior is consistent.

So I think it might be because of the different implementation of dp_hash
in userspace and kernel datapath, the different buckets were selected
(thanks Ilya for this hint in today's meeting).
Numan/Maciej, in your tests did you see more than 2 buckets hit for same
5-tuples? If the above theory is right, you should see at most 2 buckets
hit. For LB, since it uses CT and only the first packet uses the group, all
packets of the same flow would always be forwarded to same LB backend. I
guess if you wait long enough between the tests, you should see all tests
hitting same backend. It would be great if you could confirm this.

For ECMP, this behavior will cause occasional packets out of order even for
a single flow (for a burst of packets after some idle time), because CT is
not used (and we can't use it because when peered with physical ECMP router
groups we can't ensure the return traffic from physical routers hits same
LR).

For LB it causes the unexpected behavior that is reported in this thread.

For the fix, I think we should figure out how to make sure dp_hash always
uses same hash algorithm for both userspace and kernel implementation, if
possible.
I am ok with the patch from Numan for the capability of configuring the
desired hash method instead of always using default. However, using "hash"
may be a huge performance sacrifice since the packets are always handled in
slowpath, especially for ECMP. Even though LB uses CT, for short-lived flow
scenario this is still a big performance penalty (for long lived flows of
LB it may be ok since the majority of packets are still in fastpath).

I am not familiar with the dp_hash implementation. I will do some more
study, but any idea on how to ensure the consistency of dp_hash is highly
appreciated!

Thanks,
Han

On Tue, Apr 21, 2020 at 1:05 AM Daniel Alvarez Sanchez <dalva...@redhat.com>
wrote:
>
> Thanks Numan for the investigation and the great explanation!
>
> On Tue, Apr 21, 2020 at 9:38 AM Numan Siddique <num...@ovn.org> wrote:
>>
>> On Fri, Apr 17, 2020 at 12:56 PM Han Zhou <zhou...@gmail.com> wrote:
>> >
>> >
>> >
>> > On Tue, Apr 7, 2020 at 7:03 AM Maciej Jozefczyk <mjoze...@redhat.com>
wrote:
>> > >
>> > > Hello!
>> > >
>> > > I would like to ask you to clarify how the OVN Load balancing
algorithm works.
>> > >
>> > > Based on the action [1]:
>> > > 1) If connection is alive the same 'backend' will be chosen,
>> > >
>> > > 2) If it is a new connection the backend will be chosen based on
selection_method=dp_hash [2].
>> > > Based on changelog the dp_hash uses '5 tuple hash' [3].
>> > > The hash is calculated based on values: source and destination IP,
 source port, protocol and arbitrary value - 42. [4]
>> > > Based on that information we could name it SOURCE_IP_PORT.
>> > >
>> > > Unfortunately we recently got a bug report in OVN Octavia provider
driver project, that the Load Balancing in OVN
>> > > works differently [5]. The report shows even when the test uses the
same source ip and port, but new TCP connection,
>> > > traffic is randomly distributed, but based on [2] it shouldn't?
>> > >
>> > > Is it a bug?  Is something else taken to account while creating a
hash? Can it be fixed in OVS/OVN?
>> > >
>> > >
>> > >
>> > > Thanks,
>> > > Maciej
>> > >
>> > >
>> > > [1]
https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1017
>> > > [2]
https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1059
>> > > [3]
https://github.com/openvswitch/ovs/blob/d58b59c17c70137aebdde37d3c01c26a26b28519/NEWS#L364-L371
>> > > [4]
https://github.com/openvswitch/ovs/blob/74286173f4d7f51f78e9db09b07a6d4d65263252/lib/flow.c#L2217
>> > > [5] https://bugs.launchpad.net/neutron/+bug/1871239
>> > >
>> > > --
>> > > Best regards,
>> > > Maciej Józefczyk
>> > > _______________________________________________
>> > > discuss mailing list
>> > > disc...@openvswitch.org
>> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> >
>> > Hi Maciej,
>> >
>> > Thanks for reporting. It is definitely strange that same 5-tuple flow
resulted in hitting different backends. I didn't observed such behavior
before (maybe I should try again myself to confirm). Can you make sure
during the testing the group bucket didn't change? You can do so by:
>> > # ovs-ofctl dump-groups br-int
>> > and also check the group stats and see if multiple buckets has counter
increased during the test
>> > # ovs-ofctl dump-group-stats br-int [group]
>> >
>> > For the 5-tuple hash function you are seeing flow_hash_5tuple(), it is
using all the 5-tuples. It adds both ports (src and dst) at once:
>> >        /* Add both ports at once. */
>> >         hash = hash_add(hash,
>> >                         ((const uint32_t *)flow)[offsetof(struct flow,
tp_src)
>> >                                                  / sizeof(uint32_t)]);
>> >
>> > The tp_src is the start of the offset, and the size is 32, meaning
both src and dst, each is 16 bits. (Although I am not sure if dp_hash
method is using this function or not. Need to check more code)
>> >
>> > BTW, I am not sure why Neutron give it the name SOURCE_IP_PORT. Shall
it be called just 5-TUPLE, since protocol, destination IP and PORT are also
considered in the hash.
>> >
>>
>>
>> Hi Maciej and Han,
>>
>> I did some testing and I can confirm as you're saying. OVN is not
>> choosing the same backend with the src ip, src port fixed.
>>
>> I think there is an issue with OVN on how it is programming the group
>> flows.  OVN is setting the selection_method as dp_hash.
>> But when ovs-vswitchd receives the  GROUP_MOD openflow message, I
>> noticed that the selection_method is not set.
>> From the code I see that selection_method will be encoded only if
>> ovn-controller uses openflow version 1.5 [1]
>>
>> Since selection_method is NULL, vswitchd uses the dp_hash method [2].
>> dp_hash means it uses the hash calculated by
>> the datapath. In the case of kernel datapath, from what I understand
>> it uses skb_get_hash().
>>
>> I modified the vswitchd code to use the selection_method "hash" if
>> selection_method is not set. In this case the load balancer
>> works as expected. For a fixed src ip, src port, dst ip and dst port,
>> the group action is selecting the same bucket always. [3]
>>
>> I think we need to fix a few issues in OVN
>>   - Use openflow 1.5 so that ovn can set selection_method
>>  -  Use "hash" method if dp_hash is not choosing the same bucket for
>> 5-tuple hash.
>>   - May be provide the option for the CMS to choose an algorithm i.e.
>> to use dp_hash or hash.
>>
> I'd rather not expose this to the CMS as it depends on the datapath
implementation as per [0] but maybe it makes sense to eventually abstract
it to the CMS in a more LB-ish way (common algorithm names used in load
balancing) in the case at some point the LB feature is enhanced somehow to
support more algorithms.
>
> I believe that for OVN LB users, using OF 1.5 to force the use of 'hash'
would be the best solution now.
>
> My 2 cents as I'm not an LB expert.
>
> I also recall that we tested this in the past and seemed to be working. I
have been checking further in the doc [0] and found this paragraph:
>
> "If no selection method is specified, Open vSwitch up to release 2.9
applies the hash method with default fields. From 2.10 onwards Open vSwitch
defaults to the dp_hash method with symmetric L3/L4 hash algorithm, unless
the weighted group buck‐ ets cannot be mapped to a maximum of 64 dp_hash
values with sufficient accuracy. In those rare cases Open vSwitch 2.10 and
later fall back to the hash method with the default set of hash fields."
>
> The explanation seems to be that when we tested the feature we relied on
OVS 2.9 and hence the confusion.
>
> Thanks a lot again!
> Daniel
>
> [0] http://www.openvswitch.org/support/dist-docs/ovs-ofctl.8.html
>
>>
>> I'll look into it on how to support this.
>>
>> [1] -
https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2120
>>
https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2082
>>
>> [2] -
https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif.c#L5108
>> [3] -
https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-xlate.c#L4553
>>
>>
>> Thanks
>> Numan
>>
>>
>> > Thanks,
>> > Han
>> > _______________________________________________
>> > discuss mailing list
>> > disc...@openvswitch.org
>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> _______________________________________________
>> discuss mailing list
>> disc...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to