On 20.01.2020 17:09, Ilya Maximets wrote:
> On 08.01.2020 06:18, Vishal Deep Ajmera wrote:
>> Problem:
>> --------
>> In OVS-DPDK, flows with output over a bond interface of type “balance-tcp”
>> (using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into
>> "HASH" and "RECIRC" datapath actions. After recirculation, the packet is
>> forwarded to the bond member port based on 8-bits of the datapath hash
>> value computed through dp_hash. This causes performance degradation in the
>> following ways:
>>
>> 1. The recirculation of the packet implies another lookup of the packet’s
>> flow key in the exact match cache (EMC) and potentially Megaflow classifier
>> (DPCLS). This is the biggest cost factor.
>>
>> 2. The recirculated packets have a new “RSS” hash and compete with the
>> original packets for the scarce number of EMC slots. This implies more
>> EMC misses and potentially EMC thrashing causing costly DPCLS lookups.
>>
>> 3. The 256 extra megaflow entries per bond for dp_hash bond selection put
>> additional load on the revalidation threads.
>>
>> Owing to this performance degradation, deployments stick to “balance-slb”
>> bond mode even though it does not do active-active load balancing for
>> VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same
>> source MAC address.
>>
>> Proposed optimization:
>> ----------------------
>> This proposal introduces a new load-balancing output action instead of
>> recirculation.
>>
>> Maintain one table per-bond (could just be an array of uint16's) and
>> program it the same way internal flows are created today for each possible
>> hash value(256 entries) from ofproto layer. Use this table to load-balance
>> flows as part of output action processing.
>>
>> Currently xlate_normal() -> output_normal() -> 
>> bond_update_post_recirc_rules()
>> -> bond_may_recirc() and compose_output_action__() generate
>> “dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In this case the
>> RecircID identifies the bond. For the recirculated packets the ofproto layer
>> installs megaflow entries that match on RecircID and masked dp_hash and send
>> them to the corresponding output port.
>>
>> Instead, we will now generate action as
>>     "lb_output(bond,<bond id>)"
>>
>> This combines hash computation (only if needed, else re-use RSS hash) and
>> inline load-balancing over the bond. This action is used *only* for 
>> balance-tcp
>> bonds in OVS-DPDK datapath (the OVS kernel datapath remains unchanged).
>>
>> Example:
>> --------
>> Current scheme:
>> ---------------
>> With 1 IP-UDP flow:
>>
>> flow-dump from pmd on cpu core: 2
>> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no),
>>  packets:2828969, bytes:181054016, used:0.000s, 
>> actions:hash(hash_l4(0)),recirc(0x1)
>>
>> recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>>  packets:2828937, bytes:181051968, used:0.000s, actions:2
>>
>> With 8 IP-UDP flows (with random UDP src port): (all hitting same DPCL):
>>
>> flow-dump from pmd on cpu core: 2
>> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no),
>>  packets:2674009, bytes:171136576, used:0.000s, 
>> actions:hash(hash_l4(0)),recirc(0x1)
>>
>> recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>>  packets:377395, bytes:24153280, used:0.000s, actions:2
>> recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>>  packets:333486, bytes:21343104, used:0.000s, actions:1
>> recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>>  packets:348461, bytes:22301504, used:0.000s, actions:1
>> recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>>  packets:633353, bytes:40534592, used:0.000s, actions:2
>> recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>>  packets:319901, bytes:20473664, used:0.001s, actions:2
>> recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>>  packets:334985, bytes:21439040, used:0.001s, actions:1
>> recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>>  packets:326404, bytes:20889856, used:0.001s, actions:1
>>
>> New scheme:
>> -----------
>> We can do with a single flow entry (for any number of new flows):
>>
>> in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no),
>>  packets:2674009, bytes:171136576, used:0.000s, actions:lb_output(bond,1)
>>
>> A new CLI has been added to dump datapath bond cache as given below.
>>
>> “sudo ovs-appctl dpif-netdev/dp-bond-show [dp]”
>>
>> root@ubuntu-190:performance_scripts # sudo ovs-appctl 
>> dpif-netdev/dp-bond-show
>> Bond cache:
>>         bond-id 1 :
>>                 bucket 0 - slave 2
>>                 bucket 1 - slave 1
>>                 bucket 2 - slave 2
>>                 bucket 3 - slave 1
>>
>> Performance improvement:
>> ------------------------
>> With a prototype of the proposed idea, the following perf improvement is seen
>> with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the 
>> improvement
>> is even more enhanced (due to reduced number of flows).
>>
>> 1 VM:
>> *****
>> +--------------------------------------+
>> |                 mpps                 |
>> +--------------------------------------+
>> | Flows  master  with-opt.   %delta    |
>> +--------------------------------------+
>> | 1      4.47    5.73        28.21
>> | 10     4.17    5.35        28.45
>> | 400    3.52    5.39        53.01
>> | 1k     3.41    5.25        53.78
>> | 10k    2.53    4.57        80.44
>> | 100k   2.32    4.27        83.59
>> | 500k   2.32    4.27        83.59
>> +--------------------------------------+
>> mpps: million packets per second.
>> packet size 64 bytes.
>>
>> Signed-off-by: Manohar Krishnappa Chidambaraswamy <man...@gmail.com>
>> Co-authored-by: Manohar Krishnappa Chidambaraswamy <man...@gmail.com>
>> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajm...@ericsson.com>
>>
>> CC: Jan Scheurich <jan.scheur...@ericsson.com>
>> CC: Venkatesan Pradeep <venkatesan.prad...@ericsson.com>
>> CC: Nitin Katiyar <nitin.kati...@ericsson.com>
>> ---
>>  datapath/linux/compat/include/linux/openvswitch.h |   1 +
>>  lib/dpif-netdev.c                                 | 502 
>> ++++++++++++++++++++--
>>  lib/dpif-netlink.c                                |   3 +
>>  lib/dpif-provider.h                               |   9 +
>>  lib/dpif.c                                        |  49 +++
>>  lib/dpif.h                                        |   7 +
>>  lib/odp-execute.c                                 |   2 +
>>  lib/odp-util.c                                    |   4 +
>>  ofproto/bond.c                                    |  56 ++-
>>  ofproto/bond.h                                    |   9 +
>>  ofproto/ofproto-dpif-ipfix.c                      |   1 +
>>  ofproto/ofproto-dpif-sflow.c                      |   1 +
>>  ofproto/ofproto-dpif-xlate.c                      |  42 +-
>>  ofproto/ofproto-dpif.c                            |  24 ++
>>  ofproto/ofproto-dpif.h                            |   9 +-
>>  tests/lacp.at                                     |   9 +
>>  vswitchd/bridge.c                                 |   6 +
>>  vswitchd/vswitch.xml                              |  23 +
>>  18 files changed, 695 insertions(+), 62 deletions(-)
> 
> I started reviewing, but stopped at some point because I realized
> that current solution is over-complicated (too much of unnecessary
> memory copies) and not thread safe.
> See some comments inline.
> 
> I'll follow up with suggestions on how to implement this functionality
> in a different way, since I don't think that it's possible to make
> current version work correctly.

So, the root cause of all the issues in this patch, in my understanding,
is the fact that you need to collect statistics for all the bond hashes
in order to be able to rebalance traffic.  This forces you to have access
to PMD local caches.

The basic idea how to overcome this issue is to not have PMD local bond
cache, but have an RCU-protected data structure instead.

Memory layout scheme in current patch consists of 3 layers:
  1. Global hash map for all bonds. One for the whole datapath.
     Protected by dp->bond_mutex.
  2. Hash map of all the bonds. One for each PMD thread.
     Protected by pmd->bond_mutex.
  3. Hash map of all the bonds. Local copy that could be used
     lockless, but only by the owning PMD thread.

Suggested layout #1:
  Single global concurrent hash map (cmap) for all bonds. One for the whole
  datapath.  Bond addition/deletion protected by the dp->bond_mutex.
  Reads are lockless since protected by RCU.  Statistics updates must be
  fully atomic (i.e. atomic_add_relaxed()).

Suggested layout #2:
  One cmap for each PMD thread (no global one).  Bond addition/deletion
  protected by the pmd->bond_mutex.  Reads are lockless since protected
  by RCU.  Statistics updates should be atomic in terms of reads and writes.
  (non_atomic_ullong_add() function could be used).
  (This is similar to how we store per-PMD flow tables.)

#1 will consume a lot less memory, but could scale worse in case of too many
threads trying to send traffic to the same bond port.  #2 might be a bit
faster and more scalable in terms of performance, but less efficient in
memory consumption and might be slower in terms of response to slave updates
since we have to update hash maps on all the threads.
Both solutions doesn't require any reconfiguration of running PMD threads.

Note that to update cmap entry, you will likely need to prepare the new cmap
node and use a cmap_replace().  Statistics copy in this case might be a bit
tricky because you may lost part of additions within the period while PMD
threads are still using the old entry. To avoid that, statistics copying
should be RCU-postponed.  However, I'm not sure if we need highly accurate
stats there.

Any thoughts and suggestion (alternative solutions) are welcome to discuss.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to