Hi Antonio,

> -----Original Message-----
> From: Fischetti, Antonio
> Sent: Friday, June 23, 2017 10:49 PM
> To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org
> Subject: RE: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert
> for recirculated packets.
> 
> Thanks a lot Billy, really appreciate your feedback.
> My replies inline.
> 
> /Antonio
> 
> > -----Original Message-----
> > From: O Mahony, Billy
> > Sent: Friday, June 23, 2017 6:39 PM
> > To: Fischetti, Antonio <antonio.fische...@intel.com>;
> > d...@openvswitch.org
> > Subject: RE: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC
> > lookup/insert for recirculated packets.
> >
> > Hi Antonio,
> >
> > This is a really interesting patch. Comments inline below.
> >
> > Thanks,
> > /Billy.
> >
> > > -----Original Message-----
> > > From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev-
> > > boun...@openvswitch.org] On Behalf Of antonio.fische...@intel.com
> > > Sent: Monday, June 19, 2017 11:12 AM
> > > To: d...@openvswitch.org
> > > Subject: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC
> > > lookup/insert
> > for
> > > recirculated packets.
> > >
> > > From: Antonio Fischetti <antonio.fische...@intel.com>
> > >
> > > When OVS is configured as a firewall, with thousands of active
> > concurrent
> > > connections, the EMC gets quicly saturated and may come under heavy
> > > thrashing for the reason that original and recirculated packets keep
> > overwrite
> > > existing active EMC entries due to its limited size(8k).
> > >
> > > This thrashing causes the EMC to be less efficient than the dcpls in
> > terms of
> > > lookups and insertions.
> > >
> > > This patch allows to use the EMC efficiently by allowing only the
> > 'original'
> > > packets to hit EMC. All recirculated packets are sent to classifier
> > directly.
> > > An empirical threshold (EMC_FULL_THRESHOLD - of 50%) for EMC
> > > occupancy is set to trigger this logic. By doing so when EMC
> > > utilization exceeds EMC_FULL_THRESHOLD.
> > >  - EMC Insertions are allowed just for original packets. EMC insertion
> > >    and look up is skipped for recirculated packets.
> > >  - Recirculated packets are sent to classifier.
> > >
> > > This patch depends on the previous one in this series. It's based on
> > patch
> > > "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show"
> > at:
> > > https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.h
> > > tml
> > >
> > > Signed-off-by: Antonio Fischetti <antonio.fische...@intel.com>
> > > Signed-off-by: Bhanuprakash Bodireddy
> > > <bhanuprakash.bodire...@intel.com>
> > > Co-authored-by: Bhanuprakash Bodireddy
> > > <bhanuprakash.bodire...@intel.com>
> > > ---
> > > In our Connection Tracker testbench set up with
> > >
> > >  table=0, priority=1 actions=drop
> > >  table=0, priority=10,arp actions=NORMAL  table=0,
> > priority=100,ct_state=-
> > > trk,ip actions=ct(table=1)  table=1, ct_state=+new+trk,ip,in_port=1
> > > actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1
> > > actions=output:2  table=1, ct_state=+new+trk,ip,in_port=2
> > > actions=drop table=1, ct_state=+est+trk,ip,in_port=2
> > > actions=output:1
> > >
> > > we saw the following performance improvement.
> > >
> > > Measured packet Rx rate (regardless of packet loss). Bidirectional
> > > test
> > with
> > > 64B UDP packets.
> > > Each row is a test with a different number of traffic streams. The
> > traffic
> > > generator is set so that each stream establishes one UDP connection.
> > > Mpps columns reports the Rx rates on the 2 sides.
> > >
> > >  Traffic |    Orig    |     Orig      |  +changes  |   +changes
> > >  Streams |   [Mpps]   | [EMC entries] |   [Mpps]   | [EMC entries]
> > > ---------+------------+---------------+------------+---------------
> > >      10  |  3.4, 3.4  |      20       |  3.4, 3.4  |      20
> > >     100  |  2.6, 2.7  |     200       |  2.6, 2.7  |     201
> > >   1,000  |  2.4, 2.4  |    2009       |  2.4, 2.4  |    1994
> > >   2,000  |  2.2, 2.2  |    3903       |  2.2, 2.2  |    3900
> > >   3,000  |  2.1, 2.1  |    5473       |  2.2, 2.2  |    4798
> > >   4,000  |  2.0, 2.0  |    6478       |  2.2, 2.2  |    5663
> > >  10,000  |  1.8, 1.9  |    8070       |  2.0, 2.0  |    7347
> > > 100,000  |  1.7, 1.7  |    8192       |  1.8, 1.8  |    8192
> > >
> >
> > [[BO'M]]
> > A few questions on the test:
> > Are all the pkts rxd being recirculated?
> 
> [AF] Yes, I sent UDP packets with the firewall rules above. Every packets goes
> through the CT module, so after that it is recirculated.
> 
> > Are there any flows present where the pkts do not require recirculation?
> 
> [AF] No. The flow
> table=0,priority=100,ct_state=-trk,ip actions=ct(table=1) implies that any
> received packet goes through CT and then it is recirculated.
> 
> > Was the rxd rss hash calculation offloaded to the NIC?
> 
> [AF] Yes.
> 
> > For the cases with larger numbers of flows (10K , 100K) did you
> > investigate the results when the EMC is simply switched off?
> 
> [AF] No, I just focused on this criteria to check if I could really get some
> performance improvement. It's likely that with - say 100k or more flows - it's
> better to switch off EMC?
[[BO'M]] Quite possibly but I think it would depend on the distribution of the 
packets across flows. If each packet round-robins through each of the 100K 
flows (maybe a typical traffic generator behavior)  then that is the worst 
possible situation for the EMC - nearly all packets would miss the EMC and it'd 
be better off switched off. But in a more realistic situation the probability 
that the next packet belongs to any given flow is random or even better maybe a 
normal distribution, then the EMC would provide many more hits. 
> 
> >
> > Say we have 3000 flows (the lowest figure at which we see a positive
> > effect) that means 6000 flows are contending for places in the emc.
> > Is the effect we see here to do with disabling recirculated packets in
> > particular or just reducing contention on the emc in general.
> 
> [AF] I guess the benefit comes from reducing the continuous recursive
> overwrites in EMC.
> With this approach
>   => less overhead for insertions
>   => it's more likely that original packets do find a matching EMC entry.
> Otherwise whatever we insert into EMC it's likely to be overwritten before a
> new packet of that same flow comes in. When we insert entries for
> recirculated packets for sure we're deleting useful entries referring to some
> active megaflows and this will in turn happen again for the next packets
> recursively.
[[BO'M]] Okay. I think we are agreeing that when the EMC is overloaded then it 
is wasteful to shove even more entries in there! 
> 
> 
> I know that
> > the recirculated pkt hashes require software hashing albeit a small
> > amount so they do make a good category of packet to drop from the emc
> > when contention is severe.
> 
> [AF] Agree, if we skip EMC for recirculated packets  1. we save some cpu
> cycles for hash computation  2. we mitigate the EMC thrashing, and this is the
> main purpose of this patch.
> 
> >
> > Once there are too many entries contending  for any cache it's going
> > to become a liability as the lookup_cost + (miss_rate * cost_of_miss)
> > grows to be greater that the cost_of_miss. And in that case any scheme
> > where a very cheap decision can be made to not insert a certain
> > category of packets and to also not check the emc for that same
> > category will reduce the cache miss rate.
> 
> [AF] agree, thanks for explaining this with clear words, I tried to do the 
> same
> but I was not so successful :-)
> 
> >
> > But I'm not sure that is_recirculated is the best categorization on
> > which to make that decision. Mainly because it is not controllable. In
> > this test case 50% pkts were recirculated so by ruling these packets
> > out of eligibility for the EMC you get a really large reduction in EMC
> > contention. However you can never know ahead of time if any packets
> > will be recirc'd. You may have a situation where the EMC is totally
> > swamped with 200K flows as above but none of them are re-circ'd
> > packets so ruling out those packets will give no benefit.
> 
> [AF] agree, this approach is limited to conntrack, or any tunneling usecases.
> 
[[BO'M]] I'll send an email to ovs-dev to elaborate further on what could be a 
more general scheme of EMC load-shedding.  If people think it's useful we can 
begin to refine it further on the ML.
> >
> >
> > >  lib/dpif-netdev.c | 46
> ++++++++++++++++++++++++++++++++++++++++--
> > > ----
> > >  1 file changed, 40 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > fd2ed52..64a3cd4
> > > 100644
> > > --- a/lib/dpif-netdev.c
> > > +++ b/lib/dpif-netdev.c
> > > @@ -4538,6 +4538,8 @@ dp_netdev_queue_batches(struct dp_packet
> *pkt,
> > >      packet_batch_per_flow_update(batch, pkt, mf);  }
> > >
> > > +#define EMC_FULL_THRESHOLD 0x0000F000
> > > +
> > [[BO'M]]
> >
> > Use of FULL in #def is a little misleading as it is really indicating
> > the threshold after which recirc pkts do not get inserted to the EMC.
> >
> > EMC_RECIRCT_NO_INSERT_THRESHOLD maybe?
> 
> [AF] you're right, I'll take note of that. Or something like
> EMC_OCCUPANCY_THRESHOLD? Hmm,..
> 
> >
> > As this #def is used with an & operator not a > operator in the patch
> > it needs to be clear that number has the restrictions of
> > a) having exactly one bit set so is limited to power of two values.
> > b) needs to not greater  than 1u << (EM_FLOW_HASH_SHIFT-1)
> >
> > Would suggest that value is based directly off EM_FLOW_HASH_SHIFT
> >
> > e.g
> > #def EMC_RECIRCT_NO_INSERT_HALF (1u << (EM_FLOW_HASH_SHIFT-1))
> #def
> > EMC_RECIRCT_NO_INSERT_QUARTER (1u << (EM_FLOW_HASH_SHIFT-2))
> #def
> > EMC_RECIRCT_NO_INSERT_EIGHT (1u << (EM_FLOW_HASH_SHIFT-3))
> >
> > #def EMC_RECIRCT_NO_INSERT_THRESHOLD
> EMC_RECIRCT_NO_INSERT_HALF
> >
> 
> [AF] Thanks, that would help also on readability. I used the & operator just
> for performance reasons, I'm not quite sure if the > operator would require
> some more cpu cycles, I didn't check that.
> 
> >
> > >  /* Try to process all ('cnt') the 'packets' using only the exact
> > > match
> > cache
> > >   * 'pmd->flow_cache'. If a flow is not found for a packet
> > > 'packets[i]',
> > the
> > >   * miniflow is copied into 'keys' and the packet pointer is moved
> > > at
> > the @@ -
> > > 4582,6 +4584,19 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
> > >              pkt_metadata_prefetch_init(&packets[i+1]->md);
> > >          }
> > >
> > > +        /*
> > > +         * EMC lookup is skipped when one or both of the following
> > > +         * two cases occurs:
> > > +         *
> > > +         *   - EMC is disabled.  This is detected from cur_min.
> > > +         *
> > > +         *   - The EMC occupancy exceeds EMC_FULL_THRESHOLD and the
> > > +         *     packet to be classified is being recirculated.  When
> > this
> > > +         *     happens also EMC insertions are skipped for recirculated
> > > +         *     packets.  So that EMC is used just to store entries
> > which
> > > +         *     are hit from the 'original' packets.  This way the EMC
> > > +         *     thrashing is mitigated with a benefit on performance.
> > > +         */
> > >          if (!md_is_valid) {
> > >              pkt_metadata_init(&packet->md, port_no);
> > >              miniflow_extract(packet, &key->mf); @@ -4603,11
> > > +4618,18 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
> > >          } else {
> > >              /* Recirculated packets. */
> > >              miniflow_extract(packet, &key->mf);
> > > -            if (OVS_LIKELY(cur_min)) {
> > > -                key->hash = dpif_netdev_packet_get_rss_hash(packet,
> > &key->mf);
> > > -                flow = emc_lookup(flow_cache, key);
> > > -            } else {
> > > +            if (flow_cache->n_entries & EMC_FULL_THRESHOLD) {
> > > +                /* EMC occupancy is over the threshold.  We skip EMC
> > > +                 * lookup for recirculated packets. */
> > >                  flow = NULL;
> > > +            } else {
> > > +                if (OVS_LIKELY(cur_min)) {
> > > +                    key->hash = dpif_netdev_packet_get_rss_hash(packet,
> > > +                                    &key->mf);
> > > +                    flow = emc_lookup(flow_cache, key);
> > > +                } else {
> > > +                    flow = NULL;
> > > +                }
> > >              }
> > >          }
> > >          key->len = 0; /* Not computed yet. */ @@ -4695,7 +4717,13
> > > @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
> > >                                               add_actions->size);
> > >          }
> > >          ovs_mutex_unlock(&pmd->flow_mutex);
> > > -        emc_probabilistic_insert(pmd, key, netdev_flow);
> > > +        /* When EMC occupancy goes over a threshold we avoid
> > > + inserting
> > new
> > > +         * entries for recirculated packets. */
> > > +        if (!packet->md.recirc_id) {
> > > +            emc_probabilistic_insert(pmd, key, netdev_flow);
> > > +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> > > +            emc_probabilistic_insert(pmd, key, netdev_flow);
> > > +        }
> > >      }
> > >  }
> > >
> > > @@ -4788,7 +4816,13 @@ fast_path_processing(struct
> > > dp_netdev_pmd_thread *pmd,
> > >
> > >          flow = dp_netdev_flow_cast(rules[i]);
> > >
> > > -        emc_probabilistic_insert(pmd, &keys[i], flow);
> > > +        /* When EMC occupancy goes over a threshold we avoid
> > > + inserting
> > new
> > > +         * entries for recirculated packets. */
> > > +        if (!packet->md.recirc_id) {
> > > +            emc_probabilistic_insert(pmd, &keys[i], flow);
> > > +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> > > +            emc_probabilistic_insert(pmd, &keys[i], flow);
> > > +        }
> > >          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> > > n_batches);
> > >      }
> > >
> > > --
> > > 2.4.11
> > >
> > > _______________________________________________
> > > dev mailing list
> > > d...@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to