Hi Antonio,
> -----Original Message----- > From: Fischetti, Antonio > Sent: Friday, June 23, 2017 10:49 PM > To: O Mahony, Billy <billy.o.mah...@intel.com>; d...@openvswitch.org > Subject: RE: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert > for recirculated packets. > > Thanks a lot Billy, really appreciate your feedback. > My replies inline. > > /Antonio > > > -----Original Message----- > > From: O Mahony, Billy > > Sent: Friday, June 23, 2017 6:39 PM > > To: Fischetti, Antonio <antonio.fische...@intel.com>; > > d...@openvswitch.org > > Subject: RE: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC > > lookup/insert for recirculated packets. > > > > Hi Antonio, > > > > This is a really interesting patch. Comments inline below. > > > > Thanks, > > /Billy. > > > > > -----Original Message----- > > > From: ovs-dev-boun...@openvswitch.org [mailto:ovs-dev- > > > boun...@openvswitch.org] On Behalf Of antonio.fische...@intel.com > > > Sent: Monday, June 19, 2017 11:12 AM > > > To: d...@openvswitch.org > > > Subject: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC > > > lookup/insert > > for > > > recirculated packets. > > > > > > From: Antonio Fischetti <antonio.fische...@intel.com> > > > > > > When OVS is configured as a firewall, with thousands of active > > concurrent > > > connections, the EMC gets quicly saturated and may come under heavy > > > thrashing for the reason that original and recirculated packets keep > > overwrite > > > existing active EMC entries due to its limited size(8k). > > > > > > This thrashing causes the EMC to be less efficient than the dcpls in > > terms of > > > lookups and insertions. > > > > > > This patch allows to use the EMC efficiently by allowing only the > > 'original' > > > packets to hit EMC. All recirculated packets are sent to classifier > > directly. > > > An empirical threshold (EMC_FULL_THRESHOLD - of 50%) for EMC > > > occupancy is set to trigger this logic. By doing so when EMC > > > utilization exceeds EMC_FULL_THRESHOLD. > > > - EMC Insertions are allowed just for original packets. EMC insertion > > > and look up is skipped for recirculated packets. > > > - Recirculated packets are sent to classifier. > > > > > > This patch depends on the previous one in this series. It's based on > > patch > > > "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show" > > at: > > > https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.h > > > tml > > > > > > Signed-off-by: Antonio Fischetti <antonio.fische...@intel.com> > > > Signed-off-by: Bhanuprakash Bodireddy > > > <bhanuprakash.bodire...@intel.com> > > > Co-authored-by: Bhanuprakash Bodireddy > > > <bhanuprakash.bodire...@intel.com> > > > --- > > > In our Connection Tracker testbench set up with > > > > > > table=0, priority=1 actions=drop > > > table=0, priority=10,arp actions=NORMAL table=0, > > priority=100,ct_state=- > > > trk,ip actions=ct(table=1) table=1, ct_state=+new+trk,ip,in_port=1 > > > actions=ct(commit),output:2 table=1, ct_state=+est+trk,ip,in_port=1 > > > actions=output:2 table=1, ct_state=+new+trk,ip,in_port=2 > > > actions=drop table=1, ct_state=+est+trk,ip,in_port=2 > > > actions=output:1 > > > > > > we saw the following performance improvement. > > > > > > Measured packet Rx rate (regardless of packet loss). Bidirectional > > > test > > with > > > 64B UDP packets. > > > Each row is a test with a different number of traffic streams. The > > traffic > > > generator is set so that each stream establishes one UDP connection. > > > Mpps columns reports the Rx rates on the 2 sides. > > > > > > Traffic | Orig | Orig | +changes | +changes > > > Streams | [Mpps] | [EMC entries] | [Mpps] | [EMC entries] > > > ---------+------------+---------------+------------+--------------- > > > 10 | 3.4, 3.4 | 20 | 3.4, 3.4 | 20 > > > 100 | 2.6, 2.7 | 200 | 2.6, 2.7 | 201 > > > 1,000 | 2.4, 2.4 | 2009 | 2.4, 2.4 | 1994 > > > 2,000 | 2.2, 2.2 | 3903 | 2.2, 2.2 | 3900 > > > 3,000 | 2.1, 2.1 | 5473 | 2.2, 2.2 | 4798 > > > 4,000 | 2.0, 2.0 | 6478 | 2.2, 2.2 | 5663 > > > 10,000 | 1.8, 1.9 | 8070 | 2.0, 2.0 | 7347 > > > 100,000 | 1.7, 1.7 | 8192 | 1.8, 1.8 | 8192 > > > > > > > [[BO'M]] > > A few questions on the test: > > Are all the pkts rxd being recirculated? > > [AF] Yes, I sent UDP packets with the firewall rules above. Every packets goes > through the CT module, so after that it is recirculated. > > > Are there any flows present where the pkts do not require recirculation? > > [AF] No. The flow > table=0,priority=100,ct_state=-trk,ip actions=ct(table=1) implies that any > received packet goes through CT and then it is recirculated. > > > Was the rxd rss hash calculation offloaded to the NIC? > > [AF] Yes. > > > For the cases with larger numbers of flows (10K , 100K) did you > > investigate the results when the EMC is simply switched off? > > [AF] No, I just focused on this criteria to check if I could really get some > performance improvement. It's likely that with - say 100k or more flows - it's > better to switch off EMC? [[BO'M]] Quite possibly but I think it would depend on the distribution of the packets across flows. If each packet round-robins through each of the 100K flows (maybe a typical traffic generator behavior) then that is the worst possible situation for the EMC - nearly all packets would miss the EMC and it'd be better off switched off. But in a more realistic situation the probability that the next packet belongs to any given flow is random or even better maybe a normal distribution, then the EMC would provide many more hits. > > > > > Say we have 3000 flows (the lowest figure at which we see a positive > > effect) that means 6000 flows are contending for places in the emc. > > Is the effect we see here to do with disabling recirculated packets in > > particular or just reducing contention on the emc in general. > > [AF] I guess the benefit comes from reducing the continuous recursive > overwrites in EMC. > With this approach > => less overhead for insertions > => it's more likely that original packets do find a matching EMC entry. > Otherwise whatever we insert into EMC it's likely to be overwritten before a > new packet of that same flow comes in. When we insert entries for > recirculated packets for sure we're deleting useful entries referring to some > active megaflows and this will in turn happen again for the next packets > recursively. [[BO'M]] Okay. I think we are agreeing that when the EMC is overloaded then it is wasteful to shove even more entries in there! > > > I know that > > the recirculated pkt hashes require software hashing albeit a small > > amount so they do make a good category of packet to drop from the emc > > when contention is severe. > > [AF] Agree, if we skip EMC for recirculated packets 1. we save some cpu > cycles for hash computation 2. we mitigate the EMC thrashing, and this is the > main purpose of this patch. > > > > > Once there are too many entries contending for any cache it's going > > to become a liability as the lookup_cost + (miss_rate * cost_of_miss) > > grows to be greater that the cost_of_miss. And in that case any scheme > > where a very cheap decision can be made to not insert a certain > > category of packets and to also not check the emc for that same > > category will reduce the cache miss rate. > > [AF] agree, thanks for explaining this with clear words, I tried to do the > same > but I was not so successful :-) > > > > > But I'm not sure that is_recirculated is the best categorization on > > which to make that decision. Mainly because it is not controllable. In > > this test case 50% pkts were recirculated so by ruling these packets > > out of eligibility for the EMC you get a really large reduction in EMC > > contention. However you can never know ahead of time if any packets > > will be recirc'd. You may have a situation where the EMC is totally > > swamped with 200K flows as above but none of them are re-circ'd > > packets so ruling out those packets will give no benefit. > > [AF] agree, this approach is limited to conntrack, or any tunneling usecases. > [[BO'M]] I'll send an email to ovs-dev to elaborate further on what could be a more general scheme of EMC load-shedding. If people think it's useful we can begin to refine it further on the ML. > > > > > > > lib/dpif-netdev.c | 46 > ++++++++++++++++++++++++++++++++++++++++-- > > > ---- > > > 1 file changed, 40 insertions(+), 6 deletions(-) > > > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index > > fd2ed52..64a3cd4 > > > 100644 > > > --- a/lib/dpif-netdev.c > > > +++ b/lib/dpif-netdev.c > > > @@ -4538,6 +4538,8 @@ dp_netdev_queue_batches(struct dp_packet > *pkt, > > > packet_batch_per_flow_update(batch, pkt, mf); } > > > > > > +#define EMC_FULL_THRESHOLD 0x0000F000 > > > + > > [[BO'M]] > > > > Use of FULL in #def is a little misleading as it is really indicating > > the threshold after which recirc pkts do not get inserted to the EMC. > > > > EMC_RECIRCT_NO_INSERT_THRESHOLD maybe? > > [AF] you're right, I'll take note of that. Or something like > EMC_OCCUPANCY_THRESHOLD? Hmm,.. > > > > > As this #def is used with an & operator not a > operator in the patch > > it needs to be clear that number has the restrictions of > > a) having exactly one bit set so is limited to power of two values. > > b) needs to not greater than 1u << (EM_FLOW_HASH_SHIFT-1) > > > > Would suggest that value is based directly off EM_FLOW_HASH_SHIFT > > > > e.g > > #def EMC_RECIRCT_NO_INSERT_HALF (1u << (EM_FLOW_HASH_SHIFT-1)) > #def > > EMC_RECIRCT_NO_INSERT_QUARTER (1u << (EM_FLOW_HASH_SHIFT-2)) > #def > > EMC_RECIRCT_NO_INSERT_EIGHT (1u << (EM_FLOW_HASH_SHIFT-3)) > > > > #def EMC_RECIRCT_NO_INSERT_THRESHOLD > EMC_RECIRCT_NO_INSERT_HALF > > > > [AF] Thanks, that would help also on readability. I used the & operator just > for performance reasons, I'm not quite sure if the > operator would require > some more cpu cycles, I didn't check that. > > > > > > /* Try to process all ('cnt') the 'packets' using only the exact > > > match > > cache > > > * 'pmd->flow_cache'. If a flow is not found for a packet > > > 'packets[i]', > > the > > > * miniflow is copied into 'keys' and the packet pointer is moved > > > at > > the @@ - > > > 4582,6 +4584,19 @@ emc_processing(struct dp_netdev_pmd_thread > *pmd, > > > pkt_metadata_prefetch_init(&packets[i+1]->md); > > > } > > > > > > + /* > > > + * EMC lookup is skipped when one or both of the following > > > + * two cases occurs: > > > + * > > > + * - EMC is disabled. This is detected from cur_min. > > > + * > > > + * - The EMC occupancy exceeds EMC_FULL_THRESHOLD and the > > > + * packet to be classified is being recirculated. When > > this > > > + * happens also EMC insertions are skipped for recirculated > > > + * packets. So that EMC is used just to store entries > > which > > > + * are hit from the 'original' packets. This way the EMC > > > + * thrashing is mitigated with a benefit on performance. > > > + */ > > > if (!md_is_valid) { > > > pkt_metadata_init(&packet->md, port_no); > > > miniflow_extract(packet, &key->mf); @@ -4603,11 > > > +4618,18 @@ emc_processing(struct dp_netdev_pmd_thread *pmd, > > > } else { > > > /* Recirculated packets. */ > > > miniflow_extract(packet, &key->mf); > > > - if (OVS_LIKELY(cur_min)) { > > > - key->hash = dpif_netdev_packet_get_rss_hash(packet, > > &key->mf); > > > - flow = emc_lookup(flow_cache, key); > > > - } else { > > > + if (flow_cache->n_entries & EMC_FULL_THRESHOLD) { > > > + /* EMC occupancy is over the threshold. We skip EMC > > > + * lookup for recirculated packets. */ > > > flow = NULL; > > > + } else { > > > + if (OVS_LIKELY(cur_min)) { > > > + key->hash = dpif_netdev_packet_get_rss_hash(packet, > > > + &key->mf); > > > + flow = emc_lookup(flow_cache, key); > > > + } else { > > > + flow = NULL; > > > + } > > > } > > > } > > > key->len = 0; /* Not computed yet. */ @@ -4695,7 +4717,13 > > > @@ handle_packet_upcall(struct dp_netdev_pmd_thread *pmd, > > > add_actions->size); > > > } > > > ovs_mutex_unlock(&pmd->flow_mutex); > > > - emc_probabilistic_insert(pmd, key, netdev_flow); > > > + /* When EMC occupancy goes over a threshold we avoid > > > + inserting > > new > > > + * entries for recirculated packets. */ > > > + if (!packet->md.recirc_id) { > > > + emc_probabilistic_insert(pmd, key, netdev_flow); > > > + } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) { > > > + emc_probabilistic_insert(pmd, key, netdev_flow); > > > + } > > > } > > > } > > > > > > @@ -4788,7 +4816,13 @@ fast_path_processing(struct > > > dp_netdev_pmd_thread *pmd, > > > > > > flow = dp_netdev_flow_cast(rules[i]); > > > > > > - emc_probabilistic_insert(pmd, &keys[i], flow); > > > + /* When EMC occupancy goes over a threshold we avoid > > > + inserting > > new > > > + * entries for recirculated packets. */ > > > + if (!packet->md.recirc_id) { > > > + emc_probabilistic_insert(pmd, &keys[i], flow); > > > + } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) { > > > + emc_probabilistic_insert(pmd, &keys[i], flow); > > > + } > > > dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, > > > n_batches); > > > } > > > > > > -- > > > 2.4.11 > > > > > > _______________________________________________ > > > dev mailing list > > > d...@openvswitch.org > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev