Hi Yipeng and Sameh, As discussed on the OVS conference, I have just sent my Datapath Flow Cache patch to the ML: https://mail.openvswitch.org/pipermail/ovs-dev/2017-November/341066.html http://patchwork.ozlabs.org/patch/839645/
I would be happy if you could test it in your setup and compare the performance with your CD implementation. I don't have a test bed at hand where I could easily test the speedup for many visited subtables. Thanks, Jan > -----Original Message----- > From: Yipeng Wang [mailto:yipeng1.w...@intel.com] > Sent: Wednesday, 01 November, 2017 00:40 > To: d...@openvswitch.org > Cc: yipeng1.w...@intel.com; sameh.gobr...@intel.com; > antonio.fische...@intel.com; db...@vmware.com; Jan Scheurich > <jan.scheur...@ericsson.com> > Subject: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation > > The Datapath Classifier uses tuple space search for flow classification. > The rules are arranged into a set of tuples/subtables (each with a > distinct mask). Each subtable is implemented as a hash table and lookup > is done with flow keys formed by selecting the bits from the packet header > based on each subtable's mask. Tuple space search will sequentially search > each subtable until a match is found. With a large number of subtables, a > sequential search of the subtables could consume a lot of CPU cycles. In > a testbench with a uniform traffic pattern equally distributed across 20 > subtables, we measured that up to 65% of total execution time is attributed > to the megaflow cache lookup. > > This patch presents the idea of the two-layer hierarchical lookup, where a > low overhead first level of indirection is accessed first, we call this > level cuckoo distributor (CD). If a flow key has been inserted in the flow > table the first level will indicate with high probability that which > subtable to look into. A lookup is performed on the second level (the > target subtable) to retrieve the result. If the key doesn’t have a match, > then we revert back to the sequential search of subtables. The patch is > partially inspired by earlier concepts proposed in "simTable"[1] and > "Cuckoo Filter"[2], and DPDK's Cuckoo Hash implementation. > > This patch can improve the already existing Subtable Ranking when traffic > data has high entropy. Subtable Ranking helps minimize the number of > traversed subtables when most of the traffic hit the same subtable. > However, in the case of high entropy traffic such as traffic coming from > a physical port, multiple subtables could be hit with a similar frequency. > In this case the average subtable lookups per hit would be much greater > than 1. In addition, CD can adaptively turn off when it finds the traffic > mostly hit one subtable. Thus, CD will not be an overhead when Subtable > Ranking works well. > > Scheme: > CD is in front of the subtables. Packets are directed to corresponding > subtable > if hit in CD instead of searching each subtable sequentially. > ------- > | CD | > ------- > \ > \ > ----- ----- ----- > |sub ||sub |...|sub | > |table||table| |table| > ----- ----- ----- > > Evaluation: > ---------- > We create a set of rules with various src IP. We feed traffic containing > various > numbers of flows with various src IP and dst IP. All the flows hit 10/20/30 > rules creating 10/20/30 subtables. We will explain the rule/traffic setup > in detail later. > > The table below shows the preliminary continuous testing results (full line > speed test) we collected with a uni-directional phy-to-phy setup. OvS > runs with 1 PMD. We use Spirent as the hardware traffic generator. > > Before v2 rebase: > ---- > AVX2 data: > 20k flows: > no.subtable: 10 20 30 > cd-ovs 4267332 3478251 3126763 > orig-ovs 3260883 2174551 1689981 > speedup 1.31x 1.60x 1.85x > > 100k flows: > no.subtable: 10 20 30 > cd-ovs 4015783 3276100 2970645 > orig-ovs 2692882 1711955 1302321 > speedup 1.49x 1.91x 2.28x > > 1M flows: > no.subtable: 10 20 30 > cd-ovs 3895961 3170530 2968555 > orig-ovs 2683455 1646227 1240501 > speedup 1.45x 1.92x 2.39x > > Scalar data: > 1M flows: > no.subtable: 10 20 30 > cd-ovs 3658328 3028111 2863329 > orig_ovs 2683455 1646227 1240501 > speedup 1.36x 1.84x 2.31x > > After v2 rebase: > ---- > After rebase for v1, we tested 1M flows, 20 table cases, the results still > hold. > 1M flows: > no.subtable: 20 > cd-ovs 3066483 > orig-ovs 1588049 > speedup 1.93x > > > Test rules/traffic setup: > ---- > To setup a test case with 20 subtables, the rule set we use is like below: > tcp,nw_src=1.0.0.0/8, actions=output:1 > udp,nw_src=2.0.0.0/9, actions=output:1 > udp,nw_src=3.0.0.0/10,actions=output:1 > udp,nw_src=4.0.0.0/11,actions=output:1 > ... > udp,nw_src=18.0.0.0/25,actions=output:1 > udp,nw_src=19.0.0.0/26,actions=output:1 > udp,nw_src=20.0.0.0/27,actions=output:1 > > Then for the traffic generator, we generate corresponding traffics with > src_ip varying from 1.0.0.0 to 20.0.0.0. For each src_ip, we change > dst_ip for 50000 different values. This will effectively generate 1M > different flows hitting the 20 rules we created. And because the different > wildcarding bits in nw_src, the 20 rules will belong to 20 subtables. > We use 64 Bytes packet across all tests. > > How to check if CD works or not for your use case: > ---- > CD cannot improve throughput for all use cases. It targets on use cases when > multiple subtables exist and when the top-ranked subtable is not hit by the > vast majority of the traffic. > > One can use $OVS_DIR/utilities/ovs-appctl dpif-netdev/pmd-stats-show > command to check CD statistics: hit/miss. > Another statistic also shown is: "avg. subtable lookups per hit". > In our test case, the original OvS will have an average subtable lookups value > as 10, because there are in total of 20 subtables, and on average, a hit > happens > after iterating half of them. In such case, iterating 10 subtables are > very expensive. > > By using CD, this value will be close to 1, which means on average only 1 > subtable needs to be iterated to hit the rule, which reduces a lot of > overhead. > > Other statistics to notice about is "megaflow hits" and "emc hits". > If most packets hit EMC, CD does not improve much of the throughput > since CD is used to optimize megaflow search instead of EMC lookup. If your > test > case has less than 8k flows, all of them may be EMC hit. > > Note that CD is adaptively turned on/off according to the number of subtables > and > their iterated pattern. If it finds there is not much benefit, CD will turn > off > itself automatically. > > > References: > ---------- > [1] H. Lee and B. Lee, Approaches for improving tuple space search-based > table lookup, ICTC '15 > [2] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher, > Cuckoo Filter: Practically Better Than Bloom, CoNEXT '14 > > The previous RFC on mailing list are at: > https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html > > v2: Rebase to master head. > Add more testing details in cover letter. > Change commit messages. > Minor style changes to code. > Fix build errors happens without AVX and DPDK library. > > Yipeng Wang (5): > dpif-netdev: Basic CD feature with scalar lookup. > dpif-netdev: Add AVX2 implementation for CD lookup. > dpif-netdev: Add CD statistics > dpif-netdev: Add adaptive CD mechanism > unit-test: Add a delay for CD initialization. > > lib/dpif-netdev.c | 567 > +++++++++++++++++++++++++++++++++++++++++++++++++- > tests/ofproto-dpif.at | 3 + > 2 files changed, 560 insertions(+), 10 deletions(-) > > -- > 2.7.4 _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev