Hi Mark, Thank you so much for sharing this data. Please see my comments inline.
On Fri, May 18, 2018 at 1:31 PM, Mark Michelson <mmich...@redhat.com> wrote: > Hi Han, I finally did some tests and looked at the CPU usage between > master and the ip7 branch. > > On the machines running ovn-controller: > Master branch: Climbs to around 100% over the course of 3 minutes, then > oscillates close to 100% for about 10 minutes, and then is pegged to 100% > for the rest of the test. Total test time was about 23 minutes. > ip7 branch: oscillates between 10 and 25% for the first 10 minutes of the > test, then hovers around 10% for the rest. Total test time was about 19 > minutes. > > This is aligned with my observation of ~90% improvement on CPU cost. For the throughput/total time, the improvement ratio is different (in my test case the execution time reduced ~50%) but I think it can be explained. The total execution time is not accurately reflecting the efficiency of the processing, because when CPU is 100%, ovn-controller processing will be slowed down which may just end up less iterations during the whole test. I think the stop-watch profiling mechanism you implemented (also rebased into the incremental processing) will be able to tell the truth. The real impact of that is longer latency for handling a change in control plane. So I also use latency to evaluate the improvement. The way I test latency is using ovn-nbctl --wait=hv, with the nb_cfg improvement ( https://patchwork.ozlabs.org/patch/899608/). When I switched over to tests that have ACLs: > Master branch: Behaves about the same as the master branch when no ACLs > are used. Total test time was about 28 minutes > ip7 branch: CPU usage hovered around 30% for the entirety of the test, > hitting spikes around 50% a couple of times. Total test time was about 25 > minutes. > > Since I had not done it yet, I also ran perf while running the incremental > branch with ACLs. I am attaching the flame graph here. The gist is that > much like the master branch, the majority of CPU time is spent processing > logical flows. > > Seeing the drop in CPU usage between the master branch and the ip7 branch > makes me think it is worth investigating other areas that may be the > bottleneck. I monitored memory, disk usage, and network usage on the > machines, but I didn't see anything that seemed obvious as being the cause > for delay. > > The CPU drop between master and ip7 when testing with ACLs, for my understanding, most likely because of incremental processing avoids recompute flows when irrelevant input such as pinctrl/ofctrl messages (e.g. probe/echo) comes, while in master any of these inputs would trigger recomputing. > CPU-wise, I think the biggest improvements that can be made to the > incremental processing branch are: > * Adding a change handler for the Address_Set table. > * ofctrl_put() improvements we have discussed. > > I think this will have noticeable improvements in our test times. However, > based on how much the CPU usage dropped just from switching to the > incremental processing branch, I think there are likely some other > bottlenecks in our tests that would be more impactful to remove. We already > know that "ovn_network.bind_port" and "ovn_network.wait_port_up" in > ovn-scale-test terminology are the operations in our test iterations that > take the longest. If we can break those down into smaller pieces, we can > potentially zero in on what to target next. > I am not sure if there is any other *big* bottlenecks, but address-set/port-group and ofctrl_put() improvement are surely needed :) The latest patch I provided is from my ip9 branch, which is rebased on master this week, with some code refactors. Feel free to try it, but don't expect any performance difference. Thanks, Han > > On 05/11/2018 12:50 PM, Han Zhou wrote: > >> >> >> On Fri, May 11, 2018 at 7:58 AM, Mark Michelson <mmich...@redhat.com >> <mailto:mmich...@redhat.com>> wrote: >> >> On 05/10/2018 08:33 PM, Han Zhou wrote: >> >> >> >> On Wed, May 9, 2018 at 12:21 PM, Mark Michelson >> <mmich...@redhat.com <mailto:mmich...@redhat.com> >> <mailto:mmich...@redhat.com <mailto:mmich...@redhat.com>>> wrote: >> >> On 05/07/2018 08:29 AM, Mark Michelson wrote: >> >> On 05/04/2018 04:15 PM, Han Zhou wrote: >> >> >> >> On Thu, May 3, 2018 at 10:49 AM, Mark Michelson >> <mmich...@redhat.com <mailto:mmich...@redhat.com> >> <mailto:mmich...@redhat.com <mailto:mmich...@redhat.com>> >> <mailto:mmich...@redhat.com >> <mailto:mmich...@redhat.com> <mailto:mmich...@redhat.com >> <mailto:mmich...@redhat.com>>>> >> >> wrote: >> >> On 05/03/2018 12:58 PM, Mark Michelson wrote: >> >> Hi Han, >> >> (cc'ed ovs-discuss) >> >> I have some test results here for your >> incremental >> branch[0]. >> >> Browbeat was the test orchestrator for the >> test, >> and it uses >> ovn-scale-test[1] to configure the test >> parameters >> and run the test. >> >> The test use one central node on which >> ovn-northd >> runs. There >> are three farm nodes on which sandboxes >> are run for >> ovn-controller. Each farm node runs 24 >> sandboxes, >> for a total of >> 72 sandboxes (and thus 72 ovn-controller >> processes). >> >> The test uses the datapath context[2] to >> set up 72 >> logical >> switches and one logical router in >> advance. Then >> during each >> test iteration, a logical switch port is >> added to >> one of the >> logical switches and bound on one of the >> sandboxes. >> The next >> iteration does not start until the previous >> iteration is 100% >> complete (i.e. we see that the logical >> switch port >> is "up" in >> the northbound db). The total number of >> logical >> switch ports >> added during the tests is 3312. >> >> During the test, I ran `perf record` on >> one of the >> ovn-controller processes and then created >> a flame >> graph[3] from >> the results. I have attached the flame >> graph to >> this e-mail. I >> think this can give us a good jumping off >> point for >> determining >> more optimizations to make to >> ovn-controller. >> >> [0] https://github.com/hzhou8/ovs/tree/ip7 >> <https://github.com/hzhou8/ovs/tree/ip7> >> <https://github.com/hzhou8/ovs/tree/ip7 >> <https://github.com/hzhou8/ovs/tree/ip7>> >> <https://github.com/hzhou8/ovs/tree/ip7 >> <https://github.com/hzhou8/ovs/tree/ip7> >> <https://github.com/hzhou8/ovs/tree/ip7 >> <https://github.com/hzhou8/ovs/tree/ip7>>> >> [1] >> https://github.com/openvswitch/ovn-scale-test >> <https://github.com/openvswitch/ovn-scale-test> >> <https://github.com/openvswitch/ovn-scale-test >> <https://github.com/openvswitch/ovn-scale-test>> >> <https://github.com/openvswit >> ch/ovn-scale-test >> <https://github.com/openvswitch/ovn-scale-test> >> <https://github.com/openvswitch/ovn-scale-test >> <https://github.com/openvswitch/ovn-scale-test>>> >> [2] >> https://github.com/openvswitch/ovn-scale-test/pull/165 >> <https://github.com/openvswitch/ovn-scale-test/pull/165> >> <https://github.com/openvswitc >> h/ovn-scale-test/pull/165 >> <https://github.com/openvswitch/ovn-scale-test/pull/165>> >> < >> https://github.com/openvswitch/ovn-scale-test/pull/165 >> <https://github.com/openvswitch/ovn-scale-test/pull/165> >> <https://github.com/openvswitc >> h/ovn-scale-test/pull/165 >> <https://github.com/openvswitch/ovn-scale-test/pull/165>>> >> [3] >> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs >> <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs> >> <http://www.brendangregg.com/F >> lameGraphs/cpuflamegraphs >> <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs>> >> < >> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs >> <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs> >> <http://www.brendangregg.com/F >> lameGraphs/cpuflamegraphs >> <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs>>> >> >> >> >> From the IRC meeting, it was requested to see >> a flame >> graph of >> performance on OVS master. I am attaching that >> on this >> e-mail. >> >> One difference in this test run is that the >> number of >> switch ports >> was fewer (I'm not 100% sure of the exact >> number), so >> the number of >> samples in perf record is less than in the >> flame graph >> I previously >> sent. >> >> The vast majority of the time is spent in >> lflow_run(). >> Based on this >> flame graph, our initial take on the matter >> was that we >> could get >> improved performance by reducing the number of >> logical >> flows to >> process. The incremental branch seemed like a >> good >> testing target to >> that end. >> >> >> Thanks Mark for sharing the results! >> It seems you have sent the wrong attachment >> perf-master.svg >> in your second email, which is still the same one >> as in the >> first email. Would you mind sending the right one? >> Also, >> please if you could share total CPU cost for >> incremental >> v.s. master, when you have the data. >> >> From your text description, it is improved as >> expected, >> since the bottleneck moved from lflow_run() to >> ofctrl_put(). >> For the new bottleneck ofctrl_put(), it's a good >> finding, >> and I think I have some ideas to improve that, too. >> Basically, when we are able to do incremental >> computing, we >> don't need to go through and compare the installed >> v.s. >> desired flows every time, but instead we can >> maintain a list >> of changes made in the iteration and then send them >> to OVS. >> This would reduce a lot of ovn_flow_lookup() calls >> which is >> currently shown as the hotspot in ofctrl_put(). I >> will try >> this as soon as I fix some corner cases and get >> confident >> about the correctness. >> >> Thanks, >> Han >> >> >> Apologies, Han. Here is the proper file. >> >> Your conclusion is similar to what I thought about when >> I saw >> where the new bottleneck was. >> >> I don't have a comparison of total CPU cost yet. But I >> can tell >> you that from observing CPU usage in real time during >> the test >> with the master branch, the graphs appeared to be at >> 100% across >> all CPU cores for the duration of the test. I have not >> observed >> CPU usage during a test with the incremental branch to >> see if it >> improved. >> >> Mark! >> >> >> Hello Han, >> >> Just following up on the previous e-mail. I plan to run a >> test soon >> with the incremental branch and observe CPU usage. I won't >> be able >> to share a graph, but I can at least tell you if I observe an >> overall drop in CPU usage compared to using the master >> branch. >> >> I took a look through ofctrl_put() and I think that the >> incremental >> processing branch puts it in a good position to be improved >> dramatically. >> >> In master, we currently maintain a hmap of installed flows. >> When a >> loop of ovn-controller runs, we use the contents of the >> southbound >> database to build up a hmap of our desired flows. We then >> perform an >> (expensive) operation where we reconcile the installed >> flows with >> the desired flows. The output of all of this is a list of >> ofputil_flow_mods that we then pass down to vswitchd. >> >> In the incremental processing branch [1], this is helped a >> bit. When >> we can perform incremental processing, the desired flows are >> no >> longer built from scratch on each loop. Instead, the >> desired flows >> persist and are modified based on changes observed in the >> database. >> ofctrl_put() is mostly unchanged, although it can exit >> early if it >> knows that there were no changes observed. >> >> In fact the ofctrl_put() (in ip7 branch) already takes care of >> the no change scenario. It returns immediately if decided no >> change needed after some simple evaluation ("flow_changed" is >> one of the condition it considers). But yes, when change is >> needed, it can be further optimized as I mentioned also in last >> email. >> >> >> I've thought about this and considered two stages to >> potentially >> improving this. >> >> Stage 1: When performing incremental processing, don't bother >> dealing with desired flows at all. Instead, when changes are >> observed in the southbound database, modify the installed >> flows >> directly and create ofputil_flow_mods based on the changes. >> When it >> is time to run ofctrl_put(), we can pass the generated >> ofputil_flow_mods directly to vswitchd without having to >> perform any >> sort of reconciliation between hmaps. We still have to >> maintain the >> installed flows because full runs through the southbound >> database >> will need to perform the "classic" version of ofctrl_put() >> and will >> need to have an up-to-date table of installed flows. >> >> I was thinking about the same, but then figured out we still >> need the desired flows because of a corner case: different >> logical flows can generate same OVS flows (e.g. when 2 identical >> ACLs are created for the same ovn port, due to client behavior, >> which is real at least in Neutron), but duplicated flows can't >> exist in installed flows. If we don't maintain desired flows >> that stores multiple entries of duplicated flows, when we handle >> a logical flow deletion we don't know if the corresponding >> installed flow(s) is/are still being used by any other logical >> flow. >> >> >> I'm glad I asked because I would not have realized that distinct >> logical flows could result in the same OVS flow. >> >> One idea about how to address this issue is that the installed_flows >> hmap could consist of both the OVS flow and a reference count. The >> reference count is equal to the number of logical flows that >> resulted in that OVS flow. When doing incremental processing, it's >> easy to add and remove from the count based on changes in logical >> flows. If the count on an installed OVS flow reaches 0, then the >> flow can be removed from the hmap. >> >> >> This can be addressed by changing the installed flow table >> structure to maintain the M:N mapping between logical flow and >> OVS flows, and then the desired flow table may be removed. >> However, keeping the desired flow table doesn't harm either >> (despite the memory consumption), and I am not sure if there are >> other reasons we will need it (maybe it helps when full >> computation is needed). When incremental processing taking >> effect, we can still avoid the costly full table comparing by >> maintaining a list of the changes while keep updating desired >> flow table. >> >> >> My thinking behind removing the persistent desired flows table was >> mainly just removing something that didn't need to be there. For >> full computation, we can do what OVS master currently does: build >> the desired flows from scratch and then delete them at the end of >> the iteration. Thinking about it more, I think you're right that the >> main savings we would get would be memory consumption. >> >> >> Stage 2: If the code ever reaches a point where incremental >> processing is guaranteed to be performed on every loop >> iteration of >> ovn-controller, then ovn-controller could become stateless >> with >> regards to installed flows. There would be no need to >> maintain the >> installed flows since we could just generate >> ofputil_flow_mods based >> on how the southbound database has changed. We could then >> pass those >> modifications down to vswitchd. >> >> This would be ideal situation, but it seems to me not very >> realistic for ovn-controller, since there are just too many >> dependencies. Probably I am just not convinced of the value v.s. >> effort to achieve that. Maybe it is ok to recompute for changes >> that not happen very often, isn't it? >> >> >> I agree that it is not very realistic, but it would be nice if it >> were possible to do. I agree that for infrequent changes, performing >> a recompute is probably fine. >> >> >> I think similar changes could be implemented for the group >> and meter >> extend tables, but they are not nearly as important as the >> flow table. >> >> Do my suggestions make sense to you? Are they in line with >> what you >> had been thinking about? >> >> Of course! Thanks for sharing your good thoughts. I am glad this >> work helps and let's improve it further :) >> >> Thanks, >> Mark >> >> [1] Currently I'm looking at >> https://github.com/hzhou8/tree/ip7 >> <https://github.com/hzhou8/tree/ip7> >> <https://github.com/hzhou8/tree/ip7 >> <https://github.com/hzhou8/tree/ip7>> . I noticed that there is >> an >> ip8 branch as well, but since I have been using ip7 in >> tests, I >> decided look at it as a reference. >> >> >> I will keep the branch ip7 untouched and let you know when I get >> ip8 ready. >> >> >> Thanks. Can you give a brief summary of what is different in ip8? >> Just a sentence or two is fine. >> >> >> All new updates will go to ip8 branch, until a new one is needed. It will >> at least include rebase again on master, and some corner case handling. >> Probably it will also include address set and port-group incremental >> processing later. >> >> >> >
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss