Re: [ovs-discuss] Incremental perf results

Han Zhou Fri, 18 May 2018 14:04:39 -0700

Hi Mark,

Thank you so much for sharing this data. Please see my comments inline.


On Fri, May 18, 2018 at 1:31 PM, Mark Michelson <mmich...@redhat.com> wrote:

> Hi Han, I finally did some tests and looked at the CPU usage between
> master and the ip7 branch.
>
> On the machines running ovn-controller:
> Master branch: Climbs to around 100% over the course of 3 minutes, then
> oscillates close to 100% for about 10 minutes, and then is pegged to 100%
> for the rest of the test. Total test time was about 23 minutes.
> ip7 branch: oscillates between 10 and 25% for the first 10 minutes of the
> test, then hovers around 10% for the rest. Total test time was about 19
> minutes.
>
> This is aligned with my observation of ~90% improvement on CPU cost.

For the throughput/total time, the improvement ratio is different (in my
test case the execution time reduced ~50%) but I think it can be explained.
The total execution time is not accurately reflecting the efficiency of the
processing, because when CPU is 100%, ovn-controller processing will be
slowed down which may just end up less iterations during the whole test. I
think the stop-watch profiling mechanism you implemented (also rebased into
the incremental processing) will be able to tell the truth. The real impact
of that is longer latency for handling a change in control plane. So I also
use latency to evaluate the improvement. The way I test latency is using
ovn-nbctl --wait=hv, with the nb_cfg improvement (
https://patchwork.ozlabs.org/patch/899608/).

When I switched over to tests that have ACLs:
> Master branch: Behaves about the same as the master branch when no ACLs
> are used. Total test time was about 28 minutes
> ip7 branch: CPU usage hovered around 30% for the entirety of the test,
> hitting spikes around 50% a couple of times. Total test time was about 25
> minutes.
>
> Since I had not done it yet, I also ran perf while running the incremental
> branch with ACLs. I am attaching the flame graph here. The gist is that
> much like the master branch, the majority of CPU time is spent processing
> logical flows.
>
> Seeing the drop in CPU usage between the master branch and the ip7 branch
> makes me think it is worth investigating other areas that may be the
> bottleneck. I monitored memory, disk usage, and network usage on the
> machines, but I didn't see anything that seemed obvious as being the cause
> for delay.
>
> The CPU drop between master and ip7 when testing with ACLs, for my
understanding, most likely because of incremental processing avoids
recompute flows when irrelevant input such as pinctrl/ofctrl messages (e.g.
probe/echo) comes, while in master any of these inputs would trigger
recomputing.


> CPU-wise, I think the biggest improvements that can be made to the
> incremental processing branch are:
> * Adding a change handler for the Address_Set table.
> * ofctrl_put() improvements we have discussed.
>
> I think this will have noticeable improvements in our test times. However,
> based on how much the CPU usage dropped just from switching to the
> incremental processing branch, I think there are likely some other
> bottlenecks in our tests that would be more impactful to remove. We already
> know that "ovn_network.bind_port" and "ovn_network.wait_port_up" in
> ovn-scale-test terminology are the operations in our test iterations that
> take the longest. If we can break those down into smaller pieces, we can
> potentially zero in on what to target next.
>

I am not sure if there is any other *big* bottlenecks, but
address-set/port-group and ofctrl_put() improvement are surely needed :)
The latest patch I provided is from my ip9 branch, which is rebased on
master this week, with some code refactors. Feel free to try it, but don't
expect any performance difference.

Thanks,
Han

>
> On 05/11/2018 12:50 PM, Han Zhou wrote:
>
>>
>>
>> On Fri, May 11, 2018 at 7:58 AM, Mark Michelson <mmich...@redhat.com
>> <mailto:mmich...@redhat.com>> wrote:
>>
>>     On 05/10/2018 08:33 PM, Han Zhou wrote:
>>
>>
>>
>>         On Wed, May 9, 2018 at 12:21 PM, Mark Michelson
>>         <mmich...@redhat.com <mailto:mmich...@redhat.com>
>>         <mailto:mmich...@redhat.com <mailto:mmich...@redhat.com>>> wrote:
>>
>>              On 05/07/2018 08:29 AM, Mark Michelson wrote:
>>
>>                  On 05/04/2018 04:15 PM, Han Zhou wrote:
>>
>>
>>
>>                      On Thu, May 3, 2018 at 10:49 AM, Mark Michelson
>>                      <mmich...@redhat.com <mailto:mmich...@redhat.com>
>>         <mailto:mmich...@redhat.com <mailto:mmich...@redhat.com>>
>>                      <mailto:mmich...@redhat.com
>>         <mailto:mmich...@redhat.com> <mailto:mmich...@redhat.com
>>         <mailto:mmich...@redhat.com>>>>
>>
>>                      wrote:
>>
>>                           On 05/03/2018 12:58 PM, Mark Michelson wrote:
>>
>>                               Hi Han,
>>
>>                               (cc'ed ovs-discuss)
>>
>>                               I have some test results here for your
>>         incremental
>>                      branch[0].
>>
>>                               Browbeat was the test orchestrator for the
>>         test,
>>                      and it uses
>>                               ovn-scale-test[1] to configure the test
>>         parameters
>>                      and run the test.
>>
>>                               The test use one central node on which
>>         ovn-northd
>>                      runs. There
>>                               are three farm nodes on which sandboxes
>>         are run for
>>                               ovn-controller. Each farm node runs 24
>>         sandboxes,
>>                      for a total of
>>                               72 sandboxes (and thus 72 ovn-controller
>>         processes).
>>
>>                               The test uses the datapath context[2] to
>>         set up 72
>>                      logical
>>                               switches and one logical router in
>>         advance. Then
>>                      during each
>>                               test iteration, a logical switch port is
>>         added to
>>                      one of the
>>                               logical switches and bound on one of the
>>         sandboxes.
>>                      The next
>>                               iteration does not start until the previous
>>                      iteration is 100%
>>                               complete (i.e. we see that the logical
>>         switch port
>>                      is "up" in
>>                               the northbound db). The total number of
>>         logical
>>                      switch ports
>>                               added during the tests is 3312.
>>
>>                               During the test, I ran `perf record` on
>>         one of the
>>                               ovn-controller processes and then created
>>         a flame
>>                      graph[3] from
>>                               the results. I have attached the flame
>>         graph to
>>                      this e-mail. I
>>                               think this can give us a good jumping off
>>         point for
>>                      determining
>>                               more optimizations to make to
>> ovn-controller.
>>
>>                               [0] https://github.com/hzhou8/ovs/tree/ip7
>>         <https://github.com/hzhou8/ovs/tree/ip7>
>>                      <https://github.com/hzhou8/ovs/tree/ip7
>>         <https://github.com/hzhou8/ovs/tree/ip7>>
>>                               <https://github.com/hzhou8/ovs/tree/ip7
>>         <https://github.com/hzhou8/ovs/tree/ip7>
>>                      <https://github.com/hzhou8/ovs/tree/ip7
>>         <https://github.com/hzhou8/ovs/tree/ip7>>>
>>                               [1]
>>         https://github.com/openvswitch/ovn-scale-test
>>         <https://github.com/openvswitch/ovn-scale-test>
>>                      <https://github.com/openvswitch/ovn-scale-test
>>         <https://github.com/openvswitch/ovn-scale-test>>
>>                                      <https://github.com/openvswit
>> ch/ovn-scale-test
>>         <https://github.com/openvswitch/ovn-scale-test>
>>                      <https://github.com/openvswitch/ovn-scale-test
>>         <https://github.com/openvswitch/ovn-scale-test>>>
>>                               [2]
>>         https://github.com/openvswitch/ovn-scale-test/pull/165
>>         <https://github.com/openvswitch/ovn-scale-test/pull/165>
>>                             <https://github.com/openvswitc
>> h/ovn-scale-test/pull/165
>>         <https://github.com/openvswitch/ovn-scale-test/pull/165>>
>>                                                 <
>> https://github.com/openvswitch/ovn-scale-test/pull/165
>>         <https://github.com/openvswitch/ovn-scale-test/pull/165>
>>                             <https://github.com/openvswitc
>> h/ovn-scale-test/pull/165
>>         <https://github.com/openvswitch/ovn-scale-test/pull/165>>>
>>                               [3]
>>         http://www.brendangregg.com/FlameGraphs/cpuflamegraphs
>>         <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs>
>>                             <http://www.brendangregg.com/F
>> lameGraphs/cpuflamegraphs
>>         <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs>>
>>                                                 <
>> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs
>>         <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs>
>>                             <http://www.brendangregg.com/F
>> lameGraphs/cpuflamegraphs
>>         <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs>>>
>>
>>
>>
>>                            From the IRC meeting, it was requested to see
>>         a flame
>>                      graph of
>>                           performance on OVS master. I am attaching that
>>         on this
>>                      e-mail.
>>
>>                           One difference in this test run is that the
>>         number of
>>                      switch ports
>>                           was fewer (I'm not 100% sure of the exact
>>         number), so
>>                      the number of
>>                           samples in perf record is less than in the
>>         flame graph
>>                      I previously
>>                           sent.
>>
>>                           The vast majority of the time is spent in
>>         lflow_run().
>>                      Based on this
>>                           flame graph, our initial take on the matter
>>         was that we
>>                      could get
>>                           improved performance by reducing the number of
>>         logical
>>                      flows to
>>                           process. The incremental branch seemed like a
>> good
>>                      testing target to
>>                           that end.
>>
>>
>>                      Thanks Mark for sharing the results!
>>                      It seems you have sent the wrong attachment
>>         perf-master.svg
>>                      in your second email, which is still the same one
>>         as in the
>>                      first email. Would you mind sending the right one?
>>         Also,
>>                      please if you could share total CPU cost for
>>         incremental
>>                      v.s. master, when you have the data.
>>
>>                        From your text description, it is improved as
>>         expected,
>>                      since the bottleneck moved from lflow_run() to
>>         ofctrl_put().
>>                      For the new bottleneck ofctrl_put(), it's a good
>>         finding,
>>                      and I think I have some ideas to improve that, too.
>>                      Basically, when we are able to do incremental
>>         computing, we
>>                      don't need to go through and compare the installed
>> v.s.
>>                      desired flows every time, but instead we can
>>         maintain a list
>>                      of changes made in the iteration and then send them
>>         to OVS.
>>                      This would reduce a lot of ovn_flow_lookup() calls
>>         which is
>>                      currently shown as the hotspot in ofctrl_put(). I
>>         will try
>>                      this as soon as I fix some corner cases and get
>>         confident
>>                      about the correctness.
>>
>>                      Thanks,
>>                      Han
>>
>>
>>                  Apologies, Han. Here is the proper file.
>>
>>                  Your conclusion is similar to what I thought about when
>>         I saw
>>                  where the new bottleneck was.
>>
>>                  I don't have a comparison of total CPU cost yet. But I
>>         can tell
>>                  you that from observing CPU usage in real time during
>>         the test
>>                  with the master branch, the graphs appeared to be at
>>         100% across
>>                  all CPU cores for the duration of the test. I have not
>>         observed
>>                  CPU usage during a test with the incremental branch to
>>         see if it
>>                  improved.
>>
>>                  Mark!
>>
>>
>>              Hello Han,
>>
>>              Just following up on the previous e-mail. I plan to run a
>>         test soon
>>              with the incremental branch and observe CPU usage. I won't
>>         be able
>>              to share a graph, but I can at least tell you if I observe an
>>              overall drop in CPU usage compared to using the master
>> branch.
>>
>>              I took a look through ofctrl_put() and I think that the
>>         incremental
>>              processing branch puts it in a good position to be improved
>>              dramatically.
>>
>>              In master, we currently maintain a hmap of installed flows.
>>         When a
>>              loop of ovn-controller runs, we use the contents of the
>>         southbound
>>              database to build up a hmap of our desired flows. We then
>>         perform an
>>              (expensive) operation where we reconcile the installed
>>         flows with
>>              the desired flows. The output of all of this is a list of
>>              ofputil_flow_mods that we then pass down to vswitchd.
>>
>>              In the incremental processing branch [1], this is helped a
>>         bit. When
>>              we can perform incremental processing, the desired flows are
>> no
>>              longer built from scratch on each loop. Instead, the
>>         desired flows
>>              persist and are modified based on changes observed in the
>>         database.
>>              ofctrl_put() is mostly unchanged, although it can exit
>>         early if it
>>              knows that there were no changes observed.
>>
>>         In fact the ofctrl_put() (in ip7 branch) already takes care of
>>         the no change scenario. It returns immediately if decided no
>>         change needed after some simple evaluation ("flow_changed" is
>>         one of the condition it considers). But yes, when change is
>>         needed, it can be further optimized as I mentioned also in last
>>         email.
>>
>>
>>              I've thought about this and considered two stages to
>>         potentially
>>              improving this.
>>
>>              Stage 1: When performing incremental processing, don't bother
>>              dealing with desired flows at all. Instead, when changes are
>>              observed in the southbound database, modify the installed
>> flows
>>              directly and create ofputil_flow_mods based on the changes.
>>         When it
>>              is time to run ofctrl_put(), we can pass the generated
>>              ofputil_flow_mods directly to vswitchd without having to
>>         perform any
>>              sort of reconciliation between hmaps. We still have to
>>         maintain the
>>              installed flows because full runs through the southbound
>>         database
>>              will need to perform the "classic" version of ofctrl_put()
>>         and will
>>              need to have an up-to-date table of installed flows.
>>
>>         I was thinking about the same, but then figured out we still
>>         need the desired flows because of a corner case: different
>>         logical flows can generate same OVS flows (e.g. when 2 identical
>>         ACLs are created for the same ovn port, due to client behavior,
>>         which is real at least in Neutron), but duplicated flows can't
>>         exist in installed flows. If we don't maintain desired flows
>>         that stores multiple entries of duplicated flows, when we handle
>>         a logical flow deletion we don't know if the corresponding
>>         installed flow(s) is/are still being used by any other logical
>> flow.
>>
>>
>>     I'm glad I asked because I would not have realized that distinct
>>     logical flows could result in the same OVS flow.
>>
>>     One idea about how to address this issue is that the installed_flows
>>     hmap could consist of both the OVS flow and a reference count. The
>>     reference count is equal to the number of logical flows that
>>     resulted in that OVS flow. When doing incremental processing, it's
>>     easy to add and remove from the count based on changes in logical
>>     flows. If the count on an installed OVS flow reaches 0, then the
>>     flow can be removed from the hmap.
>>
>>
>>         This can be addressed by changing the installed flow table
>>         structure to maintain the M:N mapping between logical flow and
>>         OVS flows, and then the desired flow table may be removed.
>>         However, keeping the desired flow table doesn't harm either
>>         (despite the memory consumption), and I am not sure if there are
>>         other reasons we will need it (maybe it helps when full
>>         computation is needed). When incremental processing taking
>>         effect, we can still avoid the costly full table comparing by
>>         maintaining a list of the changes while keep updating desired
>>         flow table.
>>
>>
>>     My thinking behind removing the persistent desired flows table was
>>     mainly just removing something that didn't need to be there. For
>>     full computation, we can do what OVS master currently does: build
>>     the desired flows from scratch and then delete them at the end of
>>     the iteration. Thinking about it more, I think you're right that the
>>     main savings we would get would be memory consumption.
>>
>>
>>              Stage 2: If the code ever reaches a point where incremental
>>              processing is guaranteed to be performed on every loop
>>         iteration of
>>              ovn-controller, then ovn-controller could become stateless
>> with
>>              regards to installed flows. There would be no need to
>>         maintain the
>>              installed flows since we could just generate
>>         ofputil_flow_mods based
>>              on how the southbound database has changed. We could then
>>         pass those
>>              modifications down to vswitchd.
>>
>>         This would be ideal situation, but it seems to me not very
>>         realistic for ovn-controller, since there are just too many
>>         dependencies. Probably I am just not convinced of the value v.s.
>>         effort to achieve that. Maybe it is ok to recompute for changes
>>         that not happen very often, isn't it?
>>
>>
>>     I agree that it is not very realistic, but it would be nice if it
>>     were possible to do. I agree that for infrequent changes, performing
>>     a recompute is probably fine.
>>
>>
>>              I think similar changes could be implemented for the group
>>         and meter
>>              extend tables, but they are not nearly as important as the
>>         flow table.
>>
>>              Do my suggestions make sense to you? Are they in line with
>>         what you
>>              had been thinking about?
>>
>>         Of course! Thanks for sharing your good thoughts. I am glad this
>>         work helps and let's improve it further :)
>>
>>              Thanks,
>>              Mark
>>
>>              [1] Currently I'm looking at
>>         https://github.com/hzhou8/tree/ip7
>>         <https://github.com/hzhou8/tree/ip7>
>>              <https://github.com/hzhou8/tree/ip7
>>         <https://github.com/hzhou8/tree/ip7>> . I noticed that there is
>> an
>>              ip8 branch as well, but since I have been using ip7 in
>> tests, I
>>              decided look at it as a reference.
>>
>>
>>         I will keep the branch ip7 untouched and let you know when I get
>>         ip8 ready.
>>
>>
>>     Thanks. Can you give a brief summary of what is different in ip8?
>>     Just a sentence or two is fine.
>>
>>
>> All new updates will go to ip8 branch, until a new one is needed. It will
>> at least include rebase again on master, and some corner case handling.
>> Probably it will also include address set and port-group incremental
>> processing later.
>>
>>
>>
>

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] Incremental perf results

Reply via email to