This is huge guys. I just re-tweeted it from the `ApacheMesos` twitter account as well.
Thanks Dmitry, Michael, Yan and Ben for driving these changes. On Mon, Dec 11, 2017 at 1:16 PM, Benjamin Mahler <bmah...@apache.org> wrote: > I was going to send this out to the dev@ list but I'll reply here instead > :) > > This is a report on the progress we've made recently in the performance > working group, specifically highlighting the improvements to master > failover performance. Special thanks to Dmitry Zhuk, Michael Park and Yan > Xu for the work in this area. > > I didn't get time to write up something for the libprocess improvements, so > I decided to split that out into its own blog post. > > You can see it on the website here: > http://mesos.apache.org/blog/performance-working-group-progress-report/ > > And I tweeted about it here if you'd like to help share it: > https://twitter.com/bmahler/status/940325681312940033 > > On Mon, Dec 11, 2017 at 12:14 PM, Jie Yu <yujie....@gmail.com> wrote: > > > This is AWESOME! > > > > - Jie > > > > On Mon, Dec 11, 2017 at 11:54 AM, <bmah...@apache.org> wrote: > > > > > Repository: mesos > > > Updated Branches: > > > refs/heads/master 83f81b7b2 -> e7244ae1e > > > > > > > > > Added a performance working group December 2017 blog post. > > > > > > This blog post discusses the master failover performance improvements > > > that were made in the past few months. > > > > > > > > > Project: http://git-wip-us.apache.org/repos/asf/mesos/repo > > > Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/e7244ae1 > > > Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/e7244ae1 > > > Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/e7244ae1 > > > > > > Branch: refs/heads/master > > > Commit: e7244ae1eb84a8bfcbe2940107c7f97a53832cf2 > > > Parents: 83f81b7 > > > Author: Benjamin Mahler <bmah...@apache.org> > > > Authored: Mon Dec 11 11:52:43 2017 -0800 > > > Committer: Benjamin Mahler <bmah...@apache.org> > > > Committed: Mon Dec 11 11:52:43 2017 -0800 > > > > > > ---------------------------------------------------------------------- > > > .../1.3-1.5_master_failover_no_history.png | Bin 0 -> 115884 > bytes > > > .../1.3-1.5_master_failover_with_history.png | Bin 0 -> 140716 > bytes > > > ...performance-working-group-progress-report.md | 90 > > +++++++++++++++++++ > > > 3 files changed, 90 insertions(+) > > > ---------------------------------------------------------------------- > > > > > > > > > http://git-wip-us.apache.org/repos/asf/mesos/blob/e7244ae1/ > > > docs/images/1.3-1.5_master_failover_no_history.png > > > ---------------------------------------------------------------------- > > > diff --git a/docs/images/1.3-1.5_master_failover_no_history.png > > > b/docs/images/1.3-1.5_master_failover_no_history.png > > > new file mode 100644 > > > index 0000000..c1dca64 > > > Binary files /dev/null and b/docs/images/1.3-1.5_master_ > > failover_no_history.png > > > differ > > > > > > http://git-wip-us.apache.org/repos/asf/mesos/blob/e7244ae1/ > > > docs/images/1.3-1.5_master_failover_with_history.png > > > ---------------------------------------------------------------------- > > > diff --git a/docs/images/1.3-1.5_master_failover_with_history.png > > > b/docs/images/1.3-1.5_master_failover_with_history.png > > > new file mode 100644 > > > index 0000000..4f3deec > > > Binary files /dev/null and b/docs/images/1.3-1.5_master_ > > failover_with_history.png > > > differ > > > > > > http://git-wip-us.apache.org/repos/asf/mesos/blob/e7244ae1/ > > > site/source/blog/2017-12-7-performance-working-group- > progress-report.md > > > ---------------------------------------------------------------------- > > > diff --git a/site/source/blog/2017-12-7-performance-working-group- > > > progress-report.md b/site/source/blog/2017-12-7- > > performance-working-group- > > > progress-report.md > > > new file mode 100644 > > > index 0000000..7e42642 > > > --- /dev/null > > > +++ b/site/source/blog/2017-12-7-performance-working-group- > > > progress-report.md > > > @@ -0,0 +1,90 @@ > > > +--- > > > +layout: post > > > +title: December 2017 Performance Working Group Progress Report > > > +published: true > > > +post_author: > > > + display_name: Benjamin Mahler > > > + gravatar: fb43656d4d45f940160c3226c53309f5 > > > + twitter: bmahler > > > +tags: Performance > > > +--- > > > + > > > +**Scalability and performance are key features for Mesos. Some users > of > > > Mesos already run production clusters that consist of more than 35,000+ > > > agents and 100,000+ active tasks.** However, there remains a lot of > room > > > for improvement across a variety of areas of the system. > > > + > > > +The performance working group was created in order to focus on some of > > > these areas. The group's charter is to improve scalability / > throughput / > > > latency across the system, and in order to measure our improvements and > > > prevent performance regressions we will write benchmarks and automate > > them. > > > + > > > +In the past few months, we've focused on making improvements to the > > > following areas: > > > + > > > +* **Master failover time-to-completion**: Achieved a 450-600% > > improvement > > > in throughput, which reduces the time-to-completion by 80-85%. > > > +* **[Libprocess](https://github.com/apache/mesos/tree/master/ > > > 3rdparty/libprocess) message passing throughput**: These improvements > > > will be covered in a separate blog post. > > > + > > > +Before we dive into the master failover improvements, I would like to > > > recognize and thank the following contributors: > > > + > > > +* **Dmitry Zhuk**: for writing *a lot* of patches for improving the > > > master failover performance. > > > +* **Michael Park**: for reviewing and shipping many of Dmitry's more > > > challenging patches. > > > +* **Yan Xu**: for writing the master failover benchmark that was the > > > basis for measuring the improvements. > > > + > > > +## Master Failover Time-To-Completion > > > + > > > +Our first area of focus was to improve the time it takes for a master > > > failover to complete, where completion is defined as all of the agents > > > successfully re-registering. Mesos is architected to use a centralized > > > master with standby masters that participate in a quorum for high > > > availability. For scalability reasons, the leading master stores the > > state > > > of the cluster in-memory. During a master failover, the leading master > > > needs to therefore re-build the in-memory state from all of the agents > > that > > > re-register. During this time, the master is available to process other > > > requests, but will be exposing only partial state to API consumers. > > > + > > > +The rebuilding of the master’s in-memory state can be expensive for > > > larger clusters, and so the focus of this effort was to improve the > > > efficiency of this. Improvements were made via several areas, and only > > the > > > highest-impact changes are listed below: > > > + > > > +### Protobuf 3.5.0 Move Support > > > + > > > +We upgraded to protobuf 3.5.0 in order to gain move support. When we > > > profiled the master, we found that it spent a lot of time copying > > protobuf > > > messages during agent re-registration. This support allowed us to > > eliminate > > > copies of protobuf messages while retaining value semantics. > > > + > > > +### Move Support and Copy Elimination in Libprocess `dispatch` / > `defer` > > > / `install` > > > + > > > +Libprocess provides several primitives for message passing: > > > + > > > +* `dispatch`: Provides the ability to post a messages to a local > > `Process` > > > +* `defer`: Provides a deferred `dispatch`. i.e. a function object that > > > when invoked will issue a `dispatch`. > > > +* `install`: Installs a handler for receiving a protobuf message. > > > + > > > +These primitives did not have move support, as they were originally > > added > > > prior to the addition of C++11 support to the code-base. In order to > > > eliminate copies, we enhanced these primitives to support moving > > arguments > > > in and out. > > > + > > > +This required introducing a new C++ utility, because `defer` takes on > > the > > > same API as `std::bind` (e.g., placeholders). Specifically, the > function > > > object returned by `std::bind` does not move the bound arguments into > the > > > stored callable. In order to enable this, `defer` now uses a utility we > > > introduced called `lambda::partial` rather than `std::bind`. > > > `lambda::partial` performs partial function application similar to > > > `std::bind` except the returned function object moves the bound > arguments > > > into the stored callable if the invocation is performed on an r-value > > > function object. > > > + > > > +### Copy Elimination in the Master > > > + > > > +With these previous enhancements in place, we were able to eliminate > > many > > > of the expensive copies of protobuf messages performed by the master. > > > + > > > +### Benchmark and Results > > > + > > > +We wrote a synthetic benchmark to simulate a master failover. This > > > benchmark prepares all the messages that would be sent to the master by > > the > > > agents that need to re-register: > > > + > > > +* The benchmark uses synthetic agents in that they are just an actor > > that > > > knows how to re-register with the master. > > > +* Each "agent" will send a configurable number of active and completed > > > tasks belonging to a configurable number of active and completed > > frameworks. > > > +* Each task has 10 small labels to introduce metadata overhead. > > > + > > > +The benchmark has a few caveats: > > > + > > > +* It does not use executors (this would show improved results over > what > > > is shown below, but for simplicity the benchmark omits them) > > > +* It uses local message passing, whereas a real cluster would be > passing > > > messages over HTTP. > > > +* It uses a quorum size of 1, so writes to the master’s registry occur > > > only on single local log replica. > > > +* The synthetic agents do not retry their re-registration, whereas > > > typically agents will retry with a backoff. > > > + > > > +This was tested on a 2015 Macbook Pro with 2.8 GHz Intel Core i7 > > > processor. Mesos was configured using: `Apple LLVM version 9.0.0 > > > (clang-900.0.38)`, with `-O2` enabled in 1.5.0. > > > + > > > +The first results represent a cluster with 10 active tasks per agent > > > across 5 frameworks, with no completed tasks. The results from 1,000 - > > > 40,000 agents with 10,000 - 400,000 active tasks: > > > + > > > + > > > + > > > +There was a reduction in the time-to-completion of ~80% due to a > > 450-500% > > > improvement in throughput across 1.3.0 to 1.5.0. > > > + > > > +The second results add task history: each agent also now contains 100 > > > completed tasks across 5 completed frameworks. The results from 1,000 - > > > 40,000 agents with 10,000 - 400,000 active tasks and 100,000 - > 4,000,000 > > > completed tasks are shown below: > > > + > > > + > > > + > > > +This represents a reduction in time-to-completion of ~85% due to a > > > 550-700% improvement in throughput across 1.3.0 to 1.5.0. > > > + > > > +## Performance Working Group Roadmap > > > + > > > +We're currently targeting the following areas for improvements: > > > + > > > +* **Performance of the v1 API**: Currently the v1 API can be > > > significantly slower than the v0 API. We would like to reach parity, > and > > > ideally surpass the performance of the v0 API. > > > + * **[Libprocess](https://github.com/apache/mesos/tree/master/ > > > 3rdparty/libprocess) HTTP performance**: This will be undertaken as > part > > > of improving the v1 API performance, since it is HTTP-based. > > > +* **Master state API performance**: Currently, API queries of the > > > master's state are serviced by the same master actor that processes all > > of > > > the messages from schedulers and agents. Since the query processing can > > > block the master from processing other events, users need to be careful > > not > > > to query the master excessively. In practice, the master gets queried > > quite > > > heavily due to the presence of several tools that rely on the master's > > > state (e.g. DNS tooling, UIs, CLIs, etc) and so this is a critical > > problem > > > for users. This effort will leverage the state streaming API to stream > > the > > > state to a different actor that can serve the state API requests. This > > will > > > ensure that expensive state queries do not affect the master's ability > to > > > process events. > > > + > > > +If you are a user and would like to suggest some areas for performance > > > improvement, please let us know by emailing <d...@apache.mesos.org>. > > > > > > > > >