This is AWESOME! - Jie
On Mon, Dec 11, 2017 at 11:54 AM, <bmah...@apache.org> wrote: > Repository: mesos > Updated Branches: > refs/heads/master 83f81b7b2 -> e7244ae1e > > > Added a performance working group December 2017 blog post. > > This blog post discusses the master failover performance improvements > that were made in the past few months. > > > Project: http://git-wip-us.apache.org/repos/asf/mesos/repo > Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/e7244ae1 > Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/e7244ae1 > Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/e7244ae1 > > Branch: refs/heads/master > Commit: e7244ae1eb84a8bfcbe2940107c7f97a53832cf2 > Parents: 83f81b7 > Author: Benjamin Mahler <bmah...@apache.org> > Authored: Mon Dec 11 11:52:43 2017 -0800 > Committer: Benjamin Mahler <bmah...@apache.org> > Committed: Mon Dec 11 11:52:43 2017 -0800 > > ---------------------------------------------------------------------- > .../1.3-1.5_master_failover_no_history.png | Bin 0 -> 115884 bytes > .../1.3-1.5_master_failover_with_history.png | Bin 0 -> 140716 bytes > ...performance-working-group-progress-report.md | 90 +++++++++++++++++++ > 3 files changed, 90 insertions(+) > ---------------------------------------------------------------------- > > > http://git-wip-us.apache.org/repos/asf/mesos/blob/e7244ae1/ > docs/images/1.3-1.5_master_failover_no_history.png > ---------------------------------------------------------------------- > diff --git a/docs/images/1.3-1.5_master_failover_no_history.png > b/docs/images/1.3-1.5_master_failover_no_history.png > new file mode 100644 > index 0000000..c1dca64 > Binary files /dev/null and > b/docs/images/1.3-1.5_master_failover_no_history.png > differ > > http://git-wip-us.apache.org/repos/asf/mesos/blob/e7244ae1/ > docs/images/1.3-1.5_master_failover_with_history.png > ---------------------------------------------------------------------- > diff --git a/docs/images/1.3-1.5_master_failover_with_history.png > b/docs/images/1.3-1.5_master_failover_with_history.png > new file mode 100644 > index 0000000..4f3deec > Binary files /dev/null and > b/docs/images/1.3-1.5_master_failover_with_history.png > differ > > http://git-wip-us.apache.org/repos/asf/mesos/blob/e7244ae1/ > site/source/blog/2017-12-7-performance-working-group-progress-report.md > ---------------------------------------------------------------------- > diff --git a/site/source/blog/2017-12-7-performance-working-group- > progress-report.md b/site/source/blog/2017-12-7-performance-working-group- > progress-report.md > new file mode 100644 > index 0000000..7e42642 > --- /dev/null > +++ b/site/source/blog/2017-12-7-performance-working-group- > progress-report.md > @@ -0,0 +1,90 @@ > +--- > +layout: post > +title: December 2017 Performance Working Group Progress Report > +published: true > +post_author: > + display_name: Benjamin Mahler > + gravatar: fb43656d4d45f940160c3226c53309f5 > + twitter: bmahler > +tags: Performance > +--- > + > +**Scalability and performance are key features for Mesos. Some users of > Mesos already run production clusters that consist of more than 35,000+ > agents and 100,000+ active tasks.** However, there remains a lot of room > for improvement across a variety of areas of the system. > + > +The performance working group was created in order to focus on some of > these areas. The group's charter is to improve scalability / throughput / > latency across the system, and in order to measure our improvements and > prevent performance regressions we will write benchmarks and automate them. > + > +In the past few months, we've focused on making improvements to the > following areas: > + > +* **Master failover time-to-completion**: Achieved a 450-600% improvement > in throughput, which reduces the time-to-completion by 80-85%. > +* **[Libprocess](https://github.com/apache/mesos/tree/master/ > 3rdparty/libprocess) message passing throughput**: These improvements > will be covered in a separate blog post. > + > +Before we dive into the master failover improvements, I would like to > recognize and thank the following contributors: > + > +* **Dmitry Zhuk**: for writing *a lot* of patches for improving the > master failover performance. > +* **Michael Park**: for reviewing and shipping many of Dmitry's more > challenging patches. > +* **Yan Xu**: for writing the master failover benchmark that was the > basis for measuring the improvements. > + > +## Master Failover Time-To-Completion > + > +Our first area of focus was to improve the time it takes for a master > failover to complete, where completion is defined as all of the agents > successfully re-registering. Mesos is architected to use a centralized > master with standby masters that participate in a quorum for high > availability. For scalability reasons, the leading master stores the state > of the cluster in-memory. During a master failover, the leading master > needs to therefore re-build the in-memory state from all of the agents that > re-register. During this time, the master is available to process other > requests, but will be exposing only partial state to API consumers. > + > +The rebuilding of the master’s in-memory state can be expensive for > larger clusters, and so the focus of this effort was to improve the > efficiency of this. Improvements were made via several areas, and only the > highest-impact changes are listed below: > + > +### Protobuf 3.5.0 Move Support > + > +We upgraded to protobuf 3.5.0 in order to gain move support. When we > profiled the master, we found that it spent a lot of time copying protobuf > messages during agent re-registration. This support allowed us to eliminate > copies of protobuf messages while retaining value semantics. > + > +### Move Support and Copy Elimination in Libprocess `dispatch` / `defer` > / `install` > + > +Libprocess provides several primitives for message passing: > + > +* `dispatch`: Provides the ability to post a messages to a local `Process` > +* `defer`: Provides a deferred `dispatch`. i.e. a function object that > when invoked will issue a `dispatch`. > +* `install`: Installs a handler for receiving a protobuf message. > + > +These primitives did not have move support, as they were originally added > prior to the addition of C++11 support to the code-base. In order to > eliminate copies, we enhanced these primitives to support moving arguments > in and out. > + > +This required introducing a new C++ utility, because `defer` takes on the > same API as `std::bind` (e.g., placeholders). Specifically, the function > object returned by `std::bind` does not move the bound arguments into the > stored callable. In order to enable this, `defer` now uses a utility we > introduced called `lambda::partial` rather than `std::bind`. > `lambda::partial` performs partial function application similar to > `std::bind` except the returned function object moves the bound arguments > into the stored callable if the invocation is performed on an r-value > function object. > + > +### Copy Elimination in the Master > + > +With these previous enhancements in place, we were able to eliminate many > of the expensive copies of protobuf messages performed by the master. > + > +### Benchmark and Results > + > +We wrote a synthetic benchmark to simulate a master failover. This > benchmark prepares all the messages that would be sent to the master by the > agents that need to re-register: > + > +* The benchmark uses synthetic agents in that they are just an actor that > knows how to re-register with the master. > +* Each "agent" will send a configurable number of active and completed > tasks belonging to a configurable number of active and completed frameworks. > +* Each task has 10 small labels to introduce metadata overhead. > + > +The benchmark has a few caveats: > + > +* It does not use executors (this would show improved results over what > is shown below, but for simplicity the benchmark omits them) > +* It uses local message passing, whereas a real cluster would be passing > messages over HTTP. > +* It uses a quorum size of 1, so writes to the master’s registry occur > only on single local log replica. > +* The synthetic agents do not retry their re-registration, whereas > typically agents will retry with a backoff. > + > +This was tested on a 2015 Macbook Pro with 2.8 GHz Intel Core i7 > processor. Mesos was configured using: `Apple LLVM version 9.0.0 > (clang-900.0.38)`, with `-O2` enabled in 1.5.0. > + > +The first results represent a cluster with 10 active tasks per agent > across 5 frameworks, with no completed tasks. The results from 1,000 - > 40,000 agents with 10,000 - 400,000 active tasks: > + > + > + > +There was a reduction in the time-to-completion of ~80% due to a 450-500% > improvement in throughput across 1.3.0 to 1.5.0. > + > +The second results add task history: each agent also now contains 100 > completed tasks across 5 completed frameworks. The results from 1,000 - > 40,000 agents with 10,000 - 400,000 active tasks and 100,000 - 4,000,000 > completed tasks are shown below: > + > + > + > +This represents a reduction in time-to-completion of ~85% due to a > 550-700% improvement in throughput across 1.3.0 to 1.5.0. > + > +## Performance Working Group Roadmap > + > +We're currently targeting the following areas for improvements: > + > +* **Performance of the v1 API**: Currently the v1 API can be > significantly slower than the v0 API. We would like to reach parity, and > ideally surpass the performance of the v0 API. > + * **[Libprocess](https://github.com/apache/mesos/tree/master/ > 3rdparty/libprocess) HTTP performance**: This will be undertaken as part > of improving the v1 API performance, since it is HTTP-based. > +* **Master state API performance**: Currently, API queries of the > master's state are serviced by the same master actor that processes all of > the messages from schedulers and agents. Since the query processing can > block the master from processing other events, users need to be careful not > to query the master excessively. In practice, the master gets queried quite > heavily due to the presence of several tools that rely on the master's > state (e.g. DNS tooling, UIs, CLIs, etc) and so this is a critical problem > for users. This effort will leverage the state streaming API to stream the > state to a different actor that can serve the state API requests. This will > ensure that expensive state queries do not affect the master's ability to > process events. > + > +If you are a user and would like to suggest some areas for performance > improvement, please let us know by emailing <d...@apache.mesos.org>. > >