[ 
https://issues.apache.org/jira/browse/MESOS-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068196#comment-16068196
 ] 

Dmitry Zhuk commented on MESOS-7713:
------------------------------------

https://docs.google.com/spreadsheets/d/1xqFxcWxOyjbozro0SkshTIKkaGgShRMN8bqBsdtnl8k/edit?usp=sharing
 - this demonstrates performance improvements for master failover with patches 
applied. Reregistration time reduced from 1:20 to 1:00 (not including time to 
recover registry).

Test environment: scale test cluster simulating ~40K agents and ~100K tasks, 
dedicated master hosts, {{--reregistration_backoff_factor=45secs}} on agents.

Versions tested:
1.2.0 - Mesos 1.2.0 +  https://reviews.apache.org/r/58355/
1.2.0-fix - same as above + https://reviews.apache.org/r/60002/, 
https://reviews.apache.org/r/60003/ + https://reviews.apache.org/r/60472/, 
https://reviews.apache.org/r/60473/,  https://reviews.apache.org/r/60474/ + 
changes to install {{Master::reregisterSlave}} handler with {{mutable_}} 
versions of protobuf message fields accessors, take parameters by value and 
{{std::move}} them to {{defer}}.

Each version was tested 3 times by killing leading master and collecting 
metrics from newly elected master logs.
Metrics are calculated by counting number of different messages appearing in 
logs:
{{reregistering}} - "Re-registering agent ..."
{{ignoring}} - "Ignoring re-register agent message from agent ... as 
readmission is already in progress"
{{reregistered}} - "Re-registered agent ..."
{{sending}} - "Sending updated checkpointed resources ... to agent ..."
{{update}} - "Received update of agent ... with total oversubscribed resources 
..."
{{pending}} = {{reregistering}} - {{sending}} - indicates number of in-progress 
reregistrations.
{{offers}} - "Sending ... offers to framework ..."
{{applied_cnt}}, {{applied}} - "Applied ... operations in ...; attempting to 
update the registry" (corresponds to number of message and total number of 
operations)
{{reg_updated}} - "Successfully updated the registry in ..." (extracted 
duration from message).

> Optimize number of copies made in dispatch/defer mechanism
> ----------------------------------------------------------
>
>                 Key: MESOS-7713
>                 URL: https://issues.apache.org/jira/browse/MESOS-7713
>             Project: Mesos
>          Issue Type: Task
>          Components: libprocess
>    Affects Versions: 1.2.0, 1.2.1, 1.3.0
>            Reporter: Dmitry Zhuk
>            Assignee: Dmitry Zhuk
>
> Profiling agents reregistration for a large cluster shows, that many CPU 
> cycles are spent on copying protobuf objects. This is partially due to copies 
> made by a code like this:
> {code}
> future.then(defer(self(), &Process::method, param);
> {code}
> {{param}} could be copied 8-10 times before it reaches {{method}}. 
> Specifically, {{reregisterSlave}} accepts vectors of rather complex objects, 
> which are passed to {{defer}}.
> Currently there are some places in {{defer}}, {{dispatch}} and {{Future}} 
> code, which could use {{std::move}} and {{std::forward}} to evade some of the 
> copies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to