[ https://issues.apache.org/jira/browse/MESOS-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068196#comment-16068196 ]
Dmitry Zhuk commented on MESOS-7713: ------------------------------------ https://docs.google.com/spreadsheets/d/1xqFxcWxOyjbozro0SkshTIKkaGgShRMN8bqBsdtnl8k/edit?usp=sharing - this demonstrates performance improvements for master failover with patches applied. Reregistration time reduced from 1:20 to 1:00 (not including time to recover registry). Test environment: scale test cluster simulating ~40K agents and ~100K tasks, dedicated master hosts, {{--reregistration_backoff_factor=45secs}} on agents. Versions tested: 1.2.0 - Mesos 1.2.0 + https://reviews.apache.org/r/58355/ 1.2.0-fix - same as above + https://reviews.apache.org/r/60002/, https://reviews.apache.org/r/60003/ + https://reviews.apache.org/r/60472/, https://reviews.apache.org/r/60473/, https://reviews.apache.org/r/60474/ + changes to install {{Master::reregisterSlave}} handler with {{mutable_}} versions of protobuf message fields accessors, take parameters by value and {{std::move}} them to {{defer}}. Each version was tested 3 times by killing leading master and collecting metrics from newly elected master logs. Metrics are calculated by counting number of different messages appearing in logs: {{reregistering}} - "Re-registering agent ..." {{ignoring}} - "Ignoring re-register agent message from agent ... as readmission is already in progress" {{reregistered}} - "Re-registered agent ..." {{sending}} - "Sending updated checkpointed resources ... to agent ..." {{update}} - "Received update of agent ... with total oversubscribed resources ..." {{pending}} = {{reregistering}} - {{sending}} - indicates number of in-progress reregistrations. {{offers}} - "Sending ... offers to framework ..." {{applied_cnt}}, {{applied}} - "Applied ... operations in ...; attempting to update the registry" (corresponds to number of message and total number of operations) {{reg_updated}} - "Successfully updated the registry in ..." (extracted duration from message). > Optimize number of copies made in dispatch/defer mechanism > ---------------------------------------------------------- > > Key: MESOS-7713 > URL: https://issues.apache.org/jira/browse/MESOS-7713 > Project: Mesos > Issue Type: Task > Components: libprocess > Affects Versions: 1.2.0, 1.2.1, 1.3.0 > Reporter: Dmitry Zhuk > Assignee: Dmitry Zhuk > > Profiling agents reregistration for a large cluster shows, that many CPU > cycles are spent on copying protobuf objects. This is partially due to copies > made by a code like this: > {code} > future.then(defer(self(), &Process::method, param); > {code} > {{param}} could be copied 8-10 times before it reaches {{method}}. > Specifically, {{reregisterSlave}} accepts vectors of rather complex objects, > which are passed to {{defer}}. > Currently there are some places in {{defer}}, {{dispatch}} and {{Future}} > code, which could use {{std::move}} and {{std::forward}} to evade some of the > copies. -- This message was sent by Atlassian JIRA (v6.4.14#64029)