[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14558507#comment-14558507
 ] 

Joris Van Remoortere commented on MESOS-2254:
---------------------------------------------

[~marco-mesos]The endpoint is already rate-limited using a 
{{process::RateLimiter}} that permits 2 calls per second. The main concern is 
that even a single call to this API gets more expensive as N executors scan all 
P processes on the system (N*P) per call.

There are opportunities to cache; however, caching introduces decisions about 
when to clear the cache (Do we do it on a time based interval? after some 
number of requests?) as well as stale data. Since the intent of this call is 
the get a current snapshot of usage data, I would prefer to avoid introducing 
explicit caching, and instead pass along enough "information" to allow re-use 
of the data for the same "call" (batching).

In this particular case, the reason we are performing the (N*P) is because the 
containerizer calls the usage function on the isolator for each container. In 
my opinion this is the cleanest place to "cache", although I would prefer to 
call it "batch". The isolator loses the "information" that we are asking for a 
snapshot of all containers, rather it thinks we are asking for N snapshots.

My proposal would be to modify the interface to allow a batched version of the 
call, so that the usage call can re-use any data it collects. I think this is 
the cleanest way to control when we recompute / invalidate the data.

There is also the opportunity to just reduce the full stats parsing to just the 
subset of pids that we are interested in. This would already provide a ~30x 
improvement.

P.S. this problem can also be completely avoided by calling into a kernel 
module that exposes the right information efficiently ;-)

> Posix CPU isolator usage call introduce high cpu load
> -----------------------------------------------------
>
>                 Key: MESOS-2254
>                 URL: https://issues.apache.org/jira/browse/MESOS-2254
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  Self            Symbol Name
> 27133.0ms   47.8%     0.0             _pthread_body  0x1adb50
> 27133.0ms   47.8%     0.0              thread_start
> 27133.0ms   47.8%     0.0               _pthread_start
> 27133.0ms   47.8%     0.0                _pthread_body
> 27133.0ms   47.8%     0.0                 process::schedule(void*)
> 27133.0ms   47.8%     2.0                  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8%     1.0                   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8%     0.0                    
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8%     0.0                     
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8%     0.0                      std::__1::function<void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8%     0.0                       
> std::__1::__function::__func<process::Future<mesos::ResourceStatistics> 
> process::dispatch<mesos::ResourceStatistics, 
> mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess> 
> const&, process::Future<mesos::ResourceStatistics> 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator<process::Future<mesos::ResourceStatistics> 
> process::dispatch<mesos::ResourceStatistics, 
> mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess> 
> const&, process::Future<mesos::ResourceStatistics> 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8%     1.0                        
> process::Future<mesos::ResourceStatistics> 
> process::dispatch<mesos::ResourceStatistics, 
> mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess> 
> const&, process::Future<mesos::ResourceStatistics> 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7%     1.0                         
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7%     2.0                          
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6%     2.0                           os::pstree(Option<int>)
> 26748.0ms   47.1%     23.0                           os::processes()
> 24809.0ms   43.7%     349.0                           os::process(int)
> 8199.0ms   14.4%      47.0                             os::sysctl::string() 
> const
> 7562.0ms   13.3%      7562.0                            __sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to