[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-26 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559411#comment-14559411
 ] 

Timothy Chen commented on MESOS-2254:
-

I just opened MESOS-2767 for DockerContainerizer, which we should use cgroups 
instead.

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-26 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559412#comment-14559412
 ] 

Timothy Chen commented on MESOS-2254:
-

I just opened MESOS-2767 for DockerContainerizer, which we should use cgroups 
instead.

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-26 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559364#comment-14559364
 ] 

Ian Downes commented on MESOS-2254:
---

I'm against caching of results: if you request the usage() you should get the 
current usage. Expensive queries should be rate-limited. There should not be 
any ambiguity. I'm also reluctant to change the interface to support batching 
because it's a bandaid to support a non-production edge-case (people should use 
cgroups).

I'll expand on what Joris noted earlier:

>From looking at the code the query is expensive because it walks the entire 
>process table, constructing full Process objects, to obtain the process trees 
>in a container. This would indeed lead to a lot of system calls. Once the tree 
>is obtained it sums the required metric. Only a small subset of those system 
>calls are ultimately useful.

I suggest we implement a lighter weight way to determine the processes 
belonging to a container where we only use getppid() and getsid() to determine 
the parent-child relationship. Once the trees are constructed then construct 
full Process objects for only those pids and then sum the desired metric.

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-25 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14558507#comment-14558507
 ] 

Joris Van Remoortere commented on MESOS-2254:
-

[~marco-mesos]The endpoint is already rate-limited using a 
{{process::RateLimiter}} that permits 2 calls per second. The main concern is 
that even a single call to this API gets more expensive as N executors scan all 
P processes on the system (N*P) per call.

There are opportunities to cache; however, caching introduces decisions about 
when to clear the cache (Do we do it on a time based interval? after some 
number of requests?) as well as stale data. Since the intent of this call is 
the get a current snapshot of usage data, I would prefer to avoid introducing 
explicit caching, and instead pass along enough "information" to allow re-use 
of the data for the same "call" (batching).

In this particular case, the reason we are performing the (N*P) is because the 
containerizer calls the usage function on the isolator for each container. In 
my opinion this is the cleanest place to "cache", although I would prefer to 
call it "batch". The isolator loses the "information" that we are asking for a 
snapshot of all containers, rather it thinks we are asking for N snapshots.

My proposal would be to modify the interface to allow a batched version of the 
call, so that the usage call can re-use any data it collects. I think this is 
the cleanest way to control when we recompute / invalidate the data.

There is also the opportunity to just reduce the full stats parsing to just the 
subset of pids that we are interested in. This would already provide a ~30x 
improvement.

P.S. this problem can also be completely avoided by calling into a kernel 
module that exposes the right information efficiently ;-)

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 

[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-25 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14558506#comment-14558506
 ] 

Joris Van Remoortere commented on MESOS-2254:
-

[~marco-mesos]The endpoint is already rate-limited using a 
{{process::RateLimiter}} that permits 2 calls per second. The main concern is 
that even a single call to this API gets more expensive as N executors scan all 
P processes on the system (N*P) per call.

There are opportunities to cache; however, caching introduces decisions about 
when to clear the cache (Do we do it on a time based interval? after some 
number of requests?) as well as stale data. Since the intent of this call is 
the get a current snapshot of usage data, I would prefer to avoid introducing 
explicit caching, and instead pass along enough "information" to allow re-use 
of the data for the same "call" (batching).

In this particular case, the reason we are performing the (N*P) is because the 
containerizer calls the usage function on the isolator for each container. In 
my opinion this is the cleanest place to "cache", although I would prefer to 
call it "batch". The isolator loses the "information" that we are asking for a 
snapshot of all containers, rather it thinks we are asking for N snapshots.

My proposal would be to modify the interface to allow a batched version of the 
call, so that the usage call can re-use any data it collects. I think this is 
the cleanest way to control when we recompute / invalidate the data.

There is also the opportunity to just reduce the full stats parsing to just the 
subset of pids that we are interested in. This would already provide a ~30x 
improvement.

P.S. this problem can also be completely avoided by calling into a kernel 
module that exposes the right information efficiently ;-)

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 

[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-22 Thread Marco Massenzio (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557072#comment-14557072
 ] 

Marco Massenzio commented on MESOS-2254:


Is there any caching strategy that we could employ? that would seem a 
relatively simple (and extensible) approach.
We could also "throttle" the execution of the system calls: no matter how many 
times / sec the stat endpoint gets hit, it will only "refresh" its state no 
more often than every x seconds (configurable, ideally, via a 
{{--stat-refresh-interval}} flag or some such like)

In any event, this is something that we should consider fixing, so moved to the 
"accept" state.

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-22 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557020#comment-14557020
 ] 

Joris Van Remoortere commented on MESOS-2254:
-

There are 2 issues here:
1) For a given usage call, we grab all running processes and parse the entire 
proc/[pid]/stat file. This is very CPU intensive. We parse it because want to 
compute the pid hierarchy for a subtree, and so we want to extract the parent 
pid. When an OS has many running processes, we end up doing a lot of extra 
parsing of process stat files that we don't end up caring about.
2) Every update() call in the containerizer will call update() on the isolator 
for each container. This means if we are running N executors, then we actually 
end up doing the work in 1) N times.

We are parsing /proc/[pid]/stat for all running processes N times every M 
seconds, where M is the monitoring interval.

There are a couple of approaches to fixing this. One is to only parse what is 
necessary to build the subtree, and then parse the information for the 
processes we actually care about.

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-22 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556418#comment-14556418
 ] 

Niklas Quarfot Nielsen commented on MESOS-2254:
---

[~marco-mesos] It is still an issue if you hit the statistics endpoint too 
often, but won't cause the high load if folks doesn't have external monitoring 
tooling. I chatted with [~jvanremoortere] about this and we can probably 
introduce a higher level 'get cpu time for a process tree' abstraction than the 
current 'grab everything and sort afterwards', which seem to cause high load 
with many calls to sysctl()

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-22 Thread Marco Massenzio (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556276#comment-14556276
 ] 

Marco Massenzio commented on MESOS-2254:


Doing quickly backlog grooming - it's not clear if this is "duplicated" and/or 
"resolved"; or should we do work on this?

[~nnielsen] can you please update status?

thx!

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-19 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551142#comment-14551142
 ] 

Ian Downes commented on MESOS-2254:
---

I presume he's referring to the slave flag {{--resource_monitoring_interval}} 
which currently defaults to {{RESOURCE_MONITORING_INTERVAL = Seconds(1)}} but 
which [~nnielsen] has marked as soon to be deprecated.
{noformat}
  // TODO(nnielsen): Deprecate resource_monitoring_interval flag after
  // Mesos 0.23.0.
  Duration resource_monitoring_interval;
{noformat}
In the meantime, if this is causing performance issues then you could set 
{{--resource_monitoring_internal}} to something longer than the default.

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-19 Thread Daniel Nugent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551127#comment-14551127
 ] 

Daniel Nugent commented on MESOS-2254:
--

[~idownes] In that case, do you know what the rate limiting is that [~nnielsen] 
referred to?

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-19 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551118#comment-14551118
 ] 

Ian Downes commented on MESOS-2254:
---

[~nugend] No, --perf_interval is just for the perf isolator which uses a 
perf_event cgroup to efficiently run perf against a container. Unrelated to 
this.

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-14 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544499#comment-14544499
 ] 

Niklas Quarfot Nielsen commented on MESOS-2254:
---

Should be in 0.23.0 :)

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-14 Thread Daniel Nugent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544484#comment-14544484
 ] 

Daniel Nugent commented on MESOS-2254:
--

[~nnielsen] Neat. Which version is that changed in? And I presume you mean 
--perf_interval as the way to rate limit?

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-14 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544481#comment-14544481
 ] 

Niklas Quarfot Nielsen commented on MESOS-2254:
---

The collect loop (defaulting to 1 second) has recently been removed, but the 
statistics endpoint will trigger a collection (can be rate limited)

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-14 Thread Daniel Nugent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544367#comment-14544367
 ] 

Daniel Nugent commented on MESOS-2254:
--

Could the frequency of the invocation of the usage function be reduced somehow 
for the time being to mitigate the issue?

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-04-24 Thread Harry Metske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511373#comment-14511373
 ] 

Harry Metske commented on MESOS-2254:
-

I noticed the same issue. Running with only about 10 tasks on a slave (version 
0.21.0) the mesos-slave eats about 25% of an i5 core.

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-02-19 Thread Cosmin Lehene (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328205#comment-14328205
 ] 

Cosmin Lehene commented on MESOS-2254:
--

[~nnielsen] What version is this happening on?

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-02-17 Thread Dario Rexin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324331#comment-14324331
 ] 

Dario Rexin commented on MESOS-2254:


Did you also try it with the cgroups one?

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-02-17 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324319#comment-14324319
 ] 

Niklas Quarfot Nielsen commented on MESOS-2254:
---

Correct :)

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-02-17 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324209#comment-14324209
 ] 

Timothy St. Clair commented on MESOS-2254:
--

This just the POSIX isolator... right?  

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-02-03 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304137#comment-14304137
 ] 

Niklas Quarfot Nielsen commented on MESOS-2254:
---

Reproduced on Linux too.

Joris and I suspected that it boils down to an excessive number of sysctl calls 
or reading from /proc. From the traces I have from Mac OS X runs, it definitely 
look like those calls introducing the load.
We could come up with a lighter weight usage() or separate user_time(pid_t), 
system_time(pid_t) rss(pid_t) call in stout which can grab the data once and 
aggregate.

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-01-23 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289873#comment-14289873
 ] 

Timothy Chen commented on MESOS-2254:
-

Can you also try on Linux to see if we get the same CPU perf? It's running 
specifically the OSX sysctl.

> Posix CPU isolator usage call introduce high cpu load
> -
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen an very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  SelfSymbol Name
> 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
> 27133.0ms   47.8% 0.0  thread_start
> 27133.0ms   47.8% 0.0   _pthread_start
> 27133.0ms   47.8% 0.0_pthread_body
> 27133.0ms   47.8% 0.0 process::schedule(void*)
> 27133.0ms   47.8% 2.0  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8% 1.0   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8% 0.0 
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8% 0.0  std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8% 0.0   
> std::__1::__function::__func 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8% 1.0
> process::Future 
> process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID 
> const&, process::Future 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7% 1.0 
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7% 2.0  
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6% 2.0   os::pstree(Option)
> 26748.0ms   47.1% 23.0   os::processes()
> 24809.0ms   43.7% 349.0   os::process(int)
> 8199.0ms   14.4%  47.0 os::sysctl::string() 
> const
> 7562.0ms   13.3%  7562.0__sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)