[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559411#comment-14559411 ] Timothy Chen commented on MESOS-2254: - I just opened MESOS-2767 for DockerContainerizer, which we should use cgroups instead. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen a very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559412#comment-14559412 ] Timothy Chen commented on MESOS-2254: - I just opened MESOS-2767 for DockerContainerizer, which we should use cgroups instead. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen a very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559364#comment-14559364 ] Ian Downes commented on MESOS-2254: --- I'm against caching of results: if you request the usage() you should get the current usage. Expensive queries should be rate-limited. There should not be any ambiguity. I'm also reluctant to change the interface to support batching because it's a bandaid to support a non-production edge-case (people should use cgroups). I'll expand on what Joris noted earlier: >From looking at the code the query is expensive because it walks the entire >process table, constructing full Process objects, to obtain the process trees >in a container. This would indeed lead to a lot of system calls. Once the tree >is obtained it sums the required metric. Only a small subset of those system >calls are ultimately useful. I suggest we implement a lighter weight way to determine the processes belonging to a container where we only use getppid() and getsid() to determine the parent-child relationship. Once the trees are constructed then construct full Process objects for only those pids and then sum the desired metric. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen a very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14558507#comment-14558507 ] Joris Van Remoortere commented on MESOS-2254: - [~marco-mesos]The endpoint is already rate-limited using a {{process::RateLimiter}} that permits 2 calls per second. The main concern is that even a single call to this API gets more expensive as N executors scan all P processes on the system (N*P) per call. There are opportunities to cache; however, caching introduces decisions about when to clear the cache (Do we do it on a time based interval? after some number of requests?) as well as stale data. Since the intent of this call is the get a current snapshot of usage data, I would prefer to avoid introducing explicit caching, and instead pass along enough "information" to allow re-use of the data for the same "call" (batching). In this particular case, the reason we are performing the (N*P) is because the containerizer calls the usage function on the isolator for each container. In my opinion this is the cleanest place to "cache", although I would prefer to call it "batch". The isolator loses the "information" that we are asking for a snapshot of all containers, rather it thinks we are asking for N snapshots. My proposal would be to modify the interface to allow a batched version of the call, so that the usage call can re-use any data it collects. I think this is the cleanest way to control when we recompute / invalidate the data. There is also the opportunity to just reduce the full stats parsing to just the subset of pids that we are interested in. This would already provide a ~30x improvement. P.S. this problem can also be completely avoided by calling into a kernel module that exposes the right information efficiently ;-) > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen a very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14558506#comment-14558506 ] Joris Van Remoortere commented on MESOS-2254: - [~marco-mesos]The endpoint is already rate-limited using a {{process::RateLimiter}} that permits 2 calls per second. The main concern is that even a single call to this API gets more expensive as N executors scan all P processes on the system (N*P) per call. There are opportunities to cache; however, caching introduces decisions about when to clear the cache (Do we do it on a time based interval? after some number of requests?) as well as stale data. Since the intent of this call is the get a current snapshot of usage data, I would prefer to avoid introducing explicit caching, and instead pass along enough "information" to allow re-use of the data for the same "call" (batching). In this particular case, the reason we are performing the (N*P) is because the containerizer calls the usage function on the isolator for each container. In my opinion this is the cleanest place to "cache", although I would prefer to call it "batch". The isolator loses the "information" that we are asking for a snapshot of all containers, rather it thinks we are asking for N snapshots. My proposal would be to modify the interface to allow a batched version of the call, so that the usage call can re-use any data it collects. I think this is the cleanest way to control when we recompute / invalidate the data. There is also the opportunity to just reduce the full stats parsing to just the subset of pids that we are interested in. This would already provide a ~30x improvement. P.S. this problem can also be completely avoided by calling into a kernel module that exposes the right information efficiently ;-) > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen a very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557072#comment-14557072 ] Marco Massenzio commented on MESOS-2254: Is there any caching strategy that we could employ? that would seem a relatively simple (and extensible) approach. We could also "throttle" the execution of the system calls: no matter how many times / sec the stat endpoint gets hit, it will only "refresh" its state no more often than every x seconds (configurable, ideally, via a {{--stat-refresh-interval}} flag or some such like) In any event, this is something that we should consider fixing, so moved to the "accept" state. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen a very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557020#comment-14557020 ] Joris Van Remoortere commented on MESOS-2254: - There are 2 issues here: 1) For a given usage call, we grab all running processes and parse the entire proc/[pid]/stat file. This is very CPU intensive. We parse it because want to compute the pid hierarchy for a subtree, and so we want to extract the parent pid. When an OS has many running processes, we end up doing a lot of extra parsing of process stat files that we don't end up caring about. 2) Every update() call in the containerizer will call update() on the isolator for each container. This means if we are running N executors, then we actually end up doing the work in 1) N times. We are parsing /proc/[pid]/stat for all running processes N times every M seconds, where M is the monitoring interval. There are a couple of approaches to fixing this. One is to only parse what is necessary to build the subtree, and then parse the information for the processes we actually care about. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen a very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556418#comment-14556418 ] Niklas Quarfot Nielsen commented on MESOS-2254: --- [~marco-mesos] It is still an issue if you hit the statistics endpoint too often, but won't cause the high load if folks doesn't have external monitoring tooling. I chatted with [~jvanremoortere] about this and we can probably introduce a higher level 'get cpu time for a process tree' abstraction than the current 'grab everything and sort afterwards', which seem to cause high load with many calls to sysctl() > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen a very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556276#comment-14556276 ] Marco Massenzio commented on MESOS-2254: Doing quickly backlog grooming - it's not clear if this is "duplicated" and/or "resolved"; or should we do work on this? [~nnielsen] can you please update status? thx! > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen a very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551142#comment-14551142 ] Ian Downes commented on MESOS-2254: --- I presume he's referring to the slave flag {{--resource_monitoring_interval}} which currently defaults to {{RESOURCE_MONITORING_INTERVAL = Seconds(1)}} but which [~nnielsen] has marked as soon to be deprecated. {noformat} // TODO(nnielsen): Deprecate resource_monitoring_interval flag after // Mesos 0.23.0. Duration resource_monitoring_interval; {noformat} In the meantime, if this is causing performance issues then you could set {{--resource_monitoring_internal}} to something longer than the default. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551127#comment-14551127 ] Daniel Nugent commented on MESOS-2254: -- [~idownes] In that case, do you know what the rate limiting is that [~nnielsen] referred to? > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551118#comment-14551118 ] Ian Downes commented on MESOS-2254: --- [~nugend] No, --perf_interval is just for the perf isolator which uses a perf_event cgroup to efficiently run perf against a container. Unrelated to this. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544499#comment-14544499 ] Niklas Quarfot Nielsen commented on MESOS-2254: --- Should be in 0.23.0 :) > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544484#comment-14544484 ] Daniel Nugent commented on MESOS-2254: -- [~nnielsen] Neat. Which version is that changed in? And I presume you mean --perf_interval as the way to rate limit? > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544481#comment-14544481 ] Niklas Quarfot Nielsen commented on MESOS-2254: --- The collect loop (defaulting to 1 second) has recently been removed, but the statistics endpoint will trigger a collection (can be rate limited) > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544367#comment-14544367 ] Daniel Nugent commented on MESOS-2254: -- Could the frequency of the invocation of the usage function be reduced somehow for the time being to mitigate the issue? > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511373#comment-14511373 ] Harry Metske commented on MESOS-2254: - I noticed the same issue. Running with only about 10 tasks on a slave (version 0.21.0) the mesos-slave eats about 25% of an i5 core. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328205#comment-14328205 ] Cosmin Lehene commented on MESOS-2254: -- [~nnielsen] What version is this happening on? > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324331#comment-14324331 ] Dario Rexin commented on MESOS-2254: Did you also try it with the cgroups one? > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324319#comment-14324319 ] Niklas Quarfot Nielsen commented on MESOS-2254: --- Correct :) > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324209#comment-14324209 ] Timothy St. Clair commented on MESOS-2254: -- This just the POSIX isolator... right? > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304137#comment-14304137 ] Niklas Quarfot Nielsen commented on MESOS-2254: --- Reproduced on Linux too. Joris and I suspected that it boils down to an excessive number of sysctl calls or reading from /proc. From the traces I have from Mac OS X runs, it definitely look like those calls introducing the load. We could come up with a lighter weight usage() or separate user_time(pid_t), system_time(pid_t) rss(pid_t) call in stout which can grab the data once and aggregate. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289873#comment-14289873 ] Timothy Chen commented on MESOS-2254: - Can you also try on Linux to see if we get the same CPU perf? It's running specifically the OSX sysctl. > Posix CPU isolator usage call introduce high cpu load > - > > Key: MESOS-2254 > URL: https://issues.apache.org/jira/browse/MESOS-2254 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen > > With more than 20 executors running on a slave with the posix isolator, we > have seen an very high cpu load (over 200%). > From profiling one thread (there were two, taking up all the cpu time. The > total CPU time was over 200%): > {code} > Running Time SelfSymbol Name > 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 > 27133.0ms 47.8% 0.0 thread_start > 27133.0ms 47.8% 0.0 _pthread_start > 27133.0ms 47.8% 0.0_pthread_body > 27133.0ms 47.8% 0.0 process::schedule(void*) > 27133.0ms 47.8% 2.0 > process::ProcessManager::resume(process::ProcessBase*) > 27126.0ms 47.8% 1.0 > process::ProcessBase::serve(process::Event const&) > 27125.0ms 47.8% 0.0 > process::DispatchEvent::visit(process::EventVisitor*) const > 27125.0ms 47.8% 0.0 > process::ProcessBase::visit(process::DispatchEvent const&) > 27125.0ms 47.8% 0.0 std::__1::function (process::ProcessBase*)>::operator()(process::ProcessBase*) const > 27124.0ms 47.8% 0.0 > std::__1::__function::__func > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*), > std::__1::allocator > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void > (process::ProcessBase*)>::operator()(process::ProcessBase*&&) > 27124.0ms 47.8% 1.0 > process::Future > process::dispatch mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, > mesos::ContainerID>(process::PID > const&, process::Future > (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), > mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) > const > 27060.0ms 47.7% 1.0 > mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID > const&) > 27046.0ms 47.7% 2.0 > mesos::internal::usage(int, bool, bool) > 27023.0ms 47.6% 2.0 os::pstree(Option) > 26748.0ms 47.1% 23.0 os::processes() > 24809.0ms 43.7% 349.0 os::process(int) > 8199.0ms 14.4% 47.0 os::sysctl::string() > const > 7562.0ms 13.3% 7562.0__sysctl > {code} > We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)