[jira] [Created] (MESOS-3036) Slave timeout options have drastically different behaviors. Interaction unclear/non-obvious from documentation.
Daniel Nugent created MESOS-3036: Summary: Slave timeout options have drastically different behaviors. Interaction unclear/non-obvious from documentation. Key: MESOS-3036 URL: https://issues.apache.org/jira/browse/MESOS-3036 Project: Mesos Issue Type: Documentation Affects Versions: 0.22.1 Reporter: Daniel Nugent The documentation for the Slave's recovery_timeout option would seem to indicate that a recovery of up to 15 minutes is possible. However, because of the Master's slave_ping_timeout and max_slave_ping_timeouts option, any recovery that occurs 75 seconds (plus apparently some time between the time the slave dies and the Master registers the slave as disconnected [can't find a flag that determines this]) will result in all tasks running under the slave to be stopped and then restarted if the Slave can recover because the master has moved the tasks into the TASK_LOST state. The documentation should clearly state that tasks will be stopped upon Slave recovery even within the recovery_timeout period if the ping_timeout options have caused the master to shut down the slave. Also, maybe explain what the project intends the recovery_timeout setting to actually be used for? I'm a little unclear on that point now myself. Presumably some fudge factor to allow tasks to have time to restart on another slave if the slave is out of commission? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2931) Add explanation of master bootstrap process to Operational Guide
[ https://issues.apache.org/jira/browse/MESOS-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Nugent updated MESOS-2931: - Description: When Mesos starts up, masters come up in an empty bootstrap state. End users may find that they experience failures without an apparent cause. The documentation for the operational guide should lay out the requirements for the bootstrap process and its behavior. See MESOS-2148 for an example of an issue filed when someone wasn't aware of the bootstrap process. was: When Mesos starts up, masters come up in an empty bootstrap state. End users may find that they experience failures without an apparent cause. The documentation for the operational guide should lay out the requirements for the bootstrap process and its behavior. See MESOS-2148 for an example of an issue filed when someone wasn't aware of the problem. Add explanation of master bootstrap process to Operational Guide Key: MESOS-2931 URL: https://issues.apache.org/jira/browse/MESOS-2931 Project: Mesos Issue Type: Documentation Components: documentation Affects Versions: 0.22.1 Reporter: Daniel Nugent Priority: Minor When Mesos starts up, masters come up in an empty bootstrap state. End users may find that they experience failures without an apparent cause. The documentation for the operational guide should lay out the requirements for the bootstrap process and its behavior. See MESOS-2148 for an example of an issue filed when someone wasn't aware of the bootstrap process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2931) Add explanation of master bootstrap process to Operational Guide
Daniel Nugent created MESOS-2931: Summary: Add explanation of master bootstrap process to Operational Guide Key: MESOS-2931 URL: https://issues.apache.org/jira/browse/MESOS-2931 Project: Mesos Issue Type: Documentation Components: documentation Affects Versions: 0.22.1 Reporter: Daniel Nugent Priority: Minor When Mesos starts up, masters come up in an empty bootstrap state. End users may find that they experience failures without an apparent cause. The documentation for the operational guide should lay out the requirements for the bootstrap process and its behavior. See MESOS-2148 for an example of an issue filed when someone wasn't aware of the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551127#comment-14551127 ] Daniel Nugent commented on MESOS-2254: -- [~idownes] In that case, do you know what the rate limiting is that [~nnielsen] referred to? Posix CPU isolator usage call introduce high cpu load - Key: MESOS-2254 URL: https://issues.apache.org/jira/browse/MESOS-2254 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen With more than 20 executors running on a slave with the posix isolator, we have seen an very high cpu load (over 200%). From profiling one thread (there were two, taking up all the cpu time. The total CPU time was over 200%): {code} Running Time SelfSymbol Name 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 27133.0ms 47.8% 0.0 thread_start 27133.0ms 47.8% 0.0 _pthread_start 27133.0ms 47.8% 0.0_pthread_body 27133.0ms 47.8% 0.0 process::schedule(void*) 27133.0ms 47.8% 2.0 process::ProcessManager::resume(process::ProcessBase*) 27126.0ms 47.8% 1.0 process::ProcessBase::serve(process::Event const) 27125.0ms 47.8% 0.0 process::DispatchEvent::visit(process::EventVisitor*) const 27125.0ms 47.8% 0.0 process::ProcessBase::visit(process::DispatchEvent const) 27125.0ms 47.8% 0.0 std::__1::functionvoid (process::ProcessBase*)::operator()(process::ProcessBase*) const 27124.0ms 47.8% 0.0 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), std::__1::allocatorprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), void (process::ProcessBase*)::operator()(process::ProcessBase*) 27124.0ms 47.8% 1.0 process::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) const 27060.0ms 47.7% 1.0 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID const) 27046.0ms 47.7% 2.0 mesos::internal::usage(int, bool, bool) 27023.0ms 47.6% 2.0 os::pstree(Optionint) 26748.0ms 47.1% 23.0 os::processes() 24809.0ms 43.7% 349.0 os::process(int) 8199.0ms 14.4% 47.0 os::sysctl::string() const 7562.0ms 13.3% 7562.0__sysctl {code} We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544484#comment-14544484 ] Daniel Nugent commented on MESOS-2254: -- [~nnielsen] Neat. Which version is that changed in? And I presume you mean --perf_interval as the way to rate limit? Posix CPU isolator usage call introduce high cpu load - Key: MESOS-2254 URL: https://issues.apache.org/jira/browse/MESOS-2254 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen With more than 20 executors running on a slave with the posix isolator, we have seen an very high cpu load (over 200%). From profiling one thread (there were two, taking up all the cpu time. The total CPU time was over 200%): {code} Running Time SelfSymbol Name 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 27133.0ms 47.8% 0.0 thread_start 27133.0ms 47.8% 0.0 _pthread_start 27133.0ms 47.8% 0.0_pthread_body 27133.0ms 47.8% 0.0 process::schedule(void*) 27133.0ms 47.8% 2.0 process::ProcessManager::resume(process::ProcessBase*) 27126.0ms 47.8% 1.0 process::ProcessBase::serve(process::Event const) 27125.0ms 47.8% 0.0 process::DispatchEvent::visit(process::EventVisitor*) const 27125.0ms 47.8% 0.0 process::ProcessBase::visit(process::DispatchEvent const) 27125.0ms 47.8% 0.0 std::__1::functionvoid (process::ProcessBase*)::operator()(process::ProcessBase*) const 27124.0ms 47.8% 0.0 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), std::__1::allocatorprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), void (process::ProcessBase*)::operator()(process::ProcessBase*) 27124.0ms 47.8% 1.0 process::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) const 27060.0ms 47.7% 1.0 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID const) 27046.0ms 47.7% 2.0 mesos::internal::usage(int, bool, bool) 27023.0ms 47.6% 2.0 os::pstree(Optionint) 26748.0ms 47.1% 23.0 os::processes() 24809.0ms 43.7% 349.0 os::process(int) 8199.0ms 14.4% 47.0 os::sysctl::string() const 7562.0ms 13.3% 7562.0__sysctl {code} We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544367#comment-14544367 ] Daniel Nugent commented on MESOS-2254: -- Could the frequency of the invocation of the usage function be reduced somehow for the time being to mitigate the issue? Posix CPU isolator usage call introduce high cpu load - Key: MESOS-2254 URL: https://issues.apache.org/jira/browse/MESOS-2254 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen With more than 20 executors running on a slave with the posix isolator, we have seen an very high cpu load (over 200%). From profiling one thread (there were two, taking up all the cpu time. The total CPU time was over 200%): {code} Running Time SelfSymbol Name 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 27133.0ms 47.8% 0.0 thread_start 27133.0ms 47.8% 0.0 _pthread_start 27133.0ms 47.8% 0.0_pthread_body 27133.0ms 47.8% 0.0 process::schedule(void*) 27133.0ms 47.8% 2.0 process::ProcessManager::resume(process::ProcessBase*) 27126.0ms 47.8% 1.0 process::ProcessBase::serve(process::Event const) 27125.0ms 47.8% 0.0 process::DispatchEvent::visit(process::EventVisitor*) const 27125.0ms 47.8% 0.0 process::ProcessBase::visit(process::DispatchEvent const) 27125.0ms 47.8% 0.0 std::__1::functionvoid (process::ProcessBase*)::operator()(process::ProcessBase*) const 27124.0ms 47.8% 0.0 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), std::__1::allocatorprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), void (process::ProcessBase*)::operator()(process::ProcessBase*) 27124.0ms 47.8% 1.0 process::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) const 27060.0ms 47.7% 1.0 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID const) 27046.0ms 47.7% 2.0 mesos::internal::usage(int, bool, bool) 27023.0ms 47.6% 2.0 os::pstree(Optionint) 26748.0ms 47.1% 23.0 os::processes() 24809.0ms 43.7% 349.0 os::process(int) 8199.0ms 14.4% 47.0 os::sysctl::string() const 7562.0ms 13.3% 7562.0__sysctl {code} We could see that usage() in usage/usage.cpp is causing this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)