[jira] [Created] (MESOS-3036) Slave timeout options have drastically different behaviors. Interaction unclear/non-obvious from documentation.

2015-07-13 Thread Daniel Nugent (JIRA)
Daniel Nugent created MESOS-3036:


 Summary: Slave timeout options have drastically different 
behaviors. Interaction unclear/non-obvious from documentation.
 Key: MESOS-3036
 URL: https://issues.apache.org/jira/browse/MESOS-3036
 Project: Mesos
  Issue Type: Documentation
Affects Versions: 0.22.1
Reporter: Daniel Nugent


The documentation for the Slave's recovery_timeout option would seem to 
indicate that a recovery of up to 15 minutes is possible. However, because of 
the Master's slave_ping_timeout and max_slave_ping_timeouts option, any 
recovery that occurs 75 seconds (plus apparently some time between the time the 
slave dies and the Master registers the slave as disconnected [can't find a 
flag that determines this]) will result in all tasks running under the slave to 
be stopped and then restarted if the Slave can recover because the master has 
moved the tasks into the TASK_LOST state.

The documentation should clearly state that tasks will be stopped upon Slave 
recovery even within the recovery_timeout period if the ping_timeout options 
have caused the master to shut down the slave.

Also, maybe explain what the project intends the recovery_timeout setting to 
actually be used for? I'm a little unclear on that point now myself. Presumably 
some fudge factor to allow tasks to have time to restart on another slave if 
the slave is out of commission?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2931) Add explanation of master bootstrap process to Operational Guide

2015-06-24 Thread Daniel Nugent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Nugent updated MESOS-2931:
-
Description: 
When Mesos starts up, masters come up in an empty bootstrap state. End users 
may find that they experience failures without an apparent cause. The 
documentation for the operational guide should lay out the requirements for the 
bootstrap process and its behavior.

See MESOS-2148 for an example of an issue filed when someone wasn't aware of 
the bootstrap process.

  was:
When Mesos starts up, masters come up in an empty bootstrap state. End users 
may find that they experience failures without an apparent cause. The 
documentation for the operational guide should lay out the requirements for the 
bootstrap process and its behavior.

See MESOS-2148 for an example of an issue filed when someone wasn't aware of 
the problem.


 Add explanation of master bootstrap process to Operational Guide
 

 Key: MESOS-2931
 URL: https://issues.apache.org/jira/browse/MESOS-2931
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Affects Versions: 0.22.1
Reporter: Daniel Nugent
Priority: Minor

 When Mesos starts up, masters come up in an empty bootstrap state. End users 
 may find that they experience failures without an apparent cause. The 
 documentation for the operational guide should lay out the requirements for 
 the bootstrap process and its behavior.
 See MESOS-2148 for an example of an issue filed when someone wasn't aware of 
 the bootstrap process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2931) Add explanation of master bootstrap process to Operational Guide

2015-06-24 Thread Daniel Nugent (JIRA)
Daniel Nugent created MESOS-2931:


 Summary: Add explanation of master bootstrap process to 
Operational Guide
 Key: MESOS-2931
 URL: https://issues.apache.org/jira/browse/MESOS-2931
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Affects Versions: 0.22.1
Reporter: Daniel Nugent
Priority: Minor


When Mesos starts up, masters come up in an empty bootstrap state. End users 
may find that they experience failures without an apparent cause. The 
documentation for the operational guide should lay out the requirements for the 
bootstrap process and its behavior.

See MESOS-2148 for an example of an issue filed when someone wasn't aware of 
the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-19 Thread Daniel Nugent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551127#comment-14551127
 ] 

Daniel Nugent commented on MESOS-2254:
--

[~idownes] In that case, do you know what the rate limiting is that [~nnielsen] 
referred to?

 Posix CPU isolator usage call introduce high cpu load
 -

 Key: MESOS-2254
 URL: https://issues.apache.org/jira/browse/MESOS-2254
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen

 With more than 20 executors running on a slave with the posix isolator, we 
 have seen an very high cpu load (over 200%).
 From profiling one thread (there were two, taking up all the cpu time. The 
 total CPU time was over 200%):
 {code}
 Running Time  SelfSymbol Name
 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
 27133.0ms   47.8% 0.0  thread_start
 27133.0ms   47.8% 0.0   _pthread_start
 27133.0ms   47.8% 0.0_pthread_body
 27133.0ms   47.8% 0.0 process::schedule(void*)
 27133.0ms   47.8% 2.0  
 process::ProcessManager::resume(process::ProcessBase*)
 27126.0ms   47.8% 1.0   
 process::ProcessBase::serve(process::Event const)
 27125.0ms   47.8% 0.0
 process::DispatchEvent::visit(process::EventVisitor*) const
 27125.0ms   47.8% 0.0 
 process::ProcessBase::visit(process::DispatchEvent const)
 27125.0ms   47.8% 0.0  std::__1::functionvoid 
 (process::ProcessBase*)::operator()(process::ProcessBase*) const
 27124.0ms   47.8% 0.0   
 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), 
 std::__1::allocatorprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), void 
 (process::ProcessBase*)::operator()(process::ProcessBase*)
 27124.0ms   47.8% 1.0
 process::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
  const
 27060.0ms   47.7% 1.0 
 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
 const)
 27046.0ms   47.7% 2.0  
 mesos::internal::usage(int, bool, bool)
 27023.0ms   47.6% 2.0   os::pstree(Optionint)
 26748.0ms   47.1% 23.0   os::processes()
 24809.0ms   43.7% 349.0   os::process(int)
 8199.0ms   14.4%  47.0 os::sysctl::string() 
 const
 7562.0ms   13.3%  7562.0__sysctl
 {code}
 We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-14 Thread Daniel Nugent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544484#comment-14544484
 ] 

Daniel Nugent commented on MESOS-2254:
--

[~nnielsen] Neat. Which version is that changed in? And I presume you mean 
--perf_interval as the way to rate limit?

 Posix CPU isolator usage call introduce high cpu load
 -

 Key: MESOS-2254
 URL: https://issues.apache.org/jira/browse/MESOS-2254
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen

 With more than 20 executors running on a slave with the posix isolator, we 
 have seen an very high cpu load (over 200%).
 From profiling one thread (there were two, taking up all the cpu time. The 
 total CPU time was over 200%):
 {code}
 Running Time  SelfSymbol Name
 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
 27133.0ms   47.8% 0.0  thread_start
 27133.0ms   47.8% 0.0   _pthread_start
 27133.0ms   47.8% 0.0_pthread_body
 27133.0ms   47.8% 0.0 process::schedule(void*)
 27133.0ms   47.8% 2.0  
 process::ProcessManager::resume(process::ProcessBase*)
 27126.0ms   47.8% 1.0   
 process::ProcessBase::serve(process::Event const)
 27125.0ms   47.8% 0.0
 process::DispatchEvent::visit(process::EventVisitor*) const
 27125.0ms   47.8% 0.0 
 process::ProcessBase::visit(process::DispatchEvent const)
 27125.0ms   47.8% 0.0  std::__1::functionvoid 
 (process::ProcessBase*)::operator()(process::ProcessBase*) const
 27124.0ms   47.8% 0.0   
 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), 
 std::__1::allocatorprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), void 
 (process::ProcessBase*)::operator()(process::ProcessBase*)
 27124.0ms   47.8% 1.0
 process::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
  const
 27060.0ms   47.7% 1.0 
 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
 const)
 27046.0ms   47.7% 2.0  
 mesos::internal::usage(int, bool, bool)
 27023.0ms   47.6% 2.0   os::pstree(Optionint)
 26748.0ms   47.1% 23.0   os::processes()
 24809.0ms   43.7% 349.0   os::process(int)
 8199.0ms   14.4%  47.0 os::sysctl::string() 
 const
 7562.0ms   13.3%  7562.0__sysctl
 {code}
 We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-14 Thread Daniel Nugent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544367#comment-14544367
 ] 

Daniel Nugent commented on MESOS-2254:
--

Could the frequency of the invocation of the usage function be reduced somehow 
for the time being to mitigate the issue?

 Posix CPU isolator usage call introduce high cpu load
 -

 Key: MESOS-2254
 URL: https://issues.apache.org/jira/browse/MESOS-2254
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen

 With more than 20 executors running on a slave with the posix isolator, we 
 have seen an very high cpu load (over 200%).
 From profiling one thread (there were two, taking up all the cpu time. The 
 total CPU time was over 200%):
 {code}
 Running Time  SelfSymbol Name
 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
 27133.0ms   47.8% 0.0  thread_start
 27133.0ms   47.8% 0.0   _pthread_start
 27133.0ms   47.8% 0.0_pthread_body
 27133.0ms   47.8% 0.0 process::schedule(void*)
 27133.0ms   47.8% 2.0  
 process::ProcessManager::resume(process::ProcessBase*)
 27126.0ms   47.8% 1.0   
 process::ProcessBase::serve(process::Event const)
 27125.0ms   47.8% 0.0
 process::DispatchEvent::visit(process::EventVisitor*) const
 27125.0ms   47.8% 0.0 
 process::ProcessBase::visit(process::DispatchEvent const)
 27125.0ms   47.8% 0.0  std::__1::functionvoid 
 (process::ProcessBase*)::operator()(process::ProcessBase*) const
 27124.0ms   47.8% 0.0   
 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), 
 std::__1::allocatorprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), void 
 (process::ProcessBase*)::operator()(process::ProcessBase*)
 27124.0ms   47.8% 1.0
 process::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
  const
 27060.0ms   47.7% 1.0 
 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
 const)
 27046.0ms   47.7% 2.0  
 mesos::internal::usage(int, bool, bool)
 27023.0ms   47.6% 2.0   os::pstree(Optionint)
 26748.0ms   47.1% 23.0   os::processes()
 24809.0ms   43.7% 349.0   os::process(int)
 8199.0ms   14.4%  47.0 os::sysctl::string() 
 const
 7562.0ms   13.3%  7562.0__sysctl
 {code}
 We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)