[jira] [Updated] (MESOS-2759) Could not create MesosContainerizer: Could not create isolator network/port_mapping: Routing library check failed: Capability ROUTE_LINK_VETH_GET_PEER_OWN_REFERENCE is no

2015-05-25 Thread Anuj Gupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuj Gupta updated MESOS-2759:
--
Target Version/s: 0.22.1, 0.22.0  (was: 0.22.0, 0.22.1)

 Could not create MesosContainerizer: Could not create isolator 
 network/port_mapping: Routing library check failed: Capability 
 ROUTE_LINK_VETH_GET_PEER_OWN_REFERENCE is not available
 -

 Key: MESOS-2759
 URL: https://issues.apache.org/jira/browse/MESOS-2759
 Project: Mesos
  Issue Type: Bug
Reporter: Anuj Gupta
Priority: Critical

 [root@localhost ~]# mesos-slave --master=127.0.0.1:5050 
 --log_dir=/var/log/mesos --work_dir=/var/lib/mesos 
 --isolation=cgroups/cpu,cgroups/mem,network/port_mapping 
 --resources=ephemeral_ports:[32768-57344] --ephemeral_ports_per_container=1024
 mesos-slave: /usr/lib64/libnl-3.so.200: no version information available 
 (required by /usr/local/lib/libmesos-0.22.0.so)
 mesos-slave: /usr/lib64/libnl-route-3.so.200: no version information 
 available (required by /usr/local/lib/libmesos-0.22.0.so)
 mesos-slave: /usr/lib64/libnl-3.so.200: no version information available 
 (required by /lib/libnl-idiag-3.so.200)
 I0521 14:10:16.727126 13214 logging.cpp:172] INFO level logging started!
 I0521 14:10:16.727409 13214 main.cpp:156] Build: 2015-05-21 13:21:45 by root
 I0521 14:10:16.727432 13214 main.cpp:158] Version: 0.22.0
 I0521 14:10:16.727727 13214 containerizer.cpp:110] Using isolation: 
 cgroups/cpu,cgroups/mem,network/port_mapping
 Failed to create a containerizer: Could not create MesosContainerizer: Could 
 not create isolator network/port_mapping: Routing library check failed: 
 Capability ROUTE_LINK_VETH_GET_PEER_OWN_REFERENCE is not available



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2748) /help generated links point to wrong URLs

2015-05-25 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558436#comment-14558436
 ] 

haosdent commented on MESOS-2748:
-

Patch: https://reviews.apache.org/r/34655/

 /help generated links point to wrong URLs
 -

 Key: MESOS-2748
 URL: https://issues.apache.org/jira/browse/MESOS-2748
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.22.1
Reporter: Marco Massenzio
Assignee: haosdent
Priority: Minor

 As reported by Michael Lunøe mlu...@mesosphere.io (see also MESOS-329 and 
 MESOS-913 for background):
 {quote}
 In {{mesos/3rdparty/libprocess/src/help.cpp}} a markdown file is created, 
 which is then converted to html through a javascript library 
 All endpoints point to {{/help/...}}, they need to work dynamically for 
 reverse proxy to do its thing. {{/mesos/help}} works, and displays the 
 endpoints, but they each need to go to their respective {{/help/...}} 
 endpoint. 
 Note that this needs to work both for master, and for slaves. I think the 
 route to slaves help is something like this: 
 {{/mesos/slaves/20150518-210216-1695027628-5050-1366-S0/help}}, but please 
 double check this.
 {quote}
 The fix appears to be not too complex (as it would require to simply 
 manipulate the generated URL) but a quick skim of the code would suggest that 
 something more substantial may be desirable too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-25 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558506#comment-14558506
 ] 

Joris Van Remoortere commented on MESOS-2254:
-

[~marco-mesos]The endpoint is already rate-limited using a 
{{process::RateLimiter}} that permits 2 calls per second. The main concern is 
that even a single call to this API gets more expensive as N executors scan all 
P processes on the system (N*P) per call.

There are opportunities to cache; however, caching introduces decisions about 
when to clear the cache (Do we do it on a time based interval? after some 
number of requests?) as well as stale data. Since the intent of this call is 
the get a current snapshot of usage data, I would prefer to avoid introducing 
explicit caching, and instead pass along enough information to allow re-use 
of the data for the same call (batching).

In this particular case, the reason we are performing the (N*P) is because the 
containerizer calls the usage function on the isolator for each container. In 
my opinion this is the cleanest place to cache, although I would prefer to 
call it batch. The isolator loses the information that we are asking for a 
snapshot of all containers, rather it thinks we are asking for N snapshots.

My proposal would be to modify the interface to allow a batched version of the 
call, so that the usage call can re-use any data it collects. I think this is 
the cleanest way to control when we recompute / invalidate the data.

There is also the opportunity to just reduce the full stats parsing to just the 
subset of pids that we are interested in. This would already provide a ~30x 
improvement.

P.S. this problem can also be completely avoided by calling into a kernel 
module that exposes the right information efficiently ;-)

 Posix CPU isolator usage call introduce high cpu load
 -

 Key: MESOS-2254
 URL: https://issues.apache.org/jira/browse/MESOS-2254
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen

 With more than 20 executors running on a slave with the posix isolator, we 
 have seen a very high cpu load (over 200%).
 From profiling one thread (there were two, taking up all the cpu time. The 
 total CPU time was over 200%):
 {code}
 Running Time  SelfSymbol Name
 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
 27133.0ms   47.8% 0.0  thread_start
 27133.0ms   47.8% 0.0   _pthread_start
 27133.0ms   47.8% 0.0_pthread_body
 27133.0ms   47.8% 0.0 process::schedule(void*)
 27133.0ms   47.8% 2.0  
 process::ProcessManager::resume(process::ProcessBase*)
 27126.0ms   47.8% 1.0   
 process::ProcessBase::serve(process::Event const)
 27125.0ms   47.8% 0.0
 process::DispatchEvent::visit(process::EventVisitor*) const
 27125.0ms   47.8% 0.0 
 process::ProcessBase::visit(process::DispatchEvent const)
 27125.0ms   47.8% 0.0  std::__1::functionvoid 
 (process::ProcessBase*)::operator()(process::ProcessBase*) const
 27124.0ms   47.8% 0.0   
 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), 
 std::__1::allocatorprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), void 
 (process::ProcessBase*)::operator()(process::ProcessBase*)
 27124.0ms   47.8% 1.0
 process::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
  const
 27060.0ms   47.7% 1.0 
 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
 const)
 27046.0ms   47.7% 2.0  

[jira] [Issue Comment Deleted] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-25 Thread Joris Van Remoortere (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van Remoortere updated MESOS-2254:

Comment: was deleted

(was: [~marco-mesos]The endpoint is already rate-limited using a 
{{process::RateLimiter}} that permits 2 calls per second. The main concern is 
that even a single call to this API gets more expensive as N executors scan all 
P processes on the system (N*P) per call.

There are opportunities to cache; however, caching introduces decisions about 
when to clear the cache (Do we do it on a time based interval? after some 
number of requests?) as well as stale data. Since the intent of this call is 
the get a current snapshot of usage data, I would prefer to avoid introducing 
explicit caching, and instead pass along enough information to allow re-use 
of the data for the same call (batching).

In this particular case, the reason we are performing the (N*P) is because the 
containerizer calls the usage function on the isolator for each container. In 
my opinion this is the cleanest place to cache, although I would prefer to 
call it batch. The isolator loses the information that we are asking for a 
snapshot of all containers, rather it thinks we are asking for N snapshots.

My proposal would be to modify the interface to allow a batched version of the 
call, so that the usage call can re-use any data it collects. I think this is 
the cleanest way to control when we recompute / invalidate the data.

There is also the opportunity to just reduce the full stats parsing to just the 
subset of pids that we are interested in. This would already provide a ~30x 
improvement.

P.S. this problem can also be completely avoided by calling into a kernel 
module that exposes the right information efficiently ;-))

 Posix CPU isolator usage call introduce high cpu load
 -

 Key: MESOS-2254
 URL: https://issues.apache.org/jira/browse/MESOS-2254
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen

 With more than 20 executors running on a slave with the posix isolator, we 
 have seen a very high cpu load (over 200%).
 From profiling one thread (there were two, taking up all the cpu time. The 
 total CPU time was over 200%):
 {code}
 Running Time  SelfSymbol Name
 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
 27133.0ms   47.8% 0.0  thread_start
 27133.0ms   47.8% 0.0   _pthread_start
 27133.0ms   47.8% 0.0_pthread_body
 27133.0ms   47.8% 0.0 process::schedule(void*)
 27133.0ms   47.8% 2.0  
 process::ProcessManager::resume(process::ProcessBase*)
 27126.0ms   47.8% 1.0   
 process::ProcessBase::serve(process::Event const)
 27125.0ms   47.8% 0.0
 process::DispatchEvent::visit(process::EventVisitor*) const
 27125.0ms   47.8% 0.0 
 process::ProcessBase::visit(process::DispatchEvent const)
 27125.0ms   47.8% 0.0  std::__1::functionvoid 
 (process::ProcessBase*)::operator()(process::ProcessBase*) const
 27124.0ms   47.8% 0.0   
 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), 
 std::__1::allocatorprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), void 
 (process::ProcessBase*)::operator()(process::ProcessBase*)
 27124.0ms   47.8% 1.0
 process::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
  const
 27060.0ms   47.7% 1.0 
 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
 const)
 27046.0ms   47.7% 2.0  
 

[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load

2015-05-25 Thread Joris Van Remoortere (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558507#comment-14558507
 ] 

Joris Van Remoortere commented on MESOS-2254:
-

[~marco-mesos]The endpoint is already rate-limited using a 
{{process::RateLimiter}} that permits 2 calls per second. The main concern is 
that even a single call to this API gets more expensive as N executors scan all 
P processes on the system (N*P) per call.

There are opportunities to cache; however, caching introduces decisions about 
when to clear the cache (Do we do it on a time based interval? after some 
number of requests?) as well as stale data. Since the intent of this call is 
the get a current snapshot of usage data, I would prefer to avoid introducing 
explicit caching, and instead pass along enough information to allow re-use 
of the data for the same call (batching).

In this particular case, the reason we are performing the (N*P) is because the 
containerizer calls the usage function on the isolator for each container. In 
my opinion this is the cleanest place to cache, although I would prefer to 
call it batch. The isolator loses the information that we are asking for a 
snapshot of all containers, rather it thinks we are asking for N snapshots.

My proposal would be to modify the interface to allow a batched version of the 
call, so that the usage call can re-use any data it collects. I think this is 
the cleanest way to control when we recompute / invalidate the data.

There is also the opportunity to just reduce the full stats parsing to just the 
subset of pids that we are interested in. This would already provide a ~30x 
improvement.

P.S. this problem can also be completely avoided by calling into a kernel 
module that exposes the right information efficiently ;-)

 Posix CPU isolator usage call introduce high cpu load
 -

 Key: MESOS-2254
 URL: https://issues.apache.org/jira/browse/MESOS-2254
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen

 With more than 20 executors running on a slave with the posix isolator, we 
 have seen a very high cpu load (over 200%).
 From profiling one thread (there were two, taking up all the cpu time. The 
 total CPU time was over 200%):
 {code}
 Running Time  SelfSymbol Name
 27133.0ms   47.8% 0.0 _pthread_body  0x1adb50
 27133.0ms   47.8% 0.0  thread_start
 27133.0ms   47.8% 0.0   _pthread_start
 27133.0ms   47.8% 0.0_pthread_body
 27133.0ms   47.8% 0.0 process::schedule(void*)
 27133.0ms   47.8% 2.0  
 process::ProcessManager::resume(process::ProcessBase*)
 27126.0ms   47.8% 1.0   
 process::ProcessBase::serve(process::Event const)
 27125.0ms   47.8% 0.0
 process::DispatchEvent::visit(process::EventVisitor*) const
 27125.0ms   47.8% 0.0 
 process::ProcessBase::visit(process::DispatchEvent const)
 27125.0ms   47.8% 0.0  std::__1::functionvoid 
 (process::ProcessBase*)::operator()(process::ProcessBase*) const
 27124.0ms   47.8% 0.0   
 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), 
 std::__1::allocatorprocess::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*), void 
 (process::ProcessBase*)::operator()(process::ProcessBase*)
 27124.0ms   47.8% 1.0
 process::Futuremesos::ResourceStatistics 
 process::dispatchmesos::ResourceStatistics, 
 mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, 
 mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess 
 const, process::Futuremesos::ResourceStatistics 
 (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), 
 mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
  const
 27060.0ms   47.7% 1.0 
 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
 const)
 27046.0ms   47.7% 2.0  

[jira] [Resolved] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.

2015-05-25 Thread Timothy Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen resolved MESOS-2215.
-
Resolution: Fixed

 The Docker containerizer attempts to recover any task when checkpointing is 
 enabled, not just docker tasks.
 ---

 Key: MESOS-2215
 URL: https://issues.apache.org/jira/browse/MESOS-2215
 Project: Mesos
  Issue Type: Bug
  Components: docker
Affects Versions: 0.21.0
Reporter: Steve Niemitz
Assignee: Timothy Chen

 Once the slave restarts and recovers the task, I see this error in the log 
 for all tasks that were recovered every second or so.  Note, these were NOT 
 docker tasks:
 W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage 
 for  container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
 thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
  of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker 
 inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited 
 with status 1 stderr = Error: No such image or container: 
 mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
 However the tasks themselves are still healthy and running.
 The slave was launched with --containerizers=mesos,docker
 -
 More info: it looks like the docker containerizer is a little too ambitious 
 about recovering containers, again this was not a docker task:
 I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
 '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
  of framework 20150109-161713-715350282-5050-290797-
 Looking into the source, it looks like the problem is that the 
 ComposingContainerize runs recover in parallel, but neither the docker 
 containerizer nor mesos containerizer check if they should recover the task 
 or not (because they were the ones that launched it).  Perhaps this needs to 
 be written into the checkpoint somewhere?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-2650) Modularize the Resource Estimator

2015-05-25 Thread Bartek Plotka (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bartek Plotka reassigned MESOS-2650:


Assignee: Bartek Plotka  (was: Niklas Quarfot Nielsen)

 Modularize the Resource Estimator
 -

 Key: MESOS-2650
 URL: https://issues.apache.org/jira/browse/MESOS-2650
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Bartek Plotka
  Labels: mesosphere

 Modularizing the resource estimator opens up the door for org specific 
 implementations.
 Test the estimator module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2650) Modularize the Resource Estimator

2015-05-25 Thread Bartek Plotka (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558668#comment-14558668
 ] 

Bartek Plotka commented on MESOS-2650:
--

https://reviews.apache.org/r/34662/

 Modularize the Resource Estimator
 -

 Key: MESOS-2650
 URL: https://issues.apache.org/jira/browse/MESOS-2650
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Bartek Plotka
  Labels: mesosphere

 Modularizing the resource estimator opens up the door for org specific 
 implementations.
 Test the estimator module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)