[jira] [Updated] (MESOS-2759) Could not create MesosContainerizer: Could not create isolator network/port_mapping: Routing library check failed: Capability ROUTE_LINK_VETH_GET_PEER_OWN_REFERENCE is no
[ https://issues.apache.org/jira/browse/MESOS-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anuj Gupta updated MESOS-2759: -- Target Version/s: 0.22.1, 0.22.0 (was: 0.22.0, 0.22.1) Could not create MesosContainerizer: Could not create isolator network/port_mapping: Routing library check failed: Capability ROUTE_LINK_VETH_GET_PEER_OWN_REFERENCE is not available - Key: MESOS-2759 URL: https://issues.apache.org/jira/browse/MESOS-2759 Project: Mesos Issue Type: Bug Reporter: Anuj Gupta Priority: Critical [root@localhost ~]# mesos-slave --master=127.0.0.1:5050 --log_dir=/var/log/mesos --work_dir=/var/lib/mesos --isolation=cgroups/cpu,cgroups/mem,network/port_mapping --resources=ephemeral_ports:[32768-57344] --ephemeral_ports_per_container=1024 mesos-slave: /usr/lib64/libnl-3.so.200: no version information available (required by /usr/local/lib/libmesos-0.22.0.so) mesos-slave: /usr/lib64/libnl-route-3.so.200: no version information available (required by /usr/local/lib/libmesos-0.22.0.so) mesos-slave: /usr/lib64/libnl-3.so.200: no version information available (required by /lib/libnl-idiag-3.so.200) I0521 14:10:16.727126 13214 logging.cpp:172] INFO level logging started! I0521 14:10:16.727409 13214 main.cpp:156] Build: 2015-05-21 13:21:45 by root I0521 14:10:16.727432 13214 main.cpp:158] Version: 0.22.0 I0521 14:10:16.727727 13214 containerizer.cpp:110] Using isolation: cgroups/cpu,cgroups/mem,network/port_mapping Failed to create a containerizer: Could not create MesosContainerizer: Could not create isolator network/port_mapping: Routing library check failed: Capability ROUTE_LINK_VETH_GET_PEER_OWN_REFERENCE is not available -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2748) /help generated links point to wrong URLs
[ https://issues.apache.org/jira/browse/MESOS-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558436#comment-14558436 ] haosdent commented on MESOS-2748: - Patch: https://reviews.apache.org/r/34655/ /help generated links point to wrong URLs - Key: MESOS-2748 URL: https://issues.apache.org/jira/browse/MESOS-2748 Project: Mesos Issue Type: Bug Affects Versions: 0.22.1 Reporter: Marco Massenzio Assignee: haosdent Priority: Minor As reported by Michael Lunøe mlu...@mesosphere.io (see also MESOS-329 and MESOS-913 for background): {quote} In {{mesos/3rdparty/libprocess/src/help.cpp}} a markdown file is created, which is then converted to html through a javascript library All endpoints point to {{/help/...}}, they need to work dynamically for reverse proxy to do its thing. {{/mesos/help}} works, and displays the endpoints, but they each need to go to their respective {{/help/...}} endpoint. Note that this needs to work both for master, and for slaves. I think the route to slaves help is something like this: {{/mesos/slaves/20150518-210216-1695027628-5050-1366-S0/help}}, but please double check this. {quote} The fix appears to be not too complex (as it would require to simply manipulate the generated URL) but a quick skim of the code would suggest that something more substantial may be desirable too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558506#comment-14558506 ] Joris Van Remoortere commented on MESOS-2254: - [~marco-mesos]The endpoint is already rate-limited using a {{process::RateLimiter}} that permits 2 calls per second. The main concern is that even a single call to this API gets more expensive as N executors scan all P processes on the system (N*P) per call. There are opportunities to cache; however, caching introduces decisions about when to clear the cache (Do we do it on a time based interval? after some number of requests?) as well as stale data. Since the intent of this call is the get a current snapshot of usage data, I would prefer to avoid introducing explicit caching, and instead pass along enough information to allow re-use of the data for the same call (batching). In this particular case, the reason we are performing the (N*P) is because the containerizer calls the usage function on the isolator for each container. In my opinion this is the cleanest place to cache, although I would prefer to call it batch. The isolator loses the information that we are asking for a snapshot of all containers, rather it thinks we are asking for N snapshots. My proposal would be to modify the interface to allow a batched version of the call, so that the usage call can re-use any data it collects. I think this is the cleanest way to control when we recompute / invalidate the data. There is also the opportunity to just reduce the full stats parsing to just the subset of pids that we are interested in. This would already provide a ~30x improvement. P.S. this problem can also be completely avoided by calling into a kernel module that exposes the right information efficiently ;-) Posix CPU isolator usage call introduce high cpu load - Key: MESOS-2254 URL: https://issues.apache.org/jira/browse/MESOS-2254 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen With more than 20 executors running on a slave with the posix isolator, we have seen a very high cpu load (over 200%). From profiling one thread (there were two, taking up all the cpu time. The total CPU time was over 200%): {code} Running Time SelfSymbol Name 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 27133.0ms 47.8% 0.0 thread_start 27133.0ms 47.8% 0.0 _pthread_start 27133.0ms 47.8% 0.0_pthread_body 27133.0ms 47.8% 0.0 process::schedule(void*) 27133.0ms 47.8% 2.0 process::ProcessManager::resume(process::ProcessBase*) 27126.0ms 47.8% 1.0 process::ProcessBase::serve(process::Event const) 27125.0ms 47.8% 0.0 process::DispatchEvent::visit(process::EventVisitor*) const 27125.0ms 47.8% 0.0 process::ProcessBase::visit(process::DispatchEvent const) 27125.0ms 47.8% 0.0 std::__1::functionvoid (process::ProcessBase*)::operator()(process::ProcessBase*) const 27124.0ms 47.8% 0.0 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), std::__1::allocatorprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), void (process::ProcessBase*)::operator()(process::ProcessBase*) 27124.0ms 47.8% 1.0 process::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) const 27060.0ms 47.7% 1.0 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID const) 27046.0ms 47.7% 2.0
[jira] [Issue Comment Deleted] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-2254: Comment: was deleted (was: [~marco-mesos]The endpoint is already rate-limited using a {{process::RateLimiter}} that permits 2 calls per second. The main concern is that even a single call to this API gets more expensive as N executors scan all P processes on the system (N*P) per call. There are opportunities to cache; however, caching introduces decisions about when to clear the cache (Do we do it on a time based interval? after some number of requests?) as well as stale data. Since the intent of this call is the get a current snapshot of usage data, I would prefer to avoid introducing explicit caching, and instead pass along enough information to allow re-use of the data for the same call (batching). In this particular case, the reason we are performing the (N*P) is because the containerizer calls the usage function on the isolator for each container. In my opinion this is the cleanest place to cache, although I would prefer to call it batch. The isolator loses the information that we are asking for a snapshot of all containers, rather it thinks we are asking for N snapshots. My proposal would be to modify the interface to allow a batched version of the call, so that the usage call can re-use any data it collects. I think this is the cleanest way to control when we recompute / invalidate the data. There is also the opportunity to just reduce the full stats parsing to just the subset of pids that we are interested in. This would already provide a ~30x improvement. P.S. this problem can also be completely avoided by calling into a kernel module that exposes the right information efficiently ;-)) Posix CPU isolator usage call introduce high cpu load - Key: MESOS-2254 URL: https://issues.apache.org/jira/browse/MESOS-2254 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen With more than 20 executors running on a slave with the posix isolator, we have seen a very high cpu load (over 200%). From profiling one thread (there were two, taking up all the cpu time. The total CPU time was over 200%): {code} Running Time SelfSymbol Name 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 27133.0ms 47.8% 0.0 thread_start 27133.0ms 47.8% 0.0 _pthread_start 27133.0ms 47.8% 0.0_pthread_body 27133.0ms 47.8% 0.0 process::schedule(void*) 27133.0ms 47.8% 2.0 process::ProcessManager::resume(process::ProcessBase*) 27126.0ms 47.8% 1.0 process::ProcessBase::serve(process::Event const) 27125.0ms 47.8% 0.0 process::DispatchEvent::visit(process::EventVisitor*) const 27125.0ms 47.8% 0.0 process::ProcessBase::visit(process::DispatchEvent const) 27125.0ms 47.8% 0.0 std::__1::functionvoid (process::ProcessBase*)::operator()(process::ProcessBase*) const 27124.0ms 47.8% 0.0 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), std::__1::allocatorprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), void (process::ProcessBase*)::operator()(process::ProcessBase*) 27124.0ms 47.8% 1.0 process::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) const 27060.0ms 47.7% 1.0 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID const) 27046.0ms 47.7% 2.0
[jira] [Commented] (MESOS-2254) Posix CPU isolator usage call introduce high cpu load
[ https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558507#comment-14558507 ] Joris Van Remoortere commented on MESOS-2254: - [~marco-mesos]The endpoint is already rate-limited using a {{process::RateLimiter}} that permits 2 calls per second. The main concern is that even a single call to this API gets more expensive as N executors scan all P processes on the system (N*P) per call. There are opportunities to cache; however, caching introduces decisions about when to clear the cache (Do we do it on a time based interval? after some number of requests?) as well as stale data. Since the intent of this call is the get a current snapshot of usage data, I would prefer to avoid introducing explicit caching, and instead pass along enough information to allow re-use of the data for the same call (batching). In this particular case, the reason we are performing the (N*P) is because the containerizer calls the usage function on the isolator for each container. In my opinion this is the cleanest place to cache, although I would prefer to call it batch. The isolator loses the information that we are asking for a snapshot of all containers, rather it thinks we are asking for N snapshots. My proposal would be to modify the interface to allow a batched version of the call, so that the usage call can re-use any data it collects. I think this is the cleanest way to control when we recompute / invalidate the data. There is also the opportunity to just reduce the full stats parsing to just the subset of pids that we are interested in. This would already provide a ~30x improvement. P.S. this problem can also be completely avoided by calling into a kernel module that exposes the right information efficiently ;-) Posix CPU isolator usage call introduce high cpu load - Key: MESOS-2254 URL: https://issues.apache.org/jira/browse/MESOS-2254 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen With more than 20 executors running on a slave with the posix isolator, we have seen a very high cpu load (over 200%). From profiling one thread (there were two, taking up all the cpu time. The total CPU time was over 200%): {code} Running Time SelfSymbol Name 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50 27133.0ms 47.8% 0.0 thread_start 27133.0ms 47.8% 0.0 _pthread_start 27133.0ms 47.8% 0.0_pthread_body 27133.0ms 47.8% 0.0 process::schedule(void*) 27133.0ms 47.8% 2.0 process::ProcessManager::resume(process::ProcessBase*) 27126.0ms 47.8% 1.0 process::ProcessBase::serve(process::Event const) 27125.0ms 47.8% 0.0 process::DispatchEvent::visit(process::EventVisitor*) const 27125.0ms 47.8% 0.0 process::ProcessBase::visit(process::DispatchEvent const) 27125.0ms 47.8% 0.0 std::__1::functionvoid (process::ProcessBase*)::operator()(process::ProcessBase*) const 27124.0ms 47.8% 0.0 std::__1::__function::__funcprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), std::__1::allocatorprocess::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*), void (process::ProcessBase*)::operator()(process::ProcessBase*) 27124.0ms 47.8% 1.0 process::Futuremesos::ResourceStatistics process::dispatchmesos::ResourceStatistics, mesos::internal::slave::IsolatorProcess, mesos::ContainerID const, mesos::ContainerID(process::PIDmesos::internal::slave::IsolatorProcess const, process::Futuremesos::ResourceStatistics (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const), mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*) const 27060.0ms 47.7% 1.0 mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID const) 27046.0ms 47.7% 2.0
[jira] [Resolved] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.
[ https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen resolved MESOS-2215. - Resolution: Fixed The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks. --- Key: MESOS-2215 URL: https://issues.apache.org/jira/browse/MESOS-2215 Project: Mesos Issue Type: Bug Components: docker Affects Versions: 0.21.0 Reporter: Steve Niemitz Assignee: Timothy Chen Once the slave restarts and recovers the task, I see this error in the log for all tasks that were recovered every second or so. Note, these were NOT docker tasks: W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with status 1 stderr = Error: No such image or container: mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21 However the tasks themselves are still healthy and running. The slave was launched with --containerizers=mesos,docker - More info: it looks like the docker containerizer is a little too ambitious about recovering containers, again this was not a docker task: I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd' of framework 20150109-161713-715350282-5050-290797- Looking into the source, it looks like the problem is that the ComposingContainerize runs recover in parallel, but neither the docker containerizer nor mesos containerizer check if they should recover the task or not (because they were the ones that launched it). Perhaps this needs to be written into the checkpoint somewhere? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2650) Modularize the Resource Estimator
[ https://issues.apache.org/jira/browse/MESOS-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bartek Plotka reassigned MESOS-2650: Assignee: Bartek Plotka (was: Niklas Quarfot Nielsen) Modularize the Resource Estimator - Key: MESOS-2650 URL: https://issues.apache.org/jira/browse/MESOS-2650 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Bartek Plotka Labels: mesosphere Modularizing the resource estimator opens up the door for org specific implementations. Test the estimator module. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2650) Modularize the Resource Estimator
[ https://issues.apache.org/jira/browse/MESOS-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558668#comment-14558668 ] Bartek Plotka commented on MESOS-2650: -- https://reviews.apache.org/r/34662/ Modularize the Resource Estimator - Key: MESOS-2650 URL: https://issues.apache.org/jira/browse/MESOS-2650 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Bartek Plotka Labels: mesosphere Modularizing the resource estimator opens up the door for org specific implementations. Test the estimator module. -- This message was sent by Atlassian JIRA (v6.3.4#6332)