[jira] [Commented] (MESOS-2449) Support group of tasks (Pod) constructs and API in Mesos
[ https://issues.apache.org/jira/browse/MESOS-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386819#comment-14386819 ] Timothy St. Clair commented on MESOS-2449: -- It would be ideal in this use case to handle the network + ovs abstraction 1st as it's crucial to the pods. Support group of tasks (Pod) constructs and API in Mesos Key: MESOS-2449 URL: https://issues.apache.org/jira/browse/MESOS-2449 Project: Mesos Issue Type: Epic Reporter: Timothy Chen There is a common need among different frameworks, that wants to start a group of tasks that are either depend or co-located with each other. Although a framework can schedule individual tasks within the same offer and slave id, it doesn't have a way to describe dependencies, failure policies (if one of the task failed), network setup, and group container information, etc. Want to create a epic to start the discussion around the requirements folks need, and see where we can lead this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2571) Expose Memory Pressure in MemeIsolator
Chi Zhang created MESOS-2571: Summary: Expose Memory Pressure in MemeIsolator Key: MESOS-2571 URL: https://issues.apache.org/jira/browse/MESOS-2571 Project: Mesos Issue Type: Improvement Reporter: Chi Zhang Assignee: Chi Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2571) Expose Memory Pressure in MemeIsolator
[ https://issues.apache.org/jira/browse/MESOS-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387090#comment-14387090 ] Chi Zhang commented on MESOS-2571: -- https://reviews.apache.org/r/30546 Expose Memory Pressure in MemeIsolator -- Key: MESOS-2571 URL: https://issues.apache.org/jira/browse/MESOS-2571 Project: Mesos Issue Type: Improvement Reporter: Chi Zhang Assignee: Chi Zhang Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1790) Add chown option to CommandInfo.URI
[ https://issues.apache.org/jira/browse/MESOS-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386946#comment-14386946 ] Jim Klucar commented on MESOS-1790: --- Forcing the Mesos slave to be run as root to get this working is probably a non-starter for many users. I'm going to add a skip chown option and see what people think. Add chown option to CommandInfo.URI - Key: MESOS-1790 URL: https://issues.apache.org/jira/browse/MESOS-1790 Project: Mesos Issue Type: Improvement Reporter: Vinod Kone Labels: mesosphere, newbie Mesos fetcher always chown()s the extracted executor URIs as the executor user but sometimes this is not desirable, e.g., setuid bit gets lost during chown() if slave/fetcher is running as root. It would be nice to give frameworks the ability to skip the chown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2572) Add memory statistics tests.
Chi Zhang created MESOS-2572: Summary: Add memory statistics tests. Key: MESOS-2572 URL: https://issues.apache.org/jira/browse/MESOS-2572 Project: Mesos Issue Type: Task Reporter: Chi Zhang Assignee: Chi Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2570) webuiUrl doesn't get updated when a framework re-registers
[ https://issues.apache.org/jira/browse/MESOS-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386999#comment-14386999 ] Benjamin Mahler commented on MESOS-2570: Looks like a duplicate of MESOS-703 (no support for FrameworkInfo updates). webuiUrl doesn't get updated when a framework re-registers -- Key: MESOS-2570 URL: https://issues.apache.org/jira/browse/MESOS-2570 Project: Mesos Issue Type: Bug Affects Versions: 0.22.0 Reporter: Robert Stupp Priority: Minor The webuiUrl attribute doesn't get updated when a framework re-registers. I tried to set the webuiUrl for example here: https://github.com/mesosphere/cassandra-mesos/blob/rewrite/cassandra-framework/src/main/java/io/mesosphere/mesos/frameworks/cassandra/Main.java#L165 After the first startup, the correct URL is linked in Mesos webUI. But when the scheduler is stopped, the webuiUrl field is changed and the framework is restarted, the old webuiUrl is shown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2404) Add an example framework to test persistent volumes.
[ https://issues.apache.org/jira/browse/MESOS-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2404: -- Sprint: Twitter Mesos Q1 Sprint 6 Assignee: Jie Yu Story Points: 3 Add an example framework to test persistent volumes. Key: MESOS-2404 URL: https://issues.apache.org/jira/browse/MESOS-2404 Project: Mesos Issue Type: Task Reporter: Jie Yu Assignee: Jie Yu This serves two purposes: 1) testing the new persistence feature 2) served as an example for others to use the new feature -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2353) Improve performance of the master's state.json endpoint for large clusters.
[ https://issues.apache.org/jira/browse/MESOS-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2353: -- Sprint: Twitter Mesos Q1 Sprint 5, Twitter Mesos Q1 Sprint 6 (was: Twitter Mesos Q1 Sprint 5) Improve performance of the master's state.json endpoint for large clusters. --- Key: MESOS-2353 URL: https://issues.apache.org/jira/browse/MESOS-2353 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Assignee: Benjamin Mahler Labels: newbie, scalability, twitter The master's state.json endpoint consistently takes a long time to compute the JSON result, for large clusters: {noformat} $ time curl -s -o /dev/null localhost:5050/master/state.json Mon Jan 26 22:38:50 UTC 2015 real 0m13.174s user 0m0.003s sys 0m0.022s {noformat} This can cause the master to get backlogged if there are many state.json requests in flight. Looking at {{perf}} data, it seems most of the time is spent doing memory allocation / de-allocation. This ticket will try to capture any low hanging fruit to speed this up. Possibly we can leverage moves if they are not already being used by the compiler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2571) Expose Memory Pressure in MemIsolator
[ https://issues.apache.org/jira/browse/MESOS-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes updated MESOS-2571: -- Summary: Expose Memory Pressure in MemIsolator (was: Expose Memory Pressure in MemeIsolator) Expose Memory Pressure in MemIsolator - Key: MESOS-2571 URL: https://issues.apache.org/jira/browse/MESOS-2571 Project: Mesos Issue Type: Improvement Reporter: Chi Zhang Assignee: Chi Zhang Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2461) Slave should provide details on processes running in its cgroups
[ https://issues.apache.org/jira/browse/MESOS-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2461: -- Story Points: 1 Slave should provide details on processes running in its cgroups Key: MESOS-2461 URL: https://issues.apache.org/jira/browse/MESOS-2461 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.21.1 Reporter: Ian Downes Assignee: Jie Yu Priority: Minor Labels: twitter The slave can optionally be put into its own cgroups for a list of subsystems, e.g., for monitoring of memory and cpu. See the slave flag: --slave_subsystems It currently refuses to start if there are any processes in its cgroups - this could be another slave or some subprocess started by a previous slave - and only logs the pids of those processes. Improve this to log details about the processes: suggest at least the process command, uid running it, and perhaps its start time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2461) Slave should provide details on processes running in its cgroups
[ https://issues.apache.org/jira/browse/MESOS-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2461: -- Assignee: Jie Yu Slave should provide details on processes running in its cgroups Key: MESOS-2461 URL: https://issues.apache.org/jira/browse/MESOS-2461 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.21.1 Reporter: Ian Downes Assignee: Jie Yu Priority: Minor Labels: twitter The slave can optionally be put into its own cgroups for a list of subsystems, e.g., for monitoring of memory and cpu. See the slave flag: --slave_subsystems It currently refuses to start if there are any processes in its cgroups - this could be another slave or some subprocess started by a previous slave - and only logs the pids of those processes. Improve this to log details about the processes: suggest at least the process command, uid running it, and perhaps its start time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2461) Slave should provide details on processes running in its cgroups
[ https://issues.apache.org/jira/browse/MESOS-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2461: -- Sprint: Twitter Mesos Q1 Sprint 6 Slave should provide details on processes running in its cgroups Key: MESOS-2461 URL: https://issues.apache.org/jira/browse/MESOS-2461 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.21.1 Reporter: Ian Downes Priority: Minor Labels: twitter The slave can optionally be put into its own cgroups for a list of subsystems, e.g., for monitoring of memory and cpu. See the slave flag: --slave_subsystems It currently refuses to start if there are any processes in its cgroups - this could be another slave or some subprocess started by a previous slave - and only logs the pids of those processes. Improve this to log details about the processes: suggest at least the process command, uid running it, and perhaps its start time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2350) Add support for MesosContainerizerLaunch to chroot to a specified path
[ https://issues.apache.org/jira/browse/MESOS-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2350: -- Sprint: Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4, Twitter Mesos Q1 Sprint 5, Twitter Mesos Q1 Sprint 6 (was: Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4, Twitter Mesos Q1 Sprint 5) Add support for MesosContainerizerLaunch to chroot to a specified path -- Key: MESOS-2350 URL: https://issues.apache.org/jira/browse/MESOS-2350 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.21.1, 0.22.0 Reporter: Ian Downes Assignee: Ian Downes Labels: twitter In preparation for the MesosContainerizer to support a filesystem isolator the MesosContainerizerLauncher must support chrooting. Optionally, it should also configure the chroot environment by (re-)mounting special filesystems such as /proc and /sys and making device nodes such as /dev/zero, etc., such that the chroot environment is functional. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2332) Report per-container metrics for network bandwidth throttling
[ https://issues.apache.org/jira/browse/MESOS-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2332: -- Sprint: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4, Twitter Mesos Q1 Sprint 5, Twitter Mesos Q1 Sprint 6 (was: Twitter Mesos Q1 Sprint 2, Twitter Mesos Q1 Sprint 3, Twitter Mesos Q1 Sprint 4, Twitter Mesos Q1 Sprint 5) Report per-container metrics for network bandwidth throttling - Key: MESOS-2332 URL: https://issues.apache.org/jira/browse/MESOS-2332 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Paul Brett Assignee: Paul Brett Labels: features, twitter Export metrics from the network isolation to identify scope and duration of container throttling. Packet loss can be identified from the overlimits and requeues fields of the htb qdisc report for the virtual interface, e.g. {noformat} $ tc -s -d qdisc show dev mesos19223 qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 158213287452 bytes 1030876393 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc ingress : parent :fff1 Sent 119381747824 bytes 1144549901 pkt (dropped 2044879, overlimits 0 requeues 0) backlog 0b 0p requeues 0 {noformat} Note that since a packet can be examined multiple times before transmission, overlimits can exceed total packets sent. Add to the port_mapping isolator usage() and the container statistics protobuf. Carefully consider the naming (esp tx/rx) + commenting of the protobuf fields so it's clear what these represent and how they are different to the existing dropped packet counts from the network stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1127) Implement the protobufs for the scheduler API
[ https://issues.apache.org/jira/browse/MESOS-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-1127: -- Sprint: Twitter Mesos Q1 Sprint 5, Twitter Mesos Q1 Sprint 6 (was: Twitter Mesos Q1 Sprint 5) Implement the protobufs for the scheduler API - Key: MESOS-1127 URL: https://issues.apache.org/jira/browse/MESOS-1127 Project: Mesos Issue Type: Task Components: framework Reporter: Benjamin Hindman Assignee: Vinod Kone Labels: twitter The default scheduler/executor interface and implementation in Mesos have a few drawbacks: (1) The interface is fairly high-level which makes it hard to do certain things, for example, handle events (callbacks) in batch. This can have a big impact on the performance of schedulers (for example, writing task updates that need to be persisted). (2) The implementation requires writing a lot of boilerplate JNI and native Python wrappers when adding additional API components. The plan is to provide a lower-level API that can easily be used to implement the higher-level API that is currently provided. This will also open the door to more easily building native-language Mesos libraries (i.e., not needing the C++ shim layer) and building new higher-level abstractions on top of the lower-level API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2438) Improve support for streaming HTTP Responses in libprocess.
[ https://issues.apache.org/jira/browse/MESOS-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2438: -- Sprint: Twitter Mesos Q1 Sprint 4, Twitter Mesos Q1 Sprint 5, Twitter Mesos Q1 Sprint 6 (was: Twitter Mesos Q1 Sprint 4, Twitter Mesos Q1 Sprint 5) Improve support for streaming HTTP Responses in libprocess. --- Key: MESOS-2438 URL: https://issues.apache.org/jira/browse/MESOS-2438 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Benjamin Mahler Assignee: Benjamin Mahler Labels: twitter Currently libprocess' HTTP::Response supports a PIPE construct for doing streaming responses: {code} struct Response { ... // Either provide a body, an absolute path to a file, or a // pipe for streaming a response. Distinguish between the cases // using 'type' below. // // BODY: Uses 'body' as the body of the response. These may be // encoded using gzip for efficiency, if 'Content-Encoding' is not // already specified. // // PATH: Attempts to perform a 'sendfile' operation on the file // found at 'path'. // // PIPE: Splices data from 'pipe' using 'Transfer-Encoding=chunked'. // Note that the read end of the pipe will be closed by libprocess // either after the write end has been closed or if the socket the // data is being spliced to has been closed (i.e., nobody is // listening any longer). This can cause writes to the pipe to // generate a SIGPIPE (which will terminate your program unless you // explicitly ignore them or handle them). // // In all cases (BODY, PATH, PIPE), you are expected to properly // specify the 'Content-Type' header, but the 'Content-Length' and // or 'Transfer-Encoding' headers will be filled in for you. enum { NONE, BODY, PATH, PIPE } type; ... }; {code} This interface is too low level and difficult to program against: * Connection closure is signaled with SIGPIPE, which is difficult for callers to deal with (must suppress SIGPIPE locally or globally in order to get EPIPE instead). * Pipes are generally for inter-process communication, and the pipe has finite size. With a blocking pipe the caller must deal with blocking when the pipe's buffer limit is exceeded. With a non-blocking pipe, the caller must deal with retrying the write. We'll want to consider a few use cases: # Sending an HTTP::Response with streaming data. # Making a request with http::get and http::post in which the data is returned in a streaming manner. # Making a request in which the request content is streaming. This ticket will focus on 1 as it is required for the HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2462) Add option for Subprocess to set a death signal for the forked child
[ https://issues.apache.org/jira/browse/MESOS-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2462: -- Sprint: Twitter Mesos Q1 Sprint 6 Add option for Subprocess to set a death signal for the forked child Key: MESOS-2462 URL: https://issues.apache.org/jira/browse/MESOS-2462 Project: Mesos Issue Type: Improvement Components: isolation Affects Versions: 0.21.1 Reporter: Ian Downes Assignee: Jie Yu Priority: Minor Labels: twitter Currently, children forked by the slave, including those through Subprocess, will continue running if the slave exits. For some processes, including helper processes like the fetcher, du, or perf, we'd like them to be terminated when the slave exits. Add support to Subprocess to optionally set a DEATHSIG for the child, e.g., setting SIGTERM would mean the child would get SIGTERM when the slave terminates. This can be done (*after forking*) with PR_SET_DEATHSIG. See man prctl. It is preserved through an exec call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2367) Improve slave resiliency in the face of orphan containers
[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2367: -- Sprint: Twitter Mesos Q1 Sprint 5 (was: Twitter Mesos Q1 Sprint 5, Twitter Mesos Q1 Sprint 6) Improve slave resiliency in the face of orphan containers -- Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Assignee: Jie Yu Priority: Critical Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-713) Support for adding subsystems to existing cgroup hierarchies.
[ https://issues.apache.org/jira/browse/MESOS-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes resolved MESOS-713. -- Resolution: Won't Fix Remounting cgroups is not really recommended and can cause significant confusion to the kernel. Support for adding subsystems to existing cgroup hierarchies. - Key: MESOS-713 URL: https://issues.apache.org/jira/browse/MESOS-713 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Benjamin Mahler Priority: Minor Labels: newbie, twitter Currently if a slave is restarted with additional subsystems, it will refuse to proceed if those subsystems are not attached to the existing hierarchy. It's possible to add subsystems to existing hierarchies via re-mounting: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-Attaching_Subsystems_to_and_Detaching_Them_From_an_Existing_Hierarchy.html We can add support for this by calling mount with the MS_REMOUNT option. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2200) bogus docker images result in bad error message to scheduler
[ https://issues.apache.org/jira/browse/MESOS-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387337#comment-14387337 ] Steven Borrelli commented on MESOS-2200: Wanted to +1 this issue. We have a deployment pipeline where there are times a user tries to deploy a nonexistent image. Right now we repeatedly get a TASK_FAILED in mesos with no output, so an admin has to look through docker logs to see what happened. bogus docker images result in bad error message to scheduler Key: MESOS-2200 URL: https://issues.apache.org/jira/browse/MESOS-2200 Project: Mesos Issue Type: Bug Components: containerization, docker Reporter: Jay Buffington Assignee: Joerg Schad Labels: mesosphere When a scheduler specifies a bogus image in ContainerInfo mesos doesn't tell the scheduler that the docker pull failed or why. This error is logged in the mesos-slave log, but it isn't given to the scheduler (as far as I can tell): {noformat} E1218 23:50:55.406230 8123 slave.cpp:2730] Container '8f70784c-3e40-4072-9ca2-9daed23f15ff' for executor 'thermos-1418946354013-xxx-xxx-curl-0-f500cc41-dd0a-4338-8cbc-d631cb588bb1' of framework '20140522-213145-1749004561-5050-29512-' failed to start: Failed to 'docker pull docker-registry.example.com/doesntexist/hello1.1:latest': exit status = exited with status 1 stderr = 2014/12/18 23:50:55 Error: image doesntexist/hello1.1 not found {noformat} If the docker image is not in the registry, the scheduler should give the user an error message. If docker pull failed because of networking issues, it should be retried. Mesos should give the scheduler enough information to be able to make that decision. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2233) Run ASF CI mesos builds inside docker
[ https://issues.apache.org/jira/browse/MESOS-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2233: -- Sprint: Twitter Mesos Q1 Sprint 6 Story Points: 5 Run ASF CI mesos builds inside docker - Key: MESOS-2233 URL: https://issues.apache.org/jira/browse/MESOS-2233 Project: Mesos Issue Type: Task Reporter: Vinod Kone There are several limitations to mesos projects current state of CI, which is run on builds.a.o -- Only runs on Ubuntu -- Doesn't run any tests that deal with cgroups -- Doesn't run any tests that need root permissions Now that ASF CI supports docker (https://issues.apache.org/jira/browse/BUILDS-25), it would be great for the Mesos project to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-2508) Slave recovering a docker container results in Unknow container error
[ https://issues.apache.org/jira/browse/MESOS-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jay Buffington resolved MESOS-2508. --- Resolution: Duplicate Closing as dup of https://issues.apache.org/jira/browse/MESOS-2215 Slave recovering a docker container results in Unknow container error --- Key: MESOS-2508 URL: https://issues.apache.org/jira/browse/MESOS-2508 Project: Mesos Issue Type: Bug Components: containerization, docker, slave Affects Versions: 0.21.1 Environment: Ubuntu 14.04.2 LTS Docker 1.5.0 (same error with 1.4.1) Mesos 0.21.1 installed from mesosphere ubuntu repo Marathon 0.8.0 installed from mesosphere ubuntu repo Reporter: Geoffroy Jabouley Priority: Minor I'm seeing some error logs occuring during a slave recovery of a Mesos task running into a docker container. It does not impede slave recovery process, as the mesos task is still active and running on the slave afeter the recovery. But there is something not working properly when the slave is recovering my docker container. The slave detects my container as an Unknown container Cluster status: - 1 mesos-master, 1 mesos-slave, 1 marathon framework running on the host. - checkpointing is activated on both slave and framework - use native docker containerizer - 1 mesos task, started using marathon, is running inside a docker container and is monitored by the mesos-slave Action: - restart the mesos-slave process (sudo restart mesos-slave) Expected: - docker container still running - mesos task still running - no error in the mesos slave log regarding recovery process Seen: - docker container still running - mesos task still running - {color:red}Several errors *Unknown container* in the mesos slave log during recovery process{color} --- For what it forth, here are my investigations: 1) The mesos task starts fine in the docker container *e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*. Docker container name is set to *mesos-adb71dc4-c07d-42a9-8fed-264c241668ad* by Mesos docker containerizer _i guess_... {code} I0317 09:56:14.300439 2784 slave.cpp:1083] Got assigned task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 20150311-150951-3982541578-5050-50860- I0317 09:56:14.380702 2784 slave.cpp:1193] Launching task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 20150311-150951-3982541578-5050-50860- I0317 09:56:14.384466 2784 slave.cpp:3997] Launching executor test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 20150311-150951-3982541578-5050-50860- in work directory '/tmp/mesos/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad' I0317 09:56:14.390207 2784 slave.cpp:1316] Queuing task 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' for executor test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework '20150311-150951-3982541578-5050-50860- I0317 09:56:14.421787 2782 docker.cpp:927] Starting container 'adb71dc4-c07d-42a9-8fed-264c241668ad' for task 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' (and executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799') of framework '20150311-150951-3982541578-5050-50860-' I0317 09:56:15.784143 2781 docker.cpp:633] Checkpointing pid 27080 to '/tmp/mesos/meta/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad/pids/forked.pid' I0317 09:56:15.789443 2784 slave.cpp:2840] Monitoring executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework '20150311-150951-3982541578-5050-50860-' in container 'adb71dc4-c07d-42a9-8fed-264c241668ad' I0317 09:56:15.862642 2784 slave.cpp:1860] Got registration for executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 20150311-150951-3982541578-5050-50860- from executor(1)@10.195.96.237:36021 I0317 09:56:15.865319 2784 slave.cpp:1979] Flushing queued task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 20150311-150951-3982541578-5050-50860- I0317 09:56:15.885414 2787 slave.cpp:2215] Handling status update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 20150311-150951-3982541578-5050-50860- from executor(1)@10.195.96.237:36021 I0317 09:56:15.885902 2787
[jira] [Updated] (MESOS-2571) Expose Memory Pressure in MemeIsolator
[ https://issues.apache.org/jira/browse/MESOS-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2571: -- Sprint: Twitter Mesos Q1 Sprint 6 Expose Memory Pressure in MemeIsolator -- Key: MESOS-2571 URL: https://issues.apache.org/jira/browse/MESOS-2571 Project: Mesos Issue Type: Improvement Reporter: Chi Zhang Assignee: Chi Zhang Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2402) MesosContainerizerDestroyTest.LauncherDestroyFailure is flaky
[ https://issues.apache.org/jira/browse/MESOS-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378974#comment-14378974 ] Vinod Kone edited comment on MESOS-2402 at 3/30/15 10:30 PM: - commit f98f26fa50e31ab399d156f942f4fb92edcd926e Author: Vinod Kone vinodk...@gmail.com Date: Tue Mar 24 14:45:44 2015 -0700 Fixed flaky MesosContainerizerDestroyTest tests. Review: https://reviews.apache.org/r/32454 was (Author: vinodkone): commit 0c19d17eb8d24af5db45efb6e5e05de7bdfeb41b Author: Vinod Kone vinodk...@gmail.com Date: Tue Mar 24 14:45:44 2015 -0700 Fixed flaky MesosContainerizerDestroyTest tests. Review: https://reviews.apache.org/r/32454 MesosContainerizerDestroyTest.LauncherDestroyFailure is flaky - Key: MESOS-2402 URL: https://issues.apache.org/jira/browse/MESOS-2402 Project: Mesos Issue Type: Bug Affects Versions: 0.23.0 Reporter: Vinod Kone Assignee: Vinod Kone Fix For: 0.23.0 Failed to os::execvpe in childMain. Never seen this one before. {code} [ RUN ] MesosContainerizerDestroyTest.LauncherDestroyFailure Using temporary directory '/tmp/MesosContainerizerDestroyTest_LauncherDestroyFailure_QpjQEn' I0224 18:55:49.326912 21391 containerizer.cpp:461] Starting container 'test_container' for executor 'executor' of framework '' I0224 18:55:49.332252 21391 launcher.cpp:130] Forked child with pid '23496' for container 'test_container' ABORT: (src/subprocess.cpp:165): Failed to os::execvpe in childMain *** Aborted at 1424832949 (unix time) try date -d @1424832949 if you are using GNU date *** PC: @ 0x2b178c5db0d5 (unknown) I0224 18:55:49.340955 21392 process.cpp:2117] Dropped / Lost event for PID: scheduler-509d37ac-296f-4429-b101-af433c1800e9@127.0.1.1:39647 I0224 18:55:49.342300 21386 containerizer.cpp:911] Destroying container 'test_container' *** SIGABRT (@0x3e85bc8) received by PID 23496 (TID 0x2b178f9f0700) from PID 23496; stack trace: *** @ 0x2b178c397cb0 (unknown) @ 0x2b178c5db0d5 (unknown) @ 0x2b178c5de83b (unknown) @ 0x87a945 _Abort() @ 0x2b1789f610b9 process::childMain() I0224 18:55:49.391793 21386 containerizer.cpp:1120] Executor for container 'test_container' has exited I0224 18:55:49.400478 21391 process.cpp:2770] Handling HTTP event for process 'metrics' with path: '/metrics/snapshot' tests/containerizer_tests.cpp:485: Failure Value of: metrics.values[containerizer/mesos/container_destroy_errors] Actual: 16-byte object 02-00 00-00 17-2B 00-00 E0-86 0E-04 00-00 00-00 Expected: 1u Which is: 1 [ FAILED ] MesosContainerizerDestroyTest.LauncherDestroyFailure (89 ms) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2573) Use Memory Test Helper to improve some test code.
Chi Zhang created MESOS-2573: Summary: Use Memory Test Helper to improve some test code. Key: MESOS-2573 URL: https://issues.apache.org/jira/browse/MESOS-2573 Project: Mesos Issue Type: Improvement Reporter: Chi Zhang Assignee: Chi Zhang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2572) Add memory statistics tests.
[ https://issues.apache.org/jira/browse/MESOS-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chi Zhang updated MESOS-2572: - Labels: twitter (was: iso twitter) Add memory statistics tests. Key: MESOS-2572 URL: https://issues.apache.org/jira/browse/MESOS-2572 Project: Mesos Issue Type: Task Reporter: Chi Zhang Assignee: Chi Zhang Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2572) Add memory statistics tests.
[ https://issues.apache.org/jira/browse/MESOS-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387387#comment-14387387 ] Chi Zhang commented on MESOS-2572: -- [~jieyu], i've broken up the diff into 5 small patches. Let me know when you have time to take a look at them. I will re-rebase, test and post them then. Add memory statistics tests. Key: MESOS-2572 URL: https://issues.apache.org/jira/browse/MESOS-2572 Project: Mesos Issue Type: Task Reporter: Chi Zhang Assignee: Chi Zhang Labels: iso, twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2574) Namespace handle symlinks in port_mapping isolator should not be under /var/run/netns
Jie Yu created MESOS-2574: - Summary: Namespace handle symlinks in port_mapping isolator should not be under /var/run/netns Key: MESOS-2574 URL: https://issues.apache.org/jira/browse/MESOS-2574 Project: Mesos Issue Type: Bug Reporter: Jie Yu Consider putting symlinks under /var/run/messo/netns. This is because 'ip' command assumes all files under /var/run/netns are valid namespaces without duplication and it has command like: ip -all netns exec ip link to list all links for each network namespace. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2564) Kill superfluous forward declaration comments.
[ https://issues.apache.org/jira/browse/MESOS-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387545#comment-14387545 ] Benjamin Mahler commented on MESOS-2564: I'm a +1 but I believe [~benjaminhindman] enforced this style from the beginning. Kill superfluous forward declaration comments. -- Key: MESOS-2564 URL: https://issues.apache.org/jira/browse/MESOS-2564 Project: Mesos Issue Type: Improvement Reporter: Alexander Rukletsov Priority: Minor Labels: easyfix, newbie We often prepend forward declarations with a comment, which is pretty useless, e.g.: {code} // Forward declarations. class LogStorageProcess; {code} or {code} // Forward declarations. namespace registry { class Slaves; } class Authorizer; class WhitelistWatcher; {code} This JIRA aims to clean up such comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2572) Add memory statistics tests.
[ https://issues.apache.org/jira/browse/MESOS-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chi Zhang updated MESOS-2572: - Labels: iso twitter (was: twitter) Add memory statistics tests. Key: MESOS-2572 URL: https://issues.apache.org/jira/browse/MESOS-2572 Project: Mesos Issue Type: Task Reporter: Chi Zhang Assignee: Chi Zhang Labels: iso, twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2191) Add ContainerId to the TaskStatus message
[ https://issues.apache.org/jira/browse/MESOS-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387724#comment-14387724 ] Marcel Neuhausler commented on MESOS-2191: -- Trying to answer your questions in reverse order: 4. Why is the internal mesos container id sufficient? I need to correlate the mesos task id with the container-name in Docker. mesos- + mesos container id == name of container in Docker. That name becomes visible to the user if you run cAdvisor for example on your mesos slaves. Obviously it would be even nicer if Mesos would return the full container-name. 3. As it is in the code, there's one Container per Executor, so you could theoretically use the ExecutorID for the correlation you mention. Why is that not enough? It is my understanding that the ExecutorId is equal to the TaskId, so that wouldn't help in figuring out what the corresponding container name in Docker would be. 2. Are you attempting to extract, ultimately the docker container ID? If so, how would you do it? No, I want to get the name of the container in docker 1. What exactly is your goal, i.e. If you had the mesos Container ID how would you use it? We have two use-cases in which our own Mesos Framework needs to know about the container name in Docker: a) We use cAdvisor to automatically collect metrics from running docker containers. The collector process talks to our Framework to get the container name for a corresponding mesos task. c) Networking with Project Calico: When you set up a network-group in Calico you have to pass the Docker container name to Calico. Our Framework interacts with Calico and has to be able to correlate the Mesos TaskID with the Docker container name. In general, I still have a hard time to understand why you try to hide some Mesos id even so that id gets visible to the user, for example in cAdvisor as part of the docker container name. I also would expect that for completeness reasons you would return all of the IDs in the Task message protobuf data-structure (messages.proto).. btw, ideally you would return the full container-name (mesos-683fd6a3-9dbc-4180-bb64-bf8b961cc50e) in the task message. Add ContainerId to the TaskStatus message - Key: MESOS-2191 URL: https://issues.apache.org/jira/browse/MESOS-2191 Project: Mesos Issue Type: Wish Components: containerization Reporter: Marcel Neuhausler Assignee: Alexander Rojas Labels: mesosphere {{TaskStatus}} provides the frameworks with certain information ({{executorId}}, {{slaveId}}, etc.) which is useful when collecting statistics about cluster performance; however, it is difficult to associate tasks to the container it is executed since this information stays always within mesos itself. Therefore it would be good to provide the framework scheduler with this information, adding a new field in the {{TaskStatus}} message. See comments for a use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2564) Kill superfluous forward declaration comments.
[ https://issues.apache.org/jira/browse/MESOS-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387802#comment-14387802 ] Benjamin Hindman commented on MESOS-2564: - The intention was to force people to capture all of their forward declarations at the top of the file, and by explicitly calling them out with a comment it made it more clear that is where they belong (where as it's already common convention to always put your includes at the beginning of the file, although not necessary). I don't really see a ton of value of removing these, I don't see how/why they are negatively impacting the code base? Kill superfluous forward declaration comments. -- Key: MESOS-2564 URL: https://issues.apache.org/jira/browse/MESOS-2564 Project: Mesos Issue Type: Improvement Reporter: Alexander Rukletsov Priority: Minor Labels: easyfix, newbie We often prepend forward declarations with a comment, which is pretty useless, e.g.: {code} // Forward declarations. class LogStorageProcess; {code} or {code} // Forward declarations. namespace registry { class Slaves; } class Authorizer; class WhitelistWatcher; {code} This JIRA aims to clean up such comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2542) mesos containerizer should not allow tasks to run as root inside scheduler specified rootfs
[ https://issues.apache.org/jira/browse/MESOS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387897#comment-14387897 ] Jay Buffington commented on MESOS-2542: --- Also we should consider no_new_privs. From https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt {quote} With no_new_privs set, execve promises not to grant the privilege to do anything that could not have been done without the execve call. For example, the setuid and setgid bits will no longer change the uid or gid; file capabilities will not add to the permitted set {quote} mesos containerizer should not allow tasks to run as root inside scheduler specified rootfs --- Key: MESOS-2542 URL: https://issues.apache.org/jira/browse/MESOS-2542 Project: Mesos Issue Type: Technical task Components: containerization Reporter: Jay Buffington If a task has root in the container it’s fairly well documented how to break out of the chroot and get root privs outside the container. Therefore, when the mesos containerizer specifies an arbitrary rootfs to chroot into we need to be careful to not allow the task to get root access. There are likely at least two options to consider here. One is user namespaces[1] wherein the user has “root” inside the container, but outside the container that root user is mapped to an unprivileged user. Another option is to mount all user specified rootfs with a nosetuid flag and strictly control /etc/passwd. [1] https://lwn.net/Articles/532593/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)